Verifiable Linear Regression Model
Last updated
Last updated
Orion is an open-source framework explicitly designed for the development of Provable Machine Learning models. It achieves this by providing a new ONNX runtime in Cairo to run STARK-provable ML programs.
The following tutorial will be a short guide on how we can utilise the Orion framework to implement our very first fully Verifiable Linear Regression Model in Cairo.
This will enable us to add an extra layer of transparency to our model, ensuring each inference can be verified as well as all the steps executed during the model’s construction phase.
Content overview:
Simple Linear Regression with Python:Our starting point is a basic implementation of Simple Linear Regression model using the Ordinary Least Squares (OLS) method in Python.
Transitioning to Cairo: In the subsequent stage, we will create a new scarb project and replicate our model to Cairo which is a language for creating STARK-provable programs.
Implementing OLS functions using Orion: To catalyse our development process we will utilise the Orion Framework to construct the OLS functions to build our Verifiable Linear Regression model.
A Regression model is a foundational technique used to determine the relationship between independent variables (predictors) and dependent variables (outcome). This relationship is typically represented by a straight line and is often termed the “line of best fit”. By mapping how variations in one variable X may influence changes in another variable y, we can make highly accurate predictions on new unseen data points. The mathematical representation of this linear relationship is given by the equation:
In the following notebook, we will create a synthetic dataset that will serve as the backbone throughout our tutorial.
Upon inspecting the plot, it is readily apparent that there exists a positive correlation between the X and y values, consistent with our underlying equation. Our goal in this tutorial is to quantify this relationship using a regression model, using only the data points provided. By utilizing the Ordinary Least Square (OLS) method, we aim to derive a linear equation that closely approximates the original equation from which the dataset was generated from: y = 2 * X + 5 + noise
OLS method can help us decipher the linear relationship between the X and y variables by calculating the gradient (beta) and corresponding y intercept (a) to find the optimal "line of best fit".
The formula’s numerator quantifies the covariance of X and y, revealing their joint variability. Think of it as an expression to measure how both variables move together. Conversely, the denominator calculates the variance of X, which gauges the distribution of X values around its mean.
By dividing the covariance by the variance of X, we essentially measure the average change in y for a unit change in X. This allows us to capture the strength and direction of the linear relationship between X and y variables. A positive beta value suggests that as X increases, y also tends to increase, and vice versa for negative values. The magnitude of the beta value indicates the sensitivity of y to changes, with respect to changes in X values.
Implementing the formula in Python we get a gradient value of 2.03 which is very close to our original equation of y = 2 * X + 5 + noise used when generating our synthetic dataset which is a good sign.
Having determined the beta value, our next step is to calculate the y-intercept. This can be achieved by substituting the known beta, y mean, and X mean values into our line equation. The rationale behind using the y mean and X mean is grounded on the principle that the "line of best fit" must intersect these central points.
Looking at the above plot we can see we have successfully implemented our Linear regression model and captured the “line of best fit” using the OLS method.
To assess the efficacy of our regression model, we compute the mse and r_squared_score values, which yield an R-squared score of 0.83, indicating a robust predictive performance for the model.
Now that we have a good understanding of the OLS functions used, we will replicate the full linear regression model in Cairo to turn it to a fully verifiable model. Since we will be rebuilding the model from scratch, this will serve as a good opportunity to get familiar with Orion’s built-in functions and operators making the transition to Cairo seamless.
Scarb is the Cairo package manager specifically created to streamline our Cairo and Starknet development process. Scarb will typically manage project dependencies, the compilation process (both pure Cairo and Starknet contracts), downloading and building external libraries to accelerate our development with Orion.You can find all information about Scarb and Cairo installation here.
To create a new Scarb project, open your terminal and run:
A new project folder will be created for you and make sure to replace the content in Scarb.toml file with the following code:
Now let’s generate the files required to begin our transition to Cairo. In our Jupyter Notebook, we will execute the code required to turn our synthetic dataset to fixed point values and represent our X and y values as Fixedpoint Tensors in Orion.
The X_values and y_values tensor values will now be generated under src/generated
directory.
In src/lib.cairo
replace the content with the following code:
This will tell our compiler to include the separate modules listed above during the compilation of our code. We will be covering each module in detail in the following section, but let’s first review the generated folder files.
Since Cairo does not come with built-in signed integers we have to explicitly define it for our X and y values. Luckily, this is already implemented in Orion for us as a struct as shown below:
For this tutorial, we will use FP16x16 numbers where the magnitude represents the absolute value and the boolean indicates whether the number is negative or positive. To replicate the OLS functions, we will conduct our operations using FP16x16 Tensors which are also represented as a structs in Orion.
A Tensor
in Orion takes a shape and a span array of the data. We work with a Tensor<FP16x16>
. In a 16x16 fixed-point format, there are 16 bits dedicated to the integer part of the number and 16 bits for the fractional part of the number. This format allows us to work with a wide range of values and a high degree of precision for conducting the OLS Tensor operations.
At this stage, we will be reproducing the OLS functions now that we have generated our X and Y fixed point Tensors. We will begin by creating a separate file for our linear regression functions file named lin_reg_func.cairo
to host all of our linear regression functions.
The above function takes in an FP16x16 Tensor and computes its corresponding mean value. We break the steps down by first calculating the cumulative sum of the tensor values using the cumsum
built-in orion operator. We then divide the result by the length of the tensor size and return the output as a fixed point number.
The following deviation_from_mean function calculates the deviation from the mean for each element of a given tensor. We initially calculate the tensor's mean value and store it under the variable mean_value. We then create a for loop to iterate over each element in the tensor values and calculate the deviation from the mean which we will append the result to deviation_values
array. Finally, we create a new tensor named distance_from_mean_tensor by passing the deviation_values array and the tensor shape.
The OLS gradient (beta) formula:
We can now compute the beta value for our linear regression utilising the previous deviation_from_mean function. We first calculate both the deviation of x values and y values from the mean and store them in separate variables as tensors. To calculate the covariance, we use the built-in Orion matmul
operator to multiply x_deviation by y_deviation tensors. Similarly, we compute the X variance by multiplying x_deviation tensor by itself. Finally, we divide the x_y_covariance
by the x_variance
to get an approximate gradient value for our regression model.
Calculating the y-intercept is fairly simple, we just need to substitute the calculated beta, y_mean and x_mean values and rearrange for the intercept value as previously shown in the Python implementation section.
Now that we have implemented all the necessary functions for the OLS method, we can finally test our linear regression model. We begin by creating a new separate test file named test.cairo
and import all the necessary Orion libraries including our X_values
and y_values
found in the generated folder. We also import all the OLS functions from lin_reg_func.cairo
file as we will be relying upon them to construct the regression model.
Our model will get tested under the linear_regression_test()
function which will follow the following steps:
Data Retrieval: The function initiates by fetching the dependent X and y values sourced from the generated folder.
Beta Calculation: With the data on board, it proceeds to determine the gradient (beta_value) of the linear regression line using the compute_beta function.
Intercept Calculation: The y-intercept (intercept_value) of the regression line is calculated using utilising the previously calculated beta value.
Prediction Phase: At this stage, we have all the components needed for our linear regression model. We make new predictions using our X values to see how well it generalises.
Evaluation: The main part of the function is dedicated to model evaluation. It calculates the Mean Squared Error (mse), a measure of the average squared difference between the observed actual outcomes and the outcomes predicted by the model (between y_pred and y_values). It also calculates the R-squared value (r_score) which measures the accuracy of the model between the values of 0-1.
The test function will also perform basic checks making sure our beta value is positive, the R-squared value is between 0-1 and our model accuracy is above 50%.
Finally, we can execute the test file by running scarb cairo-test -f linear_regression_test
We invite the community to join us in forging a future in making AI transparent and reliable resource for all.
And as we can our test cases have passed!
If you've made it this far, well done! You are now capable of building verifiable ML models, making them ever more reliable and transparent than ever before.