Linear Ridge Regression from Scratch Using C# with Stochastic Gradient Descent

A simple linear regression problem predicts a numeric y value from a single numeric x value. For example, you might want to predict a person’s income from their college GPA. A multiple linear regression problem predicts a numeric y value from two or more numeric x values. For example, you might want to predict y = income from x1 = GPA, x2 = age, x3 = height. The term linear regression by itself usually refers to multiple regression.

The linear regression prediction equation is y = b0 + (b1*x1) + (b2*x2) + . . . The bi values are usually called the coefficients or weights. The b0 term is called the constant, or intercept, or bias.

If you have a set of training data, it is possible to compute the values of the coefficients and bias exactly (so that the sum of the squared errors is minimized) using matrix algebra. This is called a closed-form solution. It is also possible to estimate the values of the coefficients using an iterated gradient descent approach. In pseudo-code:

init coeffs, constant to random values
loop maxEpochs times
  loop each data item (random order)
    get x input
    get yActual
    compute yPred using coeffs, constant
    delta = yPred - yActual
    loop each coeff [j]
      new value[j] -= learnRate * delta * x[j]
    end-loop
    new constant -= learnRate * delta * 1
  end-loop
  optional: decay coeffs and constant here
end-loop

It’s important to visit the training data in random order so that the values of the coefficients and constant don’t simply oscillate back and forth. In my demo, I scramble the order of the data indices at the start of each epoch so that each item is visited exactly once.

I implemented a demo using C#. I generated some synthetic data with three predictors (between -1 and +1) and a target between 0 and 1. I made 40 training items and 10 test items. The data looks like:

 x0       x1       x2      y
---------------------------------
 0.7562, -0.9452,  0.3409, 0.2839
-0.1654,  0.1174, -0.7192, 0.4079
 0.9365, -0.3732,  0.3846, 0.2787
. . .

The stochastic gradient descent approach for finding the values of the linear regression coefficients and constant is highly sensitive to the choice of learning rate and number of epochs used and the ridge regression parameter. Finding good values requires trial and error.

I cheated a bit with my implementation when doing the L2 regularization. When you use linear ridge regression with a closed form solution for the coefficients, you use matrix algebra. With that approach, you add the alpha/noise value to the diagonal of one of the intermediate matrices before doing a matrix inverse (actually the details are a bit more complicated). Although it’s not obvious, this process adds L2 regularization, which in theory modifies the underlying gradient, which in turn reduces the delta that is added to the coefficient update.

So when using stochastic gradient descent training for linear ridge regression, it is possible, but tricky, to implement L2 regularization by modifying the coefficient update statements. Instead, I used a simpler technique called weight decay. It has the same effect as L2 regularization but is easier to implement and tune. At the end of each training epoch, all coefficients are decayed (reduced in magnitude). If alpha/noise = 0.005, after each epoch, all coefficients are multiplied by (1 – alpha) = 0.995, which reduces their magnitudes.

Other than when you have a huge amount of training data and matrix inversion isn’t possible, there are very few scenarios where training a linear ridge regression model using SGD is a better choice than using a closed-form matrix approach. One small advantage of using SGD is that you can monitor how the loss (root mean squared error in the demo) changes during training which gives you some insights into the nature of the data.

It’s sometimes said that using SGD is a good option when the training dataset size is large. This is only partially true. There is an indirect technique called QR decomposition, which uses less memory, that can be used instead of direct matrix inversion.

Good fun.



Linear regression is simple because it’s based on data that lies on a straight line or hyperplane. But I still get good satisfaction when I create a simple machine learning linear model.

Hobbyist Robert Acton creates wonderful models made from only matchsticks. Here’s a model of Hogwarts School made from over 600,000 matchsticks. I have to believe that Acton must get a lot satisfaction from creating his model from strictly linear components because of the huge effort involved.


Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

using System;
using System.IO;

namespace LinearRidgeRegressStochastic
{
  internal class LinearRidgeRegressStochasticProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nLinear ridge regression SGD ");

      // 1. load data from file to memory
      Console.WriteLine("\nLoading train and test data ");
      string trainFile =
        "..\\..\\..\\Data\\synthetic_train.txt";
      double[][] trainX =
        Utils.MatLoad(trainFile,
        new int[] { 0, 1, 2 }, ',', "#");
      double[] trainY =
        Utils.MatToVec(Utils.MatLoad(trainFile,
        new int[] { 3 }, ',', "#"));

      string testFile =
        "..\\..\\..\\Data\\synthetic_test.txt";
      double[][] testX =
        Utils.MatLoad(testFile,
        new int[] { 0, 1, 2 }, ',', "#");
      double[] testY =
        Utils.MatToVec(Utils.MatLoad(testFile,
        new int[] { 3 }, ',', "#"));

      Console.WriteLine("\nFirst three train X inputs: ");
      for (int i = 0; i "lt" 3; ++i)
        Utils.VecShow(trainX[i], 4, 9, true);

      Console.WriteLine("\nFirst three train y targets: ");
      for (int i = 0; i "lt" 3; ++i)
        Console.WriteLine(trainY[i].ToString("F4"));

      // 2. create LR model
      double alpha = 0.0001;
      Console.WriteLine("\nCreating and training LRR model");
      LinearRidgeRegressSGD lrrModel = 
        new LinearRidgeRegressSGD(alpha, 1);

      // 3. train model
      double lrnRate = 0.001;
      int maxEpochs = 1000;
      lrrModel.Train(trainX, trainY, maxEpochs, lrnRate);
      Console.WriteLine("Model constant," +
        " coefficients = ");
      Utils.VecShow(lrrModel.coeffs, 4, 9, true);

      // 4. evaluate model
      Console.WriteLine("\nComputing model accuracy ");
      double accTrain =
        Accuracy(lrrModel, trainX, trainY, 0.10);
      Console.WriteLine("Accuracy on train (0.10) = "
        + accTrain.ToString("F4"));

      double accTest =
        Accuracy(lrrModel, testX, testY, 0.10);
      Console.WriteLine("Accuracy on test (0.10) = " +
        accTest.ToString("F4"));

      // 5. use model
      Console.WriteLine("\nPredicting x = 0.5, -0.6, 0.7 ");
      double[] x = new double[] { 0.5, -0.6, 0.7 };
      double predY = lrrModel.Predict(x);
      Console.WriteLine("\nPredicted y = " +
        predY.ToString("F4"));

      // 6. TODO: save model to file

      Console.WriteLine("\nEnd linear ridge regression ");
      Console.ReadLine();
    } // Main

    static double Accuracy(LinearRidgeRegressSGD model,
      double[][] dataX, double[] dataY, double pctClose)
    {
      int numCorrect = 0; int numWrong = 0;
      int n = dataX.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        double[] x = dataX[i];
        double yActual = dataY[i];
        double yPred = model.Predict(x);
        if (Math.Abs(yActual - yPred) "lt"
          Math.Abs(pctClose * yActual))
          ++numCorrect;
        else
          ++numWrong;
      }
      return (numCorrect * 1.0) / n;
    }

  } // Program

  public class LinearRidgeRegressSGD
  {
    public double[]? coeffs;
    public double alpha;
    public Random rnd;

    public LinearRidgeRegressSGD(double alpha, int seed)
    {
      this.coeffs = null;  // bias at [0]
      this.alpha = alpha;  // ridge regression noise
      this.rnd = new Random(seed);
    }

    //public void Train(double[][] trainX, double[] trainY)
    //{
    //  double[][] DX =
    //    Utils.MatMakeDesign(trainX);  // add 1s column

    //  // coeffs = inv(DXt * DX) * DXt * Y 
    //  double[][] a = Utils.MatTranspose(DX);
    //  double[][] b = Utils.MatProduct(a, DX);

    //  for (int i = 0; i "lt" b.Length; ++i)  // ridge regression
    //    b[i][i] += this.alpha;

    //  double[][] c = Utils.MatInverse(b);
    //  double[][] d = Utils.MatProduct(c, a);
    //  double[][] Y = Utils.VecToMat(trainY, DX.Length, 1);
    //  double[][] e = Utils.MatProduct(d, Y);
    //  this.coeffs = Utils.MatToVec(e);
    //}

    public void Train(double[][] trainX, double[] trainY,
      int maxEpochs, double lrnRate)
    {
      // init coeffs, bias
      int n = trainX.Length;  // number training items
      int dim = trainX[0].Length;  // number predictors
      int interval = maxEpochs / 10;

      this.coeffs = new double[dim + 1];  // bias

      double lo = -1.0; double hi = +1.0;
      for (int j = 0; j "lt" dim; ++j)
        this.coeffs[j] = (hi - lo) * this.rnd.NextDouble() + lo;

      int[] indices = new int[trainX.Length];  // for random order
      for (int i = 0; i "lt" trainX.Length; ++i)
        indices[i] = i;

      for (int epoch = 0; epoch "lt" maxEpochs; ++epoch)
      {
        Shuffle(indices, this.rnd);
        for (int i = 0; i "lt" n; ++i)  // each data item
        {
          int idx = indices[i];
          double[] x = trainX[idx];
          double yActual = trainY[idx];
          double yPred = this.Predict(x);

          // update coeffs
          for (int j = 0; j "lt" dim; ++j)  // each coefficient 
            this.coeffs[j+1] -=
              lrnRate * (yPred - yActual) * x[j];

          // update constant/bias
          this.coeffs[0] -=
            lrnRate * (yPred - yActual) * 1;  // dummy 1
        }

        // decay coeffs and bias after each epoch
        for (int j = 0; j "lt" this.coeffs.Length; ++j)
          this.coeffs[j] = (1 - this.alpha) * this.coeffs[j];

        if (epoch % interval == 0)
        {
          double rmse = this.RootMSE(trainX, trainY);
          Console.WriteLine("epoch = " + epoch.ToString().
            PadLeft(4) + "   RMSE = " +
            rmse.ToString("F4").PadLeft(8));
          //Console.ReadLine();
        }
        //Console.ReadLine();
      }
      Console.WriteLine("Done ");
    } // Train

    public double RootMSE(double[][] dataX, double[] dataY)
    {
      double sum = 0.0;
      int n = dataX.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        double[] x = dataX[i];
        double yActual = dataY[i];
        double yPred = this.Predict(x);
        sum += (yActual - yPred) * (yActual - yPred);
      }
      return Math.Sqrt(sum / n);
    }

    public static void Shuffle(int[] arr, Random rnd)
    {
      // Fisher-Yates algorithm
      int n = arr.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        int ri = rnd.Next(i, n);  // random index
        int tmp = arr[ri];
        arr[ri] = arr[i];
        arr[i] = tmp;
      }
    }

    public double Predict(double[] x)
    {
      // constant at coeffs[0]
      double sum = 0.0;
      int dim = x.Length;  // number predictors
      for (int i = 0; i "lt" dim; ++i)
        sum += x[i] * this.coeffs[i + 1];

      sum += this.coeffs[0];  // add the bias
      return sum;
    }

  } // class LinearRegressionSGD

  public class Utils
  {
    public static double[][] MatCreate(int rows, int cols)
    {
      double[][] result = new double[rows][];
      for (int i = 0; i "lt" rows; ++i)
        result[i] = new double[cols];
      return result;
    }

    public static double[] MatToVec(double[][] m)
    {
      int rows = m.Length;
      int cols = m[0].Length;
      double[] result = new double[rows * cols];
      int k = 0;
      for (int i = 0; i "lt" rows; ++i)
        for (int j = 0; j "lt" cols; ++j)
        {
          result[k++] = m[i][j];
        }
      return result;
    }

    private static int NumNonCommentLines(string fn,
      string comment)
    {
      int ct = 0;
      string line = "";
      FileStream ifs = new FileStream(fn,
        FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
        if (line.StartsWith(comment) == false)
          ++ct;
      sr.Close(); ifs.Close();
      return ct;
    }

    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment)
    {
      // count number of non-comment lines
      int nRows = NumNonCommentLines(fn, comment);
      int nCols = usecols.Length;
      double[][] result = MatCreate(nRows, nCols);
      string line = "";
      string[] tokens = null;
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        tokens = line.Split(sep);
        for (int j = 0; j "lt" nCols; ++j)
        {
          int k = usecols[j];  // into tokens
          result[i][j] = double.Parse(tokens[k]);
        }
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }

    public static void VecShow(double[] vec,
      int dec, int wid, bool newLine)
    {
      for (int i = 0; i "lt" vec.Length; ++i)
      {
        double x = vec[i];
        if (Math.Abs(x) "lt" 1.0e-8) x = 0.0;  // hack
        Console.Write(x.ToString("F" +
          dec).PadLeft(wid));
      }
      if (newLine == true)
        Console.WriteLine("");
    }

  } // class Utils

} // ns

Train data:

-0.1660, 0.4406, -0.9998, 0.4487
-0.8153, -0.6275, -0.3089, 0.2757
-0.2065, 0.0776, -0.1616, 0.3096
0.7562, -0.9452, 0.3409, 0.2839
-0.1654, 0.1174, -0.7192, 0.4079
0.9365, -0.3732, 0.3846, 0.2787
0.7528, 0.7892, -0.8299, 0.4489
0.0663, 0.3838, -0.3690, 0.3740
0.3730, 0.6693, -0.9634, 0.4525
0.5786, -0.7935, -0.1042, 0.3348
0.8172, -0.4128, -0.4244, 0.3928
-0.4689, -0.0169, -0.8933, 0.4036
0.1482, -0.7065, 0.1786, 0.2809
-0.1719, 0.3888, -0.1716, 0.3385
-0.9001, 0.0718, 0.3276, 0.2309
0.1731, 0.8068, -0.7251, 0.4236
-0.7214, 0.6148, -0.2046, 0.3311
0.4520, 0.7666, 0.2473, 0.3204
0.5019, -0.3022, -0.4601, 0.3885
0.9297, 0.3269, 0.2434, 0.3353
-0.7705, 0.8990, -0.1002, 0.2941
-0.5259, 0.8068, 0.1474, 0.3022
-0.9943, 0.2343, -0.3467, 0.3144
-0.2855, 0.8171, 0.2467, 0.2934
-0.9684, 0.8589, 0.3818, 0.2229
0.5109, 0.5078, 0.8460, 0.2719
0.4230, -0.7515, -0.9602, 0.4135
0.0777, 0.1056, 0.6841, 0.2428
-0.7517, -0.4416, 0.1715, 0.2525
-0.9627, 0.6013, -0.5341, 0.3701
0.6142, -0.2243, 0.7271, 0.2510
-0.7271, -0.8802, -0.7573, 0.3173
-0.9109, -0.7850, -0.5486, 0.2943
-0.9749, -0.8561, 0.9346, 0.2487
0.1362, -0.5934, -0.4953, 0.3394
0.1627, 0.9400, 0.6937, 0.3020
-0.5203, -0.0125, 0.2399, 0.2501
-0.9628, -0.8600, -0.0273, 0.2715
0.2127, 0.1377, -0.3653, 0.3758
-0.2397, 0.1019, 0.4907, 0.2496

Test data:

0.3385, -0.4702, -0.8673, 0.4303
-0.5797, 0.5055, -0.8669, 0.4163
-0.4794, 0.6095, -0.6131, 0.4025
0.8496, -0.4734, -0.8681, 0.4909
0.4701, 0.5444, 0.8156, 0.2796
0.8980, 0.9004, 0.1133, 0.3295
0.8312, 0.2831, -0.2200, 0.4104
0.0991, 0.8524, 0.8375, 0.2744
-0.2102, 0.9265, -0.6521, 0.4078
0.8959, 0.6542, -0.9700, 0.4744
This entry was posted in Machine Learning. Bookmark the permalink.

Leave a comment