Lasso regression is just linear regression with L1 regularization. Let me explain.

A linear regression problem is one where the goal is to predict a single numeric value from one or more numeric predictor values. For example, you might want to predict the murder rate in a city based on x0 = city population, x1 = percent poor families in city, and x2 percent unemployed in city. A regression equation might look like:

y = (0.24 * x0) + (0.50 * x1) + (0.28 * x2) – 0.14

Here y is the predicted murder rate, the (0.24, 0.50, 0.28) are the coefficients/weights, and the -0.14 is the constant/bias.

Lasso (“least absolute shrinkage and selection operator”) regression is just a slight modification of standard linear regression. Lasso applies L1 regularization to the regression coefficients. This means the coefficients are nudged towards zero during training to avoid huge coefficient values. Large coefficients often lead to a prediction model that is overfitted — the model predicts well on the data used to create the model but predicts poorly on new, previously unseen data (poor generalization).

Just for fun, I decided to implement lasso regression from scratch. I found a nice, simple dataset at people.sc.fsu.edu/~jburkardt/datasets/regression/x08.txt. There are four columns: city population, city poverty, city unemployment, murder rate. I normalized the data so that all values are between 0.0 and 1.0 by dividing population by 10,000,000, poverty by 100, unemployment by 10, and murder rate by 100.

Lasso regression is almost too simple. L1 regularization adjusts coefficients according to a fraction (lambda) of the absolute value of the sum of the coefficients. The gradient of the absolute value function is just +1 or -1 and therefore after adjusting coefficients, if the coefficient is positive you just subtract lambda or if the coefficient is positive you just add lambda. (The math is a bit tricky but the result is extraordinarily simple).

The key code for lasso regression training looks like:

. . . for j in range(3): # each coeff if a[j] "greater-than" 0.0: a[j] -= lamda # nudge towards 0 elif a[j] "less-than" 0.0: a[j] += lamda # nudge towards 0 . . .

In some ways, L1 / lasso doesn’t seem to make sense because it doesn’t take into account the magnitude of each weight. If you use L2 regularization, called “ridge regression”, you nudge coefficients towards zero in a way that takes the magnitude of each coefficient into account.

Weight decay is another closely related technique. See https://jamesmccaffrey.wordpress.com/2019/05/09/the-difference-between-neural-network-l2-regularization-and-weight-decay/.

I used stochastic gradient descent for my lasso regression because I’ve used L1 regularization for neural networks and understand the mechanism. With regular regression, you can find the values of the coefficients and bias using a closed-form solution technique. This avoids having to deal with finding good values for a learning rate and number of training iterations. I don’t know if finding coefficients for lasso regression has a closed-form solution — I doubt it but I’ll need to do some research to be sure.

Lasso regression is a classical statistics technique. Many of these old techniques are being forgotten in a sense because systems based on neural networks are so much more powerful. But classical statistics techniques can still be useful, especially in situations with small datasets.

*Three more or less random images from an Internet search for “lasso”. Left: I don’t like clowns. And that’s also true for clowns with lassos. Center: Asian woman looking suspiciously through lasso loop. Unexpected. Right: I have two dogs. I don’t think either of them would like a cowboy monkey riding on them, but this dog seems to be having fun.*

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols (my lame blog editor chokes on those symbols).

# lasso_regression.py # data: https://people.sc.fsu.edu/~jburkardt/+ # datasets/regression/x08.txt import numpy as np def compute(x, a, b): # x inputs, a coefficients, b constant result = 0.0 for i in range(3): result += x[i] * a[i] result += b return result def error(data, a, b): n = len(data) sum = 0.0 for i in range(n): y = compute(data[i], a, b) # predicted t = data[i][3] # target at [3] sum += (y - t) * (y - t) return sum / n # mean squared error def accuracy(data, a, b, pct): n = len(data) n_correct = 0; n_wrong = 0 for i in range(n): y = compute(data[i], a, b) # predicted t = data[i][3] # target at [3] if np.abs(y - t) "lt" np.abs(pct * t): n_correct += 1 else: n_wrong += 1 return n_correct / (n_correct + n_wrong) def train(data, epochs, lr, lamda): rnd = np.random.RandomState(0) n = len(data) # 20 a = rnd.rand(3) # predictors: pop., pov., unemp. b = 0.0 indices = np.arange(n) # [0, 1, 2, .. 19] for ep in range(epochs): rnd.shuffle(indices) for ii in range(n): i = indices[ii] x = data[i] # 3 inputs y = compute(data[i], a, b) # predicted t = data[i][3] # target for j in range(3): # each coeff a[j] += lr * x[j] * (t - y) b += lr * (t - y) # apply Lasso to move coefficients to 0 for j in range(3): # each coeff if a[j] "gt" 0.0: a[j] -= lamda elif a[j] "lt" 0.0: a[j] += lamda # apply Lasso to move bias to 0 if b "gt" 0.0: b -= lamda elif b "lt" 0.0: b += lamda err = error(data, a, b) print("epoch = %4d | error = %0.4f " % (ep, err)) return (a, b) def main(): print("\nBegin Lasso regression demo ") print("Predict city murder rate ") # population, low_income, unemployed, murder_rate data = np.array( [[0.0587, 0.165, 0.62, 0.112], [0.0643, 0.205, 0.64, 0.134], [0.0635, 0.263, 0.93, 0.407], [0.0692, 0.165, 0.53, 0.053], [0.1248, 0.192, 0.73, 0.248], [0.0643, 0.165, 0.59, 0.127], [0.1964, 0.202, 0.64, 0.209], [0.1531, 0.213, 0.76, 0.357], [0.0713, 0.172, 0.49, 0.087], [0.0749, 0.143, 0.64, 0.096], [0.7895, 0.181, 0.60, 0.145], [0.0762, 0.231, 0.74, 0.269], [0.2793, 0.191, 0.58, 0.157], [0.0741, 0.247, 0.86, 0.362], [0.0625, 0.186, 0.65, 0.181], [0.0854, 0.249, 0.83, 0.289], [0.0716, 0.179, 0.67, 0.149], [0.0921, 0.224, 0.86, 0.258], [0.0595, 0.202, 0.84, 0.217], [0.3353, 0.169, 0.67, 0.257]]) epochs = 8 lr = 0.06 lamda = 0.001 # lasso print("\nStarting training ") (a, b) = train(data, epochs, lr, lamda) print("Done ") print("\nCoefficients: ") print(a) print("Bias: ") print(b) acc = accuracy(data, a, b, 0.25) print("\nModel accuracy = %0.4f " % acc) print("\nEnd Lasso demo ") if __name__ == "__main__": main()

James, I am curious what would your thoughts be in regard to this TabPFN ? It looks interesting.

https://www.automl.org/tabpfn-a-transformer-that-solves-small-tabular-classification-problems-in-a-second/

It’s an interesting idea, but I’m not convinced. It’s basically a huge transformer network that has been pretrained on gazillions of datasets. If you apply the system to a dataset that’s similar to any of the pretraining datasets you’ll probably get decent results. The emphasis is all on speed but in most scenarios you don’t need blazing speed. To summarize: interesting but not useful. JM