I was in a meeting recently and one of my colleagues briefly described some work he had done at a previous job. He had an enormous set of training data and wanted to train a logistic regression model.
Logistic regression is a binary classification technique and is one of the simplest forms of machine learning. Suppose you want to predict if a person is male (class 0) or female (class 1) based on age, income, and height. When you train a LR model you will get one weight for each predictor variable and one bias. You multiply each predictor value times its weight, add the bias, then apply the logistic sigmoid function. The result will be a pseudo-probability value between 0 and 1. If the p-value is less than 0.5 the prediction is class 0, and if the p-value is greater than 0.5 the prediction is class 1.
For example, suppose age = 0.29, income = 0.5400, height = 0.72 and the weights are w1 = 1.7, w2 = -1.4, w3 = -0.5, and the bias is 0.2 then:
z = (0.29 * 1.7) + (0.5400 * -1.4) + (0.72 * -0.5) + 0.2 = -0.4230 p = sigmoid(-0.4230) = 1.0 / (1.0 + exp(-z)) = 1.0 / (1.0 + exp(0.4230)) = 0.3958 = class 0 = male
There are many algorithms yo can use to find the weights and bias for logistic regression. Common techniques include: stochastic gradient descent, L-BFGS, Nelder-Mead, and iterated Newton-Raphson. The implementations of all these techniques generally assume that the training data is small enough to fit into memory (perhaps a million or so items or less).
If you have a huge set of training data that won’t fit into memory, then you have a problem. One approach is to write data-loading code that will stream training data into memory as needed. Another approach is to break the huge file down to several smaller files, train a logistic regression model on each smaller file, and then combine the separate prediction models in some way.
One way to combine separate models is to maintain separate models and use a voting scheme. Another approach is to maintain separate models and average the p-values from each model.
Yet another approach is to create a meta-model that accepts the p-values from the sub-models and then combines them, typically by using logisic regression again.
Anyway, and now I’m finally getting to the point, I think my colleague suggested that he trained separate models and then combined them into a single model by averaging the models’ weights and biases.
I was intrigued so the next day I searched the Internet looking for an research on combining logistic regression models by averaging weights and biases. I didn’t find any solid evidence (in my opinion anyway). I did find several opinions on sites like Stack Overflow, but I know from previous experience, that many machine learning opinions are completely wrong.
Left: Two logistic regression models trained on 100-item datsets and then a combined logistic regression model created by using the average weighrs and biases of the two small models. Right: A single logisitc regression model trained on all 200 data items seemed to work better.
So, I set out to do an experiement to try and gain some insight. Bottom line: the technique of combining logistic regression models by averaging weights and biases did not work well in my one experiemnt, but the results were not conclusive.
For my experiment, I used the PyTorch neural code library. I started with 200 synthetic patient data items for training and 40 for testing. Each item had a patient sex (male = 0, female = 1), age, county (one of three), monocyte count, and hospitalization history (minor, moderate, major). The goal is to predict sex from the other variables. The data looks like:
1 0.58 0 1 0 0.6540 0 0 1 0 0.39 0 0 1 0.5120 0 1 0 1 0.24 1 0 0 0.2950 0 0 1 0 0.31 0 1 0 0.4640 1 0 0 . . .
For the combined logisitc regression model, I divided the 200 training items into two 100-item sets. I trained a first logistic regression model on the first set of data, then trained a second model on the second set of data. After training, I created a third logistic regression model and set its weights and bias to the averages of the two separate models. I applied the combined model on the test data and got 62.5% accuracy.
To check this, I created a single logistic regression model and trained it on all 200 data items. When I applied this model to the test data, it achieved 75% accuracy — quite a bit better.
I think I understand why the averaged weights combined model doesn’t work well. When you train a logistic regression model, somewhat surprisingly, there are many different sets of weights and bias that will give you very similar answers. A large value in one weight can be balanced by moderate values in two other weights. So when you train separate models, you will get different sets of weights which work well on their own small dataset, but averaging the weights gives mushy weights that work OK but not well.
As always, there are dozens of factors that could be confounding my experiment. However, if I were forced to create a logistic regression model using a huge set of training data tomorrow, my first choice would be to write a streaming data loader, and my second choice would be to create separate models but average the models’ output p-values.
Two things that I’d like to explore are 1.) using a much larger dataset, 2.) using a more consistent training algorithm — L-BFGS instead of stochastic gradient descent.
Mixed media combines different techniques to produce a single work of art. When the technique succeeds, it can create art that’s more appealing than the separate techniques. Here are three mixed media portraits by artists I like. Left: By Graeme Stevenson. Center: By Hans Jochem Bakker. Right: By Andrea Matus Demeng.