Logistic regression is a relatively simple technique for binary classification. I put together a demo using raw Python (rather than using a library like scikit). Before I go any further, let me point out that there are dozens of ways to implement logistic regression from scratch.

For my demo, I used some synthetic data that looks like:

1 0.24 1 0 0 0.2950 0 0 1 0 0.39 0 0 1 0.5120 0 1 0 1 0.63 0 1 0 0.7580 1 0 0 0 0.36 1 0 0 0.4450 0 1 0 1 0.27 0 1 0 0.2860 0 0 1 . . .

Each line of data represents a person. The fields are sex (0 = male, 1 = female), age (divided by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by $100,000), and political leaning (conservative = 100, moderate = 010, liberal = 001). The goal is to predict sex from age, state, income, and political type.

There are 200 training items and 40 test items. The complete data can be found at https://jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/ (when I used a PyTorch binary neural network).

In the image below, a logistic regression model is trained using stochastic gradient descent with a learning rate of 0.005 and is monitored using mean squared error. After 10,000 training iterations, the model accuracy on the training data is 86.00% (172 out of 200 correct, and 77.50% on the test data (31 out of 40 correct).

Explaining how logistic regression works is best explained by example. Suppose the goal is to predict the sex of a person who is 35 years old, lives in Michigan, makes $75,000, and who is a political liberal. Each input variable has a numeric weight value, and there is a special weight called a bias. Suppose the weights are [0.3, 0.7, -0.2, 0.1, -0.4, 1.1, 0.6, -0.5] and the bias is 0.9.

The first step is to sum the products of each input variable and its weight, and then add the bias:

z = (35)(0.3) + (1)(0.7) + (0)(-0.2) + (0)(0.1) + (0.7500)(-0.4) + (0)(1.1) + (0)(0.6) + (1)(-0.5) + (0.9) = 0.905

The next step is to compute a p-value:

p = 1 / (1 + exp(-z)) = 1 / (1 + exp(-0.905)) = 0.7119

The last step is to interpret the p-value. The computed p-value will always be between 0 and 1. If the computed p-value is less than 0.5 the prediction is class 0 (male) and if the computed p-value is greater than 0.5 the prediction is class 1 (female).

OK, but where do the weights and the bias come from? As it turns out, there are different underlying theoretical models, and each leads to a slightly different way to compute weights from training data. The version that I prefer look like:

loop through training data: get inputs x[i] get target (0 or 1) y compute output p using curr wts for-each weight: wts[j] += lrn_rate * x[j] * (y - p) bias += lrn_rate * x[j] * (y - p) end-loop

One of my job tasks at the tech company I work for is to deliver machine learning classes to other employees. Most of the classes I teach focus on deep neural learning using PyTorch. I’ve found that employees really need to understand logistic regression before moving to the much more complicated neural networks.

*Female characters aren’t well-represented in old science fiction movies. But some female characters had a big, positive influence on some excellent movies. Left: British actress Janet Munro in the “The Crawling Eye” (1958). Center: American actress Helena Carter in “Invaders from Mars” (1953). Right: Japanese actress Momoko Kochi in “Godzilla: King of the Monsters!” (1956).*

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# people_gender_log_reg.py # predict gender (0 = male), 1 = female) # from age, state, income, job-type # data: # 1 0.24 1 0 0 0.2950 0 0 1 # 0 0.39 0 0 1 0.5120 0 1 0 # 1 0.63 0 1 0 0.7580 1 0 0 # 0 0.36 1 0 0 0.4450 0 1 0 # 1 0.27 0 1 0 0.2860 0 0 1 # . . . # Anaconda3-2020.02 Python 3.7.6 # Windows 10/11 import numpy as np # ----------------------------------------------------------- def compute_output(w, b, x): # input x using weights w and bias b z = 0.0 for i in range(len(w)): z += w[i] * x[i] z += b p = 1.0 / (1.0 + np.exp(-z)) # logistic sigmoid return p # ----------------------------------------------------------- def accuracy(w, b, data_x, data_y): n_correct = 0; n_wrong = 0 for i in range(len(data_x)): x = data_x[i] # inputs y = int(data_y[i]) # target 0 or 1 p = compute_output(w, b, x) if (y == 0 and p "lt" 0.5) or (y == 1 and p "gte" 0.5): n_correct += 1 else: n_wrong += 1 acc = (n_correct * 1.0) / (n_correct + n_wrong) return acc # ----------------------------------------------------------- def mse_loss(w, b, data_x, data_y): sum = 0.0 for i in range(len(data_x)): x = data_x[i] # inputs y = int(data_y[i]) # target 0 or 1 p = compute_output(w, b, x) sum += (y - p) * (y - p) mse = sum / len(data_x) return mse # ----------------------------------------------------------- def main(): # 0. get ready print("\nBegin logistic regression with raw Python demo ") np.random.seed(1) # 1. load data print("\nLoading People data ") train_file = ".\\DataLogReg\\people_train.txt" train_xy = np.loadtxt(train_file, usecols=range(0,9), delimiter="\t", comments="#", dtype=np.float32) train_x = train_xy[:,1:9] train_y = train_xy[:,0] test_file = ".\\DataLogReg\\people_test.txt" test_xy = np.loadtxt(test_file, usecols=range(0,9), delimiter="\t", comments="#", dtype=np.float32) test_x = test_xy[:,1:9] test_y = test_xy[:,0] # ----------------------------------------------------------- # 2. create model print("\nCreating logistic regression model ") wts = np.zeros(8) # one wt per predictor lo = -0.01; hi = 0.01 for i in range(len(wts)): wts[i] = (hi - lo) * np.random.random() + lo bias = 0.00 # ----------------------------------------------------------- # 3. train model lrn_rate = 0.005 max_epochs = 10000 indices = np.arange(len(train_x)) # [0, 1, .. 199] print("\nTraining using SGD with lrn_rate = %0.4f " % lrn_rate) for epoch in range(max_epochs): np.random.shuffle(indices) for i in indices: x = train_x[i] # inputs y = train_y[i] # target 0.0 or 1.0 p = compute_output(wts, bias, x) # update all wts and the bias for j in range(len(wts)): wts[j] += lrn_rate * x[j] * (y - p) # target - oupt bias += lrn_rate * (y - p) if epoch % 1000 == 0: loss = mse_loss(wts, bias, train_x, train_y) print("epoch = %5d | loss = %9.4f " % (epoch, loss)) print("Done ") # ----------------------------------------------------------- # 4. evaluate model print("\nEvaluating trained model ") acc_train = accuracy(wts, bias, train_x, train_y) print("Accuracy on train data: %0.4f " % acc_train) acc_test = accuracy(wts, bias, test_x, test_y) print("Accuracy on test data: %0.4f " % acc_test) # 5. use model print("\nPrediction for [33, Nebraska, $50,000, moderate]: ") x = np.array([0.33, 0,1,0, 0.5000, 0,1,0], dtype=np.float32) p = compute_output(wts, bias, x) print("%0.4f " % p) if p "lt" 0.5: print("class 0 (male) ") else: print("class 1 (female) ") # 6. TODO: save trained weights and bias to file print("\nEnd People logistic regression demo ") if __name__ == "__main__": main()

Pingback: Regresión logística desde cero usando Raw Python — Visual Studio Magazine – Psydyrony