Logistic Regression Using Raw Python

Logistic regression is a relatively simple technique for binary classification. I put together a demo using raw Python (rather than using a library like scikit). Before I go any further, let me point out that there are dozens of ways to implement logistic regression from scratch.

For my demo, I used some synthetic data that looks like:

1   0.24   1   0   0   0.2950   0   0   1
0   0.39   0   0   1   0.5120   0   1   0
1   0.63   0   1   0   0.7580   1   0   0
0   0.36   1   0   0   0.4450   0   1   0
1   0.27   0   1   0   0.2860   0   0   1
. . .

Each line of data represents a person. The fields are sex (0 = male, 1 = female), age (divided by 100), state (Michigan = 100, Nebraska = 010, Oklahoma = 001), income (divided by $100,000), and political leaning (conservative = 100, moderate = 010, liberal = 001). The goal is to predict sex from age, state, income, and political type.

There are 200 training items and 40 test items. The complete data can be found at https://jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/ (when I used a PyTorch binary neural network).

In the image below, a logistic regression model is trained using stochastic gradient descent with a learning rate of 0.005 and is monitored using mean squared error. After 10,000 training iterations, the model accuracy on the training data is 86.00% (172 out of 200 correct, and 77.50% on the test data (31 out of 40 correct).

Explaining how logistic regression works is best explained by example. Suppose the goal is to predict the sex of a person who is 35 years old, lives in Michigan, makes $75,000, and who is a political liberal. Each input variable has a numeric weight value, and there is a special weight called a bias. Suppose the weights are [0.3, 0.7, -0.2, 0.1, -0.4, 1.1, 0.6, -0.5] and the bias is 0.9.

The first step is to sum the products of each input variable and its weight, and then add the bias:

z = (35)(0.3) + (1)(0.7) + (0)(-0.2) + (0)(0.1) +
    (0.7500)(-0.4) + (0)(1.1) + (0)(0.6) + (1)(-0.5) + (0.9)

  = 0.905

The next step is to compute a p-value:

p = 1 / (1 + exp(-z))
  = 1 / (1 + exp(-0.905))
  = 0.7119

The last step is to interpret the p-value. The computed p-value will always be between 0 and 1. If the computed p-value is less than 0.5 the prediction is class 0 (male) and if the computed p-value is greater than 0.5 the prediction is class 1 (female).

OK, but where do the weights and the bias come from? As it turns out, there are different underlying theoretical models, and each leads to a slightly different way to compute weights from training data. The version that I prefer look like:

loop through training data:
  get inputs x[i]
  get target (0 or 1) y
  compute output p using curr wts
  for-each weight:
    wts[j] += lrn_rate * x[j] * (y - p)
  bias += lrn_rate * x[j] * (y - p)
end-loop

One of my job tasks at the tech company I work for is to deliver machine learning classes to other employees. Most of the classes I teach focus on deep neural learning using PyTorch. I’ve found that employees really need to understand logistic regression before moving to the much more complicated neural networks.

Female characters aren’t well-represented in old science fiction movies. But some female characters had a big, positive influence on some excellent movies. Left: British actress Janet Munro in the “The Crawling Eye” (1958). Center: American actress Helena Carter in “Invaders from Mars” (1953). Right: Japanese actress Momoko Kochi in “Godzilla: King of the Monsters!” (1956).

Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.

# people_gender_log_reg.py

# predict gender (0 = male), 1 = female) 
# from age, state, income, job-type

# data:
# 1   0.24   1   0   0   0.2950   0   0   1
# 0   0.39   0   0   1   0.5120   0   1   0
# 1   0.63   0   1   0   0.7580   1   0   0
# 0   0.36   1   0   0   0.4450   0   1   0
# 1   0.27   0   1   0   0.2860   0   0   1
# . . . 

# Anaconda3-2020.02  Python 3.7.6
# Windows 10/11

import numpy as np

# -----------------------------------------------------------

def compute_output(w, b, x):
  # input x using weights w and bias b
  z = 0.0
  for i in range(len(w)):
    z += w[i] * x[i]
  z += b
  p = 1.0 / (1.0 + np.exp(-z))  # logistic sigmoid
  return p

# -----------------------------------------------------------

def accuracy(w, b, data_x, data_y):
  n_correct = 0; n_wrong = 0
  for i in range(len(data_x)):
    x = data_x[i]  # inputs
    y = int(data_y[i])  # target 0 or 1
    p = compute_output(w, b, x)
    if (y == 0 and p "lt" 0.5) or (y == 1 and p "gte" 0.5):
      n_correct += 1
    else:
      n_wrong += 1
  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def mse_loss(w, b, data_x, data_y):
  sum = 0.0
  for i in range(len(data_x)):
    x = data_x[i]  # inputs
    y = int(data_y[i])  # target 0 or 1
    p = compute_output(w, b, x)
    sum += (y - p) * (y - p)
  mse = sum / len(data_x)
  return mse

# -----------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin logistic regression with raw Python demo ")
  np.random.seed(1)

  # 1. load data
  print("\nLoading People data ")
  
  train_file = ".\\DataLogReg\\people_train.txt"
  train_xy = np.loadtxt(train_file, usecols=range(0,9),
    delimiter="\t", comments="#",  dtype=np.float32) 
  train_x = train_xy[:,1:9]
  train_y = train_xy[:,0]

  test_file = ".\\DataLogReg\\people_test.txt"
  test_xy = np.loadtxt(test_file, usecols=range(0,9),
    delimiter="\t", comments="#", dtype=np.float32)
  test_x = test_xy[:,1:9]
  test_y = test_xy[:,0]

# -----------------------------------------------------------

  # 2. create model
  print("\nCreating logistic regression model ")
  wts = np.zeros(8)  # one wt per predictor
  lo = -0.01; hi = 0.01
  for i in range(len(wts)):
    wts[i] = (hi - lo) * np.random.random() + lo
  bias = 0.00

# -----------------------------------------------------------

  # 3. train model
  lrn_rate = 0.005
  max_epochs = 10000
  indices = np.arange(len(train_x))  # [0, 1, .. 199]
  print("\nTraining using SGD with lrn_rate = %0.4f " % lrn_rate)
  for epoch in range(max_epochs):
    np.random.shuffle(indices)
    for i in indices:
      x = train_x[i]  # inputs
      y = train_y[i]  # target 0.0 or 1.0
      p = compute_output(wts, bias, x)

      # update all wts and the bias
      for j in range(len(wts)):
        wts[j] += lrn_rate * x[j] * (y - p)  # target - oupt
      bias += lrn_rate * (y - p)
    if epoch % 1000 == 0:
      loss = mse_loss(wts, bias, train_x, train_y)
      print("epoch = %5d  |  loss = %9.4f " % (epoch, loss))
  print("Done ")

# -----------------------------------------------------------

  # 4. evaluate model
  print("\nEvaluating trained model ")
  acc_train = accuracy(wts, bias, train_x, train_y)
  print("Accuracy on train data: %0.4f " % acc_train)
  acc_test = accuracy(wts, bias, test_x, test_y)
  print("Accuracy on test data: %0.4f " % acc_test)

  # 5. use model
  print("\nPrediction for [33, Nebraska, $50,000, moderate]: ")
  x = np.array([0.33, 0,1,0, 0.5000, 0,1,0], dtype=np.float32)
  p = compute_output(wts, bias, x)
  print("%0.4f " % p)
  if p "lt" 0.5:
    print("class 0 (male) ")
  else:
    print("class 1 (female) ") 

  # 6. TODO: save trained weights and bias to file

  print("\nEnd People logistic regression demo ")

if __name__ == "__main__":
  main()