Revisiting Binary Classification Using scikit Logistic Regression

It had been a while since I looked at logistic regression using the scikit-learn (scikit or sklearn for short) machine learning library. Like any kind of skill, it’s important to stay in practice.

I used one of my standard datasets for binary classification. The data is synthetic and looks like:

 1   0.24   1 0 0   0.2950   0 0 1
 0   0.39   0 0 1   0.5120   0 1 0
 1   0.63   0 1 0   0.7580   1 0 0
 0   0.36   1 0 0   0.4450   0 1 0
. . .

Each line of tab-delimited data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (michigan = 100, nebraska = 010, oklahoma = 001), annual income (divided by 100,000), and politics type (conservative = 100, moderate = 010, liberal = 001). The goal is to predict the gender of a person from their age, state, income, and politics type.

There are 200 lines of training data and 40 lines of test data. The complete data can be found at:
jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/

I used the version of scikit that was installed with Anaconda Python version Anaconda3-2020.02 (with Python 3.7.6), which is scikit version 0.22.1.

Using scikit has pros and cons. The pros are that scikit easy to use and has a lot of nice built-in modules. The cons are that scikit is difficult to customize and the code is essentially a black box (open source but impossible to decipher).

The key statements are:

model = LogisticRegression(random_state=0,
  solver='sag', max_iter=1000, penalty='none')
model.fit(train_x, train_y)

The SAG (stochastic average gradient) algorithm is a variation of ordinary SGD (stochastic gradient descent). The penalty can be L1, or L2, or elastic (combination of L1 and L2).

My scikit logistic regression demo got 72.50% accuracy on the test data. A PyTorch binary classifier network got 85.00% accuracy. A from-scratch Python version of logistic regression got 77.50% accuracy.

There are some interesting analogies between the evolution/development of aircraft design and and the evolution/development of machine learning algorithms. Here are three aircraft designs that have a circular design theme but which weren’t successful. Left: The DFW T.28 “Floh” (“Flea” in German) was built in 1917 in Germany by Hermann Dorner. Center: The Vought V-173 “Flying Pancake” was built in 1942 to explore reduced-drag designs. Right: The Stipa was an experimental Italian aircraft designed in 1932. It had a hollow fuselage with the engine and propeller completely enclosed.

Demo code. Replace “lt” with Boolean operator symbol.

# people_gender_scikit.py

# predict gender (0 = male), 1 = female) 
# from age, state, income, job-type

# data:
# 1   0.24   1   0   0   0.2950   0   0   1
# 0   0.39   0   0   1   0.5120   0   1   0
# 1   0.27   0   1   0   0.2860   0   0   1
# . . . 

# Anaconda3-2020.02  Python 3.7.6
# scikit 0.22.1  Windows 10/11

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
import pickle

def show_confusion(cm):
  # Confusion matrix whose i-th row and j-th column entry
  # indicates the number of samples with true label being
  # i-th class and predicted label being j-th class.

  ct_act0_pred0 = cm[0][0]  # TN
  ct_act0_pred1 = cm[0][1]  # FP wrongly predicted as pos
  ct_act1_pred0 = cm[1][0]  # FN wrongly predicted as neg 
  ct_act1_pred1 = cm[1][1]  # TP
  
  print("actual 0  | %4d %4d" % (ct_act0_pred0, ct_act0_pred1))
  print("actual 1  | %4d %4d" % (ct_act1_pred0, ct_act1_pred1))
  print("           ----------")
  print("predicted      0    1")
  
# -----------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin logistic regression with scikit ")
  np.random.seed(1)

  # 1. load data
  print("\nLoading data into memory ")
  train_file = ".\\Data\\people_train.txt"
  train_xy = np.loadtxt(train_file, usecols=range(0,9),
    delimiter="\t", comments="#",  dtype=np.float32) 
  train_x = train_xy[:,1:9]
  train_y = train_xy[:,0].astype(int)

  test_file = ".\\Data\\people_test.txt"
  test_xy = np.loadtxt(test_file, usecols=range(0,9),
    delimiter="\t", comments="#",  dtype=np.float32) 
  test_x = test_xy[:,1:9]
  test_y = test_xy[:,0].astype(int)

  print("\nTraining data:")
  print(train_x[0:4])
  print(". . . \n")
  print(train_y[0:4])
  print(". . . ")

  # 2. create model and train
  print("\nCreating logistic regression model")
  model = LogisticRegression(random_state=0,
    solver='sag', max_iter=1000, penalty='none')
  model.fit(train_x, train_y)

  # 3. evaluate
  print("\nComputing model accuracy ")
  acc_train = model.score(train_x, train_y)
  print("Accuracy on training = %0.4f " % acc_train)

  acc_test = model.score(test_x, test_y)
  print("Accuracy on test = %0.4f " % acc_test)

  y_predicteds = model.predict(test_x)
  precision = precision_score(test_y, y_predicteds)
  print("Precision on test = %0.4f " % precision)

  # 4. make a prediction 
  print("\nPredict age 36, Oklahoma, $50K, moderate ")
  x = np.array([[0.36, 0,0,1, 0.5000, 0,1,0]],
    dtype=np.float32)
  
  p = model.predict_proba(x) 
  p = p[0][1]  # first (only) row, second value P(1)

  print("\nPrediction prob = %0.6f " % p)
  if p "lt" 0.5:
    print("Prediction = male ")
  else:
    print("Prediction = female ")

  # 5. save model
  print("\nSaving trained logistic regression model ")
  path = ".\\Models\\people_scikit_model.sav"
  pickle.dump(model, open(path, "wb"))

  # with open(path, 'rb') as f:
  #   loaded_model = pickle.load(f)
  # pa = loaded_model.predict_proba(x)
  # print(pa)

  # 6. confusion matrix with labels
  from sklearn.metrics import confusion_matrix
  cm = confusion_matrix(test_y, y_predicteds)
  print("\nConfusion matrix raw: ")
  print(cm)

  print("\nConfusion matrix custom: ")
  show_confusion(cm)
 
  print("\nEnd People logistic regression demo ")

if __name__ == "__main__":
  main()