## Revisiting Binary Classification Using scikit Logistic Regression

It had been a while since I looked at logistic regression using the scikit-learn (scikit or sklearn for short) machine learning library. Like any kind of skill, it’s important to stay in practice.

I used one of my standard datasets for binary classification. The data is synthetic and looks like:

``` 1   0.24   1 0 0   0.2950   0 0 1
0   0.39   0 0 1   0.5120   0 1 0
1   0.63   0 1 0   0.7580   1 0 0
0   0.36   1 0 0   0.4450   0 1 0
. . .
```

Each line of tab-delimited data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (michigan = 100, nebraska = 010, oklahoma = 001), annual income (divided by 100,000), and politics type (conservative = 100, moderate = 010, liberal = 001). The goal is to predict the gender of a person from their age, state, income, and politics type.

There are 200 lines of training data and 40 lines of test data. The complete data can be found at:
jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/

I used the version of scikit that was installed with Anaconda Python version Anaconda3-2020.02 (with Python 3.7.6), which is scikit version 0.22.1.

Using scikit has pros and cons. The pros are that scikit easy to use and has a lot of nice built-in modules. The cons are that scikit is difficult to customize and the code is essentially a black box (open source but impossible to decipher).

The key statements are:

```model = LogisticRegression(random_state=0,
solver='sag', max_iter=1000, penalty='none')
model.fit(train_x, train_y)
```

The SAG (stochastic average gradient) algorithm is a variation of ordinary SGD (stochastic gradient descent). The penalty can be L1, or L2, or elastic (combination of L1 and L2).

My scikit logistic regression demo got 72.50% accuracy on the test data. A PyTorch binary classifier network got 85.00% accuracy. A from-scratch Python version of logistic regression got 77.50% accuracy.

There are some interesting analogies between the evolution/development of aircraft design and and the evolution/development of machine learning algorithms. Here are three aircraft designs that have a circular design theme but which weren’t successful. Left: The DFW T.28 “Floh” (“Flea” in German) was built in 1917 in Germany by Hermann Dorner. Center: The Vought V-173 “Flying Pancake” was built in 1942 to explore reduced-drag designs. Right: The Stipa was an experimental Italian aircraft designed in 1932. It had a hollow fuselage with the engine and propeller completely enclosed.

Demo code. Replace “lt” with Boolean operator symbol.

```# people_gender_scikit.py

# predict gender (0 = male), 1 = female)
# from age, state, income, job-type

# data:
# 1   0.24   1   0   0   0.2950   0   0   1
# 0   0.39   0   0   1   0.5120   0   1   0
# 1   0.27   0   1   0   0.2860   0   0   1
# . . .

# Anaconda3-2020.02  Python 3.7.6
# scikit 0.22.1  Windows 10/11

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
import pickle

def show_confusion(cm):
# Confusion matrix whose i-th row and j-th column entry
# indicates the number of samples with true label being
# i-th class and predicted label being j-th class.

ct_act0_pred0 = cm[0][0]  # TN
ct_act0_pred1 = cm[0][1]  # FP wrongly predicted as pos
ct_act1_pred0 = cm[1][0]  # FN wrongly predicted as neg
ct_act1_pred1 = cm[1][1]  # TP

print("actual 0  | %4d %4d" % (ct_act0_pred0, ct_act0_pred1))
print("actual 1  | %4d %4d" % (ct_act1_pred0, ct_act1_pred1))
print("           ----------")
print("predicted      0    1")

# -----------------------------------------------------------

def main():
print("\nBegin logistic regression with scikit ")
np.random.seed(1)

train_file = ".\\Data\\people_train.txt"
train_x = train_xy[:,1:9]
train_y = train_xy[:,0]

test_file = ".\\Data\\people_test.txt"
test_x = test_xy[:,1:9]
test_y = test_xy[:,0]

print("\nTraining data:")
print(train_x[0:4])
print(". . . \n")
print(train_y[0:4])
print(". . . ")

# 2. create model and train
print("\nCreating logistic regression model")
model = LogisticRegression(random_state=0,
solver='sag', max_iter=1000, penalty='none')
model.fit(train_x, train_y)

# 3. evaluate
print("\nComputing model accuracy ")
acc_train = model.score(train_x, train_y)
print("Accuracy on training = %0.4f " % acc_train)

acc_test = model.score(test_x, test_y)
print("Accuracy on test = %0.4f " % acc_test)

y_predicteds = model.predict(test_x)
precision = precision_score(test_y, y_predicteds)
print("Precision on test = %0.4f " % precision)

# 4. make a prediction
print("\nPredict age 36, Oklahoma, \$50K, moderate ")
x = np.array([[0.36, 0,0,1, 0.5000, 0,1,0]],
dtype=np.float32)

p = model.predict_proba(x)
p = p[0][1]  # first (only) row, second value P(1)

print("\nPrediction prob = %0.6f " % p)
if p "lt" 0.5:
print("Prediction = male ")
else:
print("Prediction = female ")

# 5. save model
print("\nSaving trained logistic regression model ")
path = ".\\Models\\people_scikit_model.sav"
pickle.dump(model, open(path, "wb"))

# with open(path, 'rb') as f:
# print(pa)

# 6. confusion matrix with labels
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(test_y, y_predicteds)
print("\nConfusion matrix raw: ")
print(cm)

print("\nConfusion matrix custom: ")
show_confusion(cm)

print("\nEnd People logistic regression demo ")

if __name__ == "__main__":
main()
```
This entry was posted in Scikit. Bookmark the permalink.