It had been a while since I looked at logistic regression using the scikit-learn (scikit or sklearn for short) machine learning library. Like any kind of skill, it’s important to stay in practice.

I used one of my standard datasets for binary classification. The data is synthetic and looks like:

1 0.24 1 0 0 0.2950 0 0 1 0 0.39 0 0 1 0.5120 0 1 0 1 0.63 0 1 0 0.7580 1 0 0 0 0.36 1 0 0 0.4450 0 1 0 . . .

Each line of tab-delimited data represents a person. The fields are sex (male = 0, female = 1), age (normalized by dividing by 100), state (michigan = 100, nebraska = 010, oklahoma = 001), annual income (divided by 100,000), and politics type (conservative = 100, moderate = 010, liberal = 001). The goal is to predict the gender of a person from their age, state, income, and politics type.

There are 200 lines of training data and 40 lines of test data. The complete data can be found at:

jamesmccaffrey.wordpress.com/2022/09/23/binary-classification-using-pytorch-1-12-1-on-windows-10-11/

I used the version of scikit that was installed with Anaconda Python version Anaconda3-2020.02 (with Python 3.7.6), which is scikit version 0.22.1.

Using scikit has pros and cons. The pros are that scikit easy to use and has a lot of nice built-in modules. The cons are that scikit is difficult to customize and the code is essentially a black box (open source but impossible to decipher).

The key statements are:

model = LogisticRegression(random_state=0, solver='sag', max_iter=1000, penalty='none') model.fit(train_x, train_y)

The SAG (stochastic average gradient) algorithm is a variation of ordinary SGD (stochastic gradient descent). The penalty can be L1, or L2, or elastic (combination of L1 and L2).

My scikit logistic regression demo got 72.50% accuracy on the test data. A PyTorch binary classifier network got 85.00% accuracy. A from-scratch Python version of logistic regression got 77.50% accuracy.

*There are some interesting analogies between the evolution/development of aircraft design and and the evolution/development of machine learning algorithms. Here are three aircraft designs that have a circular design theme but which weren’t successful. Left: The DFW T.28 “Floh” (“Flea” in German) was built in 1917 in Germany by Hermann Dorner. Center: The Vought V-173 “Flying Pancake” was built in 1942 to explore reduced-drag designs. Right: The Stipa was an experimental Italian aircraft designed in 1932. It had a hollow fuselage with the engine and propeller completely enclosed.*

Demo code. Replace “lt” with Boolean operator symbol.

# people_gender_scikit.py # predict gender (0 = male), 1 = female) # from age, state, income, job-type # data: # 1 0.24 1 0 0 0.2950 0 0 1 # 0 0.39 0 0 1 0.5120 0 1 0 # 1 0.27 0 1 0 0.2860 0 0 1 # . . . # Anaconda3-2020.02 Python 3.7.6 # scikit 0.22.1 Windows 10/11 import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.metrics import precision_score import pickle def show_confusion(cm): # Confusion matrix whose i-th row and j-th column entry # indicates the number of samples with true label being # i-th class and predicted label being j-th class. ct_act0_pred0 = cm[0][0] # TN ct_act0_pred1 = cm[0][1] # FP wrongly predicted as pos ct_act1_pred0 = cm[1][0] # FN wrongly predicted as neg ct_act1_pred1 = cm[1][1] # TP print("actual 0 | %4d %4d" % (ct_act0_pred0, ct_act0_pred1)) print("actual 1 | %4d %4d" % (ct_act1_pred0, ct_act1_pred1)) print(" ----------") print("predicted 0 1") # ----------------------------------------------------------- def main(): # 0. get ready print("\nBegin logistic regression with scikit ") np.random.seed(1) # 1. load data print("\nLoading data into memory ") train_file = ".\\Data\\people_train.txt" train_xy = np.loadtxt(train_file, usecols=range(0,9), delimiter="\t", comments="#", dtype=np.float32) train_x = train_xy[:,1:9] train_y = train_xy[:,0] test_file = ".\\Data\\people_test.txt" test_xy = np.loadtxt(test_file, usecols=range(0,9), delimiter="\t", comments="#", dtype=np.float32) test_x = test_xy[:,1:9] test_y = test_xy[:,0] print("\nTraining data:") print(train_x[0:4]) print(". . . \n") print(train_y[0:4]) print(". . . ") # 2. create model and train print("\nCreating logistic regression model") model = LogisticRegression(random_state=0, solver='sag', max_iter=1000, penalty='none') model.fit(train_x, train_y) # 3. evaluate print("\nComputing model accuracy ") acc_train = model.score(train_x, train_y) print("Accuracy on training = %0.4f " % acc_train) acc_test = model.score(test_x, test_y) print("Accuracy on test = %0.4f " % acc_test) y_predicteds = model.predict(test_x) precision = precision_score(test_y, y_predicteds) print("Precision on test = %0.4f " % precision) # 4. make a prediction print("\nPredict age 36, Oklahoma, $50K, moderate ") x = np.array([[0.36, 0,0,1, 0.5000, 0,1,0]], dtype=np.float32) p = model.predict_proba(x) p = p[0][1] # first (only) row, second value P(1) print("\nPrediction prob = %0.6f " % p) if p "lt" 0.5: print("Prediction = male ") else: print("Prediction = female ") # 5. save model print("\nSaving trained logistic regression model ") path = ".\\Models\\people_scikit_model.sav" pickle.dump(model, open(path, "wb")) # with open(path, 'rb') as f: # loaded_model = pickle.load(f) # pa = loaded_model.predict_proba(x) # print(pa) # 6. confusion matrix with labels from sklearn.metrics import confusion_matrix cm = confusion_matrix(test_y, y_predicteds) print("\nConfusion matrix raw: ") print(cm) print("\nConfusion matrix custom: ") show_confusion(cm) print("\nEnd People logistic regression demo ") if __name__ == "__main__": main()

Pingback: Logistic Regression Using the scikit Library -- Visual Studio Magazine