The scikit-learn library was originally designed for classical machine learning techniques like logistic regression and naive Bayes classification. The library eventually added the ability to do binary and multi-class classification via the MLPClassifier (multi-layer perceptron) class and regression via the MLPRegressor class. As best as I can determine by wading through the scikit change logs, these two classes were added in version 0.18 in early 2017.
I decided to take a look at regression using the scikit MLPRegressor class.
In my work environment, when I need to tackle a regression problem (i.e., predict a single numeric value such as a person’s annual income), I use PyTorch. PyTorch is very complex but it gives me the flexibility I need and PyTorch can do much more sophisticated things than scikit, notably image classification, natural language processing, unsupervised anomaly detection, and Transformer architecture systems.
But scikit is easy to use and makes sense in some scenarios.
My data is synthetic and looks like:
1 0.24 1 0 0 0.2950 0 0 1 -1 0.39 0 0 1 0.5120 0 1 0 1 0.63 0 1 0 0.7580 1 0 0 -1 0.36 1 0 0 0.4450 0 1 0 1 0.27 0 1 0 0.2860 0 0 1 . . .
There are 200 training items and 40 test items.
The first value in column [0] is sex (M = -1, F = +1). Column [1] is age, normalized by dividing by 100. Columns [2,3,4] is State one-hot encoded (Michigan = 100, Nebraska = 010, Oklahoma = 001). Column [5] is annual income, divided by $100,000, and is the value to predict. Columns [6,7,8] is political leaning (conservative = 100, moderate = 010, liberal = 001).
Setting up a scikit MLP regressor is daunting because there are a lot of parameters:
params = { 'hidden_layer_sizes' : [10,10], 'activation' : 'relu', 'solver' : 'adam', 'alpha' : 0.0, 'batch_size' : 10, 'random_state' : 0, 'tol' : 0.0001, 'nesterovs_momentum' : False, 'learning_rate' : 'constant', 'learning_rate_init' : 0.01, 'max_iter' : 1000, 'shuffle' : True, 'n_iter_no_change' : 50, 'verbose' : False } print("Creating 8-(10-10)-1 tanh neural network ") net = MLPRegressor(**params)
My demo implements an accuracy() function. For most scikit classes, there is a score() function that gives simple accuracy but with regression you must specify what a correct prediction is — for example within 10% of the correct target value.
Good fun.
Predicting income is a difficult task with real data. For jobs that rely mostly on tips, such as golf course beverage cart driver, predicting income is especially difficult. I suspect the cart driver on the left makes more money from tips than the cart driver on the right.
Demo code. Replace “lt”, “gt”, “lte”, “gte” with Boolean operator symbols.
# people_income_nn_sckit.py # predict income # from sex, age, state, politics # sex age state income politics # 1 0.24 1 0 0 0.2950 0 0 1 # -1 0.39 0 0 1 0.5120 0 1 0 # state: michigan = 100, nebraska = 010, oklahoma = 001 # conservative = 100, moderate = 010, liberal = 001 # Anaconda3-2020.02 Python 3.7.6 scikit 0.22.1 # Windows 10/11 import numpy as np from sklearn.neural_network import MLPRegressor import warnings warnings.filterwarnings('ignore') # early-stop warnings # --------------------------------------------------------- def accuracy(model, data_x, data_y, pct_close=0.10): # accuracy predicted within pct_close of actual income # item-by-item allows inspection but is slow n_correct = 0; n_wrong = 0 predicteds = model.predict(data_x) # all predicteds for i in range(len(predicteds)): actual = data_y[i] pred = predicteds[i] if np.abs(pred - actual) "lt" np.abs(pct_close * actual): n_correct += 1 else: n_wrong += 1 acc = (n_correct * 1.0) / (n_correct + n_wrong) return acc # --------------------------------------------------------- def accuracy_q(model, data_x, data_y, pct_close=0.10): # accuracy within pct_close of actual income # all-at-once is quick n_items = len(data_y) preds = model.predict(data_x) # all predicteds n_correct = np.sum((np.abs(preds - data_y) "lt" \ np.abs(pct_close * data_y))) result = (n_correct / n_items) return result # --------------------------------------------------------- def main(): # 0. get ready print("\nBegin scikit neural network regression example ") print("Predict income from sex, age, State, politics ") np.random.seed(1) np.set_printoptions(precision=4, suppress=True) # 1. load data print("\nLoading data into memory ") train_file = ".\\Data\\people_train.txt" train_xy = np.loadtxt(train_file, usecols=range(0,9), delimiter="\t", comments="#", dtype=np.float32) train_x = train_xy[:,[0,1,2,3,4,6,7,8]] train_y = train_xy[:,5] test_file = ".\\Data\\people_test.txt" test_xy = np.loadtxt(test_file, usecols=range(0,9), delimiter="\t", comments="#", dtype=np.float32) test_x = test_xy[:,[0,1,2,3,4,6,7,8]] test_y = test_xy[:,5] print("\nTraining data:") print(train_x[0:4]) print(". . . \n") print(train_y[0:4]) print(". . . ") # --------------------------------------------------------- # 2. create network # MLPRegressor(hidden_layer_sizes=(100,), # activation='relu', *, solver='adam', alpha=0.0001, # batch_size='auto', learning_rate='constant', # learning_rate_init=0.001, power_t=0.5, max_iter=200, # shuffle=True, random_state=None, tol=0.0001, # verbose=False, warm_start=False, momentum=0.9, # nesterovs_momentum=True, early_stopping=False, # validation_fraction=0.1, beta_1=0.9, beta_2=0.999, # epsilon=1e-08, n_iter_no_change=10, max_fun=15000) params = { 'hidden_layer_sizes' : [10,10], 'activation' : 'relu', 'solver' : 'adam', 'alpha' : 0.0, 'batch_size' : 10, 'random_state' : 0, 'tol' : 0.0001, 'nesterovs_momentum' : False, 'learning_rate' : 'constant', 'learning_rate_init' : 0.01, 'max_iter' : 1000, 'shuffle' : True, 'n_iter_no_change' : 50, 'verbose' : False } print("\nCreating 8-(10-10)-1 relu neural network ") net = MLPRegressor(**params) # --------------------------------------------------------- # 3. train print("\nTraining with bat sz = " + \ str(params['batch_size']) + " lrn rate = " + \ str(params['learning_rate_init']) + " ") print("Stop if no change " + \ str(params['n_iter_no_change']) + " iterations ") net.fit(train_x, train_y) print("Done ") # --------------------------------------------------------- # 4. evaluate model # score() is coefficient of determination for MLPRegressor print("\nCompute model accuracy (within 0.10 of actual) ") acc_train = accuracy(net, train_x, train_y, 0.10) print("\nAccuracy on train = %0.4f " % acc_train) acc_test = accuracy(net, test_x, test_y, 0.10) print("Accuracy on test = %0.4f " % acc_test) # print("\nModel accuracy quick (within 0.10 of actual) ") # acc_train = accuracy_q(net, train_x, train_y, 0.10) # print("\nAccuracy on train = %0.4f " % acc_train) # acc_test = accuracy_q(net, test_x, test_y, 0.10) # print("Accuracy on test = %0.4f " % acc_test) # --------------------------------------------------------- # 5. use model # no proba() for MLPRegressor print("\nSetting X = M 34 Oklahoma moderate: ") X = np.array([[-1, 0.34, 0,0,1, 0,1,0]]) income = net.predict(X) # divided by 100,000 income *= 100000 # denormalize print("Predicted income: %0.2f " % income) # --------------------------------------------------------- # 6. TODO: save model using pickle # import pickle # print("Saving trained network ") # path = ".\\Models\\people_income_net.sav" # pickle.dump(model, open(path, "wb")) # use saved model # X = np.array([[-1, 0.34, 0,0,1, 0,1,0]]], # dtype=np.float32) # with open(path, 'rb') as f: # loaded_model = pickle.load(f) # inc = loaded_model.predict(X) # print(inc) print("\nEnd scikit binary neural network demo ") if __name__ == "__main__": main()
Training data. Replace commas with tabs or modify program.
# people_train.txt # # sex (-1 = male, 1 = female), age / 100, # state (michigan = 100, nebraska = 010, oklahoma = 001) # income / 100_000, # conservative = 100, moderate = 010, liberal = 001 # 1,0.24,1,0,0,0.2950,0,0,1 -1,0.39,0,0,1,0.5120,0,1,0 1,0.63,0,1,0,0.7580,1,0,0 -1,0.36,1,0,0,0.4450,0,1,0 1,0.27,0,1,0,0.2860,0,0,1 1,0.50,0,1,0,0.5650,0,1,0 1,0.50,0,0,1,0.5500,0,1,0 -1,0.19,0,0,1,0.3270,1,0,0 1,0.22,0,1,0,0.2770,0,1,0 -1,0.39,0,0,1,0.4710,0,0,1 1,0.34,1,0,0,0.3940,0,1,0 -1,0.22,1,0,0,0.3350,1,0,0 1,0.35,0,0,1,0.3520,0,0,1 -1,0.33,0,1,0,0.4640,0,1,0 1,0.45,0,1,0,0.5410,0,1,0 1,0.42,0,1,0,0.5070,0,1,0 -1,0.33,0,1,0,0.4680,0,1,0 1,0.25,0,0,1,0.3000,0,1,0 -1,0.31,0,1,0,0.4640,1,0,0 1,0.27,1,0,0,0.3250,0,0,1 1,0.48,1,0,0,0.5400,0,1,0 -1,0.64,0,1,0,0.7130,0,0,1 1,0.61,0,1,0,0.7240,1,0,0 1,0.54,0,0,1,0.6100,1,0,0 1,0.29,1,0,0,0.3630,1,0,0 1,0.50,0,0,1,0.5500,0,1,0 1,0.55,0,0,1,0.6250,1,0,0 1,0.40,1,0,0,0.5240,1,0,0 1,0.22,1,0,0,0.2360,0,0,1 1,0.68,0,1,0,0.7840,1,0,0 -1,0.60,1,0,0,0.7170,0,0,1 -1,0.34,0,0,1,0.4650,0,1,0 -1,0.25,0,0,1,0.3710,1,0,0 -1,0.31,0,1,0,0.4890,0,1,0 1,0.43,0,0,1,0.4800,0,1,0 1,0.58,0,1,0,0.6540,0,0,1 -1,0.55,0,1,0,0.6070,0,0,1 -1,0.43,0,1,0,0.5110,0,1,0 -1,0.43,0,0,1,0.5320,0,1,0 -1,0.21,1,0,0,0.3720,1,0,0 1,0.55,0,0,1,0.6460,1,0,0 1,0.64,0,1,0,0.7480,1,0,0 -1,0.41,1,0,0,0.5880,0,1,0 1,0.64,0,0,1,0.7270,1,0,0 -1,0.56,0,0,1,0.6660,0,0,1 1,0.31,0,0,1,0.3600,0,1,0 -1,0.65,0,0,1,0.7010,0,0,1 1,0.55,0,0,1,0.6430,1,0,0 -1,0.25,1,0,0,0.4030,1,0,0 1,0.46,0,0,1,0.5100,0,1,0 -1,0.36,1,0,0,0.5350,1,0,0 1,0.52,0,1,0,0.5810,0,1,0 1,0.61,0,0,1,0.6790,1,0,0 1,0.57,0,0,1,0.6570,1,0,0 -1,0.46,0,1,0,0.5260,0,1,0 -1,0.62,1,0,0,0.6680,0,0,1 1,0.55,0,0,1,0.6270,1,0,0 -1,0.22,0,0,1,0.2770,0,1,0 -1,0.50,1,0,0,0.6290,1,0,0 -1,0.32,0,1,0,0.4180,0,1,0 -1,0.21,0,0,1,0.3560,1,0,0 1,0.44,0,1,0,0.5200,0,1,0 1,0.46,0,1,0,0.5170,0,1,0 1,0.62,0,1,0,0.6970,1,0,0 1,0.57,0,1,0,0.6640,1,0,0 -1,0.67,0,0,1,0.7580,0,0,1 1,0.29,1,0,0,0.3430,0,0,1 1,0.53,1,0,0,0.6010,1,0,0 -1,0.44,1,0,0,0.5480,0,1,0 1,0.46,0,1,0,0.5230,0,1,0 -1,0.20,0,1,0,0.3010,0,1,0 -1,0.38,1,0,0,0.5350,0,1,0 1,0.50,0,1,0,0.5860,0,1,0 1,0.33,0,1,0,0.4250,0,1,0 -1,0.33,0,1,0,0.3930,0,1,0 1,0.26,0,1,0,0.4040,1,0,0 1,0.58,1,0,0,0.7070,1,0,0 1,0.43,0,0,1,0.4800,0,1,0 -1,0.46,1,0,0,0.6440,1,0,0 1,0.60,1,0,0,0.7170,1,0,0 -1,0.42,1,0,0,0.4890,0,1,0 -1,0.56,0,0,1,0.5640,0,0,1 -1,0.62,0,1,0,0.6630,0,0,1 -1,0.50,1,0,0,0.6480,0,1,0 1,0.47,0,0,1,0.5200,0,1,0 -1,0.67,0,1,0,0.8040,0,0,1 -1,0.40,0,0,1,0.5040,0,1,0 1,0.42,0,1,0,0.4840,0,1,0 1,0.64,1,0,0,0.7200,1,0,0 -1,0.47,1,0,0,0.5870,0,0,1 1,0.45,0,1,0,0.5280,0,1,0 -1,0.25,0,0,1,0.4090,1,0,0 1,0.38,1,0,0,0.4840,1,0,0 1,0.55,0,0,1,0.6000,0,1,0 -1,0.44,1,0,0,0.6060,0,1,0 1,0.33,1,0,0,0.4100,0,1,0 1,0.34,0,0,1,0.3900,0,1,0 1,0.27,0,1,0,0.3370,0,0,1 1,0.32,0,1,0,0.4070,0,1,0 1,0.42,0,0,1,0.4700,0,1,0 -1,0.24,0,0,1,0.4030,1,0,0 1,0.42,0,1,0,0.5030,0,1,0 1,0.25,0,0,1,0.2800,0,0,1 1,0.51,0,1,0,0.5800,0,1,0 -1,0.55,0,1,0,0.6350,0,0,1 1,0.44,1,0,0,0.4780,0,0,1 -1,0.18,1,0,0,0.3980,1,0,0 -1,0.67,0,1,0,0.7160,0,0,1 1,0.45,0,0,1,0.5000,0,1,0 1,0.48,1,0,0,0.5580,0,1,0 -1,0.25,0,1,0,0.3900,0,1,0 -1,0.67,1,0,0,0.7830,0,1,0 1,0.37,0,0,1,0.4200,0,1,0 -1,0.32,1,0,0,0.4270,0,1,0 1,0.48,1,0,0,0.5700,0,1,0 -1,0.66,0,0,1,0.7500,0,0,1 1,0.61,1,0,0,0.7000,1,0,0 -1,0.58,0,0,1,0.6890,0,1,0 1,0.19,1,0,0,0.2400,0,0,1 1,0.38,0,0,1,0.4300,0,1,0 -1,0.27,1,0,0,0.3640,0,1,0 1,0.42,1,0,0,0.4800,0,1,0 1,0.60,1,0,0,0.7130,1,0,0 -1,0.27,0,0,1,0.3480,1,0,0 1,0.29,0,1,0,0.3710,1,0,0 -1,0.43,1,0,0,0.5670,0,1,0 1,0.48,1,0,0,0.5670,0,1,0 1,0.27,0,0,1,0.2940,0,0,1 -1,0.44,1,0,0,0.5520,1,0,0 1,0.23,0,1,0,0.2630,0,0,1 -1,0.36,0,1,0,0.5300,0,0,1 1,0.64,0,0,1,0.7250,1,0,0 1,0.29,0,0,1,0.3000,0,0,1 -1,0.33,1,0,0,0.4930,0,1,0 -1,0.66,0,1,0,0.7500,0,0,1 -1,0.21,0,0,1,0.3430,1,0,0 1,0.27,1,0,0,0.3270,0,0,1 1,0.29,1,0,0,0.3180,0,0,1 -1,0.31,1,0,0,0.4860,0,1,0 1,0.36,0,0,1,0.4100,0,1,0 1,0.49,0,1,0,0.5570,0,1,0 -1,0.28,1,0,0,0.3840,1,0,0 -1,0.43,0,0,1,0.5660,0,1,0 -1,0.46,0,1,0,0.5880,0,1,0 1,0.57,1,0,0,0.6980,1,0,0 -1,0.52,0,0,1,0.5940,0,1,0 -1,0.31,0,0,1,0.4350,0,1,0 -1,0.55,1,0,0,0.6200,0,0,1 1,0.50,1,0,0,0.5640,0,1,0 1,0.48,0,1,0,0.5590,0,1,0 -1,0.22,0,0,1,0.3450,1,0,0 1,0.59,0,0,1,0.6670,1,0,0 1,0.34,1,0,0,0.4280,0,0,1 -1,0.64,1,0,0,0.7720,0,0,1 1,0.29,0,0,1,0.3350,0,0,1 -1,0.34,0,1,0,0.4320,0,1,0 -1,0.61,1,0,0,0.7500,0,0,1 1,0.64,0,0,1,0.7110,1,0,0 -1,0.29,1,0,0,0.4130,1,0,0 1,0.63,0,1,0,0.7060,1,0,0 -1,0.29,0,1,0,0.4000,1,0,0 -1,0.51,1,0,0,0.6270,0,1,0 -1,0.24,0,0,1,0.3770,1,0,0 1,0.48,0,1,0,0.5750,0,1,0 1,0.18,1,0,0,0.2740,1,0,0 1,0.18,1,0,0,0.2030,0,0,1 1,0.33,0,1,0,0.3820,0,0,1 -1,0.20,0,0,1,0.3480,1,0,0 1,0.29,0,0,1,0.3300,0,0,1 -1,0.44,0,0,1,0.6300,1,0,0 -1,0.65,0,0,1,0.8180,1,0,0 -1,0.56,1,0,0,0.6370,0,0,1 -1,0.52,0,0,1,0.5840,0,1,0 -1,0.29,0,1,0,0.4860,1,0,0 -1,0.47,0,1,0,0.5890,0,1,0 1,0.68,1,0,0,0.7260,0,0,1 1,0.31,0,0,1,0.3600,0,1,0 1,0.61,0,1,0,0.6250,0,0,1 1,0.19,0,1,0,0.2150,0,0,1 1,0.38,0,0,1,0.4300,0,1,0 -1,0.26,1,0,0,0.4230,1,0,0 1,0.61,0,1,0,0.6740,1,0,0 1,0.40,1,0,0,0.4650,0,1,0 -1,0.49,1,0,0,0.6520,0,1,0 1,0.56,1,0,0,0.6750,1,0,0 -1,0.48,0,1,0,0.6600,0,1,0 1,0.52,1,0,0,0.5630,0,0,1 -1,0.18,1,0,0,0.2980,1,0,0 -1,0.56,0,0,1,0.5930,0,0,1 -1,0.52,0,1,0,0.6440,0,1,0 -1,0.18,0,1,0,0.2860,0,1,0 -1,0.58,1,0,0,0.6620,0,0,1 -1,0.39,0,1,0,0.5510,0,1,0 -1,0.46,1,0,0,0.6290,0,1,0 -1,0.40,0,1,0,0.4620,0,1,0 -1,0.60,1,0,0,0.7270,0,0,1 1,0.36,0,1,0,0.4070,0,0,1 1,0.44,1,0,0,0.5230,0,1,0 1,0.28,1,0,0,0.3130,0,0,1 1,0.54,0,0,1,0.6260,1,0,0
Test data.
# people_test.txt # -1,0.51,1,0,0,0.6120,0,1,0 -1,0.32,0,1,0,0.4610,0,1,0 1,0.55,1,0,0,0.6270,1,0,0 1,0.25,0,0,1,0.2620,0,0,1 1,0.33,0,0,1,0.3730,0,0,1 -1,0.29,0,1,0,0.4620,1,0,0 1,0.65,1,0,0,0.7270,1,0,0 -1,0.43,0,1,0,0.5140,0,1,0 -1,0.54,0,1,0,0.6480,0,0,1 1,0.61,0,1,0,0.7270,1,0,0 1,0.52,0,1,0,0.6360,1,0,0 1,0.3,0,1,0,0.3350,0,0,1 1,0.29,1,0,0,0.3140,0,0,1 -1,0.47,0,0,1,0.5940,0,1,0 1,0.39,0,1,0,0.4780,0,1,0 1,0.47,0,0,1,0.5200,0,1,0 -1,0.49,1,0,0,0.5860,0,1,0 -1,0.63,0,0,1,0.6740,0,0,1 -1,0.3,1,0,0,0.3920,1,0,0 -1,0.61,0,0,1,0.6960,0,0,1 -1,0.47,0,0,1,0.5870,0,1,0 1,0.3,0,0,1,0.3450,0,0,1 -1,0.51,0,0,1,0.5800,0,1,0 -1,0.24,1,0,0,0.3880,0,1,0 -1,0.49,1,0,0,0.6450,0,1,0 1,0.66,0,0,1,0.7450,1,0,0 -1,0.65,1,0,0,0.7690,1,0,0 -1,0.46,0,1,0,0.5800,1,0,0 -1,0.45,0,0,1,0.5180,0,1,0 -1,0.47,1,0,0,0.6360,1,0,0 -1,0.29,1,0,0,0.4480,1,0,0 -1,0.57,0,0,1,0.6930,0,0,1 -1,0.2,1,0,0,0.2870,0,0,1 -1,0.35,1,0,0,0.4340,0,1,0 -1,0.61,0,0,1,0.6700,0,0,1 -1,0.31,0,0,1,0.3730,0,1,0 1,0.18,1,0,0,0.2080,0,0,1 1,0.26,0,0,1,0.2920,0,0,1 -1,0.28,1,0,0,0.3640,0,0,1 -1,0.59,0,0,1,0.6940,0,0,1
You must be logged in to post a comment.