Naive Bayes classification is a classical machine learning technique. It is best used when the predictor variables are all non-numeric. Naive Bayes works for both binary classification and multi-class classification. And naive Bayes works well when you don’t have very much training data.

I coded up a demo of naive Bayes using the scikit-learn library. The ideas are best explained by an example.

Consider the following data:

actuary green korea F barista green italy M dentist hazel japan M dentist green italy F chemist hazel japan M actuary green japan F actuary hazel japan M chemist green italy F chemist green italy F dentist green japan F barista hazel japan M dentist green japan F dentist green japan F chemist green italy F dentist green japan M dentist hazel japan M chemist green korea F barista green japan F actuary hazel italy F actuary green italy M

The columns are job type, eye color, country, and sex. The overall goal is to predict sex from job, eye, and country.

Suppose you want to predict the sex of a person who is (dentist, hazel, italy). If you look just at the dentists in the job column, 3 of the 7 dentists are male, and 4 of the 7 are female. So you’d (weakly) guess female. If you look just at the hazel values in eye color column, 5 of 6 people are male and just 1 of 6 are female. So you’d strongly guess male. If you look just at the italy values in the country column, 2 of 7 people are male and 5 of 7 are female. So you’d guess female.

The naive Bayes algorithm combines these frequencies to produce probabilities of male and female. The technique is called naive because it doesn’t take interactions between columns into account.

I coded up this problem using the scikit-learn library. The first (and most time-consuming step) is to convert the categorical values to integers: actuary = 0, barista = 1, chemist = 2, dentist = 3; green = 0, hazel = 1; italy = 0, japan = 1, korea = 2; female = 0, male = 1. In general it’s a good idea to encode using alphabetical order because scikit has an OrdinalEncoder class that encodes that way.

The encoded data looks like:

# job_eye_country_sex.txt # actuary=0, barista=1, chemist=2, dentist=3 # green=0, hazel=1 # italy=0, japan=1, korea=2 # female=0, male=1 # 0 0 2 0 1 0 0 1 3 1 1 1 3 0 0 0 2 1 1 1 0 0 1 0 0 1 1 1 2 0 0 0 2 0 0 0 3 0 1 0 1 1 1 1 3 0 1 0 3 0 1 0 2 0 0 0 3 0 1 1 3 1 1 1 2 0 2 0 1 0 1 0 0 1 0 0 0 0 0 1

I saved the encoded data as job_eye_country_sex.txt to be used as training data. I didn’t use any test data as I would in a non-demo scenario.

After reading the data into memory, the key statements to create a naive Bayes classifier are:

from sklearn.naive_bayes import CategoricalNB print("Creating naive Bayes classifier ") model = CategoricalNB(alpha=1) model.fit(X, y) print("Done ")

In my demo, the predictions for (dentist, hazel, italy) are [0.33, 0.67] and so the predicted sex is male because the value at [1] (0.67) is larger than the value at [0].

Instead of preprocessing the raw data by converting strings to integers, it is possible to programmatically encode raw string data:

from sklearn.preprocessing import OrdinalEncoder print("\nReading raw data using genfromtxt() ") train_file = ".\\Data\\job_eye_country_sex_raw.txt" XY = np.genfromtxt(train_file, usecols=range(0,4), delimiter="\t", dtype=str) print("\nEncoding the data: ") enc = OrdinalEncoder(dtype=np.int64) enc.fit(XY) # scan data print("\nCategories: ") print(enc.categories_) XY = enc.transform(XY) # encode data X = XY[:,0:3] y = XY[:,3] # now good to go

Naive Bayes is best used for categorical data. If a column is numeric the values can be bucketed into encoded integers. There is a GaussianNB set of functions that can handle numeric data, but that algorithm makes several assumptions such as the data in each column is Gaussian (Normal / bell-shaped). I don’t recommend using naive Bayes directly on numeric data.

*The 1990s had some very strange but wonderful animated cartoon shows on TV. Here are three of my favorites. All three featured a naive character.*

Left: “Aaahh!!! Real Monsters” (1994) features three young but nice monsters. Oblina (like a weird candy cane), Krumm (a hulking but naive monster who holds his eyes in his hands), and Ickis (sort of a demonic rabbit),

Center: “Rocko’s Modern Life” (1993) features the surreal life of an Australian wallaby named Rocko and his friends including a naive steer named Heffer Wolfe, and a neurotic turtle named Filburt.

*Right: “CatDog” (1998) features the life of conjoined brothers of different species. The cat half is cynical; the dog half is naive.*

Demo code:

# naive_bayes.py # Anaconda3-2020.02 Python 3.7.6 # scikit 0.22.1 # Windows 10/11 import numpy as np from sklearn.naive_bayes import CategoricalNB # --------------------------------------------------------- def main(): # 0. prepare print("\nBegin scikit naive Bayes demo ") print("Predict sex (F = 0, M = 1) from job, eye, country ") np.random.seed(1) # actuary green korea F # barista green italy M # dentist hazel japan M # . . . # actuary = 0, barista = 1, chemist = 2, dentist = 3 # green = 0, hazel = 1 # italy = 0, japan = 1, korea = 2 # 1. load data print("\nLoading train data ") train_file = ".\\Data\\job_eye_country_sex.txt" X = np.loadtxt(train_file, usecols=range(0,3), delimiter="\t", comments="#", dtype=np.int64) y = np.loadtxt(train_file, usecols=3, delimiter="\t", comments="#", dtype=np.int64) # print(y.shape) # 1D is required # y = y.flatten() # y = y.reshape(-1) # y = y.squeeze() print("Done ") print("\nDiscretized features: ") print(X) print("\nActual classes: ") print(y) # 2. create and train model print("\nCreating naive Bayes classifier ") model = CategoricalNB(alpha=1) model.fit(X, y) print("Done ") pred_classes = model.predict(X) # 3. evaluate model print("\nPredicted classes: ") print(pred_classes) acc_train = model.score(X, y) print("\nAccuracy on train data = %0.4f " % acc_train) # 3b. confusion matrix from sklearn.metrics import confusion_matrix y_predicteds = model.predict(X) cm = confusion_matrix(y, y_predicteds) # actual, pred print("\nConfusion matrix raw: ") print(cm) # 3c. precision, recall, F1 # from sklearn.metrics import classification_report # report = classification_report(y, pred_classes) # print(report) # 4. use model # dentist, hazel, italy = [3,1,0] print("\nPredicting class for dentist, hazel, italy ") probs = model.predict_proba([[3,1,0]]) print("\nPrediction probs: ") print(probs) predicted = model.predict([[3,1,0]]) print("\nPredicted class: ") print(predicted) # 5. TODO: save model using pickle # import pickle # print("Saving trained naive Bayes model ") # path = ".\\Models\\bayes_scikit_model.sav" # pickle.dump(model, open(path, "wb")) # predict (barista, green, Korea) # x = np.array([[1, 0, 2]], dtype=np.int64) # with open(path, 'rb') as f: # loaded_model = pickle.load(f) # pa = loaded_model.predict_proba(x) # print(pa) print("\nEnd demo ") if __name__ == "__main__": main()

Pingback: Naive Bayes Classification Using the scikit Library -- Visual Studio Magazine