Naive Bayes Classification Example Using the scikit Library

Naive Bayes classification is a classical machine learning technique. It is best used when the predictor variables are all non-numeric. Naive Bayes works for both binary classification and multi-class classification. And naive Bayes works well when you don’t have very much training data.

I coded up a demo of naive Bayes using the scikit-learn library. The ideas are best explained by an example.

Consider the following data:

actuary   green   korea   F
barista   green   italy   M
dentist   hazel   japan   M
dentist   green   italy   F
chemist   hazel   japan   M
actuary   green   japan   F
actuary   hazel   japan   M
chemist   green   italy   F
chemist   green   italy   F
dentist   green   japan   F
barista   hazel   japan   M
dentist   green   japan   F
dentist   green   japan   F
chemist   green   italy   F
dentist   green   japan   M
dentist   hazel   japan   M
chemist   green   korea   F
barista   green   japan   F
actuary   hazel   italy   F
actuary   green   italy   M

The columns are job type, eye color, country, and sex. The overall goal is to predict sex from job, eye, and country.

Suppose you want to predict the sex of a person who is (dentist, hazel, italy). If you look just at the dentists in the job column, 3 of the 7 dentists are male, and 4 of the 7 are female. So you’d (weakly) guess female. If you look just at the hazel values in eye color column, 5 of 6 people are male and just 1 of 6 are female. So you’d strongly guess male. If you look just at the italy values in the country column, 2 of 7 people are male and 5 of 7 are female. So you’d guess female.

The naive Bayes algorithm combines these frequencies to produce probabilities of male and female. The technique is called naive because it doesn’t take interactions between columns into account.

I coded up this problem using the scikit-learn library. The first (and most time-consuming step) is to convert the categorical values to integers: actuary = 0, barista = 1, chemist = 2, dentist = 3; green = 0, hazel = 1; italy = 0, japan = 1, korea = 2; female = 0, male = 1. In general it’s a good idea to encode using alphabetical order because scikit has an OrdinalEncoder class that encodes that way.

The encoded data looks like:

# job_eye_country_sex.txt
# actuary=0, barista=1, chemist=2, dentist=3
# green=0, hazel=1
# italy=0, japan=1, korea=2
# female=0, male=1
#
0   0   2   0
1   0   0   1
3   1   1   1
3   0   0   0
2   1   1   1
0   0   1   0
0   1   1   1
2   0   0   0
2   0   0   0
3   0   1   0
1   1   1   1
3   0   1   0
3   0   1   0
2   0   0   0
3   0   1   1
3   1   1   1
2   0   2   0
1   0   1   0
0   1   0   0
0   0   0   1

I saved the encoded data as job_eye_country_sex.txt to be used as training data. I didn’t use any test data as I would in a non-demo scenario.

After reading the data into memory, the key statements to create a naive Bayes classifier are:

from sklearn.naive_bayes import CategoricalNB

print("Creating naive Bayes classifier ")
model = CategoricalNB(alpha=1)
model.fit(X, y)
print("Done ")

In my demo, the predictions for (dentist, hazel, italy) are [0.33, 0.67] and so the predicted sex is male because the value at [1] (0.67) is larger than the value at [0].

Instead of preprocessing the raw data by converting strings to integers, it is possible to programmatically encode raw string data:

from sklearn.preprocessing import OrdinalEncoder
print("\nReading raw data using genfromtxt() ")
train_file = ".\\Data\\job_eye_country_sex_raw.txt"
XY = np.genfromtxt(train_file, usecols=range(0,4),
  delimiter="\t", dtype=str)

print("\nEncoding the data: ")
enc = OrdinalEncoder(dtype=np.int64)
enc.fit(XY)  # scan data
print("\nCategories: ")
print(enc.categories_)
XY = enc.transform(XY)  # encode data
X = XY[:,0:3]
y = XY[:,3]
# now good to go

Naive Bayes is best used for categorical data. If a column is numeric the values can be bucketed into encoded integers. There is a GaussianNB set of functions that can handle numeric data, but that algorithm makes several assumptions such as the data in each column is Gaussian (Normal / bell-shaped). I don’t recommend using naive Bayes directly on numeric data.

The 1990s had some very strange but wonderful animated cartoon shows on TV. Here are three of my favorites. All three featured a naive character.

Left: “Aaahh!!! Real Monsters” (1994) features three young but nice monsters. Oblina (like a weird candy cane), Krumm (a hulking but naive monster who holds his eyes in his hands), and Ickis (sort of a demonic rabbit),

Center: “Rocko’s Modern Life” (1993) features the surreal life of an Australian wallaby named Rocko and his friends including a naive steer named Heffer Wolfe, and a neurotic turtle named Filburt.

Right: “CatDog” (1998) features the life of conjoined brothers of different species. The cat half is cynical; the dog half is naive.

Demo code:

# naive_bayes.py

# Anaconda3-2020.02  Python 3.7.6
# scikit 0.22.1
# Windows 10/11 

import numpy as np
from sklearn.naive_bayes import CategoricalNB

# ---------------------------------------------------------

def main():
  # 0. prepare
  print("\nBegin scikit naive Bayes demo ")
  print("Predict sex (F = 0, M = 1) from job, eye, country ")
  np.random.seed(1)

  # actuary   green   korea   F
  # barista   green   italy   M
  # dentist   hazel   japan   M
  # . . . 
  # actuary = 0, barista = 1, chemist = 2, dentist = 3
  # green = 0, hazel = 1
  # italy = 0, japan = 1, korea = 2

  # 1. load data
  print("\nLoading train data ")
  train_file = ".\\Data\\job_eye_country_sex.txt"
  X = np.loadtxt(train_file, usecols=range(0,3),
    delimiter="\t", comments="#", dtype=np.int64)
  y = np.loadtxt(train_file, usecols=3,
    delimiter="\t", comments="#", dtype=np.int64) 
  # print(y.shape)  # 1D is required
  # y = y.flatten()
  # y = y.reshape(-1)
  # y = y.squeeze()
  print("Done ")

  print("\nDiscretized features: ")
  print(X)

  print("\nActual classes: ")
  print(y)

  # 2. create and train model
  print("\nCreating naive Bayes classifier ")
  model = CategoricalNB(alpha=1)
  model.fit(X, y)
  print("Done ")
  pred_classes = model.predict(X)

  # 3. evaluate model
  print("\nPredicted classes: ")
  print(pred_classes)
  acc_train = model.score(X, y)
  print("\nAccuracy on train data = %0.4f " % acc_train)

  # 3b. confusion matrix
  from sklearn.metrics import confusion_matrix
  y_predicteds = model.predict(X)
  cm = confusion_matrix(y, y_predicteds)  # actual, pred
  print("\nConfusion matrix raw: ")
  print(cm)

  # 3c. precision, recall, F1
  # from sklearn.metrics import classification_report
  # report = classification_report(y, pred_classes) 
  # print(report) 

  # 4. use model
  # dentist, hazel, italy = [3,1,0]
  print("\nPredicting class for dentist, hazel, italy ")
  probs = model.predict_proba([[3,1,0]])
  print("\nPrediction probs: ")
  print(probs)

  predicted = model.predict([[3,1,0]])
  print("\nPredicted class: ")
  print(predicted)

  # 5. TODO: save model using pickle
  # import pickle
  # print("Saving trained naive Bayes model ")
  # path = ".\\Models\\bayes_scikit_model.sav"
  # pickle.dump(model, open(path, "wb"))

  # predict (barista, green, Korea)
  # x = np.array([[1, 0, 2]], dtype=np.int64)
  # with open(path, 'rb') as f:
  #   loaded_model = pickle.load(f)
  # pa = loaded_model.predict_proba(x)
  # print(pa)
  
  print("\nEnd demo ")

if __name__ == "__main__":
  main()