Example of Multinomial Naive Bayes Classification Using the scikit Library

The scikit-learn code library has a MultinomialNB class that can be used to create prediction models for multinomial data. The most common form of multinomial data has predictor variables where the values are counts. For example, suppose you want to predict the college course type (history = 0, math = 1, psychology = 2) from the counts of each letter grade students received.

I coded up a demo. My raw demo data is:

# college_grades_train_raw.txt
# As Bs Cs Ds Fs Course
# 
5,7,12,6,4,math
1,6,10,3,0,math
0,9,12,2,1,math
8,8,10,3,2,psychology
7,14,8,0,0,psychology
5,12,9,1,3,psychology
2,16,7,0,2,psychology
3,11,5,4,4,history
5,9,7,4,2,history
8,6,8,0,1,history

The first line of data means that in a particular math course, 5 students received As, and there were 7 Bs, 12 Cs, 6 Ds, and 4 Fs.

To use the scikit MultinomialNB class, the labels to predict should be ordinal/integer encoded. So the data used by the demo is:

# college_grades_train.txt
# As Bs Cs Ds Fs Course
# history = 0, math = 1, psych = 2
#
5,7,12,6,4,1
1,6,10,3,0,1
0,9,12,2,1,1
8,8,10,3,2,2
7,14,8,0,0,2
5,12,9,1,3,2
2,16,7,0,2,2
3,11,5,4,4,0
5,9,7,4,2,0
8,6,8,0,1,0

The data is loaded into memory like so:

  train_file = ".\\Data\\college_grades_train.txt"
  XY = np.loadtxt(train_file, usecols=[0,1,2,3,4,5],
    delimiter=",", comments="#", dtype=np.int64)
  X = XY[:,0:5]
  y = XY[:,5]

The model is created and trained like so:

  import numpy as np
  from sklearn.naive_bayes import MultinomialNB

  model = MultinomialNB(alpha=1)
  model.fit(X, y)

The trained model can be evaluated:

  y_predicteds = model.predict(X)
  acc_train = model.score(X, y)
  print("\nAccuracy on train data = %0.4f " % acc_train)

And the trained model can be used to make a prediction:

  X = [[7,8,7,3,1]]  # 7 As, 8 Bs, etc.
  probs = model.predict_proba(X)
  print("Prediction probs: ")
  print(probs)

The result probs matrix has just one row: [[0.7224 0.0186 0.2590]]. The values are pseudo-probabilities of each of the three possible course types. Because the value at position [0] is the largest, the prediction is class 0 = history.

I wasn’t a very good college student in my undergraduate days at U.C. Irvine. It’s something of a miracle I graduated at all because I spent more time partying than studying (except for my math and computer classes which I loved). Toga parties can trace their origins back to the 1950s. Left: The movie “Animal House” (1978) featured a toga party and popularized the idea. Center and Right: Gender differences when it comes to preparing for a toga party. College girls will spend hours creating their togas. College guys will spend approximately 45 seconds creating their togas.

Demo code:

# multinomial_bayes.py
# predict college course type from grade counts

# Anaconda3-2020.02  Python 3.7.6
# scikit 0.22.1
# Windows 10/11 

import numpy as np
from sklearn.naive_bayes import MultinomialNB

# ---------------------------------------------------------

# Data:
# numAs numBs numCs numDs numFs Course
# history = 0, math = 1, psych = 2
# 5,7,12,6,4,1
# 1,6,10,3,0,1
# . . .

# ---------------------------------------------------------

def show_confusion(cm):
  dim = len(cm)
  mx = np.max(cm)             # largest count in cm
  wid = len(str(mx)) + 1      # width to print
  fmt = "%" + str(wid) + "d"  # like "%3d"
  for i in range(dim):
    print("actual   ", end="")
    print("%3d:" % i, end="")
    for j in range(dim):
      print(fmt % cm[i][j], end="")
    print("")
  print("------------")
  print("predicted    ", end="")
  for j in range(dim):
    print(fmt % j, end="")
  print("")

# ---------------------------------------------------------

def main():
  # 0. get ready
  print("\nBegin scikit multinomial Bayes demo ")
  print("Predict (hist = 0, math = 1, psych = 2) from grades ")
  np.random.seed(1)
  np.set_printoptions(precision=4)

  # 1. load data
  train_file = ".\\Data\\college_grades_train.txt"
  XY = np.loadtxt(train_file, usecols=[0,1,2,3,4,5],
    delimiter=",", comments="#", dtype=np.int64)
  X = XY[:,0:5]
  y = XY[:,5]
 
  print("\nPredictor counts: ")
  print(X)

  print("\nCourse types: ")
  print(y)

  # 2. create and train model
  print("\nCreating multinomial Bayes classifier ")
  model = MultinomialNB(alpha=1)
  model.fit(X, y)
  print("Done ")
  
  # 3. evaluate model
  y_predicteds = model.predict(X)
  print("\nPredicted classes: ")
  print(y_predicteds)

  acc_train = model.score(X, y)
  print("\nAccuracy on train data = %0.4f " % acc_train)

  # 3b. confusion matrix
  # from sklearn.metrics import confusion_matrix
  # cm = confusion_matrix(y, y_predicteds)  # actual, pred
  # print("\nConfusion matrix raw: ")
  # print(cm)
  # print("\nConfusion matrix formatted: ")
  # show_confusion(cm)  

  # 3c. precision, recall, F1
  # for binary classification
  # from sklearn.metrics import classification_report
  # report = classification_report(y, y_predicteds)
  # print(report)

  # 4. use model
  X = [[7,8,7,3,1]]  # 7 As, 8 Bs, etc.
  print("\nPredicting course for grade counts: "
    + str(X))
  probs = model.predict_proba(X)
  print("\nPrediction probs: ")
  print(probs)

  pred_course = model.predict(X)  # 0,1,2
  courses = ["history", "math", "psychology"]
  print("\nPredicted course: ")
  print(courses[pred_course[0]])

  # 5. TODO: save model using pickle
  # import pickle
  # print("Saving trained naive Bayes model ")
  # path = ".\\Models\\multinomial_scikit_model.sav"
  # pickle.dump(model, open(path, "wb"))

  # use saved model
  # x = np.array([[6, 7, 8, 2, 1]], dtype=np.int64)
  # with open(path, 'rb') as f:
  #   loaded_model = pickle.load(f)
  # pa = loaded_model.predict_proba(x)
  # print(pa)
  
  print("\nEnd multinomial Bayes demo ")

if __name__ == "__main__":
  main()