The Wheat Seeds Dataset Problem Using the scikit Gaussian Naive Bayes Module

I was looking at the Wheat Seeds dataset problem recently. The goal is to predict the species of a wheat seed (Kama, Rosa, Canadian) from seven numeric predictors (seed length, width, etc.) I had created prediction models using scikit k-nearest neighbors, scikit radius neighbors, and a PyTorch neural network. All three of those approaches had moderate difficulty generating a good model.

Just for fun, one rainy Pacific Northwest weekend morning, I decided I’d try the scikit Gaussian naive Bayes (NB) classifier on the Wheat Seeds dataset. Gaussian NB can be used to predict a target class label when the predictor values are all strictly numeric. I knew that theoretically, and from previous practical experience, Gaussian NB is often not a powerful approach. My experiment confirmed this.

As is often the case, data preparation was time-consuming. The raw data is at archive.ics.uci.edu/ml/datasets/seeds. The raw data looks like:

15.26  14.84  0.871   5.763  3.312  2.221  5.22   1
14.88  14.57  0.8811  5.554  3.333  1.018  4.956  1
. . .
17.63  15.98  0.8673  6.191  3.561  4.076  6.06   2
16.84  15.67  0.8623  5.998  3.484  4.675  5.877  2
. . .
11.84  13.21  0.8521  5.175  2.836  3.598  5.044  3
12.3   13.34  0.8684  5.243  2.974  5.637  5.063  3

---------------------------------------------------
10.59  12.41  0.8081  4.899  2.63   0.765  4.519 (min values)
21.18  17.25  0.9183  6.675  4.033  8.456  6.55  (max values)

There are 210 data items. Each represents one of three species of wheat seeds: Kama=1, Rosa=2, Canadian=3. There are 70 of each species. The first seven values on each line are the predictors: area, perimeter, compactness, length, width, asymmetry, groove. The eighth value is the 1-based encoded species.

The magnitudes and ranges of the raw predictor values varies significantly. Although it’s not necessary to normalize predictor values when using Gaussian NB, I had already done so for my PyTorch neural network so I used the normalized data.

I divided each predictor column by a constant: (25, 20, 1, 10, 10, 10, 10) respectively. The resulting predictors are all between 0.0 and 1.0. I also recoded the target class labels from 1-based to 0-based. The resulting 210-item normalized and recoded data looks like:

0.6104  0.7420  0.8710  0.5763  0.3312  0.2221  0.5220  0
0.5952  0.7285  0.8811  0.5554  0.3333  0.1018  0.4956  0
. . .
0.7052  0.7990  0.8673  0.6191  0.3561  0.4076  0.6060  1
0.6736  0.7835  0.8623  0.5998  0.3484  0.4675  0.5877  1
. . .
0.5048  0.6835  0.8481  0.5410  0.2911  0.3306  0.5231  2
0.5104  0.6690  0.8964  0.5073  0.3155  0.2828  0.4830  2

I split the 210-item normalized data into a 180-item training set and a 30-item test set. I used the first 60 of each target class for training and the last 10 of each target class for testing.

Creating a scikit Gaussian naive Bayes model is easy:

import numpy as np
from sklearn.naive_bayes import GaussianNB
. . .
# load data
. . .
# GaussianNB(priors=None, var_smoothing=1e-09)
print("Creating Gaussian naive Bayes classifier ")
model = GaussianNB()
model.fit(x_train, y_train)
print("Done ")

As expected, the model didn’t do very well. The accuracy on the training data was good at 93.33% (168 out of 180 correct) but the accuracy on the test data was poor at just 66.67% (20 out of 30 correct).

The seven predictors are seed area, perimeter, compactness, length, width, asymmetry, groove. Gaussian naive Bayes looks at each predictor column independently. For example, suppose you are predicting the species for a dummy input of X = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]. For the seed area column, the means and standard deviation of the values for each of the three species are computed. Suppose those means are [0.75, 0.25, 0.65]. Because the mean at [1] is closest to the 0.1 value of the first predictor, based just on that first predictor, you’d guess the species of the X-input is 1 = Rosa.

You’d repeat this process and get “evidence” from each of the seven predictors. Now at this point you could do a majority-rule vote from the seven predictors evidence values to produce a predicted species, but majority rule doesn’t take into account how mathematically strong each of the seven evidence values are. Therefore, the evidence values are combined using probability techniques. (Note: I’ve greatly simplified how Gaussian NB works at the expense of accuracy).

There are two weaknesses of Gaussian naive Bayes. First, the technique looks at each predictor independently and doesn’t take interactions into account (therefore the “naive” in the name). Second, the technique assumes all the predictor values are Gaussian distributed, which may not be true.

It was an interesting exploration on a rainy morning.

There are several analogies between the development and evolution of machine learning algorithms and aviation technology. The Gaussian naive Bayes algorithm is what I think of as an early, classical ML technique. Classical ML techniques have been supplemented (but not entirely replaced) by neural network techniques. Classical ML techniques are analogous to the first airplanes in the early 1900s.

Left: The British Sopwith Triplane (circa 1918) of World War I had a top speed of about 115 mph.

Center: Twenty years later, the British Hawker Hurricane (circa 1938) of World War II had a top speed of about 340 mph.

Right: Twenty years later, the English Electric Lightning (circa 1958) had a top speed of about 1,400 mph.

Demo code below. The training and test data can be found at https://jamesmccaffrey.wordpress.com/2023/04/04/the-wheat-seeds-dataset-problem-using-pytorch/.

# wheat_gnb.py
# Gaussian NB on the Wheat Seeds dataset

# Anaconda3-2022.10  Python 3.9.13
# scikit 1.0.2  Windows 10/11 

import numpy as np
from sklearn.naive_bayes import GaussianNB

# ---------------------------------------------------------

def show_confusion(cm):
  dim = len(cm)
  mx = np.max(cm)             # largest count in cm
  wid = len(str(mx)) + 1      # width to print
  fmt = "%" + str(wid) + "d"  # like "%3d"
  for i in range(dim):
    print("actual   ", end="")
    print("%3d:" % i, end="")
    for j in range(dim):
      print(fmt % cm[i][j], end="")
    print("")
  print("------------")
  print("predicted    ", end="")
  for j in range(dim):
    print(fmt % j, end="")
  print("")

# ---------------------------------------------------------

def main():
  # 0. prepare
  print("\nBegin scikit Gaussian naive Bayes demo ")
  print("Predict wheat species (0,1,2) from seven numerics ")
  np.random.seed(1)
  np.set_printoptions(precision=4, suppress=True)

   # 1. load data
  print("\nLoading train and test data ")
  train_file = ".\\Data\\wheat_train_k.txt"
  x_train = np.loadtxt(train_file, usecols=[0,1,2,3,4,5,6],
    delimiter="\t", comments="#", dtype=np.float32)
  y_train = np.loadtxt(train_file, usecols=7,
    delimiter="\t", comments="#", dtype=np.int64) 

  test_file = ".\\Data\\wheat_test_k.txt"
  x_test = np.loadtxt(test_file, usecols=[0,1,2,3,4,5,6],
    delimiter="\t", comments="#", dtype=np.float32)
  y_test = np.loadtxt(test_file, usecols=7,
    delimiter="\t", comments="#", dtype=np.int64) 
  print("Done ")

  print("\nData: ")
  print(x_train[0:4][:])
  print(". . .")
  print("\nActual species: ")
  print(y_train[0:4])
  print(". . .")

  # 2. create and train model
  # GaussianNB(*, priors=None, var_smoothing=1e-09)
  print("\nCreating Gaussian naive Bayes classifier ")
  model = GaussianNB()
  model.fit(x_train, y_train)
  print("Done ")

  # 3. evaluate model
  acc_train = model.score(x_train, y_train)
  print("\nAccuracy on train data = %0.4f " % acc_train)
  acc_test = model.score(x_test, y_test)
  print("Accuracy on test data =  %0.4f " % acc_test)

  # 3b. confusion matrix
  from sklearn.metrics import confusion_matrix
  y_predicteds = model.predict(x_test)
  cm = confusion_matrix(y_test, y_predicteds) 
  print("\nConfusion matrix for test data: ")
  show_confusion(cm)

  # 4. use model
  print("\nPredicting species all 0.2 predictors: ")
  X = np.array([[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]],
    dtype=np.float32)
  print(X)
  probs = model.predict_proba(X)
  print("\nPrediction probs: ")
  print(probs)

  predicted = model.predict(X)
  print("\nPredicted class: ")
  print(predicted)

  # 5. TODO: save model using pickle
  
  print("\nEnd demo ")

if __name__ == "__main__":
  main()