I was working with the scikit naive Bayes classifier. Naive Bayes is best used for categorical / string / text data such as a file that looks like:
actuary green korea F barista green italy M dentist hazel japan M dentist green italy F chemist hazel japan M . . .
The goal is to predict sex (F or M) from job, eye color, and country. In most cases the raw string data should be converted/encoded to integers like:
0 0 2 0 1 0 0 1 3 1 1 1 3 0 0 0 2 1 1 1 . . .
where (actuary=0, barista=1, chemist=2, dentist=3); (green=0, hazel=1); (italy = 0, japan=1, korea=2); (female=0, male=1). In many scenarios, you can manually convert the string data to integer data manually, for example by dropping the string data into an Excel spreadsheet and then doing find-replace operations.
Instead of manually converting strings to integers, it is possible to do the conversion/encoding programmatically. The scikit library has an OrdinalEncoder class that can do this. For example:
import numpy as np from sklearn.preprocessing import OrdinalEncoder train_file = ".\\Data\\job_eye_country_sex_raw.txt" raw = np.genfromtxt(train_file, usecols=range(0,4), delimiter="\t", dtype=str) enc.fit(raw) # scan data print("\nCategories: ") print(enc.categories_) # show what encoding will do encoded = enc.transform(raw) # encode the data X = encoded[:,0:3] y = encoded[:,3] # etc.
I thought that OrdinalEncoder was 1.) somewhat overkill for such a simple problem, and 2.) using it introduces a dependency, and mostly 3.) it couldn’t easily handle customization of string-to-integer mapping. So I implemented a lightweight ordinal encode function from scratch that only has about a dozen lines of code:
def ordinal_encode(data, col_values): # data is an np string matrix (from genfromtxt) # col_values is a list of lists, per column (nr, nc) = data.shape result = np.zeros((nr,nc), dtype=np.int64) for j in range(nc): # each col vals = col_values[j] # the strings in this col for i in range(nr): s = data[i][j] for k in range(len(vals)): if s == vals[k]: result[i][j] = k break; return result print("\nOrdinal encoding from scratch demo ") print("\nReading data to memory with genfromtxt() ") train_file = ".\\Data\\job_eye_country_sex_raw.txt" raw = np.genfromtxt(train_file, usecols=range(0,4), delimiter="\t", dtype=str) print("\nRaw data: ") print(raw) encoded = ordinal_encode(raw, [['actuary','barista','chemist','dentist'], ['green','hazel'], ['italy','japan','korea'], ['F','M']]) encoded = ordinal_encode(raw, col_vals) print("\nEncoded data: ") print(encoded) # now use for naive Bayes
My lightweight ordinal_encode() function accepts a list of lists of unique values in each column. The order in which the string values are listed determines their integer encoding. So in the example, “green” = 0, “hazel” = 1.
Instead of manually specifying the values, it’s possible to programmatically find the values:
def get_col_values(data): # data is an np string matrix (from genfromtxt) # return is a list of lists, per column (nr, nc) = data.shape result = [] # list of lists for j in range(nc): vals = [] # list of unique values in col for i in range(nr): s = data[i][j] if s in vals: pass else: vals.append(s) vals.sort() result.append(vals) return result train_file = ".\\Data\\job_eye_country_sex_raw.txt" raw = np.genfromtxt(train_file, usecols=range(0,4), delimiter="\t", dtype=str) col_vals = get_col_values(raw) # adjust order if needed print("\nColumn values: ") print(col_vals) encoded = ordinal_encode(raw, col_vals) print("\nEncoded data: ") print(encoded) # etc., etc.
The get_col_values() function encodes each column using alphabetical order, but it’s easy to customize that behavior.
The moral of this post is that sometimes using built-in library code like the scikit OrdinalEncoder class is a good thing, but sometimes writing custom code like the ordinal_encode() function is better.
Transforming string values to integers isn’t very difficult. Transforming a submarine into an airplane isn’t so easy. Here are three ideas that never became reality.
Left: The Convair submersible seaplane was a U.S. Navy design project from the early 1960s. Right: In the early 1930s, Soviet engineering student Boris Ushakov proposed a flying submarine design.
Demo code:
# experiments.py import numpy as np from sklearn.naive_bayes import CategoricalNB # from sklearn.preprocessing import OrdinalEncoder def get_col_values(data): # data is an np string matrix (from genfromtxt) # return is a list of lists, per column (nr, nc) = data.shape result = [] # list of lists for j in range(nc): vals = [] # list of unique values in col for i in range(nr): s = data[i][j] if s in vals: pass else: vals.append(s) vals.sort() result.append(vals) return result def ordinal_encode(data, col_values): # data is an np string matrix (from genfromtxt) # col_values is a list of lists, per column (nr, nc) = data.shape result = np.zeros((nr,nc), dtype=np.int64) for j in range(nc): # each col vals = col_values[j] # the strings in this col for i in range(nr): s = data[i][j] for k in range(len(vals)): if s == vals[k]: result[i][j] = k break; return result print("\nOrdinal encoding from scratch demo ") print("\nReading data to memory with genfromtxt() ") train_file = ".\\Data\\job_eye_country_sex_raw.txt" raw = np.genfromtxt(train_file, usecols=range(0,4), delimiter="\t", dtype=str) print("\nRaw data: ") print(raw) col_vals = get_col_values(raw) print("\nColumn values: ") print(col_vals) # encoded = ordinal_encode(raw, # [['actuary','barista','chemist','dentist'], # ['green','hazel'], # ['italy','japan','korea'], # ['F','M']]) encoded = ordinal_encode(raw, col_vals) print("\nEncoded data: ") print(encoded) X = encoded[:,0:3] y = encoded[:,3] print("\nBegin naive Bayes ") # # using built-in OrdinalEncoder # print("\nEncoding data: ") # enc = OrdinalEncoder(dtype=np.int64) # enc.fit(raw) # scan data # print("\nCategories: ") # print(enc.categories_) # encoded = enc.transform(XY) # X = encoded[:,0:3] # y = encoded[:,3] print("\nCreating naive Bayes classifier ") model = CategoricalNB(alpha=1) model.fit(X, y) print("Done ") pred_classes = model.predict(X) print("\nPredicted classes: ") print(pred_classes) acc_train = model.score(X, y) print("\nAccuracy on train data = %0.4f " % acc_train) # use model # dentist, hazel, italy = [3,1,0] print("\nPredicting class for dentist, hazel, italy ") probs = model.predict_proba([[3,1,0]]) print("\nPrediction probs: ") print(probs) predicted = model.predict([[3,1,0]]) print("\nPredicted class: ") print(predicted) print("\nEnd demo ")
You must be logged in to post a comment.