Multi-Class Classification Example Using LightGBM (Light Gradient Boosting Machine)

Early one Sunday morning, while I was waiting for the dog path to dry off from the evening rain so that I could walk my mutts, I figured I’d take a look at multi-class classification using the LightGBM (light gradient bosting machine) system. LightGBM is a sophisticated tree-based system that can perform classification, regression, and ranking.

There are several interfaces to LightGBM. I like the easy-to-use Python scikit-learn API. LightGBM isn’t installed by default with the Anaconda Python distribution I use, so I installed it with the command “pip install lightgbm”.

For my demo, I used one of my standard synthetic datasets. The goal is to predict political leaning from sex, age, State, and income. The 240-item tab-delimited raw data looks like:

F   24   michigan   29500.00   liberal
M   39   oklahoma   51200.00   moderate
F   63   nebraska   75800.00   conservative
M   36   michigan   44500.00   moderate
F   27   nebraska   28600.00   liberal
. . .

For LightGBM, it’s best to use ordinal encoding for categorical predictor variables. I encoded the sex variable as M = 0 and F = 1. I encoded State as Michigan = 0, Nebraska = 1, Oklahoma = 2. I encoded politics as conservative = 0, moderate = 1, liberal = 2.

Because LightGBM is tree-based, it’s not necessary to normalize numeric data. If you do normalize numeric data, the LGBM classification results will almost always be the same as those for the non-normalized data.

I split the encoded data into a 200-item set of training data and a 40-item set of test data. The resulting comma-delimited encoded data looks like:

1, 24, 0, 29500.00, 2
0, 39, 2, 51200.00, 1
1, 63, 1, 75800.00, 0
0, 36, 0, 44500.00, 1
1, 27, 1, 28600.00, 2
. . .

The key statements of my demo program are:

import numpy as np
import lightgbm as lgbm  # scikit API

train_ = np.loadtxt(train_file, usecols=[0,1,2,3],
  delimiter=",", comments="#", dtype=np.float64)
train_y = np.loadtxt(train_file, usecols=4,
  delimiter=",", comments="#", dtype=np.int64)

params = {
  # 'objective': 'multiclass',  # not needed
  'boosting_type': 'gbdt',  # default
  'num_leaves': 31,  # default
  'max_depth':-1,  # default (unlimited) 
  'n_estimators': 50,  # default = 100
  'learning_rate': 0.05,  # default = 0.10
  'min_data_in_leaf': 5,  # default = 20
  'random_state': 0,
  'verbosity': -1  # only fatal. default = 1 error, warn
}
model = lgbm.LGBMClassifier(**params) 
model.fit(train_x, train_y)

The main challenge when using LightGBM is wading through the dozens of parameters. The LGBMClassifier class/object has 19 parameters (num_leaves, max_depth, etc.) and there are 57 Learning Control Parameters (min_data_in_leaf, bagging_fraction, etc.), for a total of 76 parameters to deal with. Here are the 19 model parameters:

boosting_type='gbdt', 
num_leaves=31,
max_depth=-1,
learning_rate=0.1,
n_estimators=100,
subsample_for_bin=200000,
objective=None,
class_weight=None,
min_split_gain=0.0,
min_child_weight=0.001,
min_child_samples=20,
subsample=1.0,
subsample_freq=0,
colsample_bytree=1.0,
reg_alpha=0.0,
reg_lambda=0.0,
random_state=None,
n_jobs=None,
importance_type='split',
**kwargs

Because the number of parameters is not manageable, you must rely on the default values and then try to find the handful of parameters that will create a good model. For my demo, I changed the n_estimators (number of trees) from the default 100 to 50, the learning rate from default 0.10 to 0.05, the random_state (from default None to an arbitrary value of 0, to get reproducible results), and the min_data_in_leaf from the default of 20 to 5 — it had a big effect. I also set verbosity to -1 to suppress all but fatal error messages, but in a non-demo scenario you really want to see all system warning and error messages too. The near-impossibility of fully understanding all the LightGBM parameters and their interactions is the biggest disadvantage of using LightGBM.

The LightGBM model predicted political leaning for the 40-item test data with 82.5% accuracy (33 out of 40 correct). This is roughly comparable accuracy to that achieved by a neural network multi-class classifier. When LightGBM works, it often works very well. Tree-based systems are highly susceptible to overfitting, but the LightGBM system does a lot to mitigate overfitting.



My synthetic demo data has a political leaning column, but I have very little interest in politics. The kind of people who are attracted to politics generally have none of the personality characteristics I admire, and many of the characteristics I dislike, notably dishonesty. A Google search for “state senator arrested” returned dozens of results, which didn’t really surprise me. Here are three samples. From left to right: New Jersey, New York, Missouri.


Demo program:

# people_politics_lgbm.py
# predict politics from sex, age, State, income
# Anaconda3-2023.09-0  Python 3.11.5  LightGBM 4.3.0

import numpy as np
import lightgbm as lgbm

# -----------------------------------------------------------

def accuracy(model, data_x, data_y):
  # simple
  preds = model.predict(data_x)  # all predicted values
  n_correct = np.sum(preds == data_y)
  result = n_correct / len(data_x)
  return result
  
# -----------------------------------------------------------

def show_accuracy(model, data_x, data_y, n_classes):
  # more details
  n_corrects = np.zeros(n_classes, dtype=np.int64)
  n_wrongs = np.zeros(n_classes, dtype=np.int64)
  for i in range(len(data_x)):
    x = data_x[i].reshape(1, -1)  # batch it
    trgt = data_y[i]  # scalar like 2
    pred = model.predict(x)  # array like [2]
    pred = pred[0]  # like 2
    if pred == trgt:
      n_corrects[trgt] += 1
    else:
      n_wrongs[trgt] += 1

  accs = n_corrects / (n_corrects + n_wrongs)
  counts = n_corrects + n_wrongs

  macro_acc = np.sum(n_corrects) / len(data_x)
  print("Overall accuracy = %8.4f" % macro_acc)

  for c in range(n_classes):
    print("class %d : " % c, end ="")
    print(" ct = %3d " % counts[c], end="")
    print(" correct = %3d " % n_corrects[c], end ="")
    print(" wrong = %3d " % n_wrongs[c], end ="")
    print(" acc = %7.4f " % accs[c])

# -----------------------------------------------------------

def confusion_matrix_multi(model, data_x, data_y, n_classes):
  # assumes n_classes is 3 or greater
  cm = np.zeros((n_classes,n_classes), dtype=np.int64)
  for i in range(len(data_x)):
    x = data_x[i].reshape(1, -1)  # batch it
    trgt_y = data_y[i]  # scalar like 2
    pred_y = model.predict(x)  # array like [2]
    pred_y = pred_y[0]  # like 2
    cm[trgt_y][pred_y] += 1
  return cm

# -----------------------------------------------------------

def show_confusion(cm):
  # cm created using confusion_matrix_multi()
  dim = len(cm)
  mx = np.max(cm)             # largest count in cm
  wid = len(str(mx)) + 1      # width to print
  fmt = "%" + str(wid) + "d"  # like "%3d"
  for i in range(dim):
    print("actual   ", end="")
    print("%3d:" % i, end="")
    for j in range(dim):
      print(fmt % cm[i][j], end="")
    print("")
  print("------------")
  print("predicted    ", end="")
  for j in range(dim):
    print(fmt % j, end="")
  print("")

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin People predict politics using LightGBM ")
  print("Predict politics from sex, age, State, income ")
  np.random.seed(1)

  # 1. load data that looks like:
  # sex, age, State, income, politics
  # 1, 24, 0, 29500.00, 2
  # 0, 39, 2, 51200.00, 1
  # . . .
  print("\nLoading train and test data ")
  train_file = ".\\Data\\people_train.txt"
  train_x = np.loadtxt(train_file, usecols=[0,1,2,3],
    delimiter=",", comments="#", dtype=np.float64)
  train_y = np.loadtxt(train_file, usecols=4,
    delimiter=",", comments="#", dtype=np.int64)

  test_file = ".\\Data\\people_test.txt"
  test_x = np.loadtxt(test_file, usecols=[0,1,2,3],
    delimiter=",", comments="#", dtype=np.float64)
  test_y = np.loadtxt(test_file, usecols=4,
    delimiter=",", comments="#", dtype=np.int64)

  np.set_printoptions(precision=0, suppress=True,
    floatmode='fixed')
  print("\nFirst few train data: ")
  for i in range(3):
    print(train_x[i], end="")
    print("  | " + str(train_y[i]))
  print(". . . ")

  # 2. create and train model
  print("\nCreating and training LGBM multi-class model ")
  # model params:
  # https://lightgbm.readthedocs.io/en/latest/pythonapi/
  #   lightgbm.LGBMClassifier.html
  # core params: 
  # https://lightgbm.readthedocs.io/en/latest/Parameters.html
  params = {
    # 'objective': 'multiclass',  # not needed
    'boosting_type': 'gbdt',  # default
    'num_leaves': 31,  # default
    'max_depth':-1,  # default (unlimited) 
    'n_estimators': 50,  # default = 100
    'learning_rate': 0.05,  # default = 0.10
    'min_data_in_leaf': 5,  # default = 20
    'random_state': 0,
    'verbosity': -1  # only fatal. default = 1 error, warn
  }
  model = lgbm.LGBMClassifier(**params)  # scikit API
  model.fit(train_x, train_y)
  print("Done ")

  # 3. evaluate model
  print("\nEvaluating model ")

  # 3a. using a coarse function
  train_acc = accuracy(model, train_x, train_y)
  print("\nAccuracy on training data = %0.4f " % train_acc)
  test_acc = accuracy(model, test_x, test_y)
  print("Accuracy on test data = %0.4f " % test_acc)

  # 3b. using a detailed function
  print("\nAccuracy on test data: ")
  show_accuracy(model, test_x, test_y, n_classes=3)

  # 3c. using a confusion matrix
  print("\nConfusion matrix for test data: ")
  cm = confusion_matrix_multi(model, test_x,
    test_y, n_classes=3)
  show_confusion(cm)

  # # confusion matrix using scikit module
  # from sklearn.metrics import confusion_matrix
  # pred_y = model.predict(test_x)  # all predicteds
  # cm = confusion_matrix(test_y, pred_y)
  # print(cm)

  # # detailed report using scikit
  # from sklearn.metrics import classification_report
  # pred_y = model.predict(test_x)  # all predicteds
  # report = classification_report(test_y, pred_y,
  #  labels=[0, 1, 2])
  # print(report)

  # 4. use model
  print("\nPredicting politics for M 35 Oklahoma $55,000 ")
  print("(0 = conservative, 1 = moderate, 2 = liberal) ")
  x = np.array([[0, 35, 2, 55000.00]], dtype=np.float64)
  pred = model.predict(x)
  print("\nPredicted politics = " + str(pred[0]))

  # 5. save model
  import pickle
  print("\nSaving model ")
  pth = ".\\Models\\politics_model.pkl"
  with open(pth, "wb") as f:
    pickle.dump(model, f)

  # with open(pth, "rb") as f:
  #   model2 = pickle.load(f)
  #
  # x = np.array([[0, 35, 2, 55000.00]], dtype=np.float64)
  # pred = model2.predict(x)
  # print("\nPredicted politics = " + str(pred[0]))

  print("\nEnd demo ")

if __name__ == "__main__":
  main()

Training data:

# people_train.txt
# sex (M = 0, F = 1)
# age
# State (Michigan = 0, Nebraska = 1, Oklahoma = 2)
# income
# politics (conservative = 0, moderate = 1, liberal = 2)
#
1, 24, 0, 29500.00, 2
0, 39, 2, 51200.00, 1
1, 63, 1, 75800.00, 0
0, 36, 0, 44500.00, 1
1, 27, 1, 28600.00, 2
1, 50, 1, 56500.00, 1
1, 50, 2, 55000.00, 1
0, 19, 2, 32700.00, 0
1, 22, 1, 27700.00, 1
0, 39, 2, 47100.00, 2
1, 34, 0, 39400.00, 1
0, 22, 0, 33500.00, 0
1, 35, 2, 35200.00, 2
0, 33, 1, 46400.00, 1
1, 45, 1, 54100.00, 1
1, 42, 1, 50700.00, 1
0, 33, 1, 46800.00, 1
1, 25, 2, 30000.00, 1
0, 31, 1, 46400.00, 0
1, 27, 0, 32500.00, 2
1, 48, 0, 54000.00, 1
0, 64, 1, 71300.00, 2
1, 61, 1, 72400.00, 0
1, 54, 2, 61000.00, 0
1, 29, 0, 36300.00, 0
1, 50, 2, 55000.00, 1
1, 55, 2, 62500.00, 0
1, 40, 0, 52400.00, 0
1, 22, 0, 23600.00, 2
1, 68, 1, 78400.00, 0
0, 60, 0, 71700.00, 2
0, 34, 2, 46500.00, 1
0, 25, 2, 37100.00, 0
0, 31, 1, 48900.00, 1
1, 43, 2, 48000.00, 1
1, 58, 1, 65400.00, 2
0, 55, 1, 60700.00, 2
0, 43, 1, 51100.00, 1
0, 43, 2, 53200.00, 1
0, 21, 0, 37200.00, 0
1, 55, 2, 64600.00, 0
1, 64, 1, 74800.00, 0
0, 41, 0, 58800.00, 1
1, 64, 2, 72700.00, 0
0, 56, 2, 66600.00, 2
1, 31, 2, 36000.00, 1
0, 65, 2, 70100.00, 2
1, 55, 2, 64300.00, 0
0, 25, 0, 40300.00, 0
1, 46, 2, 51000.00, 1
0, 36, 0, 53500.00, 0
1, 52, 1, 58100.00, 1
1, 61, 2, 67900.00, 0
1, 57, 2, 65700.00, 0
0, 46, 1, 52600.00, 1
0, 62, 0, 66800.00, 2
1, 55, 2, 62700.00, 0
0, 22, 2, 27700.00, 1
0, 50, 0, 62900.00, 0
0, 32, 1, 41800.00, 1
0, 21, 2, 35600.00, 0
1, 44, 1, 52000.00, 1
1, 46, 1, 51700.00, 1
1, 62, 1, 69700.00, 0
1, 57, 1, 66400.00, 0
0, 67, 2, 75800.00, 2
1, 29, 0, 34300.00, 2
1, 53, 0, 60100.00, 0
0, 44, 0, 54800.00, 1
1, 46, 1, 52300.00, 1
0, 20, 1, 30100.00, 1
0, 38, 0, 53500.00, 1
1, 50, 1, 58600.00, 1
1, 33, 1, 42500.00, 1
0, 33, 1, 39300.00, 1
1, 26, 1, 40400.00, 0
1, 58, 0, 70700.00, 0
1, 43, 2, 48000.00, 1
0, 46, 0, 64400.00, 0
1, 60, 0, 71700.00, 0
0, 42, 0, 48900.00, 1
0, 56, 2, 56400.00, 2
0, 62, 1, 66300.00, 2
0, 50, 0, 64800.00, 1
1, 47, 2, 52000.00, 1
0, 67, 1, 80400.00, 2
0, 40, 2, 50400.00, 1
1, 42, 1, 48400.00, 1
1, 64, 0, 72000.00, 0
0, 47, 0, 58700.00, 2
1, 45, 1, 52800.00, 1
0, 25, 2, 40900.00, 0
1, 38, 0, 48400.00, 0
1, 55, 2, 60000.00, 1
0, 44, 0, 60600.00, 1
1, 33, 0, 41000.00, 1
1, 34, 2, 39000.00, 1
1, 27, 1, 33700.00, 2
1, 32, 1, 40700.00, 1
1, 42, 2, 47000.00, 1
0, 24, 2, 40300.00, 0
1, 42, 1, 50300.00, 1
1, 25, 2, 28000.00, 2
1, 51, 1, 58000.00, 1
0, 55, 1, 63500.00, 2
1, 44, 0, 47800.00, 2
0, 18, 0, 39800.00, 0
0, 67, 1, 71600.00, 2
1, 45, 2, 50000.00, 1
1, 48, 0, 55800.00, 1
0, 25, 1, 39000.00, 1
0, 67, 0, 78300.00, 1
1, 37, 2, 42000.00, 1
0, 32, 0, 42700.00, 1
1, 48, 0, 57000.00, 1
0, 66, 2, 75000.00, 2
1, 61, 0, 70000.00, 0
0, 58, 2, 68900.00, 1
1, 19, 0, 24000.00, 2
1, 38, 2, 43000.00, 1
0, 27, 0, 36400.00, 1
1, 42, 0, 48000.00, 1
1, 60, 0, 71300.00, 0
0, 27, 2, 34800.00, 0
1, 29, 1, 37100.00, 0
0, 43, 0, 56700.00, 1
1, 48, 0, 56700.00, 1
1, 27, 2, 29400.00, 2
0, 44, 0, 55200.00, 0
1, 23, 1, 26300.00, 2
0, 36, 1, 53000.00, 2
1, 64, 2, 72500.00, 0
1, 29, 2, 30000.00, 2
0, 33, 0, 49300.00, 1
0, 66, 1, 75000.00, 2
0, 21, 2, 34300.00, 0
1, 27, 0, 32700.00, 2
1, 29, 0, 31800.00, 2
0, 31, 0, 48600.00, 1
1, 36, 2, 41000.00, 1
1, 49, 1, 55700.00, 1
0, 28, 0, 38400.00, 0
0, 43, 2, 56600.00, 1
0, 46, 1, 58800.00, 1
1, 57, 0, 69800.00, 0
0, 52, 2, 59400.00, 1
0, 31, 2, 43500.00, 1
0, 55, 0, 62000.00, 2
1, 50, 0, 56400.00, 1
1, 48, 1, 55900.00, 1
0, 22, 2, 34500.00, 0
1, 59, 2, 66700.00, 0
1, 34, 0, 42800.00, 2
0, 64, 0, 77200.00, 2
1, 29, 2, 33500.00, 2
0, 34, 1, 43200.00, 1
0, 61, 0, 75000.00, 2
1, 64, 2, 71100.00, 0
0, 29, 0, 41300.00, 0
1, 63, 1, 70600.00, 0
0, 29, 1, 40000.00, 0
0, 51, 0, 62700.00, 1
0, 24, 2, 37700.00, 0
1, 48, 1, 57500.00, 1
1, 18, 0, 27400.00, 0
1, 18, 0, 20300.00, 2
1, 33, 1, 38200.00, 2
0, 20, 2, 34800.00, 0
1, 29, 2, 33000.00, 2
0, 44, 2, 63000.00, 0
0, 65, 2, 81800.00, 0
0, 56, 0, 63700.00, 2
0, 52, 2, 58400.00, 1
0, 29, 1, 48600.00, 0
0, 47, 1, 58900.00, 1
1, 68, 0, 72600.00, 2
1, 31, 2, 36000.00, 1
1, 61, 1, 62500.00, 2
1, 19, 1, 21500.00, 2
1, 38, 2, 43000.00, 1
0, 26, 0, 42300.00, 0
1, 61, 1, 67400.00, 0
1, 40, 0, 46500.00, 1
0, 49, 0, 65200.00, 1
1, 56, 0, 67500.00, 0
0, 48, 1, 66000.00, 1
1, 52, 0, 56300.00, 2
0, 18, 0, 29800.00, 0
0, 56, 2, 59300.00, 2
0, 52, 1, 64400.00, 1
0, 18, 1, 28600.00, 1
0, 58, 0, 66200.00, 2
0, 39, 1, 55100.00, 1
0, 46, 0, 62900.00, 1
0, 40, 1, 46200.00, 1
0, 60, 0, 72700.00, 2
1, 36, 1, 40700.00, 2
1, 44, 0, 52300.00, 1
1, 28, 0, 31300.00, 2
1, 54, 2, 62600.00, 0

Test data:

# people_test.txt
#
# people_test.txt
#
0, 51, 0, 61200.00, 1
0, 32, 1, 46100.00, 1
1, 55, 0, 62700.00, 0
1, 25, 2, 26200.00, 2
1, 33, 2, 37300.00, 2
0, 29, 1, 46200.00, 0
1, 65, 0, 72700.00, 0
0, 43, 1, 51400.00, 1
0, 54, 1, 64800.00, 2
1, 61, 1, 72700.00, 0
1, 52, 1, 63600.00, 0
1, 30, 1, 33500.00, 2
1, 29, 0, 31400.00, 2
0, 47, 2, 59400.00, 1
1, 39, 1, 47800.00, 1
1, 47, 2, 52000.00, 1
0, 49, 0, 58600.00, 1
0, 63, 2, 67400.00, 2
0, 30, 0, 39200.00, 0
0, 61, 2, 69600.00, 2
0, 47, 2, 58700.00, 1
1, 30, 2, 34500.00, 2
0, 51, 2, 58000.00, 1
0, 24, 0, 38800.00, 1
0, 49, 0, 64500.00, 1
1, 66, 2, 74500.00, 0
0, 65, 0, 76900.00, 0
0, 46, 1, 58000.00, 0
0, 45, 2, 51800.00, 1
0, 47, 0, 63600.00, 0
0, 29, 0, 44800.00, 0
0, 57, 2, 69300.00, 2
0, 20, 0, 28700.00, 2
0, 35, 0, 43400.00, 1
0, 61, 2, 67000.00, 2
0, 31, 2, 37300.00, 1
1, 18, 0, 20800.00, 2
1, 26, 2, 29200.00, 2
0, 28, 0, 36400.00, 2
0, 59, 2, 69400.00, 2
Posted in Machine Learning | Leave a comment

“Data Anomaly Detection Using a Neural Autoencoder with C#” in Visual Studio Magazine

I wrote an article titled “Data Anomaly Detection Using a Neural Autoencoder with C#” in the April 2024 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2024/04/15/data-anomaly-detection.aspx.

Data anomaly detection is the process of examining a set of source data to find data items that are different in some way from the majority of the source items. My article explains how to use a neural autoencoder implemented using raw C# to find anomalous data items.

My demo program uses a synthetic dataset that has 240 items. The raw data looks like:

F  24  michigan  29500.00  liberal
M  39  oklahoma  51200.00  moderate
F  63  nebraska  75800.00  conservative
M  36  michigan  44500.00  moderate
F  27  nebraska  28600.00  liberal
. . .

Each line of data represents a person. The fields are sex (male, female), age, State (Michigan, Nebraska, Oklahoma), income, and political leaning (conservative, moderate, liberal).

The result is that the data item that has the largest reconstruction error is (M, 36, nebraska, $53000.00, liberal), which has encoded and normalized form (0.00000, 0.36000, 0.00000, 1.00000, 0.00000, 0.53000, 0.00000, 0.00000, 1.00000).

The predicted output is (-0.00122, 0.40366, -0.00134, 0.99657, 0.00477, 0.49658, 0.01607, -0.01048, 0.99440). This indicates that the anomalous data item has an age value that’s a bit too small (actual 36 versus a predicted of 40) and an income value that’s a bit too large (actual $53,000 versus a predicted of $49,658).

The neural autoencoder anomaly detection technique presented in the article is just one of many ways to look for data anomalies. The technique assumes you are working with tabular data, such as log files. Working with image data, working with time series data, and working with natural language data, all require more specialized techniques.



In many science fiction movies, acting intelligent is anomalous behavior.

Left: In “Deep Blue Sea” (1999), scientists sedate a super intelligent, genentically enchanced shark. Choice A = Leave it alone. Choice B = Go poke it to see if it’s really sedated or just pretending.

Center: In “Alien” (1979), a space crew finds an abandoned alien ship with a cargo full of creepy, menacing egg-like pods. Choice A = Get away quickly. Choice B = Go poke one, and when it slowly opens, stick your helmet with an incredibly fragile glass faceplate directly in front of the pod.

Right: In “Life” (2017), a space station crew retrieves a probe to Mars that has an unknown life form. Choice A = Assume it might be dangerous, keep it isolated, and leave it alone until it can be transferred to a secure facility. Choice B = Assume it’s friendly, give it a cute name, and poke it with your hand covered only by a cheap plastic glove.


Posted in Machine Learning | Leave a comment

PyTorch TransformerEncoder Reconstruction Error Anomaly Detection for Ordered Data

A fairly well known anomaly detection technique uses a neural encoder-decoder (aka autoencoder) combined with reconstruction error. A few weeks ago, I experimented by inserting a TransformerEncoder module into such a system and the results seem promising.

However, transformer architecture is really designed for input vectors that have an inherent ordering — typically sentences. So, I created some synthetic medical data that has order. I made synthetic patient data that looks like:

0.1668, 0.2881, 0.1000, 0.4209, 0.2587, 0.6369, 0.5745, 0.6382, 0.4587, 0.3155, 0.1677, 0.3741
0.0818, 0.3512, 0.1110, 0.5682, 0.3669, 0.8235, 0.5562, 0.5792, 0.6203, 0.4873, 0.1254, 0.3769
0.3506, 0.3578, 0.1340, 0.3156, 0.2679, 0.9513, 0.5393, 0.6684, 0.6832, 0.3133, 0.2768, 0.2262
. . .

Each line of of the 200-item dataset represents a patient. The 12 values on each line are some sort of hypothetical reading taken every hour for 12 hours (or every 2 hours for 24 hours, etc.) The idea of using synthetic medical data came from my colleague Paige R.

Next, I put together a PyTorch program to create an encoder-decoder network that predicts its input. Data item that aren’t reconstructed closely are anomalies, at least according to the model.

The heart of the program is:

class Transformer_Net(T.nn.Module):
  def __init__(self):
    # 12 numeric inputs: no exact word embedding equivalent
    # pseudo embed_dim = 4
    # seq_len = 12
    super(Transformer_Net, self).__init__()

    self.fc1 = T.nn.Linear(12, 12*4)  # pseudo-embedding

    self.pos_enc = \
      PositionalEncoding(4, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=4,
      nhead=2, dim_feedforward=100, 
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=6)

    self.dec1 = T.nn.Linear(48, 18)
    self.dec2 = T.nn.Linear(18, 12)

    # use default weight initialization

  def forward(self, x):
    # x is Size([bs, 12])
    z = T.tanh(self.fc1(x))   # [bs, 48]
    z = z.reshape(-1, 12, 4)  # [bs, 12, 4] 
    z = self.pos_enc(z)       # [bs, 12, 4]
    z = self.trans_enc(z)     # [bs, 12, 4]

    z = z.reshape(-1, 48)              # [bs, 48]
    z = T.tanh(self.dec1(z))           # [bs, 18]
    z = self.dec2(z)  # no activation  # [bs, 12]
  
    return z

The architecture is very complicated. Briefly, each numeric input is mapped to a pseudo-embedding vector with 4 values. Then positional encoding is added so the transformer knows the order of the inputs. The data is converted to 3D to accommodate the TransformerEncoder requirement. The output of the TransformerEncoder is reshaped back to 2D and then fed to two Linear fully connected layers, designed so that the final output shape matches the input shape. Whew!

One architecture alternative I want to explore concerns the numeric embedding where each input reading maps to four values. My implementation really isn’t an embedding because I use a Linear layer, which is fully connected. I want to try a true embedding layer. See https://jamesmccaffrey.wordpress.com/2023/04/20/anomaly-detection-for-tabular-data-using-a-pytorch-transformer-with-numeric-embedding/.

After the model has been trained, I invoke an analyze() function that feeds each of the 200 data items to the model, fetches the output, and measures the difference between input and output. I used a custom error function that is the normalized sum of squared differences — close to but not quite Euclidean distance.

The result looks like:

Analyzing data for largest reconstruction error

Largest reconstruction idx: [140]

Largest reconstruction item:
[ 0.0362  0.0516  0.1421  0.3691  0.2506  0.9113
  0.5158  0.5966  0.6516  0.4894  0.2422  0.4905]

Largest reconstruction error: 0.0248

Its reconstruction =
[ 0.1870  0.2014  0.3200  0.5255  0.4023  0.7735
  0.6971  0.7262  0.4979  0.2906  0.2078  0.2887]

This technique seems very promising, but there are a lot of questions that need to be explored.



Putting together a PyTorch program is like putting together a jigsaw puzzle — it’s difficult to make all the pieces fit together. Jigsaw puzzle manufacturers use the same cutting template for different puzzle images. This means you can combine jigsaw puzzles if you have a lot of patience.


Demo program.

# medical_trans_anomaly.py
# Transformer based reconstruction error anomaly detection
# PyTorch 2.2.1-CPU Anaconda3-2023.09-0  Python 3.11.5
# Windows 10/11

import numpy as np
import torch as T

device = T.device('cpu') 
T.set_num_threads(1)

# -----------------------------------------------------------

class PatientDataset(T.utils.data.Dataset):
  # 12 columns
  def __init__(self, src_file):
    tmp_x = np.loadtxt(src_file, usecols=range(0,12),
      delimiter=",", comments="#", dtype=np.float32)
    self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx, :]  # row idx, all cols
    sample = { 'predictors' : preds }  # as Dictionary
    return sample  

# -----------------------------------------------------------

class PositionalEncoding(T.nn.Module):  # documentation code
  def __init__(self, d_model: int, dropout: float=0.1,
   max_len: int=5000):
    super(PositionalEncoding, self).__init__()  # old syntax
    self.dropout = T.nn.Dropout(p=dropout)
    pe = T.zeros(max_len, d_model)  # like 10x4
    position = \
      T.arange(0, max_len, dtype=T.float).unsqueeze(1)
    div_term = T.exp(T.arange(0, d_model, 2).float() * \
      (-np.log(10_000.0) / d_model))
    pe[:, 0::2] = T.sin(position * div_term)
    pe[:, 1::2] = T.cos(position * div_term)
    pe = pe.unsqueeze(0).transpose(0, 1)
    self.register_buffer('pe', pe)  # allows state-save

  def forward(self, x):
    x = x + self.pe[:x.size(0), :]
    return self.dropout(x)

# -----------------------------------------------------------

class Transformer_Net(T.nn.Module):
  def __init__(self):
    # 12 numeric inputs: no exact word embedding equivalent
    # pseudo embed_dim = 4
    # seq_len = 12
    super(Transformer_Net, self).__init__()

    self.fc1 = T.nn.Linear(12, 12*4)  # pseudo-embedding

    self.pos_enc = \
      PositionalEncoding(4, dropout=0.00)  # positional

    self.enc_layer = T.nn.TransformerEncoderLayer(d_model=4,
      nhead=2, dim_feedforward=100, 
      batch_first=True)  # d_model divisible by nhead

    self.trans_enc = T.nn.TransformerEncoder(self.enc_layer,
      num_layers=6)

    self.dec1 = T.nn.Linear(48, 18)
    self.dec2 = T.nn.Linear(18, 12)

    # use default weight initialization

  def forward(self, x):
    # x is Size([bs, 12])
    z = T.tanh(self.fc1(x))   # [bs, 48]
    z = z.reshape(-1, 12, 4)  # [bs, 12, 4] 
    z = self.pos_enc(z)       # [bs, 12, 4]
    z = self.trans_enc(z)     # [bs, 12, 4]

    z = z.reshape(-1, 48)              # [bs, 48]
    z = T.tanh(self.dec1(z))           # [bs, 18]
    z = self.dec2(z)  # no activation  # [bs, 12]
  
    return z

# -----------------------------------------------------------

def analyze_error(model, ds):
  largest_err = 0.0
  worst_x = None
  worst_y = None
  worst_idx = 0
  n_features = len(ds[0]['predictors'])

  for i in range(len(ds)):
    X = ds[i]['predictors']
    with T.no_grad():
      Y = model(X)  # should be same as X
    err = T.sum((X-Y)*(X-Y)).item()  # SSE all features
    err = err / n_features           # sort of norm'ed SSE 

    if err "gt" largest_err:  # replace gt with operator
      largest_err = err
      worst_x = X
      worst_y = Y
      worst_idx = i

  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print("\nLargest reconstruction idx: " + str(worst_idx))
  print("\nLargest reconstruction item: ")
  print(worst_x.numpy())
  print("\nLargest reconstruction error: %0.4f" % largest_err)
  print("\nIts reconstruction = " )
  print(worst_y.numpy())

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin patient transformer-based anomaly detect ")
  T.manual_seed(0)
  np.random.seed(0)
  
  # 1. create DataLoader objects
  print("\nCreating Patient Dataset ")

  data_file = ".\\Data\\medical_data_200.txt"
  data_ds = PatientDataset(data_file)  # 200 rows

  bat_size = 10
  data_ldr = T.utils.data.DataLoader(data_ds,
    batch_size=bat_size, shuffle=True)

  # 2. create network
  print("\nCreating Transformer encoder-decoder network ")
  net = Transformer_Net().to(device)

# -----------------------------------------------------------

  # 3. train autoencoder model
  max_epochs = 100
  ep_log_interval = 10
  # lrn_rate = 0.005
  lrn_rate = 0.010

  loss_func = T.nn.MSELoss()
  optimizer = T.optim.Adam(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = Adam")
  print("lrn_rate = %0.3f " % lrn_rate)
  print("max_epochs = %3d " % max_epochs)
  
  print("\nStarting training")
  net.train()
  for epoch in range(0, max_epochs):
    epoch_loss = 0  # for one full epoch

    for (batch_idx, batch) in enumerate(data_ldr):
      X = batch['predictors'] 
      Y = batch['predictors'] 

      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %4d  |  loss = %0.4f" % \
       (epoch, epoch_loss))
  print("Done ")

# -----------------------------------------------------------

  # 4. find item with largest reconstruction error
  print("\nAnalyzing data for largest reconstruction error ")
  net.eval()
  analyze_error(net, data_ds)

  print("\nEnd transformer autoencoder anomaly demo ")

if __name__ == "__main__":
  main()

Synthetic data:

# medical_data_200.txt
#
0.1668, 0.2881, 0.1000, 0.4209, 0.2587, 0.6369, 0.5745, 0.6382, 0.4587, 0.3155, 0.1677, 0.3741
0.0818, 0.3512, 0.1110, 0.5682, 0.3669, 0.8235, 0.5562, 0.5792, 0.6203, 0.4873, 0.1254, 0.3769
0.3506, 0.3578, 0.1340, 0.3156, 0.2679, 0.9513, 0.5393, 0.6684, 0.6832, 0.3133, 0.2768, 0.2262
0.2746, 0.3339, 0.1073, 0.6001, 0.5955, 0.8993, 0.6122, 0.8157, 0.3413, 0.2792, 0.3634, 0.2174
0.1151, 0.0520, 0.1077, 0.5715, 0.2847, 0.7062, 0.6966, 0.5213, 0.5296, 0.1587, 0.2357, 0.3799
0.0409, 0.1656, 0.3778, 0.4657, 0.2200, 0.8144, 0.7655, 0.7060, 0.6778, 0.3346, 0.3614, 0.1550
0.0557, 0.3230, 0.2591, 0.3661, 0.5710, 0.7391, 0.8003, 0.7904, 0.6533, 0.3495, 0.3004, 0.2396
0.1080, 0.3584, 0.2712, 0.6859, 0.4654, 0.8487, 0.5459, 0.8798, 0.4800, 0.3314, 0.1633, 0.1948
0.3614, 0.2295, 0.1011, 0.5469, 0.3307, 0.8108, 0.8544, 0.6429, 0.6634, 0.3493, 0.0063, 0.4718
0.2764, 0.3989, 0.1689, 0.3549, 0.5730, 0.8787, 0.5264, 0.8022, 0.6016, 0.4692, 0.2846, 0.1497
0.0080, 0.0105, 0.1113, 0.3985, 0.5440, 0.8155, 0.7211, 0.8368, 0.3497, 0.2117, 0.2343, 0.4878
0.2244, 0.0075, 0.4203, 0.3932, 0.5228, 0.7551, 0.8454, 0.7988, 0.5225, 0.1546, 0.0240, 0.1485
0.0178, 0.0430, 0.1903, 0.5852, 0.4239, 0.6050, 0.5288, 0.8869, 0.5272, 0.1813, 0.1009, 0.3975
0.0782, 0.2325, 0.4880, 0.6387, 0.2959, 0.7975, 0.7480, 0.8316, 0.3627, 0.1074, 0.0280, 0.2945
0.2425, 0.2275, 0.2269, 0.6954, 0.4319, 0.7521, 0.7204, 0.7981, 0.5677, 0.2060, 0.0265, 0.2480
0.2519, 0.0841, 0.4011, 0.3266, 0.3041, 0.9219, 0.5774, 0.7558, 0.5099, 0.4699, 0.1053, 0.1264
0.2940, 0.3089, 0.4631, 0.6728, 0.2056, 0.6937, 0.7467, 0.8796, 0.6801, 0.3227, 0.3662, 0.3566
0.1560, 0.1944, 0.3417, 0.5198, 0.5705, 0.9675, 0.6580, 0.8853, 0.3696, 0.1505, 0.0540, 0.3023
0.0086, 0.3792, 0.4308, 0.3060, 0.2705, 0.7328, 0.5524, 0.8238, 0.4379, 0.4760, 0.2328, 0.4515
0.3379, 0.3622, 0.2840, 0.5185, 0.5194, 0.7143, 0.6961, 0.7396, 0.3062, 0.3374, 0.1735, 0.4229
0.1261, 0.3572, 0.3311, 0.3736, 0.5152, 0.8448, 0.5216, 0.6681, 0.5716, 0.4674, 0.0002, 0.4907
0.1506, 0.3895, 0.3419, 0.6315, 0.4299, 0.8512, 0.6142, 0.7347, 0.6000, 0.4433, 0.3020, 0.3792
0.3458, 0.1291, 0.3683, 0.4803, 0.3528, 0.7643, 0.6606, 0.6270, 0.5488, 0.2721, 0.3895, 0.3711
0.0794, 0.1707, 0.2373, 0.6191, 0.5520, 0.9615, 0.7651, 0.6081, 0.4009, 0.4420, 0.2111, 0.4209
0.2290, 0.2933, 0.3076, 0.6084, 0.4275, 0.7863, 0.6371, 0.5273, 0.4512, 0.1319, 0.3931, 0.1726
0.3247, 0.3500, 0.3754, 0.5278, 0.2644, 0.7868, 0.6381, 0.5900, 0.5370, 0.2249, 0.3665, 0.4639
0.1028, 0.0444, 0.1772, 0.4998, 0.4914, 0.6833, 0.5992, 0.8407, 0.4663, 0.3467, 0.0935, 0.1408
0.2063, 0.1909, 0.1611, 0.5487, 0.4176, 0.8617, 0.5578, 0.8006, 0.3888, 0.3077, 0.3141, 0.1089
0.1297, 0.3492, 0.4379, 0.5154, 0.5466, 0.9799, 0.8306, 0.8416, 0.3395, 0.3605, 0.2814, 0.3441
0.3198, 0.0138, 0.4081, 0.5927, 0.3039, 0.7028, 0.7529, 0.6381, 0.6186, 0.2785, 0.3131, 0.4962
0.1201, 0.0572, 0.4605, 0.5166, 0.5899, 0.8546, 0.8976, 0.7184, 0.5106, 0.1542, 0.1423, 0.1105
0.0642, 0.2983, 0.1122, 0.4466, 0.5449, 0.8771, 0.7764, 0.5755, 0.4768, 0.3326, 0.3959, 0.1816
0.0991, 0.1049, 0.4001, 0.4828, 0.2228, 0.8034, 0.5848, 0.8194, 0.4189, 0.1110, 0.2374, 0.4375
0.1524, 0.2999, 0.3045, 0.5164, 0.5838, 0.9216, 0.5129, 0.7838, 0.4860, 0.4790, 0.0886, 0.2068
0.0326, 0.1714, 0.1436, 0.5535, 0.5212, 0.8787, 0.8065, 0.6370, 0.6383, 0.2715, 0.3296, 0.3506
0.0574, 0.0314, 0.1073, 0.3267, 0.3834, 0.6453, 0.5111, 0.8019, 0.4579, 0.3988, 0.1810, 0.2800
0.1912, 0.1896, 0.4213, 0.4610, 0.5619, 0.6148, 0.8095, 0.5503, 0.5474, 0.1041, 0.2155, 0.1012
0.3805, 0.3622, 0.4184, 0.6661, 0.2582, 0.6631, 0.5751, 0.7490, 0.6623, 0.4960, 0.2844, 0.3927
0.3637, 0.1603, 0.1999, 0.3694, 0.2478, 0.9250, 0.5587, 0.6057, 0.6276, 0.2242, 0.3930, 0.2067
0.2135, 0.1258, 0.4643, 0.4466, 0.3734, 0.8049, 0.8756, 0.5124, 0.5868, 0.4564, 0.0109, 0.3088
0.1304, 0.3438, 0.3234, 0.5761, 0.3811, 0.8513, 0.6160, 0.5037, 0.5307, 0.2246, 0.2069, 0.4666
0.1706, 0.0990, 0.2485, 0.6727, 0.5747, 0.9377, 0.8681, 0.5912, 0.3350, 0.1909, 0.1258, 0.1699
0.2428, 0.1654, 0.4265, 0.3741, 0.4808, 0.6961, 0.7297, 0.6396, 0.3228, 0.1915, 0.2656, 0.2989
0.2076, 0.0699, 0.3283, 0.6987, 0.5267, 0.8377, 0.8904, 0.8606, 0.5382, 0.1130, 0.0374, 0.1261
0.1807, 0.1502, 0.4901, 0.3672, 0.5891, 0.9070, 0.8297, 0.7530, 0.5675, 0.2908, 0.0053, 0.2412
0.1968, 0.2920, 0.2875, 0.4830, 0.2551, 0.6044, 0.8033, 0.6280, 0.6938, 0.1881, 0.1355, 0.3096
0.3020, 0.1855, 0.1499, 0.4250, 0.4018, 0.8695, 0.8081, 0.5521, 0.3092, 0.3076, 0.3240, 0.1050
0.2690, 0.2747, 0.2797, 0.6659, 0.4577, 0.6021, 0.6938, 0.8437, 0.6322, 0.3597, 0.2695, 0.3314
0.1096, 0.2242, 0.3687, 0.4410, 0.5423, 0.6780, 0.7989, 0.6158, 0.6095, 0.2711, 0.3231, 0.2414
0.0855, 0.3069, 0.2235, 0.5933, 0.4978, 0.6886, 0.5856, 0.5796, 0.3570, 0.2508, 0.0107, 0.1444
0.2698, 0.3199, 0.1322, 0.3927, 0.2831, 0.9669, 0.7845, 0.7216, 0.4218, 0.4339, 0.1741, 0.4694
0.2824, 0.1912, 0.1505, 0.6904, 0.2639, 0.6810, 0.6725, 0.6617, 0.3587, 0.3917, 0.0755, 0.3576
0.3017, 0.0843, 0.3404, 0.5996, 0.4553, 0.8389, 0.6182, 0.7926, 0.6781, 0.2702, 0.3129, 0.1225
0.3341, 0.0769, 0.2580, 0.4200, 0.2320, 0.9619, 0.6481, 0.7123, 0.4976, 0.1529, 0.0826, 0.1305
0.2032, 0.1046, 0.2428, 0.3432, 0.5150, 0.6426, 0.8943, 0.5709, 0.5290, 0.1179, 0.3148, 0.1758
0.2112, 0.2960, 0.1600, 0.5204, 0.2866, 0.9037, 0.7892, 0.5706, 0.6448, 0.1079, 0.3441, 0.3236
0.1613, 0.3035, 0.3868, 0.6949, 0.3112, 0.6015, 0.8736, 0.8432, 0.5915, 0.3067, 0.2828, 0.4122
0.1500, 0.3081, 0.4002, 0.5453, 0.3607, 0.8789, 0.5012, 0.8100, 0.6586, 0.1957, 0.0483, 0.1881
0.1208, 0.3532, 0.3173, 0.4147, 0.2553, 0.7161, 0.7455, 0.6297, 0.4829, 0.2776, 0.3313, 0.2705
0.1383, 0.2700, 0.1886, 0.4869, 0.3259, 0.8507, 0.8509, 0.6791, 0.6138, 0.2828, 0.2625, 0.1527
0.1732, 0.3637, 0.3422, 0.6067, 0.4019, 0.7992, 0.8372, 0.5271, 0.5293, 0.4771, 0.2071, 0.1778
0.3392, 0.1007, 0.3803, 0.5161, 0.5795, 0.8497, 0.8352, 0.5032, 0.6957, 0.1311, 0.1289, 0.4785
0.0036, 0.3291, 0.4445, 0.4759, 0.3023, 0.9211, 0.6911, 0.5537, 0.6711, 0.4584, 0.1966, 0.4427
0.1674, 0.2734, 0.2592, 0.5023, 0.2758, 0.9860, 0.6177, 0.5414, 0.3577, 0.1056, 0.2864, 0.3258
0.3178, 0.2028, 0.4167, 0.5783, 0.5111, 0.7626, 0.7591, 0.5719, 0.4287, 0.1690, 0.1635, 0.1966
0.1628, 0.3901, 0.2281, 0.6930, 0.4545, 0.7500, 0.8430, 0.7478, 0.4008, 0.4171, 0.1732, 0.2430
0.1321, 0.2789, 0.2075, 0.6233, 0.3181, 0.8176, 0.6952, 0.8421, 0.6554, 0.1738, 0.2341, 0.4593
0.1784, 0.3687, 0.2116, 0.5435, 0.4730, 0.6913, 0.5055, 0.6667, 0.6754, 0.2372, 0.3119, 0.1699
0.1368, 0.0578, 0.3867, 0.5797, 0.4754, 0.7014, 0.7769, 0.5909, 0.4699, 0.2488, 0.1421, 0.1231
0.2527, 0.2829, 0.3454, 0.5593, 0.2680, 0.6598, 0.7057, 0.8501, 0.3736, 0.2851, 0.1716, 0.2989
0.0646, 0.1370, 0.2048, 0.6378, 0.5201, 0.7707, 0.7428, 0.5582, 0.5038, 0.2188, 0.3439, 0.3686
0.2534, 0.0499, 0.2882, 0.6946, 0.5793, 0.8580, 0.5607, 0.7557, 0.5263, 0.2875, 0.1712, 0.3397
0.3400, 0.3004, 0.3317, 0.6699, 0.2259, 0.9965, 0.5212, 0.5798, 0.4691, 0.1430, 0.2495, 0.1192
0.1138, 0.0244, 0.3814, 0.5674, 0.3514, 0.6753, 0.7988, 0.6362, 0.6181, 0.2952, 0.2103, 0.1114
0.2577, 0.1403, 0.1917, 0.4736, 0.3530, 0.7879, 0.8918, 0.6458, 0.6098, 0.3211, 0.3557, 0.2420
0.0982, 0.3644, 0.1174, 0.6803, 0.4226, 0.7505, 0.8980, 0.5233, 0.5067, 0.1124, 0.2285, 0.1722
0.2524, 0.3924, 0.4500, 0.4807, 0.4834, 0.9110, 0.6979, 0.7114, 0.3603, 0.2478, 0.0569, 0.3908
0.1908, 0.1796, 0.4544, 0.5110, 0.3636, 0.7076, 0.5288, 0.6673, 0.3103, 0.2165, 0.2014, 0.4864
0.0438, 0.2692, 0.3000, 0.6108, 0.2574, 0.6333, 0.6597, 0.8188, 0.3767, 0.4071, 0.1161, 0.1868
0.0067, 0.1595, 0.2524, 0.5637, 0.2284, 0.6610, 0.5066, 0.5455, 0.5607, 0.2611, 0.1284, 0.3232
0.3974, 0.3338, 0.3798, 0.6673, 0.2159, 0.6281, 0.6896, 0.6397, 0.6749, 0.2958, 0.2159, 0.4581
0.1787, 0.3508, 0.2014, 0.4095, 0.3313, 0.8190, 0.5881, 0.7686, 0.3571, 0.1376, 0.3481, 0.1947
0.1544, 0.2286, 0.3103, 0.3304, 0.5497, 0.9805, 0.8250, 0.6135, 0.5111, 0.2358, 0.2219, 0.4898
0.1247, 0.2675, 0.2304, 0.6098, 0.3303, 0.9559, 0.8007, 0.8051, 0.4878, 0.1843, 0.0166, 0.2287
0.0148, 0.2775, 0.3681, 0.4722, 0.5071, 0.8144, 0.5159, 0.5539, 0.3774, 0.2343, 0.0209, 0.3420
0.2048, 0.2470, 0.2729, 0.6391, 0.3816, 0.6062, 0.8492, 0.7625, 0.6292, 0.4807, 0.0204, 0.1940
0.0253, 0.1687, 0.4455, 0.3326, 0.3892, 0.6502, 0.8092, 0.8366, 0.3173, 0.2946, 0.0958, 0.4810
0.3776, 0.2456, 0.4894, 0.4379, 0.5591, 0.7738, 0.5943, 0.8763, 0.5737, 0.1260, 0.3482, 0.3806
0.2420, 0.2929, 0.2014, 0.5402, 0.5258, 0.6216, 0.5522, 0.8370, 0.5473, 0.3125, 0.0993, 0.2180
0.3491, 0.1687, 0.1258, 0.6588, 0.2814, 0.9305, 0.8527, 0.6947, 0.5394, 0.3109, 0.2499, 0.4420
0.1129, 0.3535, 0.3271, 0.3460, 0.2908, 0.8384, 0.5958, 0.5526, 0.3647, 0.4379, 0.2409, 0.4854
0.1383, 0.2383, 0.3396, 0.5463, 0.2237, 0.9001, 0.8793, 0.7139, 0.3770, 0.4012, 0.0029, 0.2313
0.3670, 0.2353, 0.4421, 0.5419, 0.5290, 0.9518, 0.6284, 0.5492, 0.5885, 0.2761, 0.0507, 0.3359
0.0144, 0.0801, 0.4153, 0.3048, 0.3213, 0.6086, 0.8990, 0.7328, 0.4174, 0.4716, 0.2028, 0.2819
0.2351, 0.1057, 0.2221, 0.4487, 0.2978, 0.8338, 0.7783, 0.5288, 0.6884, 0.4012, 0.3225, 0.4007
0.0320, 0.1927, 0.2783, 0.5690, 0.3795, 0.8817, 0.7727, 0.7789, 0.5474, 0.1604, 0.3043, 0.4124
0.3616, 0.0935, 0.1707, 0.4564, 0.3282, 0.9262, 0.7454, 0.8040, 0.4711, 0.1398, 0.0460, 0.2494
0.0775, 0.3283, 0.3399, 0.5755, 0.3964, 0.6353, 0.5940, 0.6846, 0.3794, 0.1102, 0.2918, 0.3900
0.1322, 0.3374, 0.2714, 0.6459, 0.4628, 0.8324, 0.5803, 0.7118, 0.6578, 0.2220, 0.3484, 0.4635
0.1319, 0.2732, 0.4597, 0.3303, 0.5514, 0.6763, 0.8399, 0.7668, 0.4377, 0.1606, 0.2541, 0.4391
0.3287, 0.2513, 0.4825, 0.5360, 0.2791, 0.7717, 0.6347, 0.8968, 0.4521, 0.4971, 0.2075, 0.1689
0.0298, 0.1481, 0.1494, 0.5538, 0.3656, 0.9964, 0.8720, 0.5597, 0.4580, 0.2846, 0.2244, 0.4121
0.1949, 0.1680, 0.2048, 0.6643, 0.2089, 0.9284, 0.5754, 0.7743, 0.4421, 0.4897, 0.0491, 0.1750
0.3558, 0.2334, 0.2237, 0.3003, 0.2910, 0.6582, 0.5814, 0.8585, 0.6492, 0.3801, 0.1882, 0.4309
0.1983, 0.1454, 0.2102, 0.6699, 0.3548, 0.7972, 0.6018, 0.8540, 0.4533, 0.2190, 0.2870, 0.1763
0.0473, 0.3348, 0.3977, 0.5362, 0.2972, 0.8493, 0.7553, 0.6310, 0.3270, 0.4522, 0.1840, 0.4055
0.1016, 0.2366, 0.2715, 0.4528, 0.2507, 0.6977, 0.5317, 0.6211, 0.5967, 0.3460, 0.2690, 0.1034
0.2714, 0.2013, 0.1924, 0.3700, 0.2740, 0.9377, 0.8930, 0.8655, 0.4389, 0.4121, 0.2186, 0.4266
0.1935, 0.2360, 0.4149, 0.3401, 0.4148, 0.7464, 0.7417, 0.8835, 0.4571, 0.2572, 0.3163, 0.3580
0.1576, 0.2756, 0.2616, 0.3544, 0.3803, 0.7338, 0.5872, 0.8703, 0.5759, 0.3395, 0.2987, 0.3168
0.2802, 0.3722, 0.4450, 0.3670, 0.3053, 0.6286, 0.8915, 0.5946, 0.5642, 0.1359, 0.0843, 0.3011
0.0420, 0.1555, 0.3152, 0.4357, 0.4224, 0.8147, 0.6562, 0.7785, 0.5714, 0.3749, 0.2246, 0.2432
0.2452, 0.3743, 0.3388, 0.6918, 0.3764, 0.8958, 0.5150, 0.8059, 0.5073, 0.1021, 0.1109, 0.3139
0.3072, 0.0212, 0.3196, 0.6204, 0.4598, 0.9726, 0.5299, 0.6107, 0.6677, 0.4060, 0.2399, 0.4332
0.3584, 0.3891, 0.4994, 0.3559, 0.2282, 0.6294, 0.5059, 0.8887, 0.3379, 0.4367, 0.2741, 0.2950
0.1387, 0.1415, 0.2015, 0.6644, 0.4903, 0.6104, 0.6846, 0.6125, 0.3116, 0.4539, 0.3084, 0.2319
0.3186, 0.1299, 0.2232, 0.6712, 0.5908, 0.8094, 0.8808, 0.8552, 0.5072, 0.2491, 0.2841, 0.2823
0.2421, 0.3962, 0.4096, 0.4337, 0.2356, 0.6740, 0.7107, 0.6668, 0.6203, 0.4733, 0.0711, 0.4440
0.3830, 0.3838, 0.1151, 0.3230, 0.2023, 0.7251, 0.5223, 0.6175, 0.4424, 0.4815, 0.1908, 0.2113
0.2012, 0.2573, 0.1445, 0.6032, 0.5408, 0.9377, 0.8562, 0.6527, 0.4775, 0.1406, 0.0903, 0.4885
0.1138, 0.3630, 0.4562, 0.6690, 0.2625, 0.9519, 0.7528, 0.5843, 0.4380, 0.1640, 0.1777, 0.1335
0.0661, 0.0789, 0.3764, 0.5295, 0.5493, 0.6979, 0.7518, 0.5138, 0.5065, 0.4453, 0.3937, 0.1039
0.1011, 0.0895, 0.1190, 0.3041, 0.3961, 0.6182, 0.6113, 0.7560, 0.4179, 0.2079, 0.2362, 0.2522
0.2810, 0.1984, 0.3533, 0.4414, 0.3148, 0.6533, 0.8748, 0.8219, 0.6752, 0.1742, 0.3733, 0.4743
0.1134, 0.1099, 0.3206, 0.3741, 0.3768, 0.6741, 0.8790, 0.6975, 0.6856, 0.3580, 0.1937, 0.4871
0.0573, 0.2529, 0.3648, 0.4735, 0.2237, 0.7971, 0.6861, 0.8227, 0.4027, 0.2566, 0.0961, 0.3746
0.3960, 0.0710, 0.4286, 0.3924, 0.2232, 0.6555, 0.8741, 0.8554, 0.4157, 0.4791, 0.3400, 0.2739
0.1872, 0.2519, 0.1632, 0.3059, 0.3062, 0.6062, 0.7698, 0.7206, 0.4287, 0.4121, 0.0583, 0.1980
0.1169, 0.0784, 0.1352, 0.6480, 0.2353, 0.8735, 0.5482, 0.5043, 0.5229, 0.4628, 0.3442, 0.2354
0.0109, 0.3203, 0.4224, 0.6474, 0.4679, 0.9231, 0.8590, 0.6815, 0.5231, 0.3025, 0.2768, 0.3732
0.2081, 0.3314, 0.3023, 0.6299, 0.3127, 0.6714, 0.8879, 0.7968, 0.4039, 0.3325, 0.3821, 0.1323
0.0334, 0.2477, 0.1898, 0.6061, 0.4273, 0.8665, 0.5431, 0.5337, 0.5500, 0.2639, 0.0349, 0.2484
0.2689, 0.0758, 0.4583, 0.6799, 0.5846, 0.8920, 0.6625, 0.7975, 0.4152, 0.2258, 0.2424, 0.3379
0.3515, 0.1018, 0.4065, 0.6764, 0.2004, 0.7904, 0.7628, 0.8373, 0.3735, 0.4425, 0.1457, 0.4569
0.0111, 0.0342, 0.4926, 0.5438, 0.3671, 0.6677, 0.7600, 0.5148, 0.4246, 0.2294, 0.2430, 0.3603
0.3384, 0.3710, 0.3642, 0.5313, 0.3595, 0.9866, 0.5616, 0.8580, 0.4244, 0.3194, 0.2728, 0.1946
0.0671, 0.2034, 0.4167, 0.5770, 0.2476, 0.9603, 0.6919, 0.8787, 0.5214, 0.1339, 0.0810, 0.4417
0.2824, 0.3580, 0.2317, 0.5112, 0.4602, 0.8377, 0.5926, 0.6707, 0.3992, 0.4382, 0.3947, 0.1266
0.2856, 0.1320, 0.3502, 0.4033, 0.4604, 0.7485, 0.6179, 0.8647, 0.6770, 0.3454, 0.0858, 0.4758
0.3002, 0.3004, 0.1847, 0.6321, 0.2960, 0.8526, 0.7843, 0.7970, 0.4561, 0.4039, 0.3612, 0.3350
0.0362, 0.0516, 0.1421, 0.3691, 0.2506, 0.9113, 0.5158, 0.5966, 0.6516, 0.4894, 0.2422, 0.4905
0.0174, 0.3794, 0.2245, 0.6196, 0.5243, 0.9440, 0.6834, 0.8723, 0.4032, 0.4738, 0.2476, 0.4942
0.0131, 0.2901, 0.3262, 0.4970, 0.3037, 0.7307, 0.5998, 0.5877, 0.6199, 0.3010, 0.0333, 0.4108
0.2135, 0.2958, 0.3062, 0.4600, 0.5945, 0.6113, 0.8731, 0.8723, 0.4564, 0.1858, 0.2477, 0.1712
0.3213, 0.1016, 0.2163, 0.6664, 0.5612, 0.8142, 0.8451, 0.6410, 0.6992, 0.2735, 0.1179, 0.1166
0.3959, 0.1860, 0.3938, 0.5563, 0.4892, 0.6204, 0.8680, 0.8707, 0.5208, 0.4856, 0.1124, 0.3409
0.1586, 0.2512, 0.2203, 0.6277, 0.2279, 0.6168, 0.5198, 0.5602, 0.4581, 0.4822, 0.0443, 0.3590
0.2110, 0.1413, 0.1793, 0.3882, 0.2175, 0.8853, 0.7615, 0.6775, 0.5876, 0.1440, 0.3755, 0.4391
0.2636, 0.1515, 0.2666, 0.4929, 0.5741, 0.9454, 0.6912, 0.7218, 0.6502, 0.4797, 0.2557, 0.3994
0.1406, 0.3672, 0.4347, 0.5208, 0.5471, 0.9399, 0.8234, 0.5523, 0.5144, 0.4603, 0.3083, 0.2683
0.3912, 0.1003, 0.2334, 0.6817, 0.5235, 0.9601, 0.5046, 0.6519, 0.3942, 0.2184, 0.2952, 0.3896
0.1847, 0.1461, 0.3339, 0.5135, 0.4202, 0.8462, 0.6583, 0.8087, 0.4005, 0.3623, 0.3842, 0.1014
0.2893, 0.0436, 0.3175, 0.5508, 0.2972, 0.9655, 0.7489, 0.5927, 0.6081, 0.1422, 0.2221, 0.1380
0.2324, 0.1012, 0.3598, 0.5863, 0.4097, 0.8630, 0.8253, 0.8230, 0.5479, 0.2804, 0.1632, 0.4499
0.2812, 0.0740, 0.3263, 0.6635, 0.2603, 0.9382, 0.8281, 0.6997, 0.3150, 0.1590, 0.3691, 0.1177
0.1490, 0.2476, 0.1788, 0.6924, 0.2624, 0.8159, 0.7472, 0.5924, 0.4865, 0.2964, 0.3067, 0.3537
0.1870, 0.2668, 0.2814, 0.5103, 0.4634, 0.6725, 0.8152, 0.7411, 0.6921, 0.1890, 0.1865, 0.4583
0.2406, 0.2904, 0.2655, 0.5737, 0.5784, 0.6801, 0.8229, 0.8878, 0.3058, 0.1861, 0.0470, 0.4070
0.0325, 0.3994, 0.2558, 0.6621, 0.2754, 0.7763, 0.8379, 0.7381, 0.4849, 0.3051, 0.2296, 0.2761
0.0774, 0.0456, 0.1605, 0.3210, 0.4235, 0.9391, 0.6436, 0.5301, 0.4703, 0.3408, 0.1702, 0.4243
0.3768, 0.3480, 0.3816, 0.4881, 0.4731, 0.6054, 0.8930, 0.5688, 0.6625, 0.4228, 0.3201, 0.2076
0.0877, 0.1715, 0.4397, 0.4802, 0.5742, 0.7411, 0.7478, 0.8923, 0.5574, 0.1182, 0.1854, 0.2339
0.2129, 0.1139, 0.1489, 0.5673, 0.4725, 0.9469, 0.5530, 0.8194, 0.6307, 0.3586, 0.0820, 0.2046
0.1674, 0.2167, 0.2910, 0.4870, 0.5359, 0.9480, 0.8267, 0.8511, 0.5284, 0.4856, 0.2426, 0.3416
0.1277, 0.2725, 0.1208, 0.3538, 0.2521, 0.8621, 0.5701, 0.6365, 0.3177, 0.1935, 0.3857, 0.3037
0.0604, 0.2092, 0.4774, 0.6463, 0.3568, 0.7135, 0.8047, 0.5962, 0.4017, 0.1336, 0.3457, 0.2792
0.2247, 0.2947, 0.4186, 0.4790, 0.2737, 0.9315, 0.5124, 0.8787, 0.5308, 0.4502, 0.2434, 0.2007
0.1185, 0.2132, 0.4848, 0.3738, 0.4040, 0.7375, 0.8079, 0.8211, 0.4698, 0.1816, 0.0268, 0.1795
0.1090, 0.2395, 0.4492, 0.3513, 0.5833, 0.8733, 0.7968, 0.8932, 0.4664, 0.3126, 0.2717, 0.3052
0.1196, 0.0422, 0.2140, 0.6075, 0.4563, 0.9205, 0.7068, 0.5928, 0.5512, 0.2216, 0.0118, 0.4613
0.1681, 0.1705, 0.3965, 0.6794, 0.2290, 0.6691, 0.6289, 0.5994, 0.4982, 0.4298, 0.3416, 0.3383
0.0961, 0.0746, 0.3974, 0.3949, 0.4158, 0.8997, 0.5913, 0.5698, 0.3159, 0.1590, 0.3406, 0.1419
0.0895, 0.0131, 0.3705, 0.3927, 0.3654, 0.6557, 0.6509, 0.6698, 0.4892, 0.2691, 0.0411, 0.4090
0.0549, 0.1697, 0.2088, 0.4204, 0.4690, 0.8071, 0.5760, 0.6877, 0.4358, 0.3818, 0.0904, 0.4380
0.2105, 0.2002, 0.2015, 0.3745, 0.3887, 0.9927, 0.5532, 0.6134, 0.6203, 0.3659, 0.1115, 0.2259
0.1680, 0.2411, 0.3877, 0.6429, 0.4724, 0.6948, 0.8703, 0.8125, 0.4230, 0.2220, 0.3525, 0.1504
0.2534, 0.1111, 0.4309, 0.4458, 0.4933, 0.6770, 0.6926, 0.8214, 0.4588, 0.1646, 0.2596, 0.4013
0.1121, 0.1805, 0.4602, 0.6488, 0.4829, 0.8364, 0.8270, 0.8631, 0.3010, 0.2589, 0.2246, 0.3936
0.0350, 0.1599, 0.2118, 0.4694, 0.3992, 0.8640, 0.6985, 0.7482, 0.5330, 0.2713, 0.0020, 0.1778
0.1281, 0.0588, 0.3395, 0.5446, 0.4000, 0.7283, 0.7613, 0.5761, 0.3024, 0.3940, 0.1774, 0.3791
0.0740, 0.0250, 0.2512, 0.5784, 0.2411, 0.6783, 0.6816, 0.7485, 0.6000, 0.1439, 0.2498, 0.2549
0.2680, 0.0814, 0.3112, 0.3689, 0.2075, 0.7948, 0.5737, 0.7553, 0.5146, 0.4100, 0.1572, 0.4958
0.2061, 0.1915, 0.3998, 0.5291, 0.3450, 0.7957, 0.5757, 0.6574, 0.3120, 0.2850, 0.1098, 0.3107
0.0117, 0.2220, 0.2172, 0.5310, 0.4931, 0.7761, 0.7653, 0.5956, 0.6994, 0.1972, 0.3763, 0.1869
0.1990, 0.3285, 0.3866, 0.5822, 0.3762, 0.9017, 0.8680, 0.6765, 0.5112, 0.1264, 0.1563, 0.2869
0.3581, 0.0442, 0.3925, 0.5182, 0.4426, 0.6119, 0.5587, 0.6136, 0.3019, 0.3677, 0.3481, 0.3188
0.2173, 0.2463, 0.2209, 0.4467, 0.4300, 0.9237, 0.5806, 0.6310, 0.5972, 0.2364, 0.0190, 0.1625
0.0775, 0.1980, 0.3540, 0.6521, 0.5610, 0.7229, 0.8014, 0.6130, 0.4474, 0.2171, 0.1655, 0.1859
0.1276, 0.0488, 0.4852, 0.5016, 0.5692, 0.8985, 0.6831, 0.8018, 0.5512, 0.2215, 0.2087, 0.3849
0.1221, 0.0050, 0.2073, 0.6187, 0.5720, 0.8501, 0.5531, 0.8030, 0.5108, 0.4015, 0.3434, 0.4790
0.2618, 0.3417, 0.3970, 0.5908, 0.5435, 0.9692, 0.8608, 0.6583, 0.3336, 0.4318, 0.2156, 0.3168
0.3301, 0.0128, 0.4512, 0.3139, 0.4773, 0.8350, 0.7567, 0.6496, 0.4102, 0.3038, 0.3543, 0.3261
0.2653, 0.1766, 0.4889, 0.5970, 0.3420, 0.8614, 0.7170, 0.8536, 0.4100, 0.1432, 0.0765, 0.2548
0.3416, 0.1083, 0.3505, 0.5494, 0.3632, 0.6201, 0.7979, 0.6183, 0.4594, 0.2509, 0.2654, 0.1345
0.2200, 0.0062, 0.4932, 0.5394, 0.3536, 0.6587, 0.7788, 0.8623, 0.4272, 0.4066, 0.1150, 0.4829
0.2809, 0.2500, 0.4723, 0.4076, 0.5694, 0.8712, 0.8085, 0.7287, 0.6336, 0.3793, 0.0586, 0.3450
0.1117, 0.3664, 0.1793, 0.4143, 0.2191, 0.7790, 0.7230, 0.7294, 0.6622, 0.2390, 0.1790, 0.4405
0.2307, 0.2616, 0.2113, 0.4950, 0.4484, 0.6534, 0.5132, 0.5454, 0.4910, 0.1096, 0.2505, 0.1390
0.1004, 0.1706, 0.1463, 0.4082, 0.2084, 0.9940, 0.7446, 0.6513, 0.3106, 0.2559, 0.1810, 0.4724
0.1114, 0.2459, 0.3661, 0.3744, 0.4023, 0.9146, 0.5386, 0.7424, 0.3104, 0.1028, 0.0238, 0.2926
Posted in PyTorch, Transformers | Leave a comment

Anomaly Detection for Mixed Numeric and Categorical Data Using DBSCAN Clustering with C#

Data clustering with the DBSCAN (density-based spatial clustering of applications with noise) algorithm can be easily used to identify anomalous data items. DBSCAN clustering assigns each data item of the source data to a cluster ID, except for data items that are not near other items. Those far-away items are labeled with -1, indicating “noise” — these are anomalous items.

DBSCAN clustering uses Euclidean distance between data items and so the implication is that DBSCAN applies only to strictly numeric data. But I’ve been experimenting with an encoding technique for categorical data that I call one-over-n-hot encoding. For example, if a data column Color has three possible values, then one-over-n-hot encoding is red = (0.3333, 0, 0), blue = (0, 0.3333, 0), green = (0, 0, 0.3333).

For categorical items that have an inherent ordering, I use equal-interval encoding. For example, for Height, short = 0.25, medium = 0.50, tall = 0.75.

I put together a demo using the C# language. I made a 240-item set of synthetic data that looks like:

F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
. . .

Each line represents a person. The fields are sex, height, age, State, income, political leaning.

I used min-max normalization on the age (min = 18, max = 68) and income (min = $20,300, max = $81,800) columns. I used one-over-n-hot encoding on the sex, State, and political leaning columns. I used equal-interval encoding for the height column.

The resulting normalized and encoded data looks like:

0.5, 0.25, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
. . .

When using DBSCAN clustering, you don’t explicitly specify the number of clusters. Instead, you specify an epsilon value and a min_points value. These implicitly determine the resulting number of clusters. DBSCAN clustering is extremely sensitive to the values of epsilon and min_points. After a lot of trial and error, I used epsilon = 0.4790 and min_points = 24.

The result was three clusters, plus 12 anomalous items in the noise cluster. Each noise item is examined by counting the number of data items that are less than the epsilon value (near neighbors):

number clusters =  3
cluster counts
0 : 116
1 : 89
2 : 23

number noise items = 12

[  17] : F  tall    25  delaware  30000  moderate       :  near neighbors = 1
[  50] : M  tall    36  illinois  53500  conservative   :  near neighbors = 8
[  58] : M  tall    50  illinois  62900  conservative   :  near neighbors = 3
[  75] : F  short   26  colorado  40400  conservative   :  near neighbors = 3
[ 124] : F  tall    29  colorado  37100  conservative   :  near neighbors = 0
[ 169] : M  short   44  delaware  63000  conservative   :  near neighbors = 3
[ 170] : M  tall    65  delaware  81800  conservative   :  near neighbors = 1
[ 175] : F  medium  68  arkansas  72600  liberal        :  near neighbors = 0
[ 226] : M  tall    65  arkansas  76900  conservative   :  near neighbors = 3
[ 227] : M  short   46  colorado  58000  conservative   :  near neighbors = 6
[ 229] : M  short   47  arkansas  63600  conservative   :  near neighbors = 5
[ 232] : M  medium  20  arkansas  28700  liberal        :  near neighbors = 1

In this example, the most anomalous data items are [124] and [175] because they have zero near neighbors. The next most anomalous data items are [17], [170], [232] because they have only one near neighbor. And so on. In a non-demo scenario, the anomalous data items would be examined closely to try and determine why they’re anomalous.

Two other clustering-based anomaly detection techniques are k-means clustering anomaly detection and self-organizing maps clustering anomaly detection. I suspect that the three clustering anomaly techniques give different results, but I haven’t explored this question thoroughly.



I loved the “Freddy the Pig” series of books when I was a young man. Freddy is the lead character in 26 books written between 1927 and 1958 by Walter R. Brooks with illustrations by Kurt Wiese. The books focus on the adventures of a group of animals living on a rural farm. The animals can talk to each other and humans — an anomaly that is remarked upon by humans but never really questioned other than a comment like, “The animals can talk — that’s odd.”

#26 Freddy and the Dragon (1958) – Freddy and his sidekick, Jinx the cat, defeat a gang of criminals, and help a traveling circus.

#3 Freddy the Detective (1932) – Freddy and his friends solve a series of mysterious crimes on the Bean family farm — Simon the rat and his gang are the culprits. The first one of the series I read and so it has a special place in my memory.

#14 Freddy the Magician (1947) – Freddy and his farmyard friends deal with Zingo, a criminal magician.


Demo code. Replace “lt” (less than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

using System;
using System.IO;
using System.Collections.Generic;

namespace AnomalyDBSCAN
{
  internal class AnomalyDBSCANProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin anomaly detection" +
        " using DBSCAN clustering ");

      // 1. load data
      Console.WriteLine("\nLoading 240-item" +
        " synthetic People subset ");

      string rf = 
        "..\\..\\..\\Data\\people_raw.txt";
      string[] rawFileArray =
        AnomalyDBSCAN.FileLoad(rf, "#");
      Console.WriteLine("\nFirst three rows" +
        " of raw data: ");
      for (int i = 0; i "lt" 3; ++i)
        Console.WriteLine("[" + i.ToString().
          PadLeft(3) + "]  " + rawFileArray[i]);

      string fn = "..\\..\\..\\Data\\people_240.txt";
      double[][] X = AnomalyDBSCAN.MatLoad(fn,
        new int[] { 0, 1, 2, 3, 4, 5, 6,
          7, 8, 9, 10 }, ',', "#");
      Console.WriteLine("\nFirst three rows" +
        " of normalized and encoded data: ");
      AnomalyDBSCAN.MatShow(X, 4, 8, 3, true);

      // 2. create AnomalyDBSCAN object and cluster
      double epsilon = 0.479;
      int minPoints = 24;  // 4 noise

      Console.WriteLine("\nSetting epsilon = " +
          epsilon.ToString("F4"));
      Console.WriteLine("Setting minPoints = " +
        minPoints);
      Console.WriteLine("\nClustering with DBSCAN ");
      AnomalyDBSCAN dbscan =
        new AnomalyDBSCAN(epsilon, minPoints);
      int[] clustering = dbscan.Cluster(X);
      Console.WriteLine("Done ");

      // Console.WriteLine("\nClustering results: ");
      // AnomalyDBSCAN.VecShow(clustering, 4);

      Console.WriteLine("\nAnalyzing");
      dbscan.Analyze(rawFileArray);

      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();
    } // Main

  } // Program

  public class AnomalyDBSCAN
  {
    public double eps;
    public int minPts;
    public double[][] data;  // supplied in cluster()
    public int[] labels;  // supplied in cluster()

    public AnomalyDBSCAN(double eps, int minPts)
    {
      this.eps = eps;
      this.minPts = minPts;
    }

    public void Analyze(string[] rawFileArray)
    {
      // assumes Cluster() has been called so that
      // this.labels[] is computed

      int maxClusterID = -1;
      int numNoise = 0;
      for (int i = 0; i "lt" this.labels.Length; ++i)
      {
        if (this.labels[i] == -1)
        {
          ++numNoise;
        }
        if (this.labels[i] "gt" maxClusterID)
        {
          maxClusterID = this.labels[i];
        }
      }

      int numClusters = maxClusterID + 1;
      Console.WriteLine("\nnumber clusters =  " +
        numClusters);

      int[] clusterCounts = new int[numClusters];
      for (int i = 0; i "lt" this.labels.Length; ++i)
      {
        int clusterID = this.labels[i];
        if (clusterID != -1)
          ++clusterCounts[clusterID];
      }
      Console.WriteLine("\ncluster counts ");
      for (int cid = 0; cid "lt" clusterCounts.Length;
        ++cid)
      {
        Console.WriteLine(cid + " : " +
          clusterCounts[cid]);
      }

      Console.WriteLine("\nnumber noise items = " +
        numNoise + "\n");
      for (int i = 0; i "lt" this.labels.Length; ++i)
      {
        if (this.labels[i] == -1) // noise
        {
          Console.Write("[" + i.ToString().
            PadLeft(4) + "] : " +
            rawFileArray[i].ToString().
            PadRight(46)); // associated raw data

          double[] distances = 
            new double[this.data.Length];
          int countLessThanEpsilon = 0;
          for (int j = 0; j "lt" this.data.Length; ++j)
          {
            distances[j] = 
              AnomalyDBSCAN.EucDistance(this.data[i],
              this.data[j]);
            if (j != i "and" distances[j] "lt" this.eps)
            {
              ++countLessThanEpsilon;
            }
          }
          Console.WriteLine(" :  near neighbors = " +
            countLessThanEpsilon);
        } // noise item
      } // i
    } // Analyze()

    public int[] Cluster(double[][] data)
    {
      this.data = data;  // by reference
      this.labels = new int[this.data.Length];
      for (int i = 0; i "lt" labels.Length; ++i)
        this.labels[i] = -2;  // unprocessed

      int cid = -1;  // offset the start
      for (int i = 0; i "lt" this.data.Length; ++i)
      {
        if (this.labels[i] != -2)  
          continue;  // item has been processed

        List"lt"int"gt" neighbors = this.RegionQuery(i);
        if (neighbors.Count "lt" this.minPts)
        {
          this.labels[i] = -1;  // noise
        }
        else
        {
          ++cid;
          this.Expand(i, neighbors, cid);
        }
      }

      return this.labels;
    }

    private List"lt"int"gt" RegionQuery(int p)
    {
      // List of idxs close to data[p]
      List"lt"int"gt" result = new List"lt"int"gt"();
      for (int i = 0; i "lt" this.data.Length; ++i)
      {
        double dist = EucDistance(this.data[p],
          this.data[i]);
        if (dist "lt" this.eps)
          result.Add(i);
      }
      return result;
    }

    private void Expand(int p, List"lt"int"gt" neighbors,
      int cid)
    {
      this.labels[p] = cid;
      //int i = 0;
      //while(i "lt" neighbors.Count)
      for (int i = 0; i "lt" neighbors.Count; ++i)
      {
        int pn = neighbors[i];
        if (this.labels[pn] == -1)  // noise
          this.labels[pn] = cid;
        else if (this.labels[pn] == -2)  // unprocessed
        {
          this.labels[pn] = cid;
          List"lt"int"gt" newNeighbors = 
            this.RegionQuery(pn);
          // loop is modified!
          if (newNeighbors.Count "gte" this.minPts)
            neighbors.AddRange(newNeighbors); 
        }
        //++i;
      }
    }

    private static double EucDistance(double[] x1,
      double[] x2)
    {
      int dim = x1.Length;
      double sum = 0.0;
      for (int i = 0; i "lt" dim; ++i)
        sum += (x1[i] - x2[i]) * (x1[i] - x2[i]);
      return Math.Sqrt(sum);
    }

    // ------------------------------------------------------

    // misc. public utility functions for convenience
    // MatLoad(), FileLoad, VecLoad(), MatShow(),
    // VecShow(), ListShow()

    // ------------------------------------------------------

    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment)
    {
      // count number of non-comment lines
      int nRows = 0;
      string line = "";
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
        if (line.StartsWith(comment) == false)
          ++nRows;
      sr.Close(); ifs.Close();

      // make result matrix
      int nCols = usecols.Length;
      double[][] result = new double[nRows][];
      for (int r = 0; r "lt" nRows; ++r)
        result[r] = new double[nCols];

      line = "";
      string[] tokens = null;
      ifs = new FileStream(fn, FileMode.Open);
      sr = new StreamReader(ifs);

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        tokens = line.Split(sep);
        for (int j = 0; j "lt" nCols; ++j)
        {
          int k = usecols[j];  // into tokens
          result[i][j] = double.Parse(tokens[k]);
        }
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }

    // ------------------------------------------------------

    public static string[] FileLoad(string fn,
      string comment)
    {
      List"lt"string"gt" lst = new List"lt"string"gt"();
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      string line = "";
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment)) continue;
        line = line.Trim();
        lst.Add(line);
      }
      sr.Close(); ifs.Close();
      string[] result = lst.ToArray();
      return result;
    }

    // ------------------------------------------------------

    public static int[] VecLoad(string fn, int usecol,
      string comment)
    {
      char dummySep = ',';
      double[][] tmp = MatLoad(fn, new int[] { usecol },
        dummySep, comment);
      int n = tmp.Length;
      int[] result = new int[n];
      for (int i = 0; i "lt" n; ++i)
        result[i] = (int)tmp[i][0];
      return result;
    }

    // ------------------------------------------------------

    public static void MatShow(double[][] M, int dec,
      int wid, int numRows, bool showIndices)
    {
      double small = 1.0 / Math.Pow(10, dec);
      for (int i = 0; i "lt" numRows; ++i)
      {
        if (showIndices == true)
        {
          int pad = M.Length.ToString().Length;
          Console.Write("[" + i.ToString().
            PadLeft(pad) + "]");
        }
        for (int j = 0; j "lt" M[0].Length; ++j)
        {
          double v = M[i][j];
          if (Math.Abs(v) "lt" small) v = 0.0;
          Console.Write(v.ToString("F" + dec).
            PadLeft(wid));
        }
        Console.WriteLine("");
      }
      if (numRows "lt" M.Length)
        Console.WriteLine(". . . ");
    }

    // ------------------------------------------------------

    public static void VecShow(int[] vec, int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        if (i "gt" 0 "and" i % 20 == 0) Console.WriteLine("");
        Console.Write(vec[i].ToString().PadLeft(wid));
      }
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void VecShow(double[] vec, int decimals,
      int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
        Console.Write(vec[i].ToString("F" + decimals).
          PadLeft(wid));
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void ListShow(List"lt"int"gt" lst)
    {
      int n = lst.Count;
      for (int i = 0; i "lt" n; ++i)
      {
        Console.Write(lst[i] + " ");
      }
      Console.WriteLine("");
    }

  } // AnomalyDBSCAN

} // ns

Raw data:

# people_raw.txt
#
F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
F  short   50  colorado  56500  moderate
F  medium  50  illinois  55000  moderate
M  tall    19  delaware  32700  conservative
F  short   22  illinois  27700  moderate
M  tall    39  delaware  47100  liberal
F  short   34  arkansas  39400  moderate
M  medium  22  illinois  33500  conservative
F  medium  35  delaware  35200  liberal
M  tall    33  colorado  46400  moderate
F  short   45  colorado  54100  moderate
F  short   42  illinois  50700  moderate
M  tall    33  colorado  46800  moderate
F  tall    25  delaware  30000  moderate
M  medium  31  colorado  46400  conservative
F  short   27  arkansas  32500  liberal
F  short   48  illinois  54000  moderate
M  tall    64  illinois  71300  liberal
F  medium  61  colorado  72400  conservative
F  short   54  illinois  61000  conservative
F  short   29  arkansas  36300  conservative
F  short   50  delaware  55000  moderate
F  medium  55  illinois  62500  conservative
F  medium  40  illinois  52400  conservative
F  short   22  arkansas  23600  liberal
F  short   68  colorado  78400  conservative
M  tall    60  illinois  71700  liberal
M  tall    34  delaware  46500  moderate
M  medium  25  delaware  37100  conservative
M  short   31  illinois  48900  moderate
F  short   43  delaware  48000  moderate
F  short   58  colorado  65400  liberal
M  tall    55  illinois  60700  liberal
M  tall    43  colorado  51100  moderate
M  tall    43  delaware  53200  moderate
M  medium  21  arkansas  37200  conservative
F  short   55  delaware  64600  conservative
F  short   64  colorado  74800  conservative
M  tall    41  illinois  58800  moderate
F  medium  64  delaware  72700  conservative
M  medium  56  illinois  66600  liberal
F  short   31  delaware  36000  moderate
M  tall    65  delaware  70100  liberal
F  tall    55  illinois  64300  conservative
M  short   25  arkansas  40300  conservative
F  short   46  delaware  51000  moderate
M  tall    36  illinois  53500  conservative
F  short   52  illinois  58100  moderate
F  short   61  delaware  67900  conservative
F  short   57  delaware  65700  conservative
M  tall    46  colorado  52600  moderate
M  tall    62  arkansas  66800  liberal
F  short   55  illinois  62700  conservative
M  medium  22  delaware  27700  moderate
M  tall    50  illinois  62900  conservative
M  tall    32  illinois  41800  moderate
M  short   21  delaware  35600  conservative
F  medium  44  colorado  52000  moderate
F  short   46  illinois  51700  moderate
F  short   62  colorado  69700  conservative
F  short   57  illinois  66400  conservative
M  medium  67  illinois  75800  liberal
F  short   29  arkansas  34300  liberal
F  short   53  illinois  60100  conservative
M  tall    44  arkansas  54800  moderate
F  medium  46  colorado  52300  moderate
M  tall    20  illinois  30100  moderate
M  medium  38  illinois  53500  moderate
F  short   50  colorado  58600  moderate
F  short   33  colorado  42500  moderate
M  tall    33  colorado  39300  moderate
F  short   26  colorado  40400  conservative
F  short   58  arkansas  70700  conservative
F  tall    43  illinois  48000  moderate
M  medium  46  arkansas  64400  conservative
F  short   60  arkansas  71700  conservative
M  tall    42  arkansas  48900  moderate
M  tall    56  delaware  56400  liberal
M  short   62  colorado  66300  liberal
M  short   50  arkansas  64800  moderate
F  short   47  illinois  52000  moderate
M  tall    67  colorado  80400  liberal
M  tall    40  delaware  50400  moderate
F  short   42  colorado  48400  moderate
F  short   64  arkansas  72000  conservative
M  medium  47  arkansas  58700  liberal
F  medium  45  colorado  52800  moderate
M  tall    25  delaware  40900  conservative
F  short   38  arkansas  48400  conservative
F  short   55  delaware  60000  moderate
M  tall    44  arkansas  60600  moderate
F  medium  33  arkansas  41000  moderate
F  short   34  delaware  39000  moderate
F  short   27  colorado  33700  liberal
F  short   32  colorado  40700  moderate
F  tall    42  illinois  47000  moderate
M  short   24  delaware  40300  conservative
F  short   42  colorado  50300  moderate
F  short   25  delaware  28000  liberal
F  short   51  colorado  58000  moderate
M  medium  55  colorado  63500  liberal
F  short   44  arkansas  47800  liberal
M  short   18  arkansas  39800  conservative
M  tall    67  colorado  71600  liberal
F  short   45  delaware  50000  moderate
F  short   48  arkansas  55800  moderate
M  short   25  colorado  39000  moderate
M  tall    67  arkansas  78300  moderate
F  short   37  delaware  42000  moderate
M  short   32  arkansas  42700  moderate
F  short   48  arkansas  57000  moderate
M  tall    66  delaware  75000  liberal
F  tall    61  arkansas  70000  conservative
M  medium  58  delaware  68900  moderate
F  short   19  arkansas  24000  liberal
F  short   38  delaware  43000  moderate
M  medium  27  arkansas  36400  moderate
F  short   42  arkansas  48000  moderate
F  short   60  arkansas  71300  conservative
M  tall    27  delaware  34800  conservative
F  tall    29  colorado  37100  conservative
M  medium  43  arkansas  56700  moderate
F  medium  48  arkansas  56700  moderate
F  medium  27  delaware  29400  liberal
M  tall    44  arkansas  55200  conservative
F  short   23  colorado  26300  liberal
M  tall    36  colorado  53000  liberal
F  short   64  delaware  72500  conservative
F  short   29  delaware  30000  liberal
M  short   33  arkansas  49300  moderate
M  tall    66  colorado  75000  liberal
M  medium  21  delaware  34300  conservative
F  short   27  arkansas  32700  liberal
F  short   29  arkansas  31800  liberal
M  tall    31  arkansas  48600  moderate
F  short   36  delaware  41000  moderate
F  short   49  colorado  55700  moderate
M  short   28  arkansas  38400  conservative
M  medium  43  delaware  56600  moderate
M  medium  46  colorado  58800  moderate
F  short   57  arkansas  69800  conservative
M  short   52  delaware  59400  moderate
M  tall    31  delaware  43500  moderate
M  tall    55  arkansas  62000  liberal
F  short   50  arkansas  56400  moderate
F  short   48  colorado  55900  moderate
M  medium  22  delaware  34500  conservative
F  short   59  delaware  66700  conservative
F  short   34  arkansas  42800  liberal
M  tall    64  arkansas  77200  liberal
F  short   29  delaware  33500  liberal
M  medium  34  colorado  43200  moderate
M  medium  61  arkansas  75000  liberal
F  short   64  delaware  71100  conservative
M  short   29  arkansas  41300  conservative
F  short   63  colorado  70600  conservative
M  medium  29  colorado  40000  conservative
M  tall    51  arkansas  62700  moderate
M  tall    24  delaware  37700  conservative
F  medium  48  colorado  57500  moderate
F  short   18  arkansas  27400  conservative
F  short   18  arkansas  20300  liberal
F  short   33  colorado  38200  liberal
M  medium  20  delaware  34800  conservative
F  short   29  delaware  33000  liberal
M  short   44  delaware  63000  conservative
M  tall    65  delaware  81800  conservative
M  tall    56  arkansas  63700  liberal
M  medium  52  delaware  58400  moderate
M  medium  29  colorado  48600  conservative
M  tall    47  colorado  58900  moderate
F  medium  68  arkansas  72600  liberal
F  short   31  delaware  36000  moderate
F  short   61  colorado  62500  liberal
F  short   19  colorado  21500  liberal
F  tall    38  delaware  43000  moderate
M  tall    26  arkansas  42300  conservative
F  short   61  colorado  67400  conservative
F  short   40  arkansas  46500  moderate
M  medium  49  arkansas  65200  moderate
F  medium  56  arkansas  67500  conservative
M  short   48  colorado  66000  moderate
F  short   52  arkansas  56300  liberal
M  tall    18  arkansas  29800  conservative
M  tall    56  delaware  59300  liberal
M  medium  52  colorado  64400  moderate
M  medium  18  colorado  28600  moderate
M  tall    58  arkansas  66200  liberal
M  tall    39  colorado  55100  moderate
M  tall    46  arkansas  62900  moderate
M  medium  40  colorado  46200  moderate
M  medium  60  arkansas  72700  liberal
F  short   36  colorado  40700  liberal
F  short   44  arkansas  52300  moderate
F  short   28  arkansas  31300  liberal
F  short   54  delaware  62600  conservative
M  medium  51  arkansas  61200  moderate
M  short   32  colorado  46100  moderate
F  short   55  arkansas  62700  conservative
F  short   25  delaware  26200  liberal
F  medium  33  delaware  37300  liberal
M  medium  29  colorado  46200  conservative
F  short   65  arkansas  72700  conservative
M  tall    43  colorado  51400  moderate
M  short   54  colorado  64800  liberal
F  short   61  colorado  72700  conservative
F  short   52  colorado  63600  conservative
F  short   30  colorado  33500  liberal
F  short   29  arkansas  31400  liberal
M  tall    47  delaware  59400  moderate
F  short   39  colorado  47800  moderate
F  short   47  delaware  52000  moderate
M  medium  49  arkansas  58600  moderate
M  tall    63  delaware  67400  liberal
M  medium  30  arkansas  39200  conservative
M  tall    61  delaware  69600  liberal
M  medium  47  delaware  58700  moderate
F  short   30  delaware  34500  liberal
M  medium  51  delaware  58000  moderate
M  medium  24  arkansas  38800  moderate
M  short   49  arkansas  64500  moderate
F  medium  66  delaware  74500  conservative
M  tall    65  arkansas  76900  conservative
M  short   46  colorado  58000  conservative
M  tall    45  delaware  51800  moderate
M  short   47  arkansas  63600  conservative
M  tall    29  arkansas  44800  conservative
M  tall    57  delaware  69300  liberal
M  medium  20  arkansas  28700  liberal
M  medium  35  arkansas  43400  moderate
M  tall    61  delaware  67000  liberal
M  short   31  delaware  37300  moderate
F  short   18  arkansas  20800  liberal
F  medium  26  delaware  29200  liberal
M  medium  28  arkansas  36400  liberal
M  tall    59  delaware  69400  liberal

Normalized and encoded data:

# people_240.txt
#
# sex (M = 0.0, F = 0.5)
# height (short, medium, tall)
# age (min = 18, max = 68)
# State (Arkansas, Colorado, Delaware, Illinois)
# income (min = $20,300, max = $81,800)
# political leaning (conservative, moderate, liberal)
#
0.5, 0.25, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6400, 0.00, 0.25, 0.00, 0.00, 0.5886, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.6400, 0.00, 0.00, 0.00, 0.25, 0.5642, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.0200, 0.00, 0.00, 0.25, 0.00, 0.2016, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0800, 0.00, 0.00, 0.00, 0.25, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.4358, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3200, 0.25, 0.00, 0.00, 0.00, 0.3106, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.00, 0.25, 0.2146, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.3400, 0.00, 0.00, 0.25, 0.00, 0.2423, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5400, 0.00, 0.25, 0.00, 0.00, 0.5496, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.00, 0.00, 0.25, 0.4943, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.4309, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.2600, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.1984, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6000, 0.00, 0.00, 0.00, 0.25, 0.5480, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9200, 0.00, 0.00, 0.00, 0.25, 0.8293, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.8472, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7200, 0.00, 0.00, 0.00, 0.25, 0.6618, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.2602, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6400, 0.00, 0.00, 0.25, 0.00, 0.5642, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6862, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.4400, 0.00, 0.00, 0.00, 0.25, 0.5220, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0800, 0.25, 0.00, 0.00, 0.00, 0.0537, 0.0000, 0.0000, 0.3333
0.5, 0.25, 1.0000, 0.00, 0.25, 0.00, 0.00, 0.9447, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.8400, 0.00, 0.00, 0.00, 0.25, 0.8358, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3200, 0.00, 0.00, 0.25, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.2600, 0.00, 0.00, 0.00, 0.25, 0.4650, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8000, 0.00, 0.25, 0.00, 0.00, 0.7333, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6569, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5000, 0.00, 0.25, 0.00, 0.00, 0.5008, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.5350, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0600, 0.25, 0.00, 0.00, 0.00, 0.2748, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7400, 0.00, 0.00, 0.25, 0.00, 0.7203, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9200, 0.00, 0.25, 0.00, 0.00, 0.8862, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.4600, 0.00, 0.00, 0.00, 0.25, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.7600, 0.00, 0.00, 0.00, 0.25, 0.7528, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9400, 0.00, 0.00, 0.25, 0.00, 0.8098, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.7154, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.1400, 0.25, 0.00, 0.00, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.5600, 0.00, 0.00, 0.25, 0.00, 0.4992, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6800, 0.00, 0.00, 0.00, 0.25, 0.6146, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.7740, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7800, 0.00, 0.00, 0.25, 0.00, 0.7382, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.5252, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8800, 0.25, 0.00, 0.00, 0.00, 0.7561, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6894, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.25, 0.00, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.6400, 0.00, 0.00, 0.00, 0.25, 0.6927, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.2800, 0.00, 0.00, 0.00, 0.25, 0.3496, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.0600, 0.00, 0.00, 0.25, 0.00, 0.2488, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.5200, 0.00, 0.25, 0.00, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5600, 0.00, 0.00, 0.00, 0.25, 0.5106, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8800, 0.00, 0.25, 0.00, 0.00, 0.8033, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7800, 0.00, 0.00, 0.00, 0.25, 0.7496, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.9800, 0.00, 0.00, 0.00, 0.25, 0.9024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.2276, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7000, 0.00, 0.00, 0.00, 0.25, 0.6472, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5610, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.0400, 0.00, 0.00, 0.00, 0.25, 0.1593, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.4000, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6400, 0.00, 0.25, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.3610, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.3089, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1600, 0.00, 0.25, 0.00, 0.00, 0.3268, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8000, 0.25, 0.00, 0.00, 0.00, 0.8195, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.5000, 0.00, 0.00, 0.00, 0.25, 0.4504, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.5600, 0.25, 0.00, 0.00, 0.00, 0.7171, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8358, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.4800, 0.25, 0.00, 0.00, 0.00, 0.4650, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.7600, 0.00, 0.00, 0.25, 0.00, 0.5870, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.8800, 0.00, 0.25, 0.00, 0.00, 0.7480, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.6400, 0.25, 0.00, 0.00, 0.00, 0.7236, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5800, 0.00, 0.00, 0.00, 0.25, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9800, 0.00, 0.25, 0.00, 0.00, 0.9772, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4400, 0.00, 0.00, 0.25, 0.00, 0.4894, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.25, 0.00, 0.00, 0.4569, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9200, 0.25, 0.00, 0.00, 0.00, 0.8407, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5800, 0.25, 0.00, 0.00, 0.00, 0.6244, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.5400, 0.00, 0.25, 0.00, 0.00, 0.5285, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.3350, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4000, 0.25, 0.00, 0.00, 0.00, 0.4569, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7400, 0.00, 0.00, 0.25, 0.00, 0.6455, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.6553, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.3000, 0.25, 0.00, 0.00, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3200, 0.00, 0.00, 0.25, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.2179, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2800, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.4800, 0.00, 0.00, 0.00, 0.25, 0.4341, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.1200, 0.00, 0.00, 0.25, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.25, 0.00, 0.00, 0.4878, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.1252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6600, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.7400, 0.00, 0.25, 0.00, 0.00, 0.7024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.4472, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.3171, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9800, 0.00, 0.25, 0.00, 0.00, 0.8341, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5400, 0.00, 0.00, 0.25, 0.00, 0.4829, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5772, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.1400, 0.00, 0.25, 0.00, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9800, 0.25, 0.00, 0.00, 0.00, 0.9431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3800, 0.00, 0.00, 0.25, 0.00, 0.3528, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2800, 0.25, 0.00, 0.00, 0.00, 0.3642, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5967, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9600, 0.00, 0.00, 0.25, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.8600, 0.25, 0.00, 0.00, 0.00, 0.8081, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.8000, 0.00, 0.00, 0.25, 0.00, 0.7902, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0200, 0.25, 0.00, 0.00, 0.00, 0.0602, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.4000, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.25, 0.00, 0.00, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8293, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.1800, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5000, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.1800, 0.00, 0.00, 0.25, 0.00, 0.1480, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5675, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1000, 0.00, 0.25, 0.00, 0.00, 0.0976, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3600, 0.00, 0.25, 0.00, 0.00, 0.5317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8488, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.3000, 0.25, 0.00, 0.00, 0.00, 0.4715, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9600, 0.00, 0.25, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0600, 0.00, 0.00, 0.25, 0.00, 0.2276, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.2016, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.1870, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.2600, 0.25, 0.00, 0.00, 0.00, 0.4602, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3600, 0.00, 0.00, 0.25, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6200, 0.00, 0.25, 0.00, 0.00, 0.5756, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.2943, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.5902, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.7800, 0.25, 0.00, 0.00, 0.00, 0.8049, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.6800, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.3772, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.7400, 0.25, 0.00, 0.00, 0.00, 0.6780, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6400, 0.25, 0.00, 0.00, 0.00, 0.5870, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.5789, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8200, 0.00, 0.00, 0.25, 0.00, 0.7545, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.3200, 0.25, 0.00, 0.00, 0.00, 0.3659, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.9200, 0.25, 0.00, 0.00, 0.00, 0.9252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.3200, 0.00, 0.25, 0.00, 0.00, 0.3724, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.8600, 0.25, 0.00, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8260, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.3415, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.8179, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.3203, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.6600, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1200, 0.00, 0.00, 0.25, 0.00, 0.2829, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.6049, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.1154, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.0000, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.2911, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0400, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.2065, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.5200, 0.00, 0.00, 0.25, 0.00, 0.6943, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9400, 0.00, 0.00, 0.25, 0.00, 1.0000, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7600, 0.25, 0.00, 0.00, 0.00, 0.7057, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6800, 0.00, 0.00, 0.25, 0.00, 0.6195, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.4602, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5800, 0.00, 0.25, 0.00, 0.00, 0.6276, 0.0000, 0.3333, 0.0000
0.5, 0.50, 1.0000, 0.25, 0.00, 0.00, 0.00, 0.8504, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.6862, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.0200, 0.00, 0.25, 0.00, 0.00, 0.0195, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.4000, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1600, 0.25, 0.00, 0.00, 0.00, 0.3577, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.7659, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4400, 0.25, 0.00, 0.00, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.7301, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.7600, 0.25, 0.00, 0.00, 0.00, 0.7675, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.7431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6800, 0.25, 0.00, 0.00, 0.00, 0.5854, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.1545, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7600, 0.00, 0.00, 0.25, 0.00, 0.6341, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6800, 0.00, 0.25, 0.00, 0.00, 0.7171, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0000, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8000, 0.25, 0.00, 0.00, 0.00, 0.7463, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.25, 0.00, 0.00, 0.5659, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5600, 0.25, 0.00, 0.00, 0.00, 0.6927, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.4400, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3600, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.1789, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7200, 0.00, 0.00, 0.25, 0.00, 0.6878, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.6600, 0.25, 0.00, 0.00, 0.00, 0.6650, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2800, 0.00, 0.25, 0.00, 0.00, 0.4195, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.7400, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.0959, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.3000, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9400, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5000, 0.00, 0.25, 0.00, 0.00, 0.5057, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.7200, 0.00, 0.25, 0.00, 0.00, 0.7236, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6800, 0.00, 0.25, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2400, 0.00, 0.25, 0.00, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.1805, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4200, 0.00, 0.25, 0.00, 0.00, 0.4472, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9000, 0.00, 0.00, 0.25, 0.00, 0.7659, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2400, 0.25, 0.00, 0.00, 0.00, 0.3073, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.8016, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.6244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.2400, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6600, 0.00, 0.00, 0.25, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.3008, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.7187, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.9600, 0.00, 0.00, 0.25, 0.00, 0.8813, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9400, 0.25, 0.00, 0.00, 0.00, 0.9203, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5400, 0.00, 0.00, 0.25, 0.00, 0.5122, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.5800, 0.25, 0.00, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.3984, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7800, 0.00, 0.00, 0.25, 0.00, 0.7967, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0400, 0.25, 0.00, 0.00, 0.00, 0.1366, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.3400, 0.25, 0.00, 0.00, 0.00, 0.3756, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.7593, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.0081, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.1600, 0.00, 0.00, 0.25, 0.00, 0.1447, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.8200, 0.00, 0.00, 0.25, 0.00, 0.7984, 0.0000, 0.0000, 0.3333
Posted in Machine Learning | 1 Comment

A Lightweight Five-Card Poker Library Using JavaScript

One evening, I just couldn’t fall asleep. So I decided to implement a lightweight five-card poker library using JavaScript. My library has a Card class, a Hand class, and a SingleDeck class. The three main functions are: 1.) classify a hand (like “FullHouse”), 2.) compare two hands to determine which hand is better, 3.) deal a hand from a deck of 52 cards.

I didn’t implement my poker library starting from nothing — I refactored my existing C# poker library. The C# poker library took many hours to create, but my JavaScript version only took about four hours of work.

There are two ways to create a Card object:

  let c1 = Card.fromInts(14,3); // Ace of spades
  console.log(c1.toString());

  let c2 = Card.fromStr("Td");  // Ten of diamonds
  console.log(c2.toString());

The first pseudo-constructor accepts a rank and a suit as integers/numbers. The rank values are 2 = Two, 3 = Three, . . 10 = Ten, 11 = Jack, 12 = Queen, 13 = King, 14 = Ace. Rank values of 0 and 1 are not used. The suit values are 0 = clubs, 1 = diamonds, 2 = hearts, 3 = spades. The second pseudo-constructor accepts a string like “Td”. Because JavaScript doesn’t allow function/method overloading, to simulate overloading I defined two static methods.

There are three main ways to create a five-card Hand object:

  let h1 = Hand.fromStr("7cTsJc8d9hd");
  console.log(h1.toString());  // 7c8d9hTsJc
  
  let h2 = Hand.fromCards(Card.fromStr("6s"),
    Card.fromStr("Ah"), Card.fromStr("6h"),
    Card.fromStr("Ac"), Card.fromStr("6d"));
  console.log(h2.toString());  // 6d6h6sAcAh
  
  let lst = [];
  lst.push(Card.fromStr("5c")); lst.push(Card.fromStr("5d"));
  lst.push(Card.fromStr("9c")); lst.push(Card.fromStr("9d"));
  lst.push(Card.fromStr("Qh"));
  let h3 = Hand.fromList(lst);
  console.log(h3.toString());  // 5c5d9c9dQh

The first pseudo-constructor accepts an easy-to-interpret string such as “7cTsJc8d9h”. The second pseudo-constructor accepts five individual Card objects. The third pseudo-constructor accepts a List of five Card objects.

Hand objects are sorted from low card (“2c”) to high card (“As”). The sorting makes a hand easier to interpret, and much easier to classify and compare.

There are two methods to classify a Hand object. The getHandTypeStr() method returns one of ten strings: “HighCard”, “OnePair”, “TwoPair” , “ThreeKind” , “Straight”, “Flush” , “FullHouse”, “FourKind”, “StraightFlush”, “RoyalFlush”. The getHandTypeInt() method returns integer 0 (high card) through 9 (royal flush).

  console.log(h1.getHandTypeStr())  // Straight
  console.log(h1.getHandTypeInt().toString())  // 4

  console.log(h2.getHandTypeStr())  // FullHouse
  console.log(h2.getHandTypeInt().toString())  // 6

  console.log(h3.getHandTypeStr())  // TwoPair
  console.log(h3.getHandTypeInt().toString())  // 2

There is a static Hand.compare(h1, h2) function. It returns -1 if h1 is less than h2, returns +1 if h1 is greater than h2, returns 0 if h1 equals h2.

  let cmp1 = Hand.compare(h1, h2);  // -1: Straight  2P
  console.log("\nHand.compare(h2, h3) = ");
  console.log(cmp2.toString());

The SingleDeck class has a dealHand() method and a dealListCards() method. The dealHand() method returns a Hand object containing five Card objects. The dealListCards(n) method return a List/Array of n Card objects.

  d1 = new SingleDeck(1);
  d1.shuffle();
  d1.show();

  h4 = d1.dealHand();
  console.log(h4.toString());

  listCards = d1.dealListCards(38);
  console.log("Deck is now: ");  // 9 cards left
  d1.show();

To shuffle the deck, I implemented a poor man’s random number generator using the decimal part of the sin() function.

The JavaScript poker library can be used in several ways. You can compute the probabilities of different hands using a simulation. You can find the best five-card hand from seven cards. And so on. I’ll post some examples at some point if I run into another sleepless night.



I have loved cards and card games for as long as I can remember. Here are three examples I found on the Internet that I remember using when I was a young man in the 1960s. Left: The Lane company was a leading maker of plastic coated cards in the 1950s and 60s. The oriental theme seemed exotic and mysterious to me. I don’t think Lane is still around. Center: The KEM company was another leading maker of high-quality plastic coated cards. KEM is still in existence. I liked the geometry and colors of this set of two decks. Right: The Fournier company wasn’t as popular as Lane and KEM, but Fournier made some interesting and offbeat cards. Fournier is still in existence too. I think my family’s Fournier deck came from my grandfather on my mother’s side. Fournier is a Spanish company. My grandfather was French and always brought us interesting gifts from Europe.


Demo code. Replace “lt” (less-than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

// poker.js
// ES6  node.js

// ----------------------------------------------------------

class Card
{
  constructor()
  {
    // returns dummy Card for fromInts(), fromStr()
    this.rank = -1;  // 2 = Two, . . 14 = Ace
    this.suit = -1;  // 0=clubs, diamonds, hearts, 3=spades
  }

  static fromInts(rnk, sut) {
    let result = new Card();
    result.rank = rnk;
    result.suit = sut;
    return result;
  }

  static fromStr(str) {
    let result = new Card();
    let rnk = str.charAt(0);
    let sut = str.charAt(1);

    if (rnk == 'A') result.rank = 14;
    else if (rnk == 'K') result.rank = 13;
    else if (rnk == 'Q') result.rank = 12;
    else if (rnk == 'J') result.rank = 11;
    else if (rnk == 'T') result.rank = 10;
    else result.rank = parseInt(rnk);
   
    if (sut == 'c') result.suit = 0;
    else if (sut == 'd') result.suit = 1;
    else if (sut == 'h') result.suit = 2;
    else if (sut == 's') result.suit = 3;

    return result;
  }

  toString() {
    let rnk = ""; let sut = "";
    if (this.rank == 10) rnk = "T";
    else if (this.rank == 11) rnk = "J";
    else if (this.rank == 12) rnk = "Q";
    else if (this.rank == 13) rnk = "K";
    else if (this.rank == 14) rnk = "A";
    else rnk = this.rank.toString();

    if (this.suit == 0) sut = "c";
    else if (this.suit == 1) sut = "d";
    else if (this.suit == 2) sut = "h";
    else if (this.suit == 3) sut = "s";

    return rnk + sut;
  }

} // class Card

// ----------------------------------------------------------

class Hand
{
  constructor()
  {
    this.cards = [];  // make dummy 2c, 3c, 4c, 5c, 6c
    for (let i = 0; i "lt" 5; ++i)
      this.cards[i] = Card.fromInts(i+2, 0);
  }

  static fromStr(str) {  // like "Js3h7d7cAd"
    let result = new Hand();  // dummy hand
    result.cards[0] = Card.fromStr(str.substring(0,2));
    result.cards[1] = Card.fromStr(str.substring(2,4));
    result.cards[2] = Card.fromStr(str.substring(4,6));
    result.cards[3] = Card.fromStr(str.substring(6,8));
    result.cards[4] = Card.fromStr(str.substring(8,10));

    // sort the Hand low to high by rank then by suit
    result.cards.sort((a,b) ="gt" a.rank - b.rank || 
      a.suit - b.suit);
    return result;
  }

  static fromCards(c0, c1, c2, c3, c4) {
    let result = new Hand();  // dummy hand
    result.cards[0] = c0;
    result.cards[1] = c1;
    result.cards[2] = c2;
    result.cards[3] = c3;
    result.cards[4] = c4;
    result.cards.sort((a,b) ="gt" a.rank - b.rank || 
      a.suit - b.suit);
    return result;
  }

  static fromList(lst) {
    let result = new Hand();  // dummy hand
    result.cards[0] = lst[0];
    result.cards[1] = lst[1];
    result.cards[2] = lst[2];
    result.cards[3] = lst[3];
    result.cards[4] = lst[4];
    result.cards.sort((a,b) ="gt" a.rank - b.rank || 
      a.suit - b.suit);
    return result;
  }

  toString() {
    let result = "";
    for (let i = 0; i "lt" 5; ++i)
      result += this.cards[i].toString();
    return result;
  }

  // Hand type functions
  // getHandTypeStr(), getHandTypeInt(),
  //
  // isRoyalFlush(), isStraightFlush(), 
  // isFourKind(), isFullHouse(), isFlush(),
  // isStraight(), isThreeKind(), isTwoPair(),
  // isOnePair(), isHighCard()
  //
  // helpers: hasFlush(), hasStraight()

  // --------------------------------------------------------

  getHandTypeStr() {
    if (Hand.isRoyalFlush(this) == true)
      return "RoyalFlush";
    else if (Hand.isStraightFlush(this) == true)
      return "StraightFlush";
    else if (Hand.isFourKind(this) == true)
      return "FourKind";
    else if (Hand.isFullHouse(this) == true)
      return "FullHouse";
    else if (Hand.isFlush(this) == true)
      return "Flush";
    else if (Hand.isStraight(this) == true)
      return "Straight";
    else if (Hand.isThreeKind(this) == true)
      return "ThreeKind";
    else if (Hand.isTwoPair(this) == true)
      return "TwoPair";
    else if (Hand.isOnePair(this) == true)
      return "OnePair";
    else if (Hand.isHighCard(this) == true)
      return "HighCard";
    else
      return "Unknown";
  }

  // --------------------------------------------------------

  getHandTypeInt() {
    if (Hand.isRoyalFlush(this) == true)
      return 9;
    else if (Hand.isStraightFlush(this) == true)
      return 8;
    else if (Hand.isFourKind(this) == true)
      return 7;
    else if (Hand.isFullHouse(this) == true)
      return 6;
    else if (Hand.isFlush(this) == true)
      return 5;
    else if (Hand.isStraight(this) == true)
      return 4;
    else if (Hand.isThreeKind(this) == true)
      return 3;
    else if (Hand.isTwoPair(this) == true)
      return 2;
    else if (Hand.isOnePair(this) == true)
      return 1;
    else if (Hand.isHighCard(this) == true)
      return 0;
    else
      return -1;
  }

  // --------------------------------------------------------

  static hasFlush(h) {
    if ((h.cards[0].suit == h.cards[1].suit) "and"
      (h.cards[1].suit == h.cards[2].suit) "and"
      (h.cards[2].suit == h.cards[3].suit) "and"
      (h.cards[3].suit == h.cards[4].suit))
    return true;

    return false;
  }

  // --------------------------------------------------------

  static hasStraight(h) {
    // check special case of Ace-low straight
    // 2, 3, 4, 5, A when sorted
    if (h.cards[0].rank == 2 "and"
      h.cards[1].rank == 3 "and"
      h.cards[2].rank == 4 "and"
      h.cards[3].rank == 5 "and"
      h.cards[4].rank == 14)
      return true;

    // otherwise, check for 5 consecutive
    if ((h.cards[0].rank == h.cards[1].rank - 1) "and"
      (h.cards[1].rank == h.cards[2].rank - 1) "and"
      (h.cards[2].rank == h.cards[3].rank - 1) "and"
      (h.cards[3].rank == h.cards[4].rank - 1))
      return true;

    return false;
  }

  // --------------------------------------------------------

  static isRoyalFlush(h) {
    if (Hand.hasStraight(h) == true "and" 
      Hand.hasFlush(h) == true "and"
      h.cards[0].rank == 10)
      return true;
    else
      return false;
  }

  // --------------------------------------------------------

  static isStraightFlush(h) {
    if (Hand.hasStraight(h) == true "and"
     Hand.hasFlush(h) == true "and"
     h.cards[0].rank != 10)
     return true;
    else
      return false;
  }

  // --------------------------------------------------------

  static isFourKind(h) {
    // AAAA B or B AAAA if sorted
    if ((h.cards[0].rank == h.cards[1].rank) "and"
      (h.cards[1].rank == h.cards[2].rank) "and"
      (h.cards[2].rank == h.cards[3].rank) "and"
      (h.cards[3].rank != h.cards[4].rank))
      return true;

    if ((h.cards[1].rank == h.cards[2].rank) "and"
      (h.cards[2].rank == h.cards[3].rank) "and"
      (h.cards[3].rank == h.cards[4].rank) "and"
      (h.cards[0].rank != h.cards[1].rank))
      return true;

    return false;
  }

  // --------------------------------------------------------

  static isFullHouse(h) {
    // AAA BB or BB AAA if sorted
    if ((h.cards[0].rank == h.cards[1].rank) "and"
      (h.cards[1].rank == h.cards[2].rank) "and"
      (h.cards[3].rank == h.cards[4].rank) "and"
      (h.cards[2].rank != h.cards[3].rank))
      return true;

    // BB AAA
    if ((h.cards[0].rank == h.cards[1].rank) "and"
      (h.cards[2].rank == h.cards[3].rank) "and"
      (h.cards[3].rank == h.cards[4].rank) "and"
      (h.cards[1].rank != h.cards[2].rank))
      return true;

    return false;
  }

  // --------------------------------------------------------

  static isFlush(h) {
    if (Hand.hasFlush(h) == true "and" 
      Hand.hasStraight(h) == false)
      return true; // no StraightFlush or RoyalFlush
    else
      return false;
  }

  // --------------------------------------------------------

  static isStraight(h) {
    if (Hand.hasStraight(h) == true "and" 
      Hand.hasFlush(h) == false) // no SF or RF
      return true;
    else
      return false;
  }

  // --------------------------------------------------------

  static isThreeKind(h) {
    // AAA B C or B AAA C or B C AAA if sorted
    if ((h.cards[0].rank == h.cards[1].rank) "and"
      (h.cards[1].rank == h.cards[2].rank) "and"
      (h.cards[2].rank != h.cards[3].rank) "and"
      (h.cards[3].rank != h.cards[4].rank))
      return true;

    if ((h.cards[1].rank == h.cards[2].rank) "and"
      (h.cards[2].rank == h.cards[3].rank) "and"
      (h.cards[0].rank != h.cards[1].rank) "and"
      (h.cards[3].rank != h.cards[4].rank))
      return true;

    if ((h.cards[2].rank == h.cards[3].rank) "and"
      (h.cards[3].rank == h.cards[4].rank) "and"
      (h.cards[0].rank != h.cards[1].rank) "and"
      (h.cards[1].rank != h.cards[2].rank))
      return true;

    return false;
  }

  // --------------------------------------------------------

  static isTwoPair(h) {
    // AA BB C or AA C BB or C AA BB if sorted
    if ((h.cards[0].rank == h.cards[1].rank) "and"
      (h.cards[2].rank == h.cards[3].rank) "and"
      (h.cards[1].rank != h.cards[2].rank) "and"
      (h.cards[3].rank != h.cards[4].rank))
      return true;  // AA BB C

    if ((h.cards[0].rank == h.cards[1].rank) "and"
      (h.cards[3].rank == h.cards[4].rank) "and"
      (h.cards[1].rank != h.cards[2].rank) "and"
      (h.cards[2].rank != h.cards[3].rank))
      return true;  // AA C BB

    if ((h.cards[1].rank == h.cards[2].rank) "and"
      (h.cards[3].rank == h.cards[4].rank) "and"
      (h.cards[0].rank != h.cards[1].rank) "and"
      (h.cards[2].rank != h.cards[3].rank))
      return true;  // C AA BB

    return false;
  }

  // --------------------------------------------------------

  static isOnePair(h) {
    // AA B C D or B AA C D or B C AA D or B C D AA
    if ((h.cards[0].rank == h.cards[1].rank) "and"
      (h.cards[1].rank != h.cards[2].rank) "and"
      (h.cards[2].rank != h.cards[3].rank) "and"
      (h.cards[3].rank != h.cards[4].rank))
      return true;  // AA B C D

    if ((h.cards[1].rank == h.cards[2].rank) "and"
      (h.cards[0].rank != h.cards[1].rank) "and"
      (h.cards[2].rank != h.cards[3].rank) "and"
      (h.cards[3].rank != h.cards[4].rank))
      return true;  // B AA C D

    if ((h.cards[2].rank == h.cards[3].rank) "and"
      (h.cards[0].rank != h.cards[1].rank) "and"
      (h.cards[1].rank != h.cards[2].rank) "and"
      (h.cards[3].rank != h.cards[4].rank))
      return true;  // B C AA D

    if ((h.cards[3].rank == h.cards[4].rank) "and"
      (h.cards[0].rank != h.cards[1].rank) "and"
      (h.cards[1].rank != h.cards[2].rank) "and"
      (h.cards[2].rank != h.cards[3].rank))
      return true;  // B C D AA

    return false;
  }

  // --------------------------------------------------------

  static isHighCard(h) {
    if (Hand.hasFlush(h) == true)
      return false;
    else if (Hand.hasStraight(h) == true)
      return false;
    else  {
      // all remaining have at least one pair
      if ((h.cards[0].rank == h.cards[1].rank) ||
        (h.cards[1].rank == h.cards[2].rank) ||
        (h.cards[2].rank == h.cards[3].rank) ||
        (h.cards[3].rank == h.cards[4].rank))
        return false;
    }

    return true;
  }

  // --------------------------------------------------------

  // Hand comparison methods
  // Hand.compare() calls:
  // breakTieStraightFlush(), breakTieFourKind(),
  // breakTieFullHouse(), breakTieFlush(),
  // breakTieStraight(), breakTieThreeKind(),
  // breakTieTwoPair(), breakTieOnePair(),
  // breakTieHighCard()

  // --------------------------------------------------------

  static compare(h1, h2) {
    // -1 if h1 "lt" h2, +1 if h1 "gt" h2, 0 if h1 == h2

    let h1Idx = h1.getHandTypeInt();  // like 6
    let h2Idx = h2.getHandTypeInt();

    // different hand types - easy
    if (h1Idx "lt" h2Idx)
      return -1;
    else if (h1Idx "gt" h2Idx)
      return +1;
    else // same hand types so break tie
    {
      let h1HandType = h1.getHandTypeStr();
      let h2HandType = h2.getHandTypeStr();

      if (h1HandType != h2HandType)
        console.log("Logic error in Hand.compare() ");

      if (h1HandType == "RoyalFlush")
        return 0; // two Royal Flush always tie
      else if (h1HandType == "StraightFlush")
        return Hand.breakTieStraightFlush(h1, h2);
      else if (h1HandType == "FourKind")
        return Hand.breakTieFourKind(h1, h2);
      else if (h1HandType == "FullHouse")
        return Hand.breakTieFullHouse(h1, h2);
      else if (h1HandType == "Flush")
        return Hand.breakTieFlush(h1, h2);
      else if (h1HandType == "Straight")
        return Hand.breakTieStraight(h1, h2);
      else if (h1HandType == "ThreeKind")
        return Hand.breakTieThreeKind(h1, h2);
      else if (h1HandType == "TwoPair")
        return Hand.breakTieTwoPair(h1, h2);
      else if (h1HandType == "OnePair")
        return Hand.breakTieOnePair(h1, h2);
      else if (h1HandType == "HighCard")
        return Hand.breakTieHighCard(h1, h2);
    }
    return -2;  // error
  }

  // --------------------------------------------------------

  static breakTieStraightFlush(h1, h2) {
    // check special case of Ace-low straight flush
    // check one or two Ace-low hands
    // h1 is Ace - low, h2 not Ace - low. h1 is less
    if ((h1.cards[0].rank == 2 "and"
      h1.cards[4].rank == 14) "and"  // because sorted!
      !(h2.cards[0].rank == 2 "and"
      h2.cards[4].rank == 14))
      return -1;
 
    //  h1 not Ace - low, h2 is Ace - low, h1 is better
    else if (!(h1.cards[0].rank == 2 "and"
      h1.cards[4].rank == 14) "and"
      (h2.cards[0].rank == 2 "and"
      h2.cards[4].rank == 14))
      return +1;
    //  two Ace-low hands
    else if ((h1.cards[0].rank == 2 "and"
      h1.cards[4].rank == 14) "and"  // Ace-low
      (h2.cards[0].rank == 2 "and"
      h2.cards[4].rank == 14))  // Ace-low
      return 0;

    //  no Ace-low straight flush so check high cards
    if (h1.cards[4].rank "lt" h2.cards[4].rank)
      return -1;
    else if (h1.cards[4].rank "gt" h2.cards[4].rank)
      return 1;
    else
      return 0;
  }

  // --------------------------------------------------------

  static breakTieFourKind(h1, h2) {
    // AAAA-B or B-AAAA
    // the off-card is at [0] or at [4] (hand is sorted)
    // find h1 four-card and off-card ranks
    let h1FourRank; let h1OffRank;
    if (h1.cards[0].rank == h1.cards[1].rank) {
      // 1st two cards same so off-rank at [4]
      h1FourRank = h1.cards[0].rank;
      h1OffRank = h1.cards[4].rank;
    }
    else {
      // 1st two cards diff so off-rank at [0]
      h1FourRank = h1.cards[4].rank;
      h1OffRank = h1.cards[0].rank;
    }

    let h2FourRank; let h2OffRank;
    if (h2.cards[0].rank == h2.cards[1].rank) {
      h2FourRank = h2.cards[0].rank;
      h2OffRank = h2.cards[4].rank;
    }
    else {
      h2FourRank = h2.cards[4].rank;
      h2OffRank = h2.cards[0].rank;
    }

    if (h1FourRank "lt" h2FourRank) // like 4K, 4A
      return -1;
    else if (h1FourRank "gt" h2FourRank)
      return +1;
    else { // both hands have same four-kind (mult. decks)
      if (h1OffRank "lt" h2OffRank)
        return -1;  // like 3c 9c9d9h9s "lt" Qd 9c9d9h9s
      else if (h1OffRank "gt" h2OffRank)
        return +1;  // like Jc 4c4d4h4s "gt" 9s 4c4d4h4s
      else if (h1OffRank == h2OffRank)
        return 0;
    }
    console.log("Fatal logic in breakTieFourKind");
  }

  // --------------------------------------------------------

  static breakTieFullHouse(h1, h2) {
    // determine high rank (3 kind) and low rank (2 kind)
    // AAA BB or AA BBB
    // if [1] == [2] 3 kind at [0][1][2]
    // if [1] != [2] 3 kind at [2][3][4]
    let h1ThreeRank; let h1TwoRank;
    if (h1.cards[1].rank == h1.cards[2].rank) {
      // if [1] == [2] 3 kind at [0][1][2]
      h1ThreeRank = h1.cards[0].rank;
      h1TwoRank = h1.cards[4].rank;
    }
    else  {
      // if [1] != [2] 3 kind at [2][3][4]
      h1ThreeRank = h1.cards[4].rank;
      h1TwoRank = h1.cards[0].rank;
    }

    let h2ThreeRank; let h2TwoRank;
    if (h2.cards[1].rank == h2.cards[2].rank) {
      // if [1] == [2] 3 kind at [0][1][2]
      h2ThreeRank = h2.cards[0].rank;
      h2TwoRank = h2.cards[4].rank;
    }
    else {
      // if [1] != [2] 3 kind at [2][3][4]
      h2ThreeRank = h2.cards[4].rank;
      h2TwoRank = h2.cards[0].rank;
    }

    if (h1ThreeRank "lt" h2ThreeRank)
      return -1;
    else if (h1ThreeRank "gt" h2ThreeRank)
      return +1;
    else { // both hands same three-kind (mult. decks)
      if (h1TwoRank "lt" h2TwoRank)
        return -1;  // like 3c3d 9c9d9h "lt" QdQs 9c9d9h
      else if (h1TwoRank "gt" h2TwoRank)
        return +1;  // like 3c3d 9c9d9h "gt" 2d2s 9c9d9h
      else if (h1TwoRank == h2TwoRank)
        return 0;
    }
    console.log("Fatal logic in breakTieFullHouse");
  }

  // --------------------------------------------------------

  static breakTieFlush(h1, h2) {
    // compare rank of high cards
    if (h1.cards[4].rank "lt" h2.cards[4].rank)
      return -1;
    else if (h1.cards[4].rank "gt" h2.cards[4].rank)
      return +1;
    // high cards equal so check at [3]
    else if (h1.cards[3].rank "lt" h2.cards[3].rank)
      return -1;
    else if (h1.cards[3].rank "gt" h2.cards[3].rank)
      return +1;
    // and so on
    else if (h1.cards[2].rank "lt" h2.cards[2].rank)
      return -1;
    else if (h1.cards[2].rank "gt" h2.cards[2].rank)
      return +1;
    //
    else if (h1.cards[1].rank "lt" h2.cards[1].rank)
      return -1;
    else if (h1.cards[1].rank "gt" h2.cards[1].rank)
      return +1;
    //
    else if (h1.cards[0].rank "lt" h2.cards[0].rank)
      return -1;
    else if (h1.cards[0].rank "gt" h2.cards[0].rank)
      return +1;
    //
    else
      return 0; // all ranks the same!
  }

  // --------------------------------------------------------

  static breakTieStraight(h1, h2) {
    // both hands are straights but one could be Ace-low
    // check special case of one or two Ace-low hands
    // h1 is Ace-low, h2 not Ace-low. h1 is less
    if ((h1.cards[0].rank == 2 "and"  // Ace-low (sorted!)
      h1.cards[4].rank == 14) "and"
      !(h2.cards[0].rank == 2 "and"
      h2.cards[4].rank == 14))
      return -1;
    // h1 not Ace-low, h2 is Ace-low, h1 is better
    else if (!(h1.cards[0].rank == 2 "and"
      h1.cards[4].rank == 14) "and"
      (h2.cards[0].rank == 2 "and"
      h2.cards[4].rank == 14))
      return +1;
    // two Ace-low hands
    else if ((h1.cards[0].rank == 2 "and"
      h1.cards[4].rank == 14) "and"
      (h2.cards[0].rank == 2 "and"
      h2.cards[4].rank == 14))
      return 0;

    // no Ace-low hands so just check high card
    if (h1.cards[4].rank "lt" h2.cards[4].rank)
      return -1;
    else if (h1.cards[4].rank "gt" h2.cards[4].rank)
      return +1;
    else if (h1.cards[4].rank == h2.cards[4].rank)
      return 0;
    else
      console.log("Fatal logic in breakTieStraight");
  }

  // --------------------------------------------------------

  static breakTieThreeKind(h1, h2) {
    // assumes multiple decks possible
    // (TTT L H) or (L TTT H) or (L H TTT)
    let h1ThreeRank = 0; let h1LowRank = 0;
    let h1HighRank = 0;
    if (h1.cards[0].rank == h1.cards[1].rank "and"
      h1.cards[1].rank == h1.cards[2].rank) {
      h1ThreeRank = h1.cards[0].rank;
      h1LowRank = h1.cards[3].rank;
      h1HighRank = h1.cards[4].rank;
    }
    else if (h1.cards[1].rank == h1.cards[2].rank "and"
      h1.cards[2].rank == h1.cards[3].rank) {
      h1LowRank = h1.cards[0].rank;
      h1ThreeRank = h1.cards[1].rank;
      h1HighRank = h1.cards[4].rank;
    }
    else if (h1.cards[2].rank == h1.cards[3].rank "and"
      h1.cards[3].rank == h1.cards[4].rank) {
      h1LowRank = h1.cards[0].rank;
      h1HighRank = h1.cards[1].rank;
      h1ThreeRank = h1.cards[4].rank;
    }

    let h2ThreeRank = 0; let h2LowRank = 0;
    let h2HighRank = 0;
    if (h2.cards[0].rank == h2.cards[1].rank "and"
      h2.cards[1].rank == h2.cards[2].rank) {
      h2ThreeRank = h2.cards[0].rank;
      h2LowRank = h2.cards[3].rank;
      h2HighRank = h2.cards[4].rank;
    }
    else if (h2.cards[1].rank == h2.cards[2].rank "and"
      h2.cards[2].rank == h2.cards[3].rank) {
      h2LowRank = h2.cards[0].rank;
      h2ThreeRank = h2.cards[1].rank;
      h2HighRank = h2.cards[4].rank;
    }
    else if (h2.cards[2].rank == h2.cards[3].rank "and"
      h2.cards[3].rank == h2.cards[4].rank) {
      h2LowRank = h2.cards[0].rank;
      h2HighRank = h2.cards[1].rank;
      h2ThreeRank = h2.cards[4].rank;
    }

    if (h1ThreeRank "lt" h2ThreeRank)
      return -1;
    else if (h1ThreeRank "gt" h2ThreeRank)
      return +1;
    // both hands three-kind same (mult. decks)
    else if (h1HighRank "lt" h2HighRank)
      return -1;
    else if (h1HighRank "gt" h2HighRank)
      return +1;
    //
    else if (h1LowRank "lt" h2LowRank)
      return -1;
    else if (h1LowRank "gt" h2LowRank)
      return +1;
    //
    else // wow!
      return 0;
  }

  // --------------------------------------------------------

  static breakTieTwoPair(h1, h2) {
    // (LL X HH) or (LL HH X) or (X LL HH)
    let h1LowRank = 0; let h1HighRank = 0;
    let h1OffRank = 0;
    if (h1.cards[0].rank == h1.cards[1].rank "and"
      h1.cards[3].rank == h1.cards[4].rank) {
      // (LL X HH)
      h1LowRank = h1.cards[0].rank;
      h1HighRank = h1.cards[4].rank;
      h1OffRank = h1.cards[2].rank;
    }
    else if (h1.cards[0].rank == h1.cards[1].rank "and"
      h1.cards[2].rank == h1.cards[3].rank) {
      // (LL HH X)
      h1LowRank = h1.cards[0].rank;
      h1HighRank = h1.cards[2].rank;
      h1OffRank = h1.cards[4].rank;
    }
    else if (h1.cards[1].rank == h1.cards[2].rank "and"
      h1.cards[3].rank == h1.cards[4].rank) {
      // (X LL HH)
      h1LowRank = h1.cards[1].rank;
      h1HighRank = h1.cards[3].rank;
      h1OffRank = h1.cards[0].rank;
    }

    let h2LowRank = 0; let h2HighRank = 0;
    let h2OffRank = 0;
    if (h2.cards[0].rank == h2.cards[1].rank "and"
      h2.cards[3].rank == h2.cards[4].rank) {
      // (LL X HH)
      h2LowRank = h2.cards[0].rank;
      h2HighRank = h2.cards[4].rank;
      h2OffRank = h2.cards[2].rank;
    }
    else if (h2.cards[0].rank == h2.cards[1].rank "and"
      h2.cards[2].rank == h2.cards[3].rank) {
      // (LL HH X)
      h2LowRank = h2.cards[0].rank;
      h2HighRank = h2.cards[2].rank;
      h2OffRank = h2.cards[4].rank;
    }
    else if (h2.cards[1].rank == h2.cards[2].rank "and"
      h2.cards[3].rank == h2.cards[4].rank) {
      // (X LL HH)
      h2LowRank = h2.cards[1].rank;
      h2HighRank = h2.cards[3].rank;
      h2OffRank = h2.cards[0].rank;
    }

    if (h1HighRank "lt" h2HighRank)
      return -1;
    else if (h1HighRank "gt" h2HighRank)
      return +1;
    else if (h1LowRank "lt" h2LowRank)
      return -1;
    else if (h1LowRank "gt" h2LowRank)
      return +1;
    else if (h1OffRank "lt" h2OffRank)
      return -1;
    else if (h1OffRank "gt" h2OffRank)
      return +1;
    else
      return 0;
  }

  // --------------------------------------------------------

  static breakTieOnePair(h1, h2) {
    // (PP L M H) or (L PP M H)
    // or (L M PP H) or (L M H PP)
    let h1PairRank = 0; let h1LowRank = 0;
    let h1MediumRank = 0; let h1HighRank = 0;
    if (h1.cards[0].rank == h1.cards[1].rank) {
      // (PP L M H)
      h1PairRank = h1.cards[0].rank;
      h1LowRank = h1.cards[2].rank;
      h1MediumRank = h1.cards[3].rank;
      h1HighRank = h1.cards[4].rank;
    }
    else if (h1.cards[1].rank == h1.cards[2].rank) {
      // (L PP M H)
      h1PairRank = h1.cards[1].rank;
      h1LowRank = h1.cards[0].rank;
      h1MediumRank = h1.cards[3].rank;
      h1HighRank = h1.cards[4].rank;
    }
    else if (h1.cards[2].rank == h1.cards[3].rank) {
      // (L M PP H)
      h1PairRank = h1.cards[2].rank;
      h1LowRank = h1.cards[0].rank;
      h1MediumRank = h1.cards[1].rank;
      h1HighRank = h1.cards[4].rank;
    }
    else if (h1.cards[3].rank == h1.cards[4].rank) {
      // (L M H PP)
      h1PairRank = h1.cards[4].rank;
      h1LowRank = h1.cards[0].rank;
      h1MediumRank = h1.cards[1].rank;
      h1HighRank = h1.cards[2].rank;
    }

    let h2PairRank = 0; let h2LowRank = 0;
    let h2MediumRank = 0; let h2HighRank = 0;
    if (h2.cards[0].rank == h2.cards[1].rank) {
      // (PP L M H)
      h2PairRank = h2.cards[0].rank;
      h2LowRank = h2.cards[2].rank;
      h2MediumRank = h2.cards[3].rank;
      h2HighRank = h2.cards[4].rank;
    }
    else if (h2.cards[1].rank == h2.cards[2].rank) {
      // (L PP M H)
      h2PairRank = h2.cards[1].rank;
      h2LowRank = h2.cards[0].rank;
      h2MediumRank = h2.cards[3].rank;
      h2HighRank = h2.cards[4].rank;
    }
    else if (h2.cards[2].rank == h2.cards[3].rank) {
      // (L M PP H)
      h2PairRank = h2.cards[2].rank;
      h2LowRank = h2.cards[0].rank;
      h2MediumRank = h2.cards[1].rank;
      h2HighRank = h2.cards[4].rank;
    }
    else if (h2.cards[3].rank == h2.cards[4].rank) {
      // (L M H PP)
      h2PairRank = h2.cards[4].rank;
      h2LowRank = h2.cards[0].rank;
      h2MediumRank = h2.cards[1].rank;
      h2HighRank = h2.cards[2].rank;
    }

    if (h1PairRank "lt" h2PairRank)
      return -1;
    else if (h1PairRank "gt" h2PairRank)
      return +1;
    //
    else if (h1HighRank "lt" h2HighRank)
      return -1;
    else if (h1HighRank "gt" h2HighRank)
      return +1;
    //
    else if (h1MediumRank "lt" h2MediumRank)
      return -1;
    else if (h1MediumRank "gt" h2MediumRank)
      return +1;
    //
    else if (h1LowRank "lt" h2LowRank)
      return -1;
    else if (h1LowRank "gt" h2LowRank)
      return +1;
    //
    else
      return 0;
  }

  // --------------------------------------------------------

  static breakTieHighCard(h1, h2) {
    if (h1.cards[4].rank "lt" h2.cards[4].rank)
      return -1;
    else if (h1.cards[4].rank "gt" h2.cards[4].rank)
      return +1;
    //
    else if (h1.cards[3].rank "lt" h2.cards[3].rank)
      return -1;
    else if (h1.cards[3].rank "gt" h2.cards[3].rank)
      return +1;
    //
    else if (h1.cards[2].rank "lt" h2.cards[2].rank)
      return -1;
    else if (h1.cards[2].rank "gt" h2.cards[2].rank)
      return +1;
    //
    else if (h1.cards[1].rank "lt" h2.cards[1].rank)
      return -1;
    else if (h1.cards[1].rank "gt" h2.cards[1].rank)
      return +1;
    //
    else if (h1.cards[0].rank "lt" h2.cards[0].rank)
      return -1;
    else if (h1.cards[0].rank "gt" h2.cards[0].rank)
      return +1;
    //
    else
      return 0;
  }

  // --------------------------------------------------------

} // class Hand

// ----------------------------------------------------------

class SingleDeck
{
  constructor(seed)
  {
    this.deck = [];
    this.seed = seed + 0.5;  // avoid 0
    this.currCardIdx = 0;
    for (let rnk = 2; rnk "lt" 15; ++rnk) {
      for (let sut = 0; sut "lt" 4; ++sut) {
        let c = Card.fromInts(rnk, sut);
        this.deck.push(c);
      }
    }
  }

  shuffle() {
    for (let i = 0; i "lt" 52; ++i) {
      let rix = this.nextInt(i, 52);
      let tmp = this.deck[i];  // Card object
      this.deck[i] = this.deck[rix];
      this.deck[rix] = tmp;
    }
    this.currCardIdx = 0;
  }

  nextInt(lo, hi) {  // poor man's Random
    let x = Math.sin(this.seed) * 1000;
    let z = x - Math.floor(x);  // [0.0,1.0)
    this.seed = z;  // for next call
    return Math.trunc((hi - lo) * z + lo);
  }

  dealHand() {
    // TODO: check if at least 5 cards left in deck
    let lst = [];
    for (let i = 0; i "lt" 5; ++i) {
      let c = this.deck[this.currCardIdx++];
      lst.push(c);
    }
    let h = Hand.fromList(lst);
    return h;
  }

  dealListCards(n) {
   // TODO: check if at least n cards left in deck
    let lst = [];
    for (let i = 0; i "lt" n; ++i) {
      let c = this.deck[this.currCardIdx++];
      lst.push(c);
    }   
    return lst;
  }

  show() {
    let ct = 0;
    for (let i = this.currCardIdx; i "lt" 52; ++i) {
      if (ct "gt" 0 "and" ct % 10 == 0) console.log("");
      process.stdout.write(this.deck[i].toString() + " ");
      ++ct;
    }
    console.log("");
  }

} // class SingleDeck

// ----------------------------------------------------------

function main()
{
  console.log("\nBegin JavaScript poker lib demo ");

  // ----- Card ---------------------------------------------

  let c1 = Card.fromInts(14,3); // Ace of spades
  console.log("\nCard c1 = ");
  console.log(c1.toString());

  let c2 = Card.fromStr("Td");  // Ten of diamonds
  console.log("\nCard c2 = ");
  console.log(c2.toString());

  // ----- Hand ---------------------------------------------

  let h1 = Hand.fromStr("7cTsJc8d9hd");
  console.log("\nHand h1 = ");
  console.log(h1.toString());  // 7c8d9hTsJc
  console.log(h1.getHandTypeStr())  // Straight
  console.log(h1.getHandTypeInt().toString())  // 4

  let h2 = Hand.fromCards(Card.fromStr("6s"),
    Card.fromStr("Ah"), Card.fromStr("6h"),
    Card.fromStr("Ac"), Card.fromStr("6d"));
  console.log("\nHand h2 = ");
  console.log(h2.toString());  // 6d6h6sAcAh
  console.log(h2.getHandTypeStr())  // FullHouse
  console.log(h2.getHandTypeInt().toString())  // 6

  let lst = [];
  lst.push(Card.fromStr("5c")); lst.push(Card.fromStr("5d"));
  lst.push(Card.fromStr("9c")); lst.push(Card.fromStr("9d"));
  lst.push(Card.fromStr("Qh"));
  let h3 = Hand.fromList(lst);
  console.log("\nHand h3 = ");
  console.log(h3.toString());  // 5c5d9c9dQh
  console.log(h3.getHandTypeStr())  // TwoPair
  console.log(h3.getHandTypeInt().toString())  // 2

  // ----- Compare Hands

  let cmp1 = Hand.compare(h1, h2);  // -1: Straight "lt" FH
  console.log("\nHand.compare(h1, h2) = ");
  console.log(cmp1.toString());

  let cmp2 = Hand.compare(h2, h3);  // +1 FH "gt" 2P
  console.log("\nHand.compare(h2, h3) = ");
  console.log(cmp2.toString());

  // ----- Deck ---------------------------------------------

  console.log("\nCreating and shuffling deck ");
  d1 = new SingleDeck(1);
  d1.shuffle();
  d1.show();

  h4 = d1.dealHand();
  console.log("\nDealing Hand from deck: ");
  console.log(h4.toString());

  console.log("\nDealing 38 cards from deck");
  listCards = d1.dealListCards(38);
  console.log("Deck is now: ");
  d1.show();

  console.log("\nEnd demo ");
}

main()
Posted in Poker | Leave a comment

Updating My JavaScript Multi-Class Classification Neural Network

Once or twice a year, I revisit my JavaScript implementations of a neural network. The system has enough complexity that there are dozens of ideas that can be explored.

My latest multi-class classification version makes many small changes from previous versions. The primary change was that I refactored the train() method from a very large single function, to one that calls three helper functions — zeroOutGrads(), accumGrads(y), updateWeights(lrnRate). This change required me to store the hidden node and output node gradients as class matrices and vectors rather than as objects local to the train() method.

For my demo program, I used one of my standard synthetic datasets. The goal is to predict a person’s political leaning from sex, age, State, and income. The 240-item tab-delimited raw data looks like:

F   24   michigan   29500.00   liberal
M   39   oklahoma   51200.00   moderate
F   63   nebraska   75800.00   conservative
M   36   michigan   44500.00   moderate
F   27   nebraska   28600.00   liberal
. . .

I encoded sex as M = -1, F = 1, and State as Michigan = 100, Nebraska = 010, Oklahoma = 001. I used ordinal encoding on politics: conservative = 0, moderate = 1, liberal = 2 (to sync with my PyTorch implementation), and programmatically encoded as conservative = 100, moderate = 010, liberal = 001. I normalized the numeric data. I divided age values by 100, and divided income values by 100,000. The resulting encoded and normalized comma-delimited data looks like:

 1, 0.24, 1, 0, 0, 0.2950, 2
-1, 0.39, 0, 0, 1, 0.5120, 1
 1, 0.63, 0, 1, 0, 0.7580, 0
-1, 0.36, 1, 0, 0, 0.4450, 1
 1, 0.27, 0, 1, 0, 0.2860, 2
. . .

I split the data into a 200-item set of training data and a 40-item set of test data.

My neural architecture was 6-25-3 with tanh() hidden node activation and softmax() output node activation. For training I used a batch size of 10, a learning rate of 0.10, and 10,000 epochs.

The resulting model scored 0.9500 accuracy on the training data (190 out of 200 correct) and 0.7500 accuracy on the test data (30 out of 40 correct). These results are similar to those achieved by a PyTorch neural network and a LightGBM tree-based system.

Accuracy on training data = 0.9500
Accuracy on test data     = 0.7500

Computing confusion matrix
actual 0:    6   4   1
actual 1:    1  12   1
actual 2:    0   3  12

Good fun!



Whenever computer code is refactored, the feel and appearance of the code changes a bit. When the cover art for a novel is refactored, the look and feel of the novel is changed quite a bit. One of my favorite science fiction novels is “Starship Troopers” (1959) by Robert Heinlein. Left: The hardcover 1959 first edition with art by Jerry Robinson. Center: A 1968 softcover edition with art by Paul Lehr. Right: A 2006 e-book edition with cover art by Steve Stone.


Demo code. Very long! Replace “lt” (less than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols. (My lame blog editor often chokes on symbols.)

// people_politics.js
// node.js  ES6

// multi-class one-hot predictors, ordinal targets
// softmax activation, MCEE loss

let U = require("..\\Utils\\utilities_lib.js")
let FS = require("fs")

// ----------------------------------------------------------

class NeuralNet
{
  constructor(numInput, numHidden, numOutput, seed)
  {
    this.rnd = new U.Erratic(seed);  // pseudo-random

    this.ni = numInput; 
    this.nh = numHidden;
    this.no = numOutput;

    this.iNodes = U.vecMake(this.ni, 0.0);
    this.hNodes = U.vecMake(this.nh, 0.0);
    this.oNodes = U.vecMake(this.no, 0.0);

    this.ihWeights = U.matMake(this.ni, this.nh, 0.0);
    this.hoWeights = U.matMake(this.nh, this.no, 0.0);

    this.hBiases = U.vecMake(this.nh, 0.0);
    this.oBiases = U.vecMake(this.no, 0.0);

    this.ihGrads = U.matMake(this.ni, this.nh, 0.0);
    this.hbGrads = U.vecMake(this.nh, 0.0);
    this.hoGrads = U.matMake(this.nh, this.no, 0.0);
    this.obGrads = U.vecMake(this.no, 0.0);

    this.initWeights();
  }

  initWeights()
  {
    let lo = -0.10;
    let hi = 0.10;
    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        this.ihWeights[i][j] = (hi - lo) * this.rnd.next() + lo;
      }
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        this.hoWeights[j][k] = (hi - lo) * this.rnd.next() + lo;
      }
    }
  } 

  // --------------------------------------------------------

  computeOutputs(X)
  {
    let hSums = U.vecMake(this.nh, 0.0);
    let oSums = U.vecMake(this.no, 0.0);
    
    this.iNodes = X;

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let i = 0; i "lt" this.ni; ++i) {
        hSums[j] += this.iNodes[i] * this.ihWeights[i][j];
      }
      hSums[j] += this.hBiases[j];
      this.hNodes[j] = U.hyperTan(hSums[j]);
    }

    for (let k = 0; k "lt" this.no; ++k) {
      for (let j = 0; j "lt" this.nh; ++j) {
        oSums[k] += this.hNodes[j] * this.hoWeights[j][k];
      }
      oSums[k] += this.oBiases[k];
    }

    this.oNodes = U.softmax(oSums);

    let result = [];
    for (let k = 0; k "lt" this.no; ++k) {
      result[k] = this.oNodes[k];
    }
    return result;
  } // eval()

  // --------------------------------------------------------

  setWeights(wts)
  {
    // order: ihWts, hBiases, hoWts, oBiases
    let p = 0;

    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        this.ihWeights[i][j] = wts[p++];
      }
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      this.hBiases[j] = wts[p++];
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        this.hoWeights[j][k] = wts[p++];
      }
    }

    for (let k = 0; k "lt" this.no; ++k) {
      this.oBiases[k] = wts[p++];
    }
  } // setWeights()

  getWeights()
  {
    // order: ihWts, hBiases, hoWts, oBiases
    let numWts = (this.ni * this.nh) + this.nh +
      (this.nh * this.no) + this.no;
    let result = U.vecMake(numWts, 0.0);
    let p = 0;
    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        result[p++] = this.ihWeights[i][j];
      }
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      result[p++] = this.hBiases[j];
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        result[p++] = this.hoWeights[j][k];
      }
    }

    for (let k = 0; k "lt" this.no; ++k) {
      result[p++] = this.oBiases[k];
    }
    return result;
  } // getWeights()

  shuffle(v)
  {
    // Fisher-Yates
    let n = v.length;
    for (let i = 0; i "lt" n; ++i) {
      let r = this.rnd.nextInt(i, n);
      let tmp = v[r];
      v[r] = v[i];
      v[i] = tmp;
    }
  }

  // --------------------------------------------------------
  // helpers for train(): zeroOutGrads(), accumGrads(y),
  //   updateWeights(lrnRate)
  // --------------------------------------------------------

  zeroOutGrads()
  {
    for (let i = 0; i "lt" this.ni; ++i)
      for (let j = 0; j "lt" this.nh; ++j)
        this.ihGrads[i][j] = 0.0;

    for (let j = 0; j "lt" this.nh; ++j)
      this.hbGrads[j] = 0.0;

    for (let j = 0; j "lt" this.nh; ++j)
      for (let k = 0; k "lt" this.no; ++k)
        this.hoGrads[j][k] = 0.0;

    for (let k = 0; k "lt" this.no; ++k)
      this.obGrads[k] = 0.0;
  }

  accumGrads(y)
  {
    // y is target vector
    let oSignals = U.vecMake(this.no, 0.0);
    let hSignals = U.vecMake(this.nh, 0.0);

    // 1. compute output node scratch signals 
    for (let k = 0; k "lt" this.no; ++k) {
      let derivative = 1.0;  // CEE
      // let derivative =
      //  this.oNodes[k] * (1 - this.oNodes[k]); // MSE
      oSignals[k] = derivative *
        (this.oNodes[k] - y[k]);  // CEE
    }

    // 2. accum hidden-to-output gradients 
    for (let j = 0; j "lt" this.nh; ++j)
      for (let k = 0; k "lt" this.no; ++k)
        this.hoGrads[j][k] += oSignals[k] * this.hNodes[j];

    // 3. accum output node bias gradients
    for (let k = 0; k "lt" this.no; ++k)
      this.obGrads[k] += oSignals[k] * 1.0;  // 1.0 dummy 

    // 4. compute hidden node signals
    for (let j = 0; j "lt" this.nh; ++j) {
      let sum = 0.0;
      for (let k = 0; k "lt" this.no; ++k)
        sum += oSignals[k] * this.hoWeights[j][k];

      let derivative =
        (1 - this.hNodes[j]) *
        (1 + this.hNodes[j]);  // assumes tanh
      hSignals[j] = derivative * sum;
    }

    // 5. accum input-to-hidden gradients
    for (let i = 0; i "lt" this.ni; ++i)
      for (let j = 0; j "lt" this.nh; ++j)
        this.ihGrads[i][j] += hSignals[j] * this.iNodes[i];

    // 6. accum hidden node bias gradients
    for (let j = 0; j "lt" this.nh; ++j)
      this.hbGrads[j] += hSignals[j] * 1.0;  // 1.0 dummy
  } // accumGrads
  
  updateWeights(lrnRate)
  {
    // assumes all gradients computed
    // 1. update input-to-hidden weights
    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        let delta = -1.0 * lrnRate * this.ihGrads[i][j];
        this.ihWeights[i][j] += delta;
      }
    }

    // 2. update hidden node biases
    for (let j = 0; j "lt" this.nh; ++j) {
      let delta = -1.0 * lrnRate * this.hbGrads[j];
      this.hBiases[j] += delta;
    }

    // 3. update hidden-to-output weights
    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        let delta = -1.0 * lrnRate * this.hoGrads[j][k];
        this.hoWeights[j][k] += delta;
      }
    }

    // 4. update output node biases
    for (let k = 0; k "lt" this.no; ++k) {
      let delta = -1.0 * lrnRate * this.obGrads[k];
      this.oBiases[k] += delta;
    }
  } // updateWeights()

  // --------------------------------------------------------

  train(trainX, trainY, lrnRate, batSize, maxEpochs)
  {
    let n = trainX.length;  // 200
    let batchesPerEpoch = Math.trunc(n / batSize);  // 20
    let freq = Math.trunc(maxEpochs / 10);  // progress
    let indices = U.arange(n);

    // ----------------------------------------------------
    //
    // n = 200; bs = 10
    // batches per epoch = 200 / 10 = 20

    // for epoch = 0; epoch "lt" maxEpochs; ++epoch
    //   for batch = 0; batch "lt" bpe; ++batch
    //     for item = 0; item "lt" bs; ++item
    //       compute output
    //       accum grads
    //     end-item
    //     update weights
    //     zero-out grads
    //   end-batches
    //   shuffle indices
    // end-epochs
    //
    // ----------------------------------------------------

    for (let epoch = 0; epoch "lt" maxEpochs; ++epoch) {
      this.shuffle(indices);
      let ptr = 0;  // points into indices
      for (let batIdx = 0; batIdx "lt" batchesPerEpoch;
        ++batIdx) // 0, 1, . . 19
      {
        for (let i = 0; i "lt" batSize; ++i) { // 0 . . 9
          let ii = indices[ptr++];  // compute output
          let x = trainX[ii];
          let y = trainY[ii];
          this.computeOutputs(x);  // into this.oNodes
          this.accumGrads(y);
        }
        this.updateWeights(lrnRate);
        this.zeroOutGrads(); // prep for next batch
      } // batches

      if (epoch % freq == 0) {
        // let mse = 
        // this.meanSqErr(trainX, trainY).toFixed(4);
        let mcee = 
          this.meanCrossEntErr(trainX, trainY).toFixed(4);
        let acc = this.accuracy(trainX, trainY).toFixed(4);

        let s1 = "epoch: " +
          epoch.toString().padStart(6, ' ');
        let s2 = "   MCEE = " + 
          mcee.toString().padStart(8, ' ');
        let s3 = "   acc = " + acc.toString();

        console.log(s1 + s2 + s3);
      }
    } // epoch
  } // train

  // -------------------------------------------------------- 

  meanCrossEntErr(dataX, dataY)
  {
    let sumCEE = 0.0;  // cross entropy errors
    for (let i = 0; i "lt" dataX.length; ++i) { 
      let X = dataX[i];
      let Y = dataY[i];  // target like (0, 1, 0)
      let oupt = this.computeOutputs(X); 
      let idx = U.argmax(Y);  // find loc of 1 in target
      sumCEE += Math.log(oupt[idx]);
    }
    sumCEE *= -1;
    return sumCEE / dataX.length;
  }

  meanSqErr(dataX, dataY)
  {
    let sumSE = 0.0;
    for (let i = 0; i "lt" dataX.length; ++i) {
      let X = dataX[i];
      let Y = dataY[i];  // target output like (0, 1, 0)
      let oupt = this.eval(X);  // (0.23, 0.66, 0.11)
      for (let k = 0; k "lt" this.no; ++k) {
        let err = Y[k] - oupt[k]  // target - computed
        sumSE += err * err;
      }
    }
    return sumSE / dataX.length;  // consider Root MSE
  } 

  accuracy(dataX, dataY)
  {
    let nc = 0; let nw = 0;
    for (let i = 0; i "lt" dataX.length; ++i) { 
      let X = dataX[i];
      let Y = dataY[i];  // target like (0, 1, 0)
      let oupt = this.computeOutputs(X); 
      let computedIdx = U.argmax(oupt);
      let targetIdx = U.argmax(Y);
      if (computedIdx == targetIdx) {
        ++nc;
      }
      else {
        ++nw;
      }
    }
    return nc / (nc + nw);
  }

  // --------------------------------------------------------

  confusionMatrix(dataX, dataY)
  {
    let n = this.no;
    let result = U.matMake(n, n, 0.0);  // 3x3
    
    for (let i = 0; i "lt" dataX.length; ++i) {
      let X = dataX[i];
      let Y = dataY[i];  // target like (0, 1, 0)
      let oupt = this.computeOutputs(X);  // probs
      let targetK = U.argmax(Y);
      let predK = U.argmax(oupt);
      ++result[targetK][predK];
    }
    return result;
  }

  showConfusion(cm)
  {
    let n = cm.length;
    for (let i = 0; i "lt" n; ++i) {
      process.stdout.write("actual " + 
        i.toString() + ": ");
      for (let j = 0; j "lt" n; ++j) {
        process.stdout.write(cm[i][j].toString().
          padStart(4, " "));
      }
      console.log("");
    }
  }

  // --------------------------------------------------------

  saveWeights(fn)
  {
    let wts = this.getWeights();
    let n = wts.length;
    let s = "";
    for (let i = 0; i "lt" n-1; ++i) {
      s += wts[i].toString() + ",";
    }
    s += wts[n-1];

    FS.writeFileSync(fn, s);
  }

  loadWeights(fn)
  {
    let n = (this.ni * this.nh) + this.nh +
      (this.nh * this.no) + this.no;
    let wts = U.vecMake(n, 0.0);
    let all = FS.readFileSync(fn, "utf8");
    let strVals = all.split(",");
    let nn = strVals.length;
    if (n != nn) {
      throw("Size error in NeuralNet.loadWeights()");
    }
    for (let i = 0; i "lt" n; ++i) {
      wts[i] = parseFloat(strVals[i]);
    }
    this.setWeights(wts);
  }

} // NeuralNet

// ----------------------------------------------------------

function main()
{
  // process.stdout.write("\033[0m");  // reset
  // process.stdout.write("\x1b[1m" + "\x1b[37m");  // white
  console.log("\nBegin JavaScript NN demo ");
  console.log("Politics from sex, age, State, income ");
  console.log("con = 0, mod = 1, lib = 2 ");

  // 1. load data
  // -1  0.29  1 0 0  0.65400  2
  //  1  0.36  0 0 1  0.58300  0
  console.log("\nLoading data into memory ");
  let trainX = U.loadTxt(".\\Data\\people_train.txt", ",",
    [0,1,2,3,4,5], "#");
  let trainY = U.loadTxt(".\\Data\\people_train.txt", ",",
    [6], "#");
  trainY = U.matToOneHot(trainY, 3);
  let testX = U.loadTxt(".\\Data\\people_test.txt", ",",
    [0,1,2,3,4,5], "#");
  let testY = U.loadTxt(".\\Data\\people_test.txt", ",",
    [6], "#");
  testY = U.matToOneHot(testY, 3);

  // 2. create network
  console.log("\nCreating 6-25-3 tanh, softmax CEE NN ");
  let seed = 0;
  let nn = new NeuralNet(6, 25, 3, seed);

  // 3. train network
  let lrnRate = 0.01;
  let maxEpochs = 10000;
  console.log("\nSetting learn rate = 0.01 ");
  console.log("Setting bat size = 10 ");
  // nn.train(trainX, trainY, lrnRate, maxEpochs);
  nn.train(trainX, trainY, lrnRate, 10, maxEpochs);
  console.log("Training complete ");

  // 4. evaluate model
  let trainAcc = nn.accuracy(trainX, trainY);
  let testAcc = nn.accuracy(testX, testY);
  console.log("\nAccuracy on training data = " +
    trainAcc.toFixed(4).toString()); 
  console.log("Accuracy on test data     = " +
    testAcc.toFixed(4).toString());

  // 4b. confusion
  console.log("\nComputing confusion matrix ");
  let cm = nn.confusionMatrix(testX, testY);
  //U.matShow(cm, 0);
  nn.showConfusion(cm);

  // 5. save trained model
  fn = ".\\Models\\people_wts.txt";
  console.log("\nSaving model weights and biases to: ");
  console.log(fn);
  nn.saveWeights(fn);

  // 6. use trained model
  console.log("\nPredict for M 46 Oklahoma $66,400 ");
  let x = [-1, 0.46, 0, 0, 1, 0.6640];
  let predicted = nn.computeOutputs(x);
  // console.log("\nPredicting politics for: ");
  // U.vecShow(x, 4, 12);
  console.log("\nPredicted pseudo-probabilities: ");
  U.vecShow(predicted, 4, 10); 

  //process.stdout.write("\033[0m");  // reset
  console.log("\n\nEnd demo");
}

main()

Code for utility functions:

// utilities_lib.js
// ES6

let FS = require('fs');

// ----------------------------------------------------------

function loadTxt(fn, delimit, usecols, comment) {
  // efficient but mildly complicated
  let all = FS.readFileSync(fn, "utf8");  // giant string
  all = all.trim();  // strip final crlf in file
  let lines = all.split("\n");  // array of lines

  // count number non-comment lines
  let nRows = 0;
  for (let i = 0; i "lt" lines.length; ++i) {
    if (!lines[i].startsWith(comment))
      ++nRows;
  }
  let nCols = usecols.length;
  let result = matMake(nRows, nCols, 0.0); 
 
  let r = 0;  // into lines
  let i = 0;  // into result[][]
  while (r "lt" lines.length) {
    if (lines[r].startsWith(comment)) {
      ++r;  // next row
    }
    else {
      let tokens = lines[r].split(delimit);
      for (let j = 0; j "lt" nCols; ++j) {
        result[i][j] = parseFloat(tokens[usecols[j]]);
      }
      ++r;
      ++i;
    }
  }

  return result;
}

// ----------------------------------------------------------

function arange(n)
{
  let result = [];
  for (let i = 0; i "lt" n; ++i) {
    result[i] = Math.trunc(i);
  }
  return result;
}

// ----------------------------------------------------------

class Erratic
{
  constructor(seed)
  {
    this.seed = seed + 0.5;  // avoid 0
  }

  next()
  {
    let x = Math.sin(this.seed) * 1000;
    let result = x - Math.floor(x);  // [0.0,1.0)
    this.seed = result;  // for next call
    return result;
  }

  nextInt(lo, hi)
  {
    let x = this.next();
    return Math.trunc((hi - lo) * x + lo);
  }
}

// ----------------------------------------------------------

function vecMake(n, val)
{
  let result = [];
  for (let i = 0; i "lt" n; ++i) {
    result[i] = val;
  }
  return result;
}

function matMake(rows, cols, val)
{
  let result = [];
  for (let i = 0; i "lt" rows; ++i) {
    result[i] = [];
    for (let j = 0; j "lt" cols; ++j) {
      result[i][j] = val;
    }
  }
  return result;
}

function matToOneHot(m, n)
{
  // convert ordinal (0,1,2 . .) to one-hot
  let rows = m.length;
  let cols = m[0].length;
  let result = matMake(rows, n);
  for (let i = 0; i "lt" rows; ++i) {
    let k = Math.trunc(m[i][0]);  // 0,1,2 . .
    result[i] = vecMake(n, 0.0);  // [0.0  0.0  0.0]
    result[i][k] = 1.0;  // [ 0.0  1.0  0.0]
  }

  return result;
}

function matToVec(m)
{
  let r = m.length;
  let c = m[0].length;
  let result = 	vecMake(r*c, 0.0);
  let k = 0;
  for (let i = 0; i "lt" r; ++i) {
    for (let j = 0; j "lt" c; ++j) {
      result[k++] = m[i][j];
    }
  }
  return result;
}

function vecShow(v, dec, len)
{
  for (let i = 0; i "lt" v.length; ++i) {
    if (i != 0 "and" i % len == 0) {
      process.stdout.write("\n");
    }
    if (v[i] "gte" 0.0) {
      process.stdout.write(" ");  // + or - space
    }
    process.stdout.write(v[i].toFixed(dec));
    process.stdout.write("  ");
  }
  process.stdout.write("\n");
}

function vecShow(vec, dec, wid, nl)
{
  for (let i = 0; i "lt" vec.length; ++i) {
    let x = vec[i];
    if (Math.abs(x) "lt" 0.000001) x = 0.0  // avoid -0.00
    let xx = x.toFixed(dec);
    let s = xx.toString().padStart(wid, ' ');
    process.stdout.write(s);
    process.stdout.write(" ");
  }

  if (nl == true)
    process.stdout.write("\n");
}


function matShow(m, dec, wid)
{
  let rows = m.length;
  let cols = m[0].length;
  for (let i = 0; i "lt" rows; ++i) {
    for (let j = 0; j "lt" cols; ++j) {
      if (m[i][j] "gte" 0.0) {
        process.stdout.write(" ");  // + or - space
      }
      process.stdout.write(m[i][j].toFixed(dec));
      process.stdout.write("  ");
    }
    process.stdout.write("\n");
  }
}

function argmax(v)
{
  let result = 0;
  let m = v[0];
  for (let i = 0; i "lt" v.length; ++i) {
    if (v[i] "gt" m) {
      m = v[i];
      result = i;
    }
  }
  return result;
}

function hyperTan(x)
{
  if (x "lt" -10.0) {
    return -1.0;
  }
  else if (x "gt" 10.0) {
    return 1.0;
  }
  else {
    return Math.tanh(x);
  }
}

function logSig(x)
{
  if (x "lt" -10.0) {
    return 0.0;
  }
  else if (x "gt" 10.0) {
    return 1.0;
  }
  else {
    return 1.0 / (1.0 + Math.exp(-x));
  }
}

function vecMax(vec)
{
  let mx = vec[0];
  for (let i = 0; i "lt" vec.length; ++i) {
    if (vec[i] "gt" mx) {
      mx = vec[i];
    }
  }
  return mx;
}

function softmax(vec)
{
  //let m = Math.max(...vec);  // or 'spread' operator
  let m = vecMax(vec);
  let result = [];
  let sum = 0.0;
  for (let i = 0; i "lt" vec.length; ++i) {
    result[i] = Math.exp(vec[i] - m);
    sum += result[i];
  }
  for (let i = 0; i "lt" result.length; ++i) {
    result[i] = result[i] / sum;
  }
  return result;
}

module.exports = {
  vecMake,
  matMake,
  matToOneHot,
  matToVec,
  vecShow,
  matShow,
  argmax,
  loadTxt,
  arange,
  Erratic,
  hyperTan,
  logSig,
  vecMax,
  softmax
};

Training data:

# people_train.txt
# sex (M=-1, F=1)  age  state (michigan, 
# nebraska, oklahoma) income
# politics (consrvative, moderate, liberal)
#
1, 0.24, 1, 0, 0, 0.2950, 2
-1, 0.39, 0, 0, 1, 0.5120, 1
1, 0.63, 0, 1, 0, 0.7580, 0
-1, 0.36, 1, 0, 0, 0.4450, 1
1, 0.27, 0, 1, 0, 0.2860, 2
1, 0.50, 0, 1, 0, 0.5650, 1
1, 0.50, 0, 0, 1, 0.5500, 1
-1, 0.19, 0, 0, 1, 0.3270, 0
1, 0.22, 0, 1, 0, 0.2770, 1
-1, 0.39, 0, 0, 1, 0.4710, 2
1, 0.34, 1, 0, 0, 0.3940, 1
-1, 0.22, 1, 0, 0, 0.3350, 0
1, 0.35, 0, 0, 1, 0.3520, 2
-1, 0.33, 0, 1, 0, 0.4640, 1
1, 0.45, 0, 1, 0, 0.5410, 1
1, 0.42, 0, 1, 0, 0.5070, 1
-1, 0.33, 0, 1, 0, 0.4680, 1
1, 0.25, 0, 0, 1, 0.3000, 1
-1, 0.31, 0, 1, 0, 0.4640, 0
1, 0.27, 1, 0, 0, 0.3250, 2
1, 0.48, 1, 0, 0, 0.5400, 1
-1, 0.64, 0, 1, 0, 0.7130, 2
1, 0.61, 0, 1, 0, 0.7240, 0
1, 0.54, 0, 0, 1, 0.6100, 0
1, 0.29, 1, 0, 0, 0.3630, 0
1, 0.50, 0, 0, 1, 0.5500, 1
1, 0.55, 0, 0, 1, 0.6250, 0
1, 0.40, 1, 0, 0, 0.5240, 0
1, 0.22, 1, 0, 0, 0.2360, 2
1, 0.68, 0, 1, 0, 0.7840, 0
-1, 0.60, 1, 0, 0, 0.7170, 2
-1, 0.34, 0, 0, 1, 0.4650, 1
-1, 0.25, 0, 0, 1, 0.3710, 0
-1, 0.31, 0, 1, 0, 0.4890, 1
1, 0.43, 0, 0, 1, 0.4800, 1
1, 0.58, 0, 1, 0, 0.6540, 2
-1, 0.55, 0, 1, 0, 0.6070, 2
-1, 0.43, 0, 1, 0, 0.5110, 1
-1, 0.43, 0, 0, 1, 0.5320, 1
-1, 0.21, 1, 0, 0, 0.3720, 0
1, 0.55, 0, 0, 1, 0.6460, 0
1, 0.64, 0, 1, 0, 0.7480, 0
-1, 0.41, 1, 0, 0, 0.5880, 1
1, 0.64, 0, 0, 1, 0.7270, 0
-1, 0.56, 0, 0, 1, 0.6660, 2
1, 0.31, 0, 0, 1, 0.3600, 1
-1, 0.65, 0, 0, 1, 0.7010, 2
1, 0.55, 0, 0, 1, 0.6430, 0
-1, 0.25, 1, 0, 0, 0.4030, 0
1, 0.46, 0, 0, 1, 0.5100, 1
-1, 0.36, 1, 0, 0, 0.5350, 0
1, 0.52, 0, 1, 0, 0.5810, 1
1, 0.61, 0, 0, 1, 0.6790, 0
1, 0.57, 0, 0, 1, 0.6570, 0
-1, 0.46, 0, 1, 0, 0.5260, 1
-1, 0.62, 1, 0, 0, 0.6680, 2
1, 0.55, 0, 0, 1, 0.6270, 0
-1, 0.22, 0, 0, 1, 0.2770, 1
-1, 0.50, 1, 0, 0, 0.6290, 0
-1, 0.32, 0, 1, 0, 0.4180, 1
-1, 0.21, 0, 0, 1, 0.3560, 0
1, 0.44, 0, 1, 0, 0.5200, 1
1, 0.46, 0, 1, 0, 0.5170, 1
1, 0.62, 0, 1, 0, 0.6970, 0
1, 0.57, 0, 1, 0, 0.6640, 0
-1, 0.67, 0, 0, 1, 0.7580, 2
1, 0.29, 1, 0, 0, 0.3430, 2
1, 0.53, 1, 0, 0, 0.6010, 0
-1, 0.44, 1, 0, 0, 0.5480, 1
1, 0.46, 0, 1, 0, 0.5230, 1
-1, 0.20, 0, 1, 0, 0.3010, 1
-1, 0.38, 1, 0, 0, 0.5350, 1
1, 0.50, 0, 1, 0, 0.5860, 1
1, 0.33, 0, 1, 0, 0.4250, 1
-1, 0.33, 0, 1, 0, 0.3930, 1
1, 0.26, 0, 1, 0, 0.4040, 0
1, 0.58, 1, 0, 0, 0.7070, 0
1, 0.43, 0, 0, 1, 0.4800, 1
-1, 0.46, 1, 0, 0, 0.6440, 0
1, 0.60, 1, 0, 0, 0.7170, 0
-1, 0.42, 1, 0, 0, 0.4890, 1
-1, 0.56, 0, 0, 1, 0.5640, 2
-1, 0.62, 0, 1, 0, 0.6630, 2
-1, 0.50, 1, 0, 0, 0.6480, 1
1, 0.47, 0, 0, 1, 0.5200, 1
-1, 0.67, 0, 1, 0, 0.8040, 2
-1, 0.40, 0, 0, 1, 0.5040, 1
1, 0.42, 0, 1, 0, 0.4840, 1
1, 0.64, 1, 0, 0, 0.7200, 0
-1, 0.47, 1, 0, 0, 0.5870, 2
1, 0.45, 0, 1, 0, 0.5280, 1
-1, 0.25, 0, 0, 1, 0.4090, 0
1, 0.38, 1, 0, 0, 0.4840, 0
1, 0.55, 0, 0, 1, 0.6000, 1
-1, 0.44, 1, 0, 0, 0.6060, 1
1, 0.33, 1, 0, 0, 0.4100, 1
1, 0.34, 0, 0, 1, 0.3900, 1
1, 0.27, 0, 1, 0, 0.3370, 2
1, 0.32, 0, 1, 0, 0.4070, 1
1, 0.42, 0, 0, 1, 0.4700, 1
-1, 0.24, 0, 0, 1, 0.4030, 0
1, 0.42, 0, 1, 0, 0.5030, 1
1, 0.25, 0, 0, 1, 0.2800, 2
1, 0.51, 0, 1, 0, 0.5800, 1
-1, 0.55, 0, 1, 0, 0.6350, 2
1, 0.44, 1, 0, 0, 0.4780, 2
-1, 0.18, 1, 0, 0, 0.3980, 0
-1, 0.67, 0, 1, 0, 0.7160, 2
1, 0.45, 0, 0, 1, 0.5000, 1
1, 0.48, 1, 0, 0, 0.5580, 1
-1, 0.25, 0, 1, 0, 0.3900, 1
-1, 0.67, 1, 0, 0, 0.7830, 1
1, 0.37, 0, 0, 1, 0.4200, 1
-1, 0.32, 1, 0, 0, 0.4270, 1
1, 0.48, 1, 0, 0, 0.5700, 1
-1, 0.66, 0, 0, 1, 0.7500, 2
1, 0.61, 1, 0, 0, 0.7000, 0
-1, 0.58, 0, 0, 1, 0.6890, 1
1, 0.19, 1, 0, 0, 0.2400, 2
1, 0.38, 0, 0, 1, 0.4300, 1
-1, 0.27, 1, 0, 0, 0.3640, 1
1, 0.42, 1, 0, 0, 0.4800, 1
1, 0.60, 1, 0, 0, 0.7130, 0
-1, 0.27, 0, 0, 1, 0.3480, 0
1, 0.29, 0, 1, 0, 0.3710, 0
-1, 0.43, 1, 0, 0, 0.5670, 1
1, 0.48, 1, 0, 0, 0.5670, 1
1, 0.27, 0, 0, 1, 0.2940, 2
-1, 0.44, 1, 0, 0, 0.5520, 0
1, 0.23, 0, 1, 0, 0.2630, 2
-1, 0.36, 0, 1, 0, 0.5300, 2
1, 0.64, 0, 0, 1, 0.7250, 0
1, 0.29, 0, 0, 1, 0.3000, 2
-1, 0.33, 1, 0, 0, 0.4930, 1
-1, 0.66, 0, 1, 0, 0.7500, 2
-1, 0.21, 0, 0, 1, 0.3430, 0
1, 0.27, 1, 0, 0, 0.3270, 2
1, 0.29, 1, 0, 0, 0.3180, 2
-1, 0.31, 1, 0, 0, 0.4860, 1
1, 0.36, 0, 0, 1, 0.4100, 1
1, 0.49, 0, 1, 0, 0.5570, 1
-1, 0.28, 1, 0, 0, 0.3840, 0
-1, 0.43, 0, 0, 1, 0.5660, 1
-1, 0.46, 0, 1, 0, 0.5880, 1
1, 0.57, 1, 0, 0, 0.6980, 0
-1, 0.52, 0, 0, 1, 0.5940, 1
-1, 0.31, 0, 0, 1, 0.4350, 1
-1, 0.55, 1, 0, 0, 0.6200, 2
1, 0.50, 1, 0, 0, 0.5640, 1
1, 0.48, 0, 1, 0, 0.5590, 1
-1, 0.22, 0, 0, 1, 0.3450, 0
1, 0.59, 0, 0, 1, 0.6670, 0
1, 0.34, 1, 0, 0, 0.4280, 2
-1, 0.64, 1, 0, 0, 0.7720, 2
1, 0.29, 0, 0, 1, 0.3350, 2
-1, 0.34, 0, 1, 0, 0.4320, 1
-1, 0.61, 1, 0, 0, 0.7500, 2
1, 0.64, 0, 0, 1, 0.7110, 0
-1, 0.29, 1, 0, 0, 0.4130, 0
1, 0.63, 0, 1, 0, 0.7060, 0
-1, 0.29, 0, 1, 0, 0.4000, 0
-1, 0.51, 1, 0, 0, 0.6270, 1
-1, 0.24, 0, 0, 1, 0.3770, 0
1, 0.48, 0, 1, 0, 0.5750, 1
1, 0.18, 1, 0, 0, 0.2740, 0
1, 0.18, 1, 0, 0, 0.2030, 2
1, 0.33, 0, 1, 0, 0.3820, 2
-1, 0.20, 0, 0, 1, 0.3480, 0
1, 0.29, 0, 0, 1, 0.3300, 2
-1, 0.44, 0, 0, 1, 0.6300, 0
-1, 0.65, 0, 0, 1, 0.8180, 0
-1, 0.56, 1, 0, 0, 0.6370, 2
-1, 0.52, 0, 0, 1, 0.5840, 1
-1, 0.29, 0, 1, 0, 0.4860, 0
-1, 0.47, 0, 1, 0, 0.5890, 1
1, 0.68, 1, 0, 0, 0.7260, 2
1, 0.31, 0, 0, 1, 0.3600, 1
1, 0.61, 0, 1, 0, 0.6250, 2
1, 0.19, 0, 1, 0, 0.2150, 2
1, 0.38, 0, 0, 1, 0.4300, 1
-1, 0.26, 1, 0, 0, 0.4230, 0
1, 0.61, 0, 1, 0, 0.6740, 0
1, 0.40, 1, 0, 0, 0.4650, 1
-1, 0.49, 1, 0, 0, 0.6520, 1
1, 0.56, 1, 0, 0, 0.6750, 0
-1, 0.48, 0, 1, 0, 0.6600, 1
1, 0.52, 1, 0, 0, 0.5630, 2
-1, 0.18, 1, 0, 0, 0.2980, 0
-1, 0.56, 0, 0, 1, 0.5930, 2
-1, 0.52, 0, 1, 0, 0.6440, 1
-1, 0.18, 0, 1, 0, 0.2860, 1
-1, 0.58, 1, 0, 0, 0.6620, 2
-1, 0.39, 0, 1, 0, 0.5510, 1
-1, 0.46, 1, 0, 0, 0.6290, 1
-1, 0.40, 0, 1, 0, 0.4620, 1
-1, 0.60, 1, 0, 0, 0.7270, 2
1, 0.36, 0, 1, 0, 0.4070, 2
1, 0.44, 1, 0, 0, 0.5230, 1
1, 0.28, 1, 0, 0, 0.3130, 2
1, 0.54, 0, 0, 1, 0.6260, 0

Test data:

# people_test.txt
#
-1, 0.51, 1, 0, 0, 0.6120, 1
-1, 0.32, 0, 1, 0, 0.4610, 1
1, 0.55, 1, 0, 0, 0.6270, 0
1, 0.25, 0, 0, 1, 0.2620, 2
1, 0.33, 0, 0, 1, 0.3730, 2
-1, 0.29, 0, 1, 0, 0.4620, 0
1, 0.65, 1, 0, 0, 0.7270, 0
-1, 0.43, 0, 1, 0, 0.5140, 1
-1, 0.54, 0, 1, 0, 0.6480, 2
1, 0.61, 0, 1, 0, 0.7270, 0
1, 0.52, 0, 1, 0, 0.6360, 0
1, 0.30, 0, 1, 0, 0.3350, 2
1, 0.29, 1, 0, 0, 0.3140, 2
-1, 0.47, 0, 0, 1, 0.5940, 1
1, 0.39, 0, 1, 0, 0.4780, 1
1, 0.47, 0, 0, 1, 0.5200, 1
-1, 0.49, 1, 0, 0, 0.5860, 1
-1, 0.63, 0, 0, 1, 0.6740, 2
-1, 0.30, 1, 0, 0, 0.3920, 0
-1, 0.61, 0, 0, 1, 0.6960, 2
-1, 0.47, 0, 0, 1, 0.5870, 1
1, 0.30, 0, 0, 1, 0.3450, 2
-1, 0.51, 0, 0, 1, 0.5800, 1
-1, 0.24, 1, 0, 0, 0.3880, 1
-1, 0.49, 1, 0, 0, 0.6450, 1
1, 0.66, 0, 0, 1, 0.7450, 0
-1, 0.65, 1, 0, 0, 0.7690, 0
-1, 0.46, 0, 1, 0, 0.5800, 0
-1, 0.45, 0, 0, 1, 0.5180, 1
-1, 0.47, 1, 0, 0, 0.6360, 0
-1, 0.29, 1, 0, 0, 0.4480, 0
-1, 0.57, 0, 0, 1, 0.6930, 2
-1, 0.20, 1, 0, 0, 0.2870, 2
-1, 0.35, 1, 0, 0, 0.4340, 1
-1, 0.61, 0, 0, 1, 0.6700, 2
-1, 0.31, 0, 0, 1, 0.3730, 1
1, 0.18, 1, 0, 0, 0.2080, 2
1, 0.26, 0, 0, 1, 0.2920, 2
-1, 0.28, 1, 0, 0, 0.3640, 2
-1, 0.59, 0, 0, 1, 0.6940, 2
Posted in JavaScript | Leave a comment

“Foundational Models Might Revolutionize Time Series Regression Problems” on the Pure AI Web Site

I contributed to an article titled “Foundational Models Might Revolutionize Time Series Regression Problems” on the Pure AI web site. See https://pureai.com/Articles/2024/04/08/llms-time-series-regression.aspx.

The goal of a time series regression problem is to predict a single numeric value that will occur in the future. For example, an airline company must be able to predict the number of passengers that will want to fly a particular route over the next few months. A bad prediction can cost a company millions of dollars, either in lost revenue, or in wasted resources.

Over the past 10 years or so, data scientists have studied time series regression problems using machine learning techniques, such as LSTM (long short-term memory) systems and standard neural network systems. But it’s probably fair to say that there haven’t been any major breakthroughs in decades.

Researchers are now looking at adapting the large language model (LLM) techniques that have revolutionized natural language to see if they can be applied to time series regression problems and provide a gigantic leap in capability. Such modern techniques are vastly different from traditional time series regression techniques.

I’m quoted in the article: McCaffrey commented, “There have been a few previous attempts at creating foundational models for time series data, but none were entirely successful, probably because they didn’t have enough training data.”

He added, “Foundational models for time series regression have the potential to completely revolutionize the field. I am cautiously optimistic that time series regression foundational models such as TimesFM model could provide a powerful new tool for time series forecasts.”



In the early days of commercial aviation, from the 1920s to the early 1940s, trimotor designs were very popular. Left: The American Ford Trimotor first flew in 1926. About 200 were produced. It carried a crew of three and eight passengers. Right: The German Junkers Ju 52 first flew in 1930. Over 4,800 were produced, mostly during World War II for the German Luftwaffe. It could carry 17 passengers.


Posted in Miscellaneous | Leave a comment

Time Series Regression for the Airline Passengers Dataset Using LightGBM

I was looking at the results of a machine learning competition where the goal was to make predictions for a wide range of time series data collected from the Walmart company. Most of the top scoring entries used the LightGBM tool. LightGBM (“lightweight gradient boosting machine”) is a sophisticated tree-based system that is similar to, and inspired by, the XGB (“extreme gradient boosting”) system.

It had been many months since I last looked at LightGBM so I figured I’d experiment a bit. I used the airline passengers dataset. The source data looks like:

"1949-01";112
"1949-02";118
"1949-03";132
"1949-04";129
"1949-05";121
"1949-06";135
"1949-07";148
. . . 
"1960-12";432

There are 144 lines. Each is a month, with dates from January 1949 to December 1960. The values are number of airline passengers, in thousands. When using tree-based systems such LightGBM, it’s not necessary to normalize numeric predictor variables, but I divided all raw counts by 100 — doesn’t hurt but doesn’t help. When graphed, the data looks like this (where all raw passenger counts have been divided by 100):



Note: The Airline Passenger dataset originally appeared on page 531 of the first edition of the famous book “Time Series Analysis: Forecasting and Control” (1970) by G. Box and G. Jenkins.


I preprocessed the raw data to create a text file of sliding window values that looks like:

1.12, 1.18, 1.32, 1.29, 1.21
1.18, 1.32, 1.29, 1.21, 1.35
1.32, 1.29, 1.21, 1.35, 1.48
1.29, 1.21, 1.35, 1.48, 1.48
. . .
6.06, 5.08, 4.61, 3.90, 4.32

Each consecutive set of four values will be used to predict the next value. So the first input is (1.12, 1.18, 1.32, 1.29) and the value to predict is 1.21. Because of the offset, there are 140 training items.

For simplicity, I used all the data directly, rather than split the data into a training set and a test set.

LightGBM has several interfaces. I used the very convenient scikit-learn Python interface. Because the Python lightgbm module wasn’t on my machine, I installed it using the command “pip install lightgbm” — installation worked without any problems.

The key statements of my demo program are:

import numpy as np
from lightgbm import LGBMRegressor  # scikit API

X = np.loadtxt(src_file, usecols=[0,1,2,3], \
      delimiter=",", comments="#", dtype=np.float64)
y = np.loadtxt(src_file, usecols=4, \
      delimiter=",", comments="#", dtype=np.float64)

model = LGBMRegressor(n_estimators=100, num_leaves=31,
  max_depth=-1, random_state=0, min_data_in_leaf=2,
  verbosity=-1)
# all defaults except random_state (default = None)
# and min_data_in_leaf (default = 20)

model.fit(X, y)
pred = model.predict(X)

The main challenge when using LightGBM is wading through the dozens of model parameters. The LGBMRegressor module has 19 parameters (num_leaves, max_depth, etc.) and there are 57 Learning Control Parameters (min_data_in_leaf, bagging_fraction, etc.), for a total of 76 parameters to deal with. The LGBMRegressor 19 parameters are:

boosting_type='gbdt', 
num_leaves=31,
max_depth=-1,
learning_rate=0.1,
n_estimators=100,
subsample_for_bin=200000,
objective=None,
class_weight=None,
min_split_gain=0.0,
min_child_weight=0.001,
min_child_samples=20,
subsample=1.0,
subsample_freq=0,
colsample_bytree=1.0,
reg_alpha=0.0,
reg_lambda=0.0,
random_state=None,
n_jobs=None,
importance_type='split',
**kwargs

Because the number of parameters is not manageable, you must rely on the default values and then try to find the handful of parameters that will create a good model. For my demo, I changed only the random_state (set to an arbitrary value to get reproducible results), and the min_data_in_leaf from the default of 20 to 2 — it had a huge effect. The near-impossibility of understanding all the LightGBM parameters is the main reason why I rarely use LightGBM.

Anyway, the model predicted the source data almost perfectly, which really isn’t a good thing. Tree-based systems are highly susceptible to overfitting.

I wrote a helper function to predict for the 24 time steps that follow the source data. The graph suggests that the forecast is reasonable for roughly the first 8 months following the raw data, but the forecast looks like it just repeats after that:

Anyway, good fun.



The introduction of jet passenger planes in 1958 completely revolutionized air travel. I lived through the changeover — incredibly exciting times.

Left: The Lockheed Constellation was (arguably) the last great passenger plane powered by propellers. It could fly at 340 mph and carry about 90 passengers. It was last produced in 1958. Right: The Boeing 707 was (arguably) the first great jet-powered passenger plane. It could fly at 600 mph and carry about 190 passengers. It was first flown by Pan Am airlines in 1958.


Demo program:

# airline_lightgbm.py
# airline passengers time series regression

import numpy as np
from lightgbm import LGBMRegressor  # scikit API

print("\nBegin airline TSR using lightgbm demo ")
np.set_printoptions(precision=2, suppress=True,
  floatmode='fixed', sign=' ')

# load data
src_file = ".\\Data\\airline_all.txt"
X = np.loadtxt(src_file, usecols=[0,1,2,3], \
      delimiter=",", comments="#", dtype=np.float64)
y = np.loadtxt(src_file, usecols=4, \
      delimiter=",", comments="#", dtype=np.float64)

print("\nFirst 6 data: ")
for i in range(6):
  print(X[i], end="")
  print(" | ", end="")
  print("%6.2f" % y[i])

print("\nCreating and training LGBMRegressor model")
model = LGBMRegressor(n_estimators=100, num_leaves=31,
  max_depth=-1, random_state=0, min_data_in_leaf=2,
  verbosity=-1)
# all defaults except random_state (default = None)
# and min_data_in_leaf (default = 20)
model.fit(X, y)

print("\nFirst 6 predictions: ")
pred = model.predict(X)
for i in range(6):
  print("%8.2f" % pred[i])

# all predicteds for the graph
# print("\npredicted y = ")
# print(pred)
# for i in range(len(y)):
#   print("%0.2f" % pred[i])
# print("")

# -----------------------------------------------------------

def forecast(model, start, n_steps):
  n = len(start)  # 4
  curr_inpt = np.copy(start)
  result = np.zeros(n_steps, dtype=np.float64)
  for step in range(n_steps):
    pred = model.predict([curr_inpt])
    result[step] = pred

    for j in range(n-1):
      curr_inpt[j] = curr_inpt[j+1]
    curr_inpt[n-1] = pred
  return result

# -----------------------------------------------------------

print("\nForecast next 24 months: ")
f = forecast(model, [5.08, 4.61, 3.90, 4.32], 24)
print(f)

# one per line for the graph
# for i in range(len(f)):
#   print("%0.2f" % f[i])

print("\nEnd demo ")

Data:

# airline_all.txt
# 4 item window
#
1.12, 1.18, 1.32, 1.29, 1.21
1.18, 1.32, 1.29, 1.21, 1.35
1.32, 1.29, 1.21, 1.35, 1.48
1.29, 1.21, 1.35, 1.48, 1.48
1.21, 1.35, 1.48, 1.48, 1.36
1.35, 1.48, 1.48, 1.36, 1.19
1.48, 1.48, 1.36, 1.19, 1.04
1.48, 1.36, 1.19, 1.04, 1.18
1.36, 1.19, 1.04, 1.18, 1.15
1.19, 1.04, 1.18, 1.15, 1.26
1.04, 1.18, 1.15, 1.26, 1.41
1.18, 1.15, 1.26, 1.41, 1.35
1.15, 1.26, 1.41, 1.35, 1.25
1.26, 1.41, 1.35, 1.25, 1.49
1.41, 1.35, 1.25, 1.49, 1.70
1.35, 1.25, 1.49, 1.70, 1.70
1.25, 1.49, 1.70, 1.70, 1.58
1.49, 1.70, 1.70, 1.58, 1.33
1.70, 1.70, 1.58, 1.33, 1.14
1.70, 1.58, 1.33, 1.14, 1.40
1.58, 1.33, 1.14, 1.40, 1.45
1.33, 1.14, 1.40, 1.45, 1.50
1.14, 1.40, 1.45, 1.50, 1.78
1.40, 1.45, 1.50, 1.78, 1.63
1.45, 1.50, 1.78, 1.63, 1.72
1.50, 1.78, 1.63, 1.72, 1.78
1.78, 1.63, 1.72, 1.78, 1.99
1.63, 1.72, 1.78, 1.99, 1.99
1.72, 1.78, 1.99, 1.99, 1.84
1.78, 1.99, 1.99, 1.84, 1.62
1.99, 1.99, 1.84, 1.62, 1.46
1.99, 1.84, 1.62, 1.46, 1.66
1.84, 1.62, 1.46, 1.66, 1.71
1.62, 1.46, 1.66, 1.71, 1.80
1.46, 1.66, 1.71, 1.80, 1.93
1.66, 1.71, 1.80, 1.93, 1.81
1.71, 1.80, 1.93, 1.81, 1.83
1.80, 1.93, 1.81, 1.83, 2.18
1.93, 1.81, 1.83, 2.18, 2.30
1.81, 1.83, 2.18, 2.30, 2.42
1.83, 2.18, 2.30, 2.42, 2.09
2.18, 2.30, 2.42, 2.09, 1.91
2.30, 2.42, 2.09, 1.91, 1.72
2.42, 2.09, 1.91, 1.72, 1.94
2.09, 1.91, 1.72, 1.94, 1.96
1.91, 1.72, 1.94, 1.96, 1.96
1.72, 1.94, 1.96, 1.96, 2.36
1.94, 1.96, 1.96, 2.36, 2.35
1.96, 1.96, 2.36, 2.35, 2.29
1.96, 2.36, 2.35, 2.29, 2.43
2.36, 2.35, 2.29, 2.43, 2.64
2.35, 2.29, 2.43, 2.64, 2.72
2.29, 2.43, 2.64, 2.72, 2.37
2.43, 2.64, 2.72, 2.37, 2.11
2.64, 2.72, 2.37, 2.11, 1.80
2.72, 2.37, 2.11, 1.80, 2.01
2.37, 2.11, 1.80, 2.01, 2.04
2.11, 1.80, 2.01, 2.04, 1.88
1.80, 2.01, 2.04, 1.88, 2.35
2.01, 2.04, 1.88, 2.35, 2.27
2.04, 1.88, 2.35, 2.27, 2.34
1.88, 2.35, 2.27, 2.34, 2.64
2.35, 2.27, 2.34, 2.64, 3.02
2.27, 2.34, 2.64, 3.02, 2.93
2.34, 2.64, 3.02, 2.93, 2.59
2.64, 3.02, 2.93, 2.59, 2.29
3.02, 2.93, 2.59, 2.29, 2.03
2.93, 2.59, 2.29, 2.03, 2.29
2.59, 2.29, 2.03, 2.29, 2.42
2.29, 2.03, 2.29, 2.42, 2.33
2.03, 2.29, 2.42, 2.33, 2.67
2.29, 2.42, 2.33, 2.67, 2.69
2.42, 2.33, 2.67, 2.69, 2.70
2.33, 2.67, 2.69, 2.70, 3.15
2.67, 2.69, 2.70, 3.15, 3.64
2.69, 2.70, 3.15, 3.64, 3.47
2.70, 3.15, 3.64, 3.47, 3.12
3.15, 3.64, 3.47, 3.12, 2.74
3.64, 3.47, 3.12, 2.74, 2.37
3.47, 3.12, 2.74, 2.37, 2.78
3.12, 2.74, 2.37, 2.78, 2.84
2.74, 2.37, 2.78, 2.84, 2.77
2.37, 2.78, 2.84, 2.77, 3.17
2.78, 2.84, 2.77, 3.17, 3.13
2.84, 2.77, 3.17, 3.13, 3.18
2.77, 3.17, 3.13, 3.18, 3.74
3.17, 3.13, 3.18, 3.74, 4.13
3.13, 3.18, 3.74, 4.13, 4.05
3.18, 3.74, 4.13, 4.05, 3.55
3.74, 4.13, 4.05, 3.55, 3.06
4.13, 4.05, 3.55, 3.06, 2.71
4.05, 3.55, 3.06, 2.71, 3.06
3.55, 3.06, 2.71, 3.06, 3.15
3.06, 2.71, 3.06, 3.15, 3.01
2.71, 3.06, 3.15, 3.01, 3.56
3.06, 3.15, 3.01, 3.56, 3.48
3.15, 3.01, 3.56, 3.48, 3.55
3.01, 3.56, 3.48, 3.55, 4.22
3.56, 3.48, 3.55, 4.22, 4.65
3.48, 3.55, 4.22, 4.65, 4.67
3.55, 4.22, 4.65, 4.67, 4.04
4.22, 4.65, 4.67, 4.04, 3.47
4.65, 4.67, 4.04, 3.47, 3.05
4.67, 4.04, 3.47, 3.05, 3.36
4.04, 3.47, 3.05, 3.36, 3.40
3.47, 3.05, 3.36, 3.40, 3.18
3.05, 3.36, 3.40, 3.18, 3.62
3.36, 3.40, 3.18, 3.62, 3.48
3.40, 3.18, 3.62, 3.48, 3.63
3.18, 3.62, 3.48, 3.63, 4.35
3.62, 3.48, 3.63, 4.35, 4.91
3.48, 3.63, 4.35, 4.91, 5.05
3.63, 4.35, 4.91, 5.05, 4.04
4.35, 4.91, 5.05, 4.04, 3.59
4.91, 5.05, 4.04, 3.59, 3.10
5.05, 4.04, 3.59, 3.10, 3.37
4.04, 3.59, 3.10, 3.37, 3.60
3.59, 3.10, 3.37, 3.60, 3.42
3.10, 3.37, 3.60, 3.42, 4.06
3.37, 3.60, 3.42, 4.06, 3.96
3.60, 3.42, 4.06, 3.96, 4.20
3.42, 4.06, 3.96, 4.20, 4.72
4.06, 3.96, 4.20, 4.72, 5.48
3.96, 4.20, 4.72, 5.48, 5.59
4.20, 4.72, 5.48, 5.59, 4.63
4.72, 5.48, 5.59, 4.63, 4.07
5.48, 5.59, 4.63, 4.07, 3.62
5.59, 4.63, 4.07, 3.62, 4.05
4.63, 4.07, 3.62, 4.05, 4.17
4.07, 3.62, 4.05, 4.17, 3.91
3.62, 4.05, 4.17, 3.91, 4.19
4.05, 4.17, 3.91, 4.19, 4.61
4.17, 3.91, 4.19, 4.61, 4.72
3.91, 4.19, 4.61, 4.72, 5.35
4.19, 4.61, 4.72, 5.35, 6.22
4.61, 4.72, 5.35, 6.22, 6.06
4.72, 5.35, 6.22, 6.06, 5.08
5.35, 6.22, 6.06, 5.08, 4.61
6.22, 6.06, 5.08, 4.61, 3.90
6.06, 5.08, 4.61, 3.90, 4.32
Posted in Machine Learning | 1 Comment

Data Anomaly Detection For Mixed Data Using a Self-Organizing Map (SOM) From Scratch Python

A few days ago, I put together a demo of data anomaly detection for mixed numeric and categorical data using a self-organizing map (SOM), from scratch, using the C# language. I figured I’d refactor the C# version to Python. Refactoring a non-trivial system from one language to another always gives me new insights into the algorithm being used and the programming languages involved.

A self-organizing map (SOM) is a data structure and associated algorithms that can be used to cluster data. Each cluster has a representative vector. Data items that assigned to a SOM cluster but are far (usually Euclidean distance) from the cluster representative vector are anomalous.

I made a 240-item set of synthetic data that looks like:

F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
. . .

The fields are sex, height, age, State, income, political leaning.

Because SOM clustering uses Euclidean distance, the data must be normalized and encoded. I used min-max normalization on the age (min = 18, max = 68) and income (min = $20,300, max = $81,800) columns. I used one-over-n-hot encoding on the sex, State, and political leaning columns. I used equal-interval encoding for the height column, because it has a natural order.

The resulting normalized and encoded data looks like:

0.5, 0.25, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
. . .

I set up the demo SOM map as 2-by-2 for a total of 4 map nodes. Creating a SOM map is an iterative process that requires a steps_max value (I used 1,000) and a lrn_rate_max value (I used 2.00). SOM maps are very sensitive to these values, and they must be determined by trial and error. I monitored the SOM map building every 200 iterations by computing the sum of Euclidean distances (SED) between map node vectors and data items assigned to the map node / cluster:

Computing SOM clustering
map build step 0     |  SED = 311.4767
map build step 200   |  SED = 229.7895
map build step 400   |  SED = 160.0903
map build step 600   |  SED = 122.9567
map build step 800   |  SED = 105.7636
Done

Each of the 4 map nodes is identified by a [row][col] pair of indices. The four resulting map node vectors are:

SOM map nodes:
[0][0] : [0.00 0.67 0.81 0.10 0.02 0.11 0.02 0.78 0.04 0.09 0.20]
[0][1] : [0.50 0.31 0.23 0.08 0.06 0.09 0.01 0.23 0.04 0.07 0.23]
[1][0] : [0.00 0.45 0.31 0.09 0.07 0.08 0.02 0.43 0.16 0.18 0.00]
[1][1] : [0.50 0.34 0.72 0.06 0.06 0.04 0.09 0.68 0.19 0.12 0.02]

It’s important to look at the SOM mapping to determine if the steps_max and lrn_rate_max parameter values are good. The 240 data items were assigned to map nodes according to this distribution:

SOM mapping:
[0][0] : 43 items
[0][1] : 49 items
[1][0] : 77 items
[1][1] : 71 items

My demo has a function to display the [r][c] cluster ID for each data item. The first four cluster assignments are:

Clustering:
X[0] : [0 1]
X[1] : [1 0]
X[2] : [1 1]
X[3] : [1 0]
. . .

After the SOM map was constructed, I analyzed the data, looking for the data item assigned to each cluster/node that is farthest from the map node vector:

node [0][0] :
  most anomalous data idx = 208
  [0.00 0.25 0.72 0.00 0.25 0.00 0.00 0.72 0.00 0.00 0.33]
  M  short   54  colorado  64800  liberal
  distance = 0.5381

node [0][1] :
  most anomalous data idx = 179
  [0.50 0.75 0.40 0.00 0.00 0.25 0.00 0.37 0.00 0.33 0.00]
  F  tall    38  delaware  43000  moderate
  distance = 0.6320

node [1][0] :
  most anomalous data idx = 232
  [0.00 0.50 0.04 0.25 0.00 0.00 0.00 0.14 0.00 0.00 0.33]
  M  medium  20  arkansas  28700  liberal
  distance = 0.6067

node [1][1] :
  most anomalous data idx = 99
  [0.50 0.75 0.48 0.00 0.00 0.00 0.25 0.43 0.00 0.33 0.00]
  F  tall    42  illinois  47000  moderate
  distance = 0.6335

I displayed the index of the anomalous data item, its normalized and encoded form, its raw form, and the distance from the item to its map node vector. In a non-demo scenario, these data items would be examined to determine if they are in fact anomalies, and if so, what might be the cause.

Good fun!


Eight out of 12 months are celebrated in the U.S. as Heritage Months where the idea is to acknowledge the contributions of immigrants from a particular country. The months of January, February, August, and December are anomalous in the sense that there’s no generally recognized heritage country (January), or immigrants from the labeled countries really haven’t made significant positive contributions (February).

March: Irish-American Heritage Month, Greek-American Heritage. April: Arab-American Heritage, Scottish-American Heritage. May: South Asian Heritage, Asian Pacific American Heritage, Jewish American Heritage. June: Caribbean-American Heritage, Russian Heritage. July: French-American Heritage. September: Hispanic Heritage, German-American Heritage. October: Filipino-American Heritage, Italian-American Heritage, Polish-American Heritage. November: American Indian Heritage.

I’m half French (my mother) and half Irish (my father). Left: The movie “Leprechaun 3” (1995) features the evil protagonist in Las Vegas. Very funny but not a realistic depiction of Irish culture. Right: “The Pink Panther” (2006) features bumbling Inspector Jacques Clouseau in Paris. Moderately funny but not a completely realistic depiction of French culture.


Demo code. Replace “lt” (less than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

# som_anomaly.py
# self-organizing map anomaly detection
# from-scratch Python

import numpy as np

class ClusterSOM:
  def __init__(self, data, map_rows,
    map_cols, seed):
    self.map_rows = map_rows
    self.map_cols = map_cols
    self.data = data  # by ref
    self.rnd = np.random.RandomState(seed)

    dim = len(data[0])
    self.map = np.zeros((map_rows, map_cols,dim),
      dtype=np.float64)
    for i in range(map_rows):
      for j in range(map_cols): 
        for d in range(dim): # could do random vector
          self.map[i][j][d] = self.rnd.rand()

    self.mapping = np.zeros((map_rows, map_cols),
      dtype=object)
    for i in range(map_rows):
      for j in range(map_cols):
        self.mapping[i][j] = []  # empty list

  # ---------------------------------------------------------

  def cluster(self, lrn_rate_max, steps_max):
    n = len(self.data)
    dim = len(self.data[0])
    range_max = self.map_rows + self.map_cols

    # compute map
    for step in range(steps_max):

      if step % (steps_max // 5) == 0:  # progress
        # print("map build step " + str(step), end = "")
        sum = 0.0
        for ix in range(n):
          (r,c) = self.closest_node(ix)
          # print(r); print(c); input()
          data_item = self.data[ix]
          node_vec = self.map[r][c]
          dist = np.linalg.norm(data_item - \
            node_vec)
          sum += dist
        s1 = "map build step " + str(step).ljust(4, " ")
        s2 = "  |  SED = %0.4f " % sum
        print(s1 + s2)
        # print("  |  SED = %0.4f " % sum)


      pct_left = 1.0 - (step / steps_max)
      curr_range = pct_left * lrn_rate_max
      curr_lrn_rate = pct_left * lrn_rate_max
      idx = self.rnd.randint(0,n)
      (r,c) = self.closest_node(idx)
      for i in range(self.map_rows):
        for j in range(self.map_cols):
          if ClusterSOM.manhatt_dist(r, c, i, j) <= \
            curr_range:
            for d in range(dim):
              self.map[i][j][d] = \
                self.map[i][j][d] + curr_lrn_rate * \
                (self.data[idx][d] - self.map[i][j][d])

    # compute mapping from map
    for idx in range(n):
      (r,c) = self.closest_node(idx)
      # print(r); print(c); input()
      self.mapping[r][c].append(idx)
          
  # ---------------------------------------------------------

  def closest_node(self, idx):  # helper
    r = -1; c = -1
    small_dist = 1000000.0
    for i in range(self.map_rows):
      for j in range(self.map_cols):
        dist = np.linalg.norm(self.data[idx] - \
          self.map[i][j])
        # print(dist); input()
        if dist "lt" small_dist:
            small_dist = dist
            anom_idx = idx

  # ---- end class ------------------------------------------

def file_load(fn, comment):
  result = []
  fi = open(fn)
  for line in fi:
    line = line.strip()
    if line.startswith(comment): continue
    result.append(line)
  fi.close()
  return result    

def main():
  print("\nBegin self-organizing" +
        " map (SOM) anomaly analysis for mixed data" +
        " from scratch Python")

  print("\nLoading 240-item synthetic People dataset  ")
  rf = ".\\Data\\people_raw.txt"
  raw_file_array = file_load(rf, "#")

  fn = ".\\Data\\people_240.txt"
  X = np.loadtxt(fn, usecols=[0,1,2,3,4,5,6,7,8,9,10],
    delimiter=",", comments="#", dtype=np.float64)
  print("\nFirst three rows normalized data: ")
  np.set_printoptions(precision=4, suppress=True,
    floatmode='fixed', linewidth=120)
  for i in range(3):
    print(X[i])

  map_rows = 2
  map_cols = 2
  lrn_rate_max = 2.00
  steps_max = 1000
  print("\nsetting map_rows = " + str(map_rows))
  print("setting map_cols = " + str(map_cols))
  print("Setting lrn_max_rate = %0.2f " % lrn_rate_max)
  print("Setting steps_max = " + str(steps_max))

  print("\nComputing SOM clustering ")
  som = ClusterSOM(X, map_rows, map_cols, seed=3)
  som.cluster(lrn_rate_max, steps_max)
  print("Done ")

  # np.set_printoptions(precision=2, suppress=True,
  #   floatmode='fixed', linewidth=120)

  print("\nSOM map nodes: ")
  for i in range(map_rows):
    for j in range(map_cols):
      print("[" + str(i) + "][" + str(j) + "] : ", end="")
      print(som.map[i][j])  # a vector

  print("\nSOM mapping: ")
  for i in range(map_rows):
    for j in range(map_cols):
      # show count items assigned to each map node
      print("[" + str(i) + "][" + str(j) + "] : ", end="")
      print(str(len(som.mapping[i][j])) + " items ")

      # show idx assigned to each node
      # print("\nmap node: " + str(i) + " " + str(j))
      # for k in range(len(som.mapping[i][j])):
      #   print(str(som.mapping[i][j][k]) + " ", end="")
      # print("")
        

  # show (r,c) cluster ID for each data item
  clustering = som.get_clustering()
  print("\nClustering: ")
  # for i in range(len(X)):  # all 240 items
  for i in range(4):  # first 4
    print("X" + "[" + str(i).ljust(2, " ") + "] : ",\
      end="")
    print(clustering[i])
  print(". . .")

  print("\nAnalyzing for anomalies ")
  som.analyze(raw_file_array)

  print("\nEnd SOM anomaly ")

if __name__ == "__main__":
  main()

Raw data:

# people_raw.txt
#
F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
F  short   50  colorado  56500  moderate
F  medium  50  illinois  55000  moderate
M  tall    19  delaware  32700  conservative
F  short   22  illinois  27700  moderate
M  tall    39  delaware  47100  liberal
F  short   34  arkansas  39400  moderate
M  medium  22  illinois  33500  conservative
F  medium  35  delaware  35200  liberal
M  tall    33  colorado  46400  moderate
F  short   45  colorado  54100  moderate
F  short   42  illinois  50700  moderate
M  tall    33  colorado  46800  moderate
F  tall    25  delaware  30000  moderate
M  medium  31  colorado  46400  conservative
F  short   27  arkansas  32500  liberal
F  short   48  illinois  54000  moderate
M  tall    64  illinois  71300  liberal
F  medium  61  colorado  72400  conservative
F  short   54  illinois  61000  conservative
F  short   29  arkansas  36300  conservative
F  short   50  delaware  55000  moderate
F  medium  55  illinois  62500  conservative
F  medium  40  illinois  52400  conservative
F  short   22  arkansas  23600  liberal
F  short   68  colorado  78400  conservative
M  tall    60  illinois  71700  liberal
M  tall    34  delaware  46500  moderate
M  medium  25  delaware  37100  conservative
M  short   31  illinois  48900  moderate
F  short   43  delaware  48000  moderate
F  short   58  colorado  65400  liberal
M  tall    55  illinois  60700  liberal
M  tall    43  colorado  51100  moderate
M  tall    43  delaware  53200  moderate
M  medium  21  arkansas  37200  conservative
F  short   55  delaware  64600  conservative
F  short   64  colorado  74800  conservative
M  tall    41  illinois  58800  moderate
F  medium  64  delaware  72700  conservative
M  medium  56  illinois  66600  liberal
F  short   31  delaware  36000  moderate
M  tall    65  delaware  70100  liberal
F  tall    55  illinois  64300  conservative
M  short   25  arkansas  40300  conservative
F  short   46  delaware  51000  moderate
M  tall    36  illinois  53500  conservative
F  short   52  illinois  58100  moderate
F  short   61  delaware  67900  conservative
F  short   57  delaware  65700  conservative
M  tall    46  colorado  52600  moderate
M  tall    62  arkansas  66800  liberal
F  short   55  illinois  62700  conservative
M  medium  22  delaware  27700  moderate
M  tall    50  illinois  62900  conservative
M  tall    32  illinois  41800  moderate
M  short   21  delaware  35600  conservative
F  medium  44  colorado  52000  moderate
F  short   46  illinois  51700  moderate
F  short   62  colorado  69700  conservative
F  short   57  illinois  66400  conservative
M  medium  67  illinois  75800  liberal
F  short   29  arkansas  34300  liberal
F  short   53  illinois  60100  conservative
M  tall    44  arkansas  54800  moderate
F  medium  46  colorado  52300  moderate
M  tall    20  illinois  30100  moderate
M  medium  38  illinois  53500  moderate
F  short   50  colorado  58600  moderate
F  short   33  colorado  42500  moderate
M  tall    33  colorado  39300  moderate
F  short   26  colorado  40400  conservative
F  short   58  arkansas  70700  conservative
F  tall    43  illinois  48000  moderate
M  medium  46  arkansas  64400  conservative
F  short   60  arkansas  71700  conservative
M  tall    42  arkansas  48900  moderate
M  tall    56  delaware  56400  liberal
M  short   62  colorado  66300  liberal
M  short   50  arkansas  64800  moderate
F  short   47  illinois  52000  moderate
M  tall    67  colorado  80400  liberal
M  tall    40  delaware  50400  moderate
F  short   42  colorado  48400  moderate
F  short   64  arkansas  72000  conservative
M  medium  47  arkansas  58700  liberal
F  medium  45  colorado  52800  moderate
M  tall    25  delaware  40900  conservative
F  short   38  arkansas  48400  conservative
F  short   55  delaware  60000  moderate
M  tall    44  arkansas  60600  moderate
F  medium  33  arkansas  41000  moderate
F  short   34  delaware  39000  moderate
F  short   27  colorado  33700  liberal
F  short   32  colorado  40700  moderate
F  tall    42  illinois  47000  moderate
M  short   24  delaware  40300  conservative
F  short   42  colorado  50300  moderate
F  short   25  delaware  28000  liberal
F  short   51  colorado  58000  moderate
M  medium  55  colorado  63500  liberal
F  short   44  arkansas  47800  liberal
M  short   18  arkansas  39800  conservative
M  tall    67  colorado  71600  liberal
F  short   45  delaware  50000  moderate
F  short   48  arkansas  55800  moderate
M  short   25  colorado  39000  moderate
M  tall    67  arkansas  78300  moderate
F  short   37  delaware  42000  moderate
M  short   32  arkansas  42700  moderate
F  short   48  arkansas  57000  moderate
M  tall    66  delaware  75000  liberal
F  tall    61  arkansas  70000  conservative
M  medium  58  delaware  68900  moderate
F  short   19  arkansas  24000  liberal
F  short   38  delaware  43000  moderate
M  medium  27  arkansas  36400  moderate
F  short   42  arkansas  48000  moderate
F  short   60  arkansas  71300  conservative
M  tall    27  delaware  34800  conservative
F  tall    29  colorado  37100  conservative
M  medium  43  arkansas  56700  moderate
F  medium  48  arkansas  56700  moderate
F  medium  27  delaware  29400  liberal
M  tall    44  arkansas  55200  conservative
F  short   23  colorado  26300  liberal
M  tall    36  colorado  53000  liberal
F  short   64  delaware  72500  conservative
F  short   29  delaware  30000  liberal
M  short   33  arkansas  49300  moderate
M  tall    66  colorado  75000  liberal
M  medium  21  delaware  34300  conservative
F  short   27  arkansas  32700  liberal
F  short   29  arkansas  31800  liberal
M  tall    31  arkansas  48600  moderate
F  short   36  delaware  41000  moderate
F  short   49  colorado  55700  moderate
M  short   28  arkansas  38400  conservative
M  medium  43  delaware  56600  moderate
M  medium  46  colorado  58800  moderate
F  short   57  arkansas  69800  conservative
M  short   52  delaware  59400  moderate
M  tall    31  delaware  43500  moderate
M  tall    55  arkansas  62000  liberal
F  short   50  arkansas  56400  moderate
F  short   48  colorado  55900  moderate
M  medium  22  delaware  34500  conservative
F  short   59  delaware  66700  conservative
F  short   34  arkansas  42800  liberal
M  tall    64  arkansas  77200  liberal
F  short   29  delaware  33500  liberal
M  medium  34  colorado  43200  moderate
M  medium  61  arkansas  75000  liberal
F  short   64  delaware  71100  conservative
M  short   29  arkansas  41300  conservative
F  short   63  colorado  70600  conservative
M  medium  29  colorado  40000  conservative
M  tall    51  arkansas  62700  moderate
M  tall    24  delaware  37700  conservative
F  medium  48  colorado  57500  moderate
F  short   18  arkansas  27400  conservative
F  short   18  arkansas  20300  liberal
F  short   33  colorado  38200  liberal
M  medium  20  delaware  34800  conservative
F  short   29  delaware  33000  liberal
M  short   44  delaware  63000  conservative
M  tall    65  delaware  81800  conservative
M  tall    56  arkansas  63700  liberal
M  medium  52  delaware  58400  moderate
M  medium  29  colorado  48600  conservative
M  tall    47  colorado  58900  moderate
F  medium  68  arkansas  72600  liberal
F  short   31  delaware  36000  moderate
F  short   61  colorado  62500  liberal
F  short   19  colorado  21500  liberal
F  tall    38  delaware  43000  moderate
M  tall    26  arkansas  42300  conservative
F  short   61  colorado  67400  conservative
F  short   40  arkansas  46500  moderate
M  medium  49  arkansas  65200  moderate
F  medium  56  arkansas  67500  conservative
M  short   48  colorado  66000  moderate
F  short   52  arkansas  56300  liberal
M  tall    18  arkansas  29800  conservative
M  tall    56  delaware  59300  liberal
M  medium  52  colorado  64400  moderate
M  medium  18  colorado  28600  moderate
M  tall    58  arkansas  66200  liberal
M  tall    39  colorado  55100  moderate
M  tall    46  arkansas  62900  moderate
M  medium  40  colorado  46200  moderate
M  medium  60  arkansas  72700  liberal
F  short   36  colorado  40700  liberal
F  short   44  arkansas  52300  moderate
F  short   28  arkansas  31300  liberal
F  short   54  delaware  62600  conservative
M  medium  51  arkansas  61200  moderate
M  short   32  colorado  46100  moderate
F  short   55  arkansas  62700  conservative
F  short   25  delaware  26200  liberal
F  medium  33  delaware  37300  liberal
M  medium  29  colorado  46200  conservative
F  short   65  arkansas  72700  conservative
M  tall    43  colorado  51400  moderate
M  short   54  colorado  64800  liberal
F  short   61  colorado  72700  conservative
F  short   52  colorado  63600  conservative
F  short   30  colorado  33500  liberal
F  short   29  arkansas  31400  liberal
M  tall    47  delaware  59400  moderate
F  short   39  colorado  47800  moderate
F  short   47  delaware  52000  moderate
M  medium  49  arkansas  58600  moderate
M  tall    63  delaware  67400  liberal
M  medium  30  arkansas  39200  conservative
M  tall    61  delaware  69600  liberal
M  medium  47  delaware  58700  moderate
F  short   30  delaware  34500  liberal
M  medium  51  delaware  58000  moderate
M  medium  24  arkansas  38800  moderate
M  short   49  arkansas  64500  moderate
F  medium  66  delaware  74500  conservative
M  tall    65  arkansas  76900  conservative
M  short   46  colorado  58000  conservative
M  tall    45  delaware  51800  moderate
M  short   47  arkansas  63600  conservative
M  tall    29  arkansas  44800  conservative
M  tall    57  delaware  69300  liberal
M  medium  20  arkansas  28700  liberal
M  medium  35  arkansas  43400  moderate
M  tall    61  delaware  67000  liberal
M  short   31  delaware  37300  moderate
F  short   18  arkansas  20800  liberal
F  medium  26  delaware  29200  liberal
M  medium  28  arkansas  36400  liberal
M  tall    59  delaware  69400  liberal

Normalized and encoded data:

# people_240.txt
#
# sex (M = 0.0, F = 0.5)
# height (short, medium, tall)
# age (min = 18, max = 68)
# State (Arkansas, Colorado, Delaware, Illinois)
# income (min = $20,300, max = $81,800)
# political leaning (conservative, moderate, liberal)
#
0.5, 0.25, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6400, 0.00, 0.25, 0.00, 0.00, 0.5886, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.6400, 0.00, 0.00, 0.00, 0.25, 0.5642, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.0200, 0.00, 0.00, 0.25, 0.00, 0.2016, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0800, 0.00, 0.00, 0.00, 0.25, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.4358, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3200, 0.25, 0.00, 0.00, 0.00, 0.3106, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.00, 0.25, 0.2146, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.3400, 0.00, 0.00, 0.25, 0.00, 0.2423, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5400, 0.00, 0.25, 0.00, 0.00, 0.5496, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.00, 0.00, 0.25, 0.4943, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.4309, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.2600, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.1984, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6000, 0.00, 0.00, 0.00, 0.25, 0.5480, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9200, 0.00, 0.00, 0.00, 0.25, 0.8293, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.8472, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7200, 0.00, 0.00, 0.00, 0.25, 0.6618, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.2602, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6400, 0.00, 0.00, 0.25, 0.00, 0.5642, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6862, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.4400, 0.00, 0.00, 0.00, 0.25, 0.5220, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0800, 0.25, 0.00, 0.00, 0.00, 0.0537, 0.0000, 0.0000, 0.3333
0.5, 0.25, 1.0000, 0.00, 0.25, 0.00, 0.00, 0.9447, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.8400, 0.00, 0.00, 0.00, 0.25, 0.8358, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3200, 0.00, 0.00, 0.25, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.2600, 0.00, 0.00, 0.00, 0.25, 0.4650, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8000, 0.00, 0.25, 0.00, 0.00, 0.7333, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6569, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5000, 0.00, 0.25, 0.00, 0.00, 0.5008, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.5350, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0600, 0.25, 0.00, 0.00, 0.00, 0.2748, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7400, 0.00, 0.00, 0.25, 0.00, 0.7203, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9200, 0.00, 0.25, 0.00, 0.00, 0.8862, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.4600, 0.00, 0.00, 0.00, 0.25, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.7600, 0.00, 0.00, 0.00, 0.25, 0.7528, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9400, 0.00, 0.00, 0.25, 0.00, 0.8098, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.7154, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.1400, 0.25, 0.00, 0.00, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.5600, 0.00, 0.00, 0.25, 0.00, 0.4992, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6800, 0.00, 0.00, 0.00, 0.25, 0.6146, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.7740, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7800, 0.00, 0.00, 0.25, 0.00, 0.7382, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.5252, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8800, 0.25, 0.00, 0.00, 0.00, 0.7561, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6894, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.25, 0.00, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.6400, 0.00, 0.00, 0.00, 0.25, 0.6927, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.2800, 0.00, 0.00, 0.00, 0.25, 0.3496, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.0600, 0.00, 0.00, 0.25, 0.00, 0.2488, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.5200, 0.00, 0.25, 0.00, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5600, 0.00, 0.00, 0.00, 0.25, 0.5106, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8800, 0.00, 0.25, 0.00, 0.00, 0.8033, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7800, 0.00, 0.00, 0.00, 0.25, 0.7496, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.9800, 0.00, 0.00, 0.00, 0.25, 0.9024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.2276, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7000, 0.00, 0.00, 0.00, 0.25, 0.6472, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5610, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.0400, 0.00, 0.00, 0.00, 0.25, 0.1593, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.4000, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6400, 0.00, 0.25, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.3610, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.3089, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1600, 0.00, 0.25, 0.00, 0.00, 0.3268, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8000, 0.25, 0.00, 0.00, 0.00, 0.8195, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.5000, 0.00, 0.00, 0.00, 0.25, 0.4504, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.5600, 0.25, 0.00, 0.00, 0.00, 0.7171, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8358, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.4800, 0.25, 0.00, 0.00, 0.00, 0.4650, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.7600, 0.00, 0.00, 0.25, 0.00, 0.5870, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.8800, 0.00, 0.25, 0.00, 0.00, 0.7480, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.6400, 0.25, 0.00, 0.00, 0.00, 0.7236, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5800, 0.00, 0.00, 0.00, 0.25, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9800, 0.00, 0.25, 0.00, 0.00, 0.9772, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4400, 0.00, 0.00, 0.25, 0.00, 0.4894, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.25, 0.00, 0.00, 0.4569, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9200, 0.25, 0.00, 0.00, 0.00, 0.8407, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5800, 0.25, 0.00, 0.00, 0.00, 0.6244, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.5400, 0.00, 0.25, 0.00, 0.00, 0.5285, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.3350, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4000, 0.25, 0.00, 0.00, 0.00, 0.4569, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7400, 0.00, 0.00, 0.25, 0.00, 0.6455, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.6553, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.3000, 0.25, 0.00, 0.00, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3200, 0.00, 0.00, 0.25, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.2179, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2800, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.4800, 0.00, 0.00, 0.00, 0.25, 0.4341, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.1200, 0.00, 0.00, 0.25, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.25, 0.00, 0.00, 0.4878, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.1252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6600, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.7400, 0.00, 0.25, 0.00, 0.00, 0.7024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.4472, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.3171, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9800, 0.00, 0.25, 0.00, 0.00, 0.8341, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5400, 0.00, 0.00, 0.25, 0.00, 0.4829, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5772, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.1400, 0.00, 0.25, 0.00, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9800, 0.25, 0.00, 0.00, 0.00, 0.9431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3800, 0.00, 0.00, 0.25, 0.00, 0.3528, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2800, 0.25, 0.00, 0.00, 0.00, 0.3642, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5967, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9600, 0.00, 0.00, 0.25, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.8600, 0.25, 0.00, 0.00, 0.00, 0.8081, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.8000, 0.00, 0.00, 0.25, 0.00, 0.7902, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0200, 0.25, 0.00, 0.00, 0.00, 0.0602, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.4000, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.25, 0.00, 0.00, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8293, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.1800, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5000, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.1800, 0.00, 0.00, 0.25, 0.00, 0.1480, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5675, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1000, 0.00, 0.25, 0.00, 0.00, 0.0976, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3600, 0.00, 0.25, 0.00, 0.00, 0.5317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8488, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.3000, 0.25, 0.00, 0.00, 0.00, 0.4715, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9600, 0.00, 0.25, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0600, 0.00, 0.00, 0.25, 0.00, 0.2276, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.2016, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.1870, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.2600, 0.25, 0.00, 0.00, 0.00, 0.4602, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3600, 0.00, 0.00, 0.25, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6200, 0.00, 0.25, 0.00, 0.00, 0.5756, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.2943, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.5902, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.7800, 0.25, 0.00, 0.00, 0.00, 0.8049, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.6800, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.3772, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.7400, 0.25, 0.00, 0.00, 0.00, 0.6780, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6400, 0.25, 0.00, 0.00, 0.00, 0.5870, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.5789, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8200, 0.00, 0.00, 0.25, 0.00, 0.7545, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.3200, 0.25, 0.00, 0.00, 0.00, 0.3659, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.9200, 0.25, 0.00, 0.00, 0.00, 0.9252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.3200, 0.00, 0.25, 0.00, 0.00, 0.3724, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.8600, 0.25, 0.00, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8260, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.3415, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.8179, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.3203, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.6600, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1200, 0.00, 0.00, 0.25, 0.00, 0.2829, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.6049, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.1154, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.0000, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.2911, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0400, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.2065, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.5200, 0.00, 0.00, 0.25, 0.00, 0.6943, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9400, 0.00, 0.00, 0.25, 0.00, 1.0000, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7600, 0.25, 0.00, 0.00, 0.00, 0.7057, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6800, 0.00, 0.00, 0.25, 0.00, 0.6195, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.4602, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5800, 0.00, 0.25, 0.00, 0.00, 0.6276, 0.0000, 0.3333, 0.0000
0.5, 0.50, 1.0000, 0.25, 0.00, 0.00, 0.00, 0.8504, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.6862, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.0200, 0.00, 0.25, 0.00, 0.00, 0.0195, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.4000, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1600, 0.25, 0.00, 0.00, 0.00, 0.3577, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.7659, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4400, 0.25, 0.00, 0.00, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.7301, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.7600, 0.25, 0.00, 0.00, 0.00, 0.7675, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.7431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6800, 0.25, 0.00, 0.00, 0.00, 0.5854, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.1545, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7600, 0.00, 0.00, 0.25, 0.00, 0.6341, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6800, 0.00, 0.25, 0.00, 0.00, 0.7171, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0000, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8000, 0.25, 0.00, 0.00, 0.00, 0.7463, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.25, 0.00, 0.00, 0.5659, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5600, 0.25, 0.00, 0.00, 0.00, 0.6927, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.4400, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3600, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.1789, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7200, 0.00, 0.00, 0.25, 0.00, 0.6878, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.6600, 0.25, 0.00, 0.00, 0.00, 0.6650, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2800, 0.00, 0.25, 0.00, 0.00, 0.4195, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.7400, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.0959, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.3000, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9400, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5000, 0.00, 0.25, 0.00, 0.00, 0.5057, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.7200, 0.00, 0.25, 0.00, 0.00, 0.7236, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6800, 0.00, 0.25, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2400, 0.00, 0.25, 0.00, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.1805, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4200, 0.00, 0.25, 0.00, 0.00, 0.4472, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9000, 0.00, 0.00, 0.25, 0.00, 0.7659, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2400, 0.25, 0.00, 0.00, 0.00, 0.3073, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.8016, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.6244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.2400, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6600, 0.00, 0.00, 0.25, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.3008, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.7187, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.9600, 0.00, 0.00, 0.25, 0.00, 0.8813, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9400, 0.25, 0.00, 0.00, 0.00, 0.9203, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5400, 0.00, 0.00, 0.25, 0.00, 0.5122, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.5800, 0.25, 0.00, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.3984, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7800, 0.00, 0.00, 0.25, 0.00, 0.7967, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0400, 0.25, 0.00, 0.00, 0.00, 0.1366, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.3400, 0.25, 0.00, 0.00, 0.00, 0.3756, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.7593, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.0081, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.1600, 0.00, 0.00, 0.25, 0.00, 0.1447, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.8200, 0.00, 0.00, 0.25, 0.00, 0.7984, 0.0000, 0.0000, 0.3333
Posted in Machine Learning | Leave a comment

Clustering Categorical Data Using K-Means With One-Over-N-Hot Encoding And Equal-Interval Encoding

At one point in my research career, I spent a lot of time looking at machine learning clustering for categorical data. I devised algorithms based on entropy minimization, categorial utility, and naive Bayes. All of the algorithms worked quite well, but required a significant effort to implement.

A naive approach to encoding categorical data for k-means doesn’t work very well. For example, one-hot encoding makes the difference between any two values the same, regardless of how many possible values there are. Ordinal encoding, such as red = 1, blue = 2, green = 3, is even worse because the distance between red and green is greater than the distance between red and blue.

I recently discovered a pair of encoding techniques that allow k-means clustering to be effectively applied to categorical data. For ordinary categorical data I use a technique I call one-over-n-hot encoding. For example, if a data column named Color has n = 3 possible values red, blue, green, the encoding is (0.33, 0, 0), (0, 0.33, 0), (0, 0, 0.33). The encoding is modified one-hot where 1 values are replaced by 1/n.

For categorical data that has inherent ordering where “greater than” makes sense, I use equal-interval encoding. For example, if a column Height has possible values short, medium, tall, the encoding is short = 0.25, medium = 0.50, tall = 0.75.

I put together a demo. The 8-item raw data is:

red,    medium, hard
red,    small,  soft
blue,   large,  hard
blue,   medium, hard
orange, medium, hard
green,  small,  soft
green,  medium, hard
green,  large,  soft

The one-over-n-hot (for Color and Hardness) and equal-interval (for Size) encoded data is:

0.25, 0, 0, 0,   0.50,   0.5
0.25, 0, 0, 0,   0.25,   0.0
0, 0.25, 0, 0,   0.75,   0.5
0, 0.25, 0, 0,   0.50,   0.5
0, 0, 0.25, 0,   0.50,   0.5
0, 0, 0, 0.25,   0.25,   0.0
0, 0, 0, 0.25,   0.50,   0.5
0, 0, 0, 0.25,   0.75,   0.0

I used the scikit-learn KMeans module with k = 3 to cluster the encoded data which gave a clustering of [0 1 2 2 0 1 0 1]. The raw data organized by cluster ID is:

k = 0
red,    medium, hard
orange, medium, hard
green,  medium, hard

k = 1
red,    small,  soft
green,  small,  soft
green,  large,  soft

k = 2
blue,   large,  hard
blue,   medium, hard

This clustering seems good in some sort of intuitive sense, and in fact has minimum entropy (randomness).

The one-over-n-hot plus equal-interval encoding scheme can be easily used for clustering mixed numeric and categorical data by applying min-max normalization to numeric columns.



Retro Futurism is a term that means old (typically from the 1950s and 1960s) visions of what the future (typically 2000s) might be like. Left: A vision of the home of the future by artist Paul Alexander (1937-2021). Right: Another home of the future from a 1957 Ford Motor Company advertising brochure.

Because I grew up in the 1960s, I lived through the era of futurism art when it was contemporary, and I’m quite familiar with it. But there actually weren’t a whole lot of futurism illustrations and art. Somewhat strangely, there are a lot of current artists are creating new versions of hypothetical retro futurism art — sort of a neo retro futurism.


Demo program:

# kmeans_categorical_demo.py
# one-over-n-hot encoding and
# equal-interval encoding

import numpy as np
from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings("ignore")

def read_file(fn):
  lst = []
  f = open(fn, "r")
  for line in f:
    line = line.strip()  # remove NL
    if line.startswith("#"):
      continue
    # print(line)
    lst.append(line)
  f.close()
  return lst

def main():
  print("\nBegin k-means categorical clustering ")
  np.set_printoptions(precision=2, suppress=True,
    floatmode='fixed')

  fn1 = "categorical_data_raw.txt"
  lst = read_file(fn1)
  print("\nraw data: ")
  for i in range(len(lst)):
    print(lst[i])

  fn2 = "categorical_data_encoded.txt"
  # 0.25, 0, 0, 0,  0.75,  0.5
  encoded = np.loadtxt(fn2, delimiter=",",
   comments="#", usecols = (0, 1, 2, 3, 4, 5))
  print("\nencoded data: ")
  print(encoded)

  print("\napplying k-means clustering ")
  k = 3
  kmeans = KMeans(n_clusters=k,
    random_state=0, tol=0.0001, init='k-means++',
    n_init='auto').fit(encoded)

  # class sklearn.cluster.KMeans(n_clusters=8,
  # *, init='k-means++', n_init='auto', max_iter=300,
  # tol=0.0001, verbose=0, random_state=None,
  # copy_x=True, algorithm='lloyd')

  print("\nclustering: ")
  print(kmeans.labels_)

  print("\nclustered data: ")
  for kk in range(k):
    print("\nk = " + str(kk))
    for i in range(len(encoded)):
      cid = kmeans.labels_[i]
      if cid == kk:
        print(lst[i])

  print("\nEnd demo ")

if __name__ == "__main__":
  main()

Raw data:

# categorical_data_raw.txt
#
red,    medium, hard
red,    small,  soft
blue,   large,  hard
blue,   medium, hard
orange, medium, hard
green,  small,  soft
green,  medium, hard
green,  large,  soft

Encoded data:

# categorical_data_encoded.txt
#
0.25, 0, 0, 0,  0.50,  0.5
0.25, 0, 0, 0,  0.25,  0.0
0, 0.25, 0, 0,  0.75,  0.5
0, 0.25, 0, 0,  0.50,  0.5
0, 0, 0.25, 0,  0.50,  0.5
0, 0, 0, 0.25,  0.25,  0.0
0, 0, 0, 0.25,  0.50,  0.5
0, 0, 0, 0.25,  0.75,  0.0
Posted in Machine Learning | 1 Comment