My Top Ten Favorite Dr. Fu Manchu Movies

Dr. Fu Manchu was arguably the first super-villain. He was created by author Arthur Henry Ward (1883-1959) who used the pen name Sax Rohmer. Rohmer wrote 14 Fu Manchu novels from 1913 to 1959. The Fu Manchu novels were very popular and roughly 10 movies featuring the evil doctor and his daughter(s) were made from 1929 to 1969.

Here are my ten favorites (not difficult because there are only ten), listed in chronological order. None of these movies is particularly good but they all have a certain charm and interest if you like old movies like I do. The James Bond “Dr. No” novel by Ian Fleming and movie are based on many of the ideas in the Fu Manchu novels and movies.


1. The Mysterious Dr. Fu Manchu (1929) – Dr. Fu Manchu (actor Warner Oland) is a Chinese doctor. During the Boxer Rebellion (1899-1901) uprising against the British, Manchu’s wife and daughter are accidentally killed by British soldiers. Manchu goes to England and vows revenge. Opposing Manchu are Nayland Smith (actor O.P. Heggie) of Scotland Yard and his pal Dr. Petrie (Neil Hamilton). In the end, Manchu is captured and drinks poison to commit suicide. The first of three films starring Warner Oland (who also played Charlie Chan) as Manchu. A very early sound film that has mostly historical interest rather than entertainment value. My grade: NA.


2. The Return of Dr. Fu Manchu (1930) – Because the previous film was so popular, Fu Manchu (Warner Oland) turns out to have not died and he continues seeking revenge on those he holds responsible for the death of his wife and child. Essentially a continuation of the previous film. Manchu is defeated again by Nayland Smith (O.P. Heggie) and Dr. Petrie (Neil Hamilton). Like many early sound films, this one plays a bit like like a silent film. A slow-moving film that has mostly historical interest. Notice the poster omits “Dr.” in the title. My grade: NA.


3. Daughter of the Dragon (1931) – The third and final film with Oland as Manchu. Princess Ling Moy (actress Anna May Wong) is the long-lost daughter of Fu Manchu. Manchu reveals this to her and out of loyalty she becomes his evil assistant. Complicating matters is the fact that Ling Moy’s boyfriend Ah Kee is a Chinese agent seeking to capture her father. Furthermore Ronald Petrie (a young aristocrat, but not Dr. Petrie as in all other films) is a target of Manchu and Ling Moy but she falls in love with him. Ah Kee shoots and kills Manchu about halfway through the movie and later shoots and kills Ling Moy just before dying himself. Nayland Smith does not appear in this movie. Kind of a depressing ending. My grade: C.


4. The Mask of Fu Manchu (1932) – The sinister Dr. Fu Manchu (actor Boris Karloff) and a daughter Fah Lo See (actress Myrna Loy) compete with a British scientific expedition to find the sword of Genghis Khan which they believe will give them mystical power. Nayland Smith (actor Lewis Stone) and his young colleague Terrence Granville (actor Charles Starrett; no Dr. Petrie in this film) thwart Manchu and kill him using his own death ray. This is my favorite Fu Manchu movie mostly because of the superior acting by Karloff and Loy. My grade: B.


5. Drums of Fu Manchu (1943) – This movie was compiled together from a 15-part serial shown in theaters from 1940-41. Here Fu Manchu (actor Henry Brandon) seeks to acquire the sceptre of Genghis Khan, which will unite all of Asia under his command. Nayland Smith (actor William Royle) and Dr. Petrie (Olaf Hytten) join with young Allan Parker (Robert Kellard) to defeat the evil doctor and his daughter Fah Lo Suee (actress Gloria Franklin). Rarely for the time, the villains escape at the end. Because the movie is a serial compilation there is non-stop action but not much coherence. My grade: C+.


6. The Face of Fu Manchu (1965) – The movie opens with the apparent execution of Fu Manchu (actor Christopher Lee) in China. But a few months later in London, Nayland Smith (Nigel Green) and Dr. Petrie (Howard Marion-Crawford) deduce that Manchu is still alive. Manchu, and his daughter Lin Tang (actress Tsai Chin), kidnap a scientist who has knowledge of how to use a rare poppy seed to create a deadly poison. In the end, it looks like Manchu and Tang have succeeded but Smith has booby trapped Manchu’s hideout and it’s destroyed. This is the first of five consecutive Fu Manchu films starring Christopher Lee as Manchu and Tsai Chin as daughter Lin Tang, released one per year from 1965 to 1969. Nayland Smith was played by three different actors (Nigel Green once, Douglas Wilmer twice, Richard Greene twice). My grade: B-.


7. The Brides of Fu Manchu (1966) – Dr. Fu Manchu (Christopher Lee) and his daughter Lin Tang (Tsai Chin) kidnap the daughters of scientists and then blackmail the fathers into helping him create a secret sonic super weapon to use to conquer the world. Nayland Smith (Douglas Wilmer) and Dr. Petrie (Howard Marion-Crawford) defeat the evil doctor and rescue the girls. My grade: C+.


8. The Vengeance of Fu Manchu (1967) – In this one, Fu Manchu (Christopher Lee) and his evil daughter Lin Tang (Tsai Chin) replace Nayland Smith (Douglas Wilmer) with a lookalike. The idea is to create a giant worldwide organization of criminals and control the entire world. Once again Smith and Dr. Petrie (Howard Marion-Crawford) defeat the villains. My grade: C.


9. The Blood of Fu Manchu (1968) – The evil Fu Manchu (actor Christopher Lee) and his equally evil daughter Lin Tang (Tsai Chin) have established a new hideout in the Amazon jungle. They have developed a deadly poison that affects men but not women. They plan to use women as carriers to assassinate political leaders around the world and control the world. Nayland Smith (actor Richard Greene) and Dr. Petrie (Howard Marion-Crawford) defeat the doctor again. My grade: C.


10. The Castle of Fu Manchu (1969) – Fu Manchu (Christopher Lee) and his daughter Lin Tang (actress Tsai Chin) are at it again, this time based in a castle in Istanbul. They are building a new secret super weapon, financed in part by opium smuggling. And once again, Nayland Smith (working for Interpol now, played by actor Richard Greene) and Dr. Petrie (Howard Marion-Crawford) defeat the plans and rescue the hostages. My grade: C+.


Special Mention


In addition to the ten films listed above, there was a short-lived TV series titled “The Adventures of Dr. Fu Manchu”. The show ran for 13 episodes in 1956. It starred Glen Gordon as Dr. Fu Manchu, actress Laurette Luez as Manchu’s semi-unwilling slave/assistant Karameneh, Lester Matthews as Nayland Smith, and Clark Howat as Dr. Petrie. I kind of like these old shows even though they feel like typical 1950s adventure TV. At the end of every episode, Manchu would be defeated and he would pick up the black King’s bishop from an ornate chess board and break it in half in frustration. My grade: C-/B- depending on episode.


Notice the chess board is oriented incorrectly — the lower left square should be black, not white. Argh! Annoying that TV people can’t seem to ever get this correct.



Here are three representative book covers. Left: The first edition of the first novel from 1913. Center: A cover from 1959. Right: A cover from 1962.


Posted in Top Ten | 1 Comment

“Multi-Class Classification Using LightGBM” in Visual Studio Magazine

I wrote an article titled “Multi-Class Classification Using LightGBM” in the May 2024 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2024/05/02/LightGBM-multi-class-classification.aspx.

A multi-class classification problem is one where the goal is to predict a discrete variable that has three or more possible values. For example, you might want to predict a person’s political leaning (conservative, moderate, liberal) from sex, age, state of residence and annual income. There are many machine learning techniques for multi-class classification. One of the most powerful techniques is to use the LightGBM (lightweight gradient boosting machine) system.

LightGBM is a sophisticated, open-source, tree-based system that was introduced in 2017. LightGBM can perform multi-class classification, binary classification (predict one of two possible values), regression (predict a single numeric value) and ranking.

My article presents a complete end-to-end demo. LightGBM has three programming language interfaces — C, Python and R. The demo program uses the Python language API.

I used one of my standard synthetic datasets. The raw data look like:

F  24  michigan  29500.00  liberal
M  39  oklahoma  51200.00  moderate
F  63  nebraska  75800.00  conservative
M  36  michigan  44500.00  moderate
F  27  nebraska  28600.00  liberal
. . .

When using LightGBM, you encode categorical data using ordinal encoding. Unlike most machine learning classification techniques, you don’t need to normalize numeric data. The encoded data looks like:

1, 24, 0, 29500.00, 2
0, 39, 2, 51200.00, 1
1, 63, 1, 75800.00, 0
0, 36, 0, 44500.00, 1
1, 27, 1, 28600.00, 2
. . .

My article explain how to install Python and LightGBM for readers who are new to both.

The key statements that create and train the demo LightGBM multi-class classifier are:

  print("Creating and training LGBM multi-class model ")
  params = {
    # 'objective': 'multiclass',  # not needed
    'boosting_type': 'gbdt',  # default
    'num_leaves': 31,  # default
    'max_depth':-1,  # default (unlimited) 
    'n_estimators': 50,  # default = 100
    'learning_rate': 0.05,  # default = 0.10
    'min_data_in_leaf': 5,  # default = 20
    'random_state': 0,
    'verbosity': -1  # only fatal. default = 1 error, warn
  }
  model = lgbm.LGBMClassifier(**params)  # scikit API
  model.fit(train_x, train_y)
  print("Done ")

The main challenge when using LightGBM is wading through the dozens of parameters. The LGBMClassifier class/object has 19 parameters (num_leaves, max_depth and so on) and behind the scenes there are 57 Learning Control Parameters (min_data_in_leaf, bagging_fraction and so on), for a total of 76 parameters to deal with. A large section of my article explains which parameters to change and which to leave alone.

Arguably, the two most powerful techniques for multi-class classification on non-trivial datasets are neural networks and tree boosting. In some recent multi-class classification challenges, LightGBM entries have dominated the contest leader board. This may be due, in part, to the fact that LightGBM can be used out-of-the-box, which leaves a lot of time for hyperparameter fine-tuning. Using a neural network classifier requires significantly more background knowledge and effort.



It’s sometimes difficult to classify films because several different genres can be represented. One of my favorite classifications is science fiction mystery movies.

Left: Dark City (1998) – A man wakes up next to a murdered prostitute. He can’t remember who he is or how he got there. He is pursued by the creepy Strangers. Who are they? Why is it always night time?

Center: Gog (1954) – A government investigator (Richard Egan) is sent to a super-secret underground desert laboratory complex to solve a series of bizarre deaths. Which of the many suspects is responsible?

Right: The Power (1968) – One of a group of six scientists, including one played by George Hamilton, has super mind control powers and is murdering the others in the group one by one. Who is it and how can he/she be stopped?


Posted in Machine Learning | Leave a comment

Data Anomaly Detection Using Principal Component Analysis (PCA) Reconstruction Error

One evening, while I was walking my two dogs, I thought about the possibility of looking for data anomalies by analyzing principal component analysis (PCA) reconstruction error. Bottom line: the technique works, but it just doesn’t feel right to me.

The ideas here are extremely complex and can only be explained by using a concrete example. I implemented the idea using raw C#. Using Python and the scikit library would have been much, much easier. I started with a small, 12-item subset of the Penguin dataset:

[ 0]     39.5     17.4    186.0   3800.0
[ 1]     40.3     18.0    195.0   3250.0
[ 2]     36.7     19.3    193.0   3450.0
[ 3]     38.9     17.8    181.0   3625.0
[ 4]     46.5     17.9    192.0   3500.0
[ 5]     45.4     18.7    188.0   3525.0
[ 6]     45.2     17.8    198.0   3950.0
[ 7]     46.1     18.2    178.0   3250.0
[ 8]     46.1     13.2    211.0   4500.0
[ 9]     48.7     14.1    210.0   4450.0
[10]     46.5     13.5    210.0   4550.0
[11]     45.4     14.6    211.0   4800.0

Each item is one of three species of penguin. The fields are bill length, bill width, flipper length, body mass.

I performed z-score standardization on the source data — this is required for PCA. Then I computed the eigenvalues and eigenvectors of the standardized data — this is one of the most complex operations in numerical programming. To compute the eigens, I used the singular value decomposition (SVD) technique (I could have used the classical covariance matrix technique).

Because the source data has 12 rows and 4 columns, there are 4 eigenvalues, and 4 eigenvectors, each with 4 values. The percentage of variance explained by the 4 eigenvectors are 0.7801, 0.1578, 0.0409, 0.0211 and so the variance explained by just the first 2 eigenvectors is 0.7801 + 0.1578 = 0.9379.

I used the first 2 eigenvectors to reconstruct the source data. Then I computed the Euclidean distance as a measure of error between the source data and the reconstructed data:

[ 0]     39.6     17.8    191.4   3662.2  | recon err =  137.8913
[ 1]     40.3     18.2    189.1   3556.6  | recon err =  306.6433
[ 2]     36.5     18.6    188.2   3511.9  | recon err =   62.1100
[ 3]     39.1     18.5    187.7   3489.6  | recon err =  135.5649
[ 4]     46.4     17.7    189.2   3569.5  | recon err =   69.5266
[ 5]     45.3     18.2    186.8   3455.4  | recon err =   69.6141
[ 6]     45.0     16.8    195.1   3844.0  | recon err =  106.0892
[ 7]     46.3     18.9    182.0   3231.4  | recon err =   19.0492
[ 8]     46.3     13.9    211.7   4619.6  | recon err =  119.6264
[ 9]     48.7     14.1    209.0   4497.1  | recon err =   47.1339
[10]     46.6     13.9    211.1   4594.7  | recon err =   44.7474
[11]     45.2     13.9    211.7   4618.0  | recon err =  182.0142

Based on this analysis, the largest reconstruction error is 306.6433 which is associated with data item [1] and so this is “the most anomalous” in some sense. The source item [1] and its reconstructed item are:

original:      40.3     18.0    195.0   3250.0
reconstructed: 40.3     18.2    189.1   3556.6

The reconstruction error is dominated by the body mass term. Ugh. This makes sense because the magnitudes of the body mass values are much greater than the other variables. This means you’d probably have to normalize the source data first (to get a variable values in the same range) and then z-score standardize the data (to accommodate PCA).

So, technically, the PCA reconstruction error technique works. But the technique just doesn’t feel right to me, based on many years of experience.

The technique is very, very, very complex. And complex is almost always bad. And because in order to apply PCA, source data must be z-score standardized, it can only work with strictly numeric data, not mixed numeric and categorical. And because the reconstruction technique must use a discrete number of eigenvectors for reconstruction, the technique is not very granular.

There may be some scenarios where anomaly detection based on PCA reconstruction error is useful, but I suspect other techniques are better choices in almost all situations. But it was an interesting exploration anyway.

I usually post my demo code, but I’m not going to do so for this topic. The code is very long and very ugly.



Most of the experienced engineers I know, have a good intuitive sense of when a software system design is overly-complex. I’m no expert on bicycles, but my intuition tells me that these two examples might be a bit too complex.


Posted in Machine Learning | Leave a comment

One-Shot Learning, Few-Shot Learning, Zero-Shot Learning, and Fine-Tuning

The terms one-shot learning, few-shot learning, zero-shot learning, and fine-tuning don’t have universally agreed-upon definitions. All four terms are kinds of “transfer learning” where the goal is to start with an existing model and use it on a new problem. That said, here’s a set of four brief, incomplete, but reasonable, explanations.

One-Shot Learning – Usually applies to image classification. The goal is to classify an image when you only have one example of each class. For example, a bank may have just one example of each customer’s signature and a new signature on a receipt must be classified as authentic or fraudulent. One common technique is to use what’s called Siamese network architecture.

Few-Shot Learning – Usually applies to image classification. The goal is to classify an image when you only have a few (perhaps a dozen or so) examples of each class of labeled data. For example, you have a trained model that can classify hundreds of different classes of animals, but your system acquires just five images of a new previously unseen type of animal, such as an axolotl. One of many possible techniques is to generate hundreds of variations of the five new images (by stretching, horizontally inverting, etc.), add them to the original training data, and then retrain the model. A more sophisticated approach for few-shot learning is called “model agnostic meta-learning” (MAML), where a base model is trained specifically so that new classes can be trained quickly and easily.

Zero-Shot Learning – Often applies to image classification. The goal is to classify an image using an existing model that has never been trained on the image. For example, you have a trained model that can classify hundreds of different classes of automobiles but you need it to classify a new previously unseen example, such as a Chrysler PT Cruiser. There’s no way to create new knowledge from nothing, so one common technique for zero-shot learning is kind of a cheat: you incorporate an auxiliary set of data images along with text information such as “the 2001 PT Cruiser has a boxy look that resembles the 2006 Chevrolet HHR.” The auxiliary information acts as a different kind of labeling.

Fine-Tuning – Often applies to language models. The goal is to start with a pre-trained model such as GPT-3 that understands basic English and knows general facts from Wikipedia, and add specialized information such as the human resources policies of a company.

A not-so-good strategy is to prepare new specialized training data then completely retrain the base model, changing all of the billions of parameters/weights. A better strategy is to prepare new specialized training data and then train in a way that changes only some of the last layers parameters/weights, or even better, use the specialized training data to create a new layer (or a few layers) that can be appended to the base model. This last approach allows you to have just one base model, and many relatively small modules for specialized tasks.

Fine-tuning is one of the most active areas of AI research. A relatively new development is called “continual agent learning”, where a software agent/module that somehow learns-to-learn over its lifetime, such that it can eventually learn very quickly from its own experience.



I can fine-tune a machine learning model. But when I played guitar in a band, I had to use a tuning device instead of manual tuning.

Left: This amazing animation is from animusic.com and it shows an imaginary multi-stringed six-in-one instrument being played by automated wooden fingers. Wonderful! This screenshot doesn’t do it justice.

Right: Here’s an old photo of me playing with my band at the Disneyland Employees Summer Party, called the Banana Ball (because it originated with Jungle Cruise employees). I’m on the far right playing bass guitar. On the far left on rhythm guitar is my good friend Paul Ruiz. In the back on drums is my good friend Jeff Rhoads. The singer is George Trullinger, who went on to a long successful career as a professional singer in Las Vegas.


Posted in Machine Learning | Leave a comment

Time Series Regression Using a Standard Neural Network With C#

Time series regression (TSR) problems are very challenging. There are dozens of techniques — and the fact that there are so many techniques for TSR indicates that there’s no single best approach.

There’s been quite a bit of recent research activity that looks at attacking TSR problems using modern neural techniques, specifically systems that use transformer architecture. I decided to revisit one of my preferred techniques, which is to use a standard neural network. The ideas are a bit complicated and are best explained by a concrete example.

I used the Airline Passengers dataset. The source data looks like:

"1949-01";112
"1949-02";118
"1949-03";132
"1949-04";129
"1949-05";121
"1949-06";135
"1949-07";148
. . . 
"1960-12";432

There are 144 lines, with dates from January 1949 to December 1960. The values are number of airline passengers, in thousands, so the first item means there were a total of 112,000 airline passengers in January 1949 (not very many back then). When graphed, the data looks like this (where all passenger counts have been divided by 100):



Note: The Airline Passenger dataset originally appeared on page 531 of the first edition of the famous book “Time Series Analysis: Forecasting and Control” (1970) by G. Box and G. Jenkins.


I preprocessed the raw data to create a text file of sliding window values that looks like:

1.12, 1.18, 1.32, 1.29, 1.21
1.18, 1.32, 1.29, 1.21, 1.35
1.32, 1.29, 1.21, 1.35, 1.48
1.29, 1.21, 1.35, 1.48, 1.48
. . .
6.06, 5.08, 4.61, 3.90, 4.32

Each consecutive set of four values (a “window”) will be used to predict the next value. So the first input is (1.12, 1.18, 1.32, 1.29) and the value to predict is 1.21. Because of the offset, there are 140 training items.

I implemented a 4-12-1 neural network with tanh() hidden node activation and identity() output node activation. I trained the network using 10,000 iterations, with a learning rate of 0.01, and a batch size of 1. Interestingly, I consistently got better results using a batch size of 1, instead of the more usual size of 10 or something similar. In retrospect, this makes perfect sense.

The trained neural network model predicts with 0.8500 accuracy (119 out of 140 correct), where a correct prediction is defined to be one that is within 10% of the true value.

I implemented a Forecast() method that predicts beyond the source data. I started with the last set of four data points: (5.08, 4.61, 3.90, 4.32) and used them to predict for January 1961, and got 4.52. Then I used that prediction to create the next set of four input values: (4.61, 3.90, 4.32, 4.52) and used them to predict for February 1961. And so on.

I graphed the predicted passenger counts and the forecast counts:

The results look good, at least for a relatively short forecast. Much fun!



I stumbled across some wonderfully creative short videos on YouTube by a guy named Hey Anglomangler. Creepy and fascinating visions of a bizarre future or alternate universe. The videos look like they were created using some sort of AI tool.


Demo code. Replace “lt” (less than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

using System;
using System.IO;
using System.Collections.Generic;

// data from Box, G., Jenkins, G., and Reinsel, G. (1976)
// "Time Series Analysis, Forecasting and Control."

namespace NeuralNetworkTimeSeries
{
  internal class NeuralTimeSeriesProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin neural network" +
        " times series demo");
      Console.WriteLine("Goal is to predict airline " +
        "passengers over time ");
      Console.WriteLine("Data from January 1949 to" +
        " December 1960 ");

      Console.WriteLine("\nLoading windowed and" +
        " normalized data from file ");

      string dataFile =
        "..\\..\\..\\Data\\airline_all.txt";
 
      double[][] dataX = Utils.MatLoad(dataFile,
        new int[] { 0, 1, 2, 3 }, ',', "#");
      double[] dataY =
        Utils.MatToVec(Utils.MatLoad(dataFile,
        new int[] { 4 }, ',', "#"));

      Console.WriteLine("\nFirst three X data: ");
      for (int i = 0; i "lt" 3; ++i)
        Utils.VecShow(dataX[i], 2, 6, 10, true);

      Console.WriteLine("\nFirst three target Y: ");
      for (int i = 0; i "lt" 3; ++i)
        Console.WriteLine(dataY[i].ToString("F2"));

      int numInput = 4; // number predictors
      int numHidden = 12;
      int numOutput = 1; // regression

      Console.WriteLine("\nCreating a " + numInput + "-" +
        numHidden + "-" + numOutput + " neural network");
      NeuralNetwork nn = new NeuralNetwork(numInput,
        numHidden, numOutput, 0);

      int maxEpochs = 10000;
      double lrnRate = 0.01;
      int batSize = 1;
      Console.WriteLine("\nSetting maxEpochs = " +
        maxEpochs);
      Console.WriteLine("Setting learnRate = " +
        lrnRate.ToString("F2"));
      Console.WriteLine("Setting batch size = " +
        batSize);

      Console.WriteLine("\nStarting training");
      nn.Train(dataX, dataY, lrnRate, batSize, maxEpochs);
      Console.WriteLine("Done");

      double[] weights = nn.GetWeights();
      Console.WriteLine("\nFinal neural network model" +
        " weights and biases:\n");
      Utils.VecShow(weights, 2, 7, 10, true);

      double acc = nn.Accuracy(dataX, dataY, 0.10);
      Console.WriteLine("\nModel accuracy (within 10%" +
        " actual) = " + acc.ToString("F4"));

      Console.WriteLine("\nForecasting ahead 12 steps ");
      double[] forecast = nn.Forecast(new double[] { 5.08,
        4.61, 3.90, 4.32 }, 12);
      Utils.VecShow(forecast, 2, 6, 12, true);
   
      Console.WriteLine("\nEnd time series demo ");
      Console.ReadLine();

    } // Main
  } // Program

  public class NeuralNetwork
  {
    private int ni; // number input nodes
    private int nh;
    private int no;

    private double[] iNodes;
    private double[][] ihWeights; // input-hidden
    private double[] hBiases;
    private double[] hNodes;

    private double[][] hoWeights; // hidden-output
    private double[] oBiases;
    private double[] oNodes;  // single val as array

    // gradients
    private double[][] ihGrads;
    private double[] hbGrads;
    private double[][] hoGrads;
    private double[] obGrads;

    private Random rnd;

    // ------------------------------------------------------

    public NeuralNetwork(int numIn, int numHid,
      int numOut, int seed)
    {
      this.ni = numIn;
      this.nh = numHid;
      this.no = numOut;  // 1 for regression

      this.iNodes = new double[numIn];

      this.ihWeights = MatCreate(numIn, numHid);
      this.hBiases = new double[numHid];
      this.hNodes = new double[numHid];

      this.hoWeights = MatCreate(numHid, numOut);
      this.oBiases = new double[numOut];  // [1]
      this.oNodes = new double[numOut];  // [1]

      this.ihGrads = MatCreate(numIn, numHid);
      this.hbGrads = new double[numHid];
      this.hoGrads = MatCreate(numHid, numOut);
      this.obGrads = new double[numOut];

      this.rnd = new Random(seed);
      this.InitWeights(); // all weights and biases
    } // ctor

    // ------------------------------------------------------

    private static double[][] MatCreate(int rows,
      int cols)
    {
      // helper for ctor
      double[][] result = new double[rows][];
      for (int i = 0; i "lt" rows; ++i)
        result[i] = new double[cols];
      return result;
    }

    // ------------------------------------------------------

    private void InitWeights() // helper for ctor
    {
      // weights and biases to small random values
      double lo = -0.01; double hi = +0.01;
      int numWts = (this.ni * this.nh) +
        (this.nh * this.no) + this.nh + this.no;
      double[] initialWeights = new double[numWts];
      for (int i = 0; i "lt" initialWeights.Length; ++i)
        initialWeights[i] =
          (hi - lo) * rnd.NextDouble() + lo;
      this.SetWeights(initialWeights);
    }

    // ------------------------------------------------------

    public void SetWeights(double[] wts)
    {
      // copy serialized weights and biases in wts[] 
      // to ih weights, ih biases, ho weights, ho biases
      int numWts = (this.ni * this.nh) +
        (this.nh * this.no) + this.nh + this.no;
      if (wts.Length != numWts)
        throw new Exception("Bad array in SetWeights");

      int k = 0; // points into wts param

      for (int i = 0; i "lt" this.ni; ++i)
        for (int j = 0; j "lt" this.nh; ++j)
          this.ihWeights[i][j] = wts[k++];
      for (int i = 0; i "lt" this.nh; ++i)
        this.hBiases[i] = wts[k++];
      for (int i = 0; i "lt" this.nh; ++i)
        for (int j = 0; j "lt" this.no; ++j)
          this.hoWeights[i][j] = wts[k++];
      for (int i = 0; i "lt" this.no; ++i)
        this.oBiases[i] = wts[k++];
    }

    // ------------------------------------------------------

    public double[] GetWeights()
    {
      int numWts = (this.ni * this.nh) +
        (this.nh * this.no) + this.nh + this.no;
      double[] result = new double[numWts];
      int k = 0;
      for (int i = 0; i "lt" ihWeights.Length; ++i)
        for (int j = 0; j "lt" this.ihWeights[0].Length; ++j)
          result[k++] = this.ihWeights[i][j];
      for (int i = 0; i "lt" this.hBiases.Length; ++i)
        result[k++] = this.hBiases[i];
      for (int i = 0; i "lt" this.hoWeights.Length; ++i)
        for (int j = 0; j "lt" this.hoWeights[0].Length; ++j)
          result[k++] = this.hoWeights[i][j];
      for (int i = 0; i "lt" this.oBiases.Length; ++i)
        result[k++] = this.oBiases[i];
      return result;
    }

    // ------------------------------------------------------

    public double ComputeOutput(double[] x)
    {
      double[] hSums = new double[this.nh]; // scratch 
      double[] oSums = new double[this.no]; // out sums

      for (int i = 0; i "lt" x.Length; ++i)
        this.iNodes[i] = x[i];
      // note: no need to copy x-values unless
      // you implement a ToString.
      // more efficient to simply use the X[] directly.

      // 1. compute i-h sum of weights * inputs
      for (int j = 0; j "lt" this.nh; ++j)
        for (int i = 0; i "lt" this.ni; ++i)
          hSums[j] += this.iNodes[i] *
            this.ihWeights[i][j]; // note +=

      // 2. add biases to hidden sums
      for (int i = 0; i "lt" this.nh; ++i)
        hSums[i] += this.hBiases[i];

      // 3. apply hidden activation
      for (int i = 0; i "lt" this.nh; ++i)
        this.hNodes[i] = HyperTan(hSums[i]);

      // 4. compute h-o sum of wts * hOutputs
      for (int j = 0; j "lt" this.no; ++j)
        for (int i = 0; i "lt" this.nh; ++i)
          oSums[j] += this.hNodes[i] *
            this.hoWeights[i][j];  // [1]

      // 5. add biases to output sums
      for (int i = 0; i "lt" this.no; ++i)
        oSums[i] += this.oBiases[i];

      // 6. apply output activation
      for (int i = 0; i "lt" this.no; ++i)
        this.oNodes[i] = Identity(oSums[i]);

      return this.oNodes[0];  // single value
    }

    // ------------------------------------------------------

    private static double HyperTan(double x)
    {
      if (x "lt" -10.0) return -1.0;
      else if (x "gt" 10.0) return 1.0;
      else return Math.Tanh(x);
    }

    // ------------------------------------------------------

    private static double Identity(double x)
    {
      return x;
    }

    // ------------------------------------------------------

    private void ZeroOutGrads()
    {
      for (int i = 0; i "lt" this.ni; ++i)
        for (int j = 0; j "lt" this.nh; ++j)
          this.ihGrads[i][j] = 0.0;

      for (int j = 0; j "lt" this.nh; ++j)
        this.hbGrads[j] = 0.0;

      for (int j = 0; j "lt" this.nh; ++j)
        for (int k = 0; k "lt" this.no; ++k)
          this.hoGrads[j][k] = 0.0;

      for (int k = 0; k "lt" this.no; ++k)
        this.obGrads[k] = 0.0;
    }  // ZeroOutGrads()

    // ------------------------------------------------------

    private void AccumGrads(double y)
    {
      double[] oSignals = new double[this.no];
      double[] hSignals = new double[this.nh];

      // 1. compute output node scratch signals 
      for (int k = 0; k "lt" this.no; ++k)
        oSignals[k] = 1 * (this.oNodes[k] - y);

      // 2. accum hidden-to-output gradients 
      for (int j = 0; j "lt" this.nh; ++j)
        for (int k = 0; k "lt" this.no; ++k)
          hoGrads[j][k] +=
           oSignals[k] * this.hNodes[j];

      // 3. accum output node bias gradients
      for (int k = 0; k "lt" this.no; ++k)
        obGrads[k] +=
         oSignals[k] * 1.0;  // 1.0 dummy 

      // 4. compute hidden node signals
      for (int j = 0; j "lt" this.nh; ++j)
      {
        double sum = 0.0;
        for (int k = 0; k "lt" this.no; ++k)
          sum += oSignals[k] * this.hoWeights[j][k];

        double derivative =
          (1 - this.hNodes[j]) *
          (1 + this.hNodes[j]);  // assumes tanh
        hSignals[j] = derivative * sum;
      }

      // 5. accum input-to-hidden gradients
      for (int i = 0; i "lt" this.ni; ++i)
        for (int j = 0; j "lt" this.nh; ++j)
          this.ihGrads[i][j] +=
            hSignals[j] * this.iNodes[i];

      // 6. accum hidden node bias gradients
      for (int j = 0; j "lt" this.nh; ++j)
        this.hbGrads[j] +=
          hSignals[j] * 1.0;  // 1.0 dummy
    } // AccumGrads

    // ------------------------------------------------------

    private void UpdateWeights(double lrnRate)
    {
      // assumes all gradients computed
      // 1. update input-to-hidden weights
      for (int i = 0; i "lt" this.ni; ++i)
      {
        for (int j = 0; j "lt" this.nh; ++j)
        {
          double delta = -1.0 * lrnRate *
            this.ihGrads[i][j];
          this.ihWeights[i][j] += delta;
        }
      }

      // 2. update hidden node biases
      for (int j = 0; j "lt" this.nh; ++j)
      {
        double delta = -1.0 * lrnRate *
          this.hbGrads[j];
        this.hBiases[j] += delta;
      }

      // 3. update hidden-to-output weights
      for (int j = 0; j "lt" this.nh; ++j)
      {
        for (int k = 0; k "lt" this.no; ++k)
        {
          double delta = -1.0 * lrnRate *
            this.hoGrads[j][k];
          this.hoWeights[j][k] += delta;
        }
      }

      // 4. update output node biases
      for (int k = 0; k "lt" this.no; ++k)
      {
        double delta = -1.0 * lrnRate *
          this.obGrads[k];
        this.oBiases[k] += delta;
      }
    } // UpdateWeights()

    // ------------------------------------------------------

    public void Train(double[][] trainX, double[] trainY,
      double lrnRate, int batSize, int maxEpochs)
    {
      int n = trainX.Length; 
      int batchesPerEpoch = n / batSize; 
      int freq = maxEpochs / 5;  // to show progress

      int[] indices = new int[n];
      for (int i = 0; i "lt" n; ++i)
        indices[i] = i;

      // ----------------------------------------------------
      //
      // for epoch = 0; epoch "lt" maxEpochs; ++epoch
      //   shuffle indices
      //   for batch = 0; batch "lt" bpe; ++batch
      //     for item = 0; item "lt" bs; ++item
      //       compute output
      //       accum grads
      //     end-item
      //     update weights
      //     zero-out grads
      //   end-batches
      // end-epochs
      //
      // ----------------------------------------------------

      for (int epoch = 0; epoch "lt" maxEpochs; ++epoch)
      {
        Shuffle(indices);
        int ptr = 0;  // points into indices
        for (int batIdx = 0; batIdx "lt" batchesPerEpoch;
          ++batIdx)
        {
          for (int i = 0; i "lt" batSize; ++i)
          {
            int ii = indices[ptr++];  // compute output
            double[] x = trainX[ii];
            double y = trainY[ii];
            this.ComputeOutput(x);  // into this.oNoodes
            this.AccumGrads(y);
          }
          this.UpdateWeights(lrnRate);
          this.ZeroOutGrads(); // prep for next batch
        } // batches

        if (epoch % freq == 0)  // progress 
        {
          double mse = this.Error(trainX, trainY);
          double acc = this.Accuracy(trainX, trainY, 0.10);

          string s1 = "epoch: " + 
            epoch.ToString().PadLeft(4);
          string s2 = "  MSE = " + 
            mse.ToString("F4");
          string s3 = "  acc (10%) = " + 
            acc.ToString("F4");
          Console.WriteLine(s1 + s2 + s3);
        }
      } // epoch
    } // Train

    // ------------------------------------------------------

    private void Shuffle(int[] sequence)
    {
      for (int i = 0; i "lt" sequence.Length; ++i)
      {
        int r = this.rnd.Next(i, sequence.Length);
        int tmp = sequence[r];
        sequence[r] = sequence[i];
        sequence[i] = tmp;
        //sequence[i] = i; // for testing
      }
    } // Shuffle

    // ------------------------------------------------------

    public double Error(double[][] trainX, double[] trainY)
    {
      // MSE
      int n = trainX.Length;
      double sumSquaredError = 0.0;
      for (int i = 0; i "lt" n; ++i)
      {
        double predY = this.ComputeOutput(trainX[i]);
        double actualY = trainY[i];
        sumSquaredError += (predY - actualY) *
          (predY - actualY);
      }
      return sumSquaredError / n;
    } // Error

    // ------------------------------------------------------

    public double Accuracy(double[][] dataX,
      double[] dataY, double pctClose)
    {
      // percentage correct using winner-takes all
      int n = dataX.Length;
      int nCorrect = 0;
      int nWrong = 0;
      for (int i = 0; i "lt" n; ++i)
      {
        double predY = this.ComputeOutput(dataX[i]);
        double actualY = dataY[i];

        //Console.WriteLine("actual Y = " + 
        //  actualY.ToString("F2"));
        //Console.WriteLine("pred   Y = " + 
        //  predY.ToString("F2"));

        if (Math.Abs(predY - actualY) "lt"
          Math.Abs(pctClose * actualY))
          ++nCorrect;
        else
          ++nWrong;
      }
      return (nCorrect * 1.0) / (nCorrect + nWrong);
    }

    public double[] Forecast(double[] start, int steps)
    {
      double[] result = new double[steps];
      int n = start.Length;
      double[] currInput = new double[start.Length];
      for (int i = 0; i "lt" n; ++i)
        currInput[i] = start[i];

      for (int step = 0; step "lt" steps; ++step)
      {
        double pred = this.ComputeOutput(currInput);
        //Console.WriteLine(pred.ToString("F2"));
        result[step] = pred;

        // create new input
        for (int i = 0; i "lt" n-1; ++i)
          currInput[i] = currInput[i + 1];
 
        currInput[n - 1] = pred;
      }
      return result;
    }

    // ------------------------------------------------------

    public void SaveWeights(string fn)
    {
      FileStream ofs = new FileStream(fn, FileMode.Create);
      StreamWriter sw = new StreamWriter(ofs);

      double[] wts = this.GetWeights();
      for (int i = 0; i "lt" wts.Length; ++i)
        sw.WriteLine(wts[i].ToString("F8"));  // one per line
      sw.Close();
      ofs.Close();
    }

    public void LoadWeights(string fn)
    {
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      List"lt"double"gt" listWts = new List"lt"double"gt"();
      string line = "";  // one wt per line
      while ((line = sr.ReadLine()) != null)
      {
        // if (line.StartsWith(comment) == true)
        //   continue;
        listWts.Add(double.Parse(line));
      }
      sr.Close();
      ifs.Close();

      double[] wts = listWts.ToArray();
      this.SetWeights(wts);
    }

    // ------------------------------------------------------

  } // NeuralNetwork class


  public class Utils
  {
    // ------------------------------------------------------

    public static double[][] MatLoad(string fn,
     int[] usecols, char sep, string comment)
    {
      List"lt"double[]"gt" result = new List"lt"double[]"gt"();
      string line = "";
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        string[] tokens = line.Split(sep);
        List"lt"double"gt" lst = new List"lt"double"gt"();
        for (int j = 0; j "lt" usecols.Length; ++j)
          lst.Add(double.Parse(tokens[usecols[j]]));
        double[] row = lst.ToArray();
        result.Add(row);
      }
      sr.Close();
      ifs.Close();
      return result.ToArray();
    }

    // ------------------------------------------------------

    public static double[] MatToVec(double[][] m)
    {
      int rows = m.Length;
      int cols = m[0].Length;
      double[] result = new double[rows * cols];
      int k = 0;
      for (int i = 0; i "lt" rows; ++i)
        for (int j = 0; j "lt" cols; ++j)
          result[k++] = m[i][j];

      return result;
    }

    // ------------------------------------------------------

    public static void MatShow(double[][] m,
      int dec, int wid)
    {
      for (int i = 0; i "lt" m.Length; ++i)
      {
        for (int j = 0; j "lt" m[0].Length; ++j)
        {
          double v = m[i][j];
          if (Math.Abs(v) "lt" 1.0e-8) v = 0.0; // hack
          Console.Write(v.ToString("F" +
            dec).PadLeft(wid));
        }
        Console.WriteLine("");
      }
    }

    // ------------------------------------------------------

    public static void VecShow(int[] vec, int wid)
    {
      for (int i = 0; i "lt" vec.Length; ++i)
        Console.Write(vec[i].ToString().PadLeft(wid));
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void VecShow(double[] vec,
      int dec, int wid, int lineLen, bool newLine)
    {
      for (int i = 0; i "lt" vec.Length; ++i)
      {
        if (i != 0 "and" i % lineLen == 0)
          Console.WriteLine("");
        double x = vec[i];
        if (Math.Abs(x) "lt" 1.0e-8) x = 0.0;
        Console.Write(x.ToString("F" +
          dec).PadLeft(wid));
      }
      if (newLine == true)
        Console.WriteLine("");
    }

  } // Utils

} // ns

Demo data:

# airline_all.txt
# 4 item window
#
1.12, 1.18, 1.32, 1.29, 1.21
1.18, 1.32, 1.29, 1.21, 1.35
1.32, 1.29, 1.21, 1.35, 1.48
1.29, 1.21, 1.35, 1.48, 1.48
1.21, 1.35, 1.48, 1.48, 1.36
1.35, 1.48, 1.48, 1.36, 1.19
1.48, 1.48, 1.36, 1.19, 1.04
1.48, 1.36, 1.19, 1.04, 1.18
1.36, 1.19, 1.04, 1.18, 1.15
1.19, 1.04, 1.18, 1.15, 1.26
1.04, 1.18, 1.15, 1.26, 1.41
1.18, 1.15, 1.26, 1.41, 1.35
1.15, 1.26, 1.41, 1.35, 1.25
1.26, 1.41, 1.35, 1.25, 1.49
1.41, 1.35, 1.25, 1.49, 1.70
1.35, 1.25, 1.49, 1.70, 1.70
1.25, 1.49, 1.70, 1.70, 1.58
1.49, 1.70, 1.70, 1.58, 1.33
1.70, 1.70, 1.58, 1.33, 1.14
1.70, 1.58, 1.33, 1.14, 1.40
1.58, 1.33, 1.14, 1.40, 1.45
1.33, 1.14, 1.40, 1.45, 1.50
1.14, 1.40, 1.45, 1.50, 1.78
1.40, 1.45, 1.50, 1.78, 1.63
1.45, 1.50, 1.78, 1.63, 1.72
1.50, 1.78, 1.63, 1.72, 1.78
1.78, 1.63, 1.72, 1.78, 1.99
1.63, 1.72, 1.78, 1.99, 1.99
1.72, 1.78, 1.99, 1.99, 1.84
1.78, 1.99, 1.99, 1.84, 1.62
1.99, 1.99, 1.84, 1.62, 1.46
1.99, 1.84, 1.62, 1.46, 1.66
1.84, 1.62, 1.46, 1.66, 1.71
1.62, 1.46, 1.66, 1.71, 1.80
1.46, 1.66, 1.71, 1.80, 1.93
1.66, 1.71, 1.80, 1.93, 1.81
1.71, 1.80, 1.93, 1.81, 1.83
1.80, 1.93, 1.81, 1.83, 2.18
1.93, 1.81, 1.83, 2.18, 2.30
1.81, 1.83, 2.18, 2.30, 2.42
1.83, 2.18, 2.30, 2.42, 2.09
2.18, 2.30, 2.42, 2.09, 1.91
2.30, 2.42, 2.09, 1.91, 1.72
2.42, 2.09, 1.91, 1.72, 1.94
2.09, 1.91, 1.72, 1.94, 1.96
1.91, 1.72, 1.94, 1.96, 1.96
1.72, 1.94, 1.96, 1.96, 2.36
1.94, 1.96, 1.96, 2.36, 2.35
1.96, 1.96, 2.36, 2.35, 2.29
1.96, 2.36, 2.35, 2.29, 2.43
2.36, 2.35, 2.29, 2.43, 2.64
2.35, 2.29, 2.43, 2.64, 2.72
2.29, 2.43, 2.64, 2.72, 2.37
2.43, 2.64, 2.72, 2.37, 2.11
2.64, 2.72, 2.37, 2.11, 1.80
2.72, 2.37, 2.11, 1.80, 2.01
2.37, 2.11, 1.80, 2.01, 2.04
2.11, 1.80, 2.01, 2.04, 1.88
1.80, 2.01, 2.04, 1.88, 2.35
2.01, 2.04, 1.88, 2.35, 2.27
2.04, 1.88, 2.35, 2.27, 2.34
1.88, 2.35, 2.27, 2.34, 2.64
2.35, 2.27, 2.34, 2.64, 3.02
2.27, 2.34, 2.64, 3.02, 2.93
2.34, 2.64, 3.02, 2.93, 2.59
2.64, 3.02, 2.93, 2.59, 2.29
3.02, 2.93, 2.59, 2.29, 2.03
2.93, 2.59, 2.29, 2.03, 2.29
2.59, 2.29, 2.03, 2.29, 2.42
2.29, 2.03, 2.29, 2.42, 2.33
2.03, 2.29, 2.42, 2.33, 2.67
2.29, 2.42, 2.33, 2.67, 2.69
2.42, 2.33, 2.67, 2.69, 2.70
2.33, 2.67, 2.69, 2.70, 3.15
2.67, 2.69, 2.70, 3.15, 3.64
2.69, 2.70, 3.15, 3.64, 3.47
2.70, 3.15, 3.64, 3.47, 3.12
3.15, 3.64, 3.47, 3.12, 2.74
3.64, 3.47, 3.12, 2.74, 2.37
3.47, 3.12, 2.74, 2.37, 2.78
3.12, 2.74, 2.37, 2.78, 2.84
2.74, 2.37, 2.78, 2.84, 2.77
2.37, 2.78, 2.84, 2.77, 3.17
2.78, 2.84, 2.77, 3.17, 3.13
2.84, 2.77, 3.17, 3.13, 3.18
2.77, 3.17, 3.13, 3.18, 3.74
3.17, 3.13, 3.18, 3.74, 4.13
3.13, 3.18, 3.74, 4.13, 4.05
3.18, 3.74, 4.13, 4.05, 3.55
3.74, 4.13, 4.05, 3.55, 3.06
4.13, 4.05, 3.55, 3.06, 2.71
4.05, 3.55, 3.06, 2.71, 3.06
3.55, 3.06, 2.71, 3.06, 3.15
3.06, 2.71, 3.06, 3.15, 3.01
2.71, 3.06, 3.15, 3.01, 3.56
3.06, 3.15, 3.01, 3.56, 3.48
3.15, 3.01, 3.56, 3.48, 3.55
3.01, 3.56, 3.48, 3.55, 4.22
3.56, 3.48, 3.55, 4.22, 4.65
3.48, 3.55, 4.22, 4.65, 4.67
3.55, 4.22, 4.65, 4.67, 4.04
4.22, 4.65, 4.67, 4.04, 3.47
4.65, 4.67, 4.04, 3.47, 3.05
4.67, 4.04, 3.47, 3.05, 3.36
4.04, 3.47, 3.05, 3.36, 3.40
3.47, 3.05, 3.36, 3.40, 3.18
3.05, 3.36, 3.40, 3.18, 3.62
3.36, 3.40, 3.18, 3.62, 3.48
3.40, 3.18, 3.62, 3.48, 3.63
3.18, 3.62, 3.48, 3.63, 4.35
3.62, 3.48, 3.63, 4.35, 4.91
3.48, 3.63, 4.35, 4.91, 5.05
3.63, 4.35, 4.91, 5.05, 4.04
4.35, 4.91, 5.05, 4.04, 3.59
4.91, 5.05, 4.04, 3.59, 3.10
5.05, 4.04, 3.59, 3.10, 3.37
4.04, 3.59, 3.10, 3.37, 3.60
3.59, 3.10, 3.37, 3.60, 3.42
3.10, 3.37, 3.60, 3.42, 4.06
3.37, 3.60, 3.42, 4.06, 3.96
3.60, 3.42, 4.06, 3.96, 4.20
3.42, 4.06, 3.96, 4.20, 4.72
4.06, 3.96, 4.20, 4.72, 5.48
3.96, 4.20, 4.72, 5.48, 5.59
4.20, 4.72, 5.48, 5.59, 4.63
4.72, 5.48, 5.59, 4.63, 4.07
5.48, 5.59, 4.63, 4.07, 3.62
5.59, 4.63, 4.07, 3.62, 4.05
4.63, 4.07, 3.62, 4.05, 4.17
4.07, 3.62, 4.05, 4.17, 3.91
3.62, 4.05, 4.17, 3.91, 4.19
4.05, 4.17, 3.91, 4.19, 4.61
4.17, 3.91, 4.19, 4.61, 4.72
3.91, 4.19, 4.61, 4.72, 5.35
4.19, 4.61, 4.72, 5.35, 6.22
4.61, 4.72, 5.35, 6.22, 6.06
4.72, 5.35, 6.22, 6.06, 5.08
5.35, 6.22, 6.06, 5.08, 4.61
6.22, 6.06, 5.08, 4.61, 3.90
6.06, 5.08, 4.61, 3.90, 4.32
Posted in Machine Learning | 1 Comment

Updating My JavaScript Regression Neural Network

Once or twice a year, I revisit my from-scratch JavaScript implementations of a neural network. The system has enough complexity that there are dozens of ideas that can be explored.

My latest regression version makes many small changes from previous versions. The primary change was that I refactored the train() method from a single, very large method, to one that calls three helper functions — zeroOutGrads(), accumGrads(y), updateWeights(lrnRate). This change required me to store the hidden node and output node gradients as class matrices and vectors rather than as objects local to the train() method.

For my demo program, I used one of my standard synthetic datasets. The goal is to predict a person’s income from sex, age, State, and political leaning. The 240-item tab-delimited raw data looks like:

F   24   michigan   29500.00   liberal
M   39   oklahoma   51200.00   moderate
F   63   nebraska   75800.00   conservative
M   36   michigan   44500.00   moderate
F   27   nebraska   28600.00   liberal
. . .

I encoded sex as M = -1, F = 1, and State as Michigan = 100, Nebraska = 010, Oklahoma = 001, and political leaning as conservative = 100, moderate = 010, liberal = 001. I normalized the numeric data. I divided age values by 100, and divided the target income values by 100,000. The resulting encoded and normalized comma-delimited data looks like:

 1, 0.24, 1, 0, 0, 0.2950, 0, 0, 1
-1, 0.39, 0, 0, 1, 0.5120, 0, 1, 0
 1, 0.63, 0, 1, 0, 0.7580, 1, 0, 0
-1, 0.36, 1, 0, 0, 0.4450, 0, 1, 0
 1, 0.27, 0, 1, 0, 0.2860, 0, 0, 1
. . .

I split the data into a 200-item set of training data and a 40-item set of test data.

My neural architecture was 8-25-3 with tanh() hidden node activation and softmax() output node activation. For training I used a batch size of 10, a learning rate of 0.001, and 1,000 epochs.

The resulting model scored 0.9100 accuracy on the training data (182 out of 200 correct) and 0.9500 accuracy on the test data (38 out of 40 correct) where a correct income prediction is one that is within 10% of the true target income. These results are similar to those achieved by a PyTorch neural network and a LightGBM tree-based system.

Good fun!



Different versions of a computer program can have a very different look and feel. Different book cover art has a big influence on my perception of a book. Here are covers for three versions of Tolkien’s “The Two Towers”, the second book in the Lord of the Rings trilogy. Left: By artist Jack Gaughan (1930-1985), for the 1965 accidentally unauthorized edition. Center: By artist Barbara Remington (1929-2020), for the first U.S. authorized paperback 1965 edition. Right: Art by Tolkien (1892-1973), adapted for the 1973 edition.


Demo code. Very long! Replace “lt” (less than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

// people_income.js
// node.js  ES6

// NN regression
// tanh, identity activation, MSE loss

let U = require("..\\Utils\\utilities_lib.js")
let FS = require("fs")

// ----------------------------------------------------------

class NeuralNet
{
  constructor(numInput, numHidden, numOutput, seed)
  {
    this.rnd = new U.Erratic(seed);  // pseudo-random

    this.ni = numInput; 
    this.nh = numHidden;
    this.no = numOutput;

    this.iNodes = U.vecMake(this.ni, 0.0);
    this.hNodes = U.vecMake(this.nh, 0.0);
    this.oNodes = U.vecMake(this.no, 0.0);

    this.ihWeights = U.matMake(this.ni, this.nh, 0.0);
    this.hoWeights = U.matMake(this.nh, this.no, 0.0);

    this.hBiases = U.vecMake(this.nh, 0.0);
    this.oBiases = U.vecMake(this.no, 0.0); // [1] 

    this.ihGrads = U.matMake(this.ni, this.nh, 0.0);
    this.hbGrads = U.vecMake(this.nh, 0.0);
    this.hoGrads = U.matMake(this.nh, this.no, 0.0);
    this.obGrads = U.vecMake(this.no, 0.0);

    this.initWeights();
  }

  initWeights()
  {
    let lo = -0.10;
    let hi = 0.10;
    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        this.ihWeights[i][j] = (hi - lo) * 
          this.rnd.next() + lo;
      }
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        this.hoWeights[j][k] = (hi - lo) * 
          this.rnd.next() + lo;
      }
    }
  } 

  computeOutput(X)
  {
    let hSums = U.vecMake(this.nh, 0.0);
    let oSums = U.vecMake(this.no, 0.0);
    
    this.iNodes = X;

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let i = 0; i "lt" this.ni; ++i) {
        hSums[j] += this.iNodes[i] * this.ihWeights[i][j];
      }
      hSums[j] += this.hBiases[j];
      this.hNodes[j] = U.hyperTan(hSums[j]);
    }

    for (let k = 0; k "lt" this.no; ++k) {
      for (let j = 0; j "lt" this.nh; ++j) {
        oSums[k] += this.hNodes[j] * this.hoWeights[j][k];
      }
      oSums[k] += this.oBiases[k];
    }

    // this.oNodes = U.softmax(oSums);  // multi-class
    // this.oNodes = U.identity(oSums); // regression
    for (let k = 0; k "lt" this.no; ++k) {
      this.oNodes[k] = oSums[k];
    }

    let result = [];  // or use U.vecMake
    for (let k = 0; k "lt" this.no; ++k) {
      result[k] = this.oNodes[k];
    }
    return result[0];  // a single scalar value
  } // computeOutput()

  setWeights(wts)
  {
    // order: ihWts, hBiases, hoWts, oBiases
    let p = 0;

    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        this.ihWeights[i][j] = wts[p++];
      }
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      this.hBiases[j] = wts[p++];
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        this.hoWeights[j][k] = wts[p++];
      }
    }

    for (let k = 0; k "lt" this.no; ++k) {
      this.oBiases[k] = wts[p++];
    }
  } // setWeights()

  getWeights()
  {
    // order: ihWts, hBiases, hoWts, oBiases
    let numWts = (this.ni * this.nh) + this.nh +
      (this.nh * this.no) + this.no;
    let result = U.vecMake(numWts, 0.0);
    let p = 0;
    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        result[p++] = this.ihWeights[i][j];
      }
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      result[p++] = this.hBiases[j];
    }

    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        result[p++] = this.hoWeights[j][k];
      }
    }

    for (let k = 0; k "lt" this.no; ++k) {
      result[p++] = this.oBiases[k];
    }
    return result;
  } // getWeights()

  shuffle(v)
  {
    // Fisher-Yates
    let n = v.length;
    for (let i = 0; i "lt" n; ++i) {
      let r = this.rnd.nextInt(i, n);
      let tmp = v[r];
      v[r] = v[i];
      v[i] = tmp;
    }
  }

  // --------------------------------------------------------
  // helpers for train(): zeroOutGrads(), accumGrads(y),
  //   updateWeights(lrnRate)
  // --------------------------------------------------------

  zeroOutGrads()
  {
    for (let i = 0; i "lt" this.ni; ++i)
      for (let j = 0; j "lt" this.nh; ++j)
        this.ihGrads[i][j] = 0.0;

    for (let j = 0; j "lt" this.nh; ++j)
      this.hbGrads[j] = 0.0;

    for (let j = 0; j "lt" this.nh; ++j)
      for (let k = 0; k "lt" this.no; ++k)
        this.hoGrads[j][k] = 0.0;

    for (let k = 0; k "lt" this.no; ++k)
      this.obGrads[k] = 0.0;
  }

  accumGrads(y)
  {
    // y is target scalar
    let oSignals = U.vecMake(this.no, 0.0);
    let hSignals = U.vecMake(this.nh, 0.0);

    // 1. compute output node scratch signals 
    for (let k = 0; k "lt" this.no; ++k) {
      let derivative = 1.0;  // CEE
      // let derivative =
      //  this.oNodes[k] * (1 - this.oNodes[k]); // MSE
      oSignals[k] = derivative *
        (this.oNodes[k] - y);  // CEE
    }

    // 2. accum hidden-to-output gradients 
    for (let j = 0; j "lt" this.nh; ++j)
      for (let k = 0; k "lt" this.no; ++k)
        this.hoGrads[j][k] += oSignals[k] * this.hNodes[j];

    // 3. accum output node bias gradients
    for (let k = 0; k "lt" this.no; ++k)
      this.obGrads[k] += oSignals[k] * 1.0;  // 1.0 dummy 

    // 4. compute hidden node signals
    for (let j = 0; j "lt" this.nh; ++j) {
      let sum = 0.0;
      for (let k = 0; k "lt" this.no; ++k)
        sum += oSignals[k] * this.hoWeights[j][k];

      let derivative =
        (1 - this.hNodes[j]) *
        (1 + this.hNodes[j]);  // assumes tanh
      hSignals[j] = derivative * sum;
    }

    // 5. accum input-to-hidden gradients
    for (let i = 0; i "lt" this.ni; ++i)
      for (let j = 0; j "lt" this.nh; ++j)
        this.ihGrads[i][j] += hSignals[j] * this.iNodes[i];

    // 6. accum hidden node bias gradients
    for (let j = 0; j "lt" this.nh; ++j)
      this.hbGrads[j] += hSignals[j] * 1.0;  // 1.0 dummy
  } // accumGrads
  
  updateWeights(lrnRate)
  {
    // assumes all gradients computed
    // 1. update input-to-hidden weights
    for (let i = 0; i "lt" this.ni; ++i) {
      for (let j = 0; j "lt" this.nh; ++j) {
        let delta = -1.0 * lrnRate * this.ihGrads[i][j];
        this.ihWeights[i][j] += delta;
      }
    }

    // 2. update hidden node biases
    for (let j = 0; j "lt" this.nh; ++j) {
      let delta = -1.0 * lrnRate * this.hbGrads[j];
      this.hBiases[j] += delta;
    }

    // 3. update hidden-to-output weights
    for (let j = 0; j "lt" this.nh; ++j) {
      for (let k = 0; k "lt" this.no; ++k) {
        let delta = -1.0 * lrnRate * this.hoGrads[j][k];
        this.hoWeights[j][k] += delta;
      }
    }

    // 4. update output node biases
    for (let k = 0; k "lt" this.no; ++k) {
      let delta = -1.0 * lrnRate * this.obGrads[k];
      this.oBiases[k] += delta;
    }
  } // updateWeights()

  // --------------------------------------------------------

  train(trainX, trainY, lrnRate, batSize, maxEpochs)
  {
    let n = trainX.length;  // 200
    let batchesPerEpoch = Math.trunc(n / batSize);  // 20
    let freq = Math.trunc(maxEpochs / 10);  // progress
    let indices = U.arange(n);

    // ----------------------------------------------------
    //
    // n = 200; bs = 10
    // batches per epoch = 200 / 10 = 20

    // for epoch = 0; epoch "lt" maxEpochs; ++epoch
    //   for batch = 0; batch "lt" bpe; ++batch
    //     for item = 0; item "lt" bs; ++item
    //       compute output
    //       accum grads
    //     end-item
    //     update weights
    //     zero-out grads
    //   end-batches
    //   shuffle indices
    // end-epochs
    //
    // ----------------------------------------------------

    for (let epoch = 0; epoch "lt" maxEpochs; ++epoch) {
      this.shuffle(indices);
      let ptr = 0;  // points into indices
      for (let batIdx = 0; batIdx "lt" batchesPerEpoch;
        ++batIdx) // 0, 1, . . 19
      {
        for (let i = 0; i "lt" batSize; ++i) { // 0 . . 9
          let ii = indices[ptr++];  // compute output
          let x = trainX[ii];
          let y = trainY[ii];
          this.computeOutput(x);  // into this.oNodes
          this.accumGrads(y);
        }
        this.updateWeights(lrnRate);
        this.zeroOutGrads(); // prep for next batch
      } // batches

      if (epoch % freq == 0) {
        // let mse = 
        // this.meanSqErr(trainX, trainY).toFixed(4);
        let mcee = 
          this.meanSqErr(trainX, trainY).toFixed(4);
        let acc = this.accuracy(trainX, trainY,
          0.10).toFixed(4);

        let s1 = "epoch: " +
          epoch.toString().padStart(6, ' ');
        let s2 = "   MCEE = " + 
          mcee.toString().padStart(8, ' ');
        let s3 = "   acc = " + acc.toString();

        console.log(s1 + s2 + s3);
      }
    } // epoch
  } // train

  // -------------------------------------------------------- 

  meanSqErr(dataX, dataY)
  {
    // for regression
    let sumSE = 0.0;
    for (let i = 0; i "lt" dataX.length; ++i) {
      let X = dataX[i];
      let Y = dataY[i];  // target income
      let oupt = this.computeOutput(X); 
      sumSE += (Y - oupt) * (Y - oupt); 
    }
    return sumSE / dataX.length;  // consider Root MSE
  } 

  accuracy(dataX, dataY, pctClose)
  {
    let nc = 0; let nw = 0;
    for (let i = 0; i "lt" dataX.length; ++i) { 
      let X = dataX[i];
      let y = dataY[i];  // target income
      let pred = this.computeOutput(X);
      if ( Math.abs(y - pred) "lt" Math.abs(y * pctClose) ) {
        ++nc;
      }
      else {
        ++nw;
      }
    }
    return nc / (nc + nw);
  }

  saveWeights(fn)
  {
    let wts = this.getWeights();
    let n = wts.length;
    let s = "";
    for (let i = 0; i "lt" n-1; ++i) {
      s += wts[i].toString() + ",";
    }
    s += wts[n-1];

    FS.writeFileSync(fn, s);
  }

  loadWeights(fn)
  {
    let n = (this.ni * this.nh) + this.nh +
      (this.nh * this.no) + this.no;
    let wts = U.vecMake(n, 0.0);
    let all = FS.readFileSync(fn, "utf8");
    let strVals = all.split(",");
    let nn = strVals.length;
    if (n != nn) {
      throw("Size error in NeuralNet.loadWeights()");
    }
    for (let i = 0; i "lt" n; ++i) {
      wts[i] = parseFloat(strVals[i]);
    }
    this.setWeights(wts);
  }

} // NeuralNet

// ----------------------------------------------------------

function main()
{
  // process.stdout.write("\033[0m");  // reset
  // process.stdout.write("\x1b[1m" + "\x1b[37m"); // white
  console.log("\nBegin JavaScript NN regression demo ");
  console.log("Predict income from sex, age, State, politics ");
  
  // 1. load data
  // -1  0.29  1 0 0  0.65400  0 0 1
  //  1  0.36  0 0 1  0.58300  1 0 0
  console.log("\nLoading data into memory ");
  let trainX = U.loadTxt(".\\Data\\people_train.txt", ",",
    [0,1,2,3,4,6,7,8], "#");
  let trainY = U.loadTxt(".\\Data\\people_train.txt", ",",
    [5], "#");
  trainY = U.matToVec(trainY);
  let testX = U.loadTxt(".\\Data\\people_test.txt", ",",
    [0,1,2,3,4,6,7,8], "#");
  let testY = U.loadTxt(".\\Data\\people_test.txt", ",",
    [5], "#");
  testY = U.matToVec(testY);

  // 2. create network
  console.log("\nCreating 8-25-1 tanh, identity MSE NN ");
  let seed = 0;
  let nn = new NeuralNet(8, 25, 1, seed);

  // 3. train network
  let lrnRate = 0.001;
  let maxEpochs = 1000;
  console.log("\nSetting learn rate = 0.001 ");
  console.log("Setting bat size = 10 ");
  nn.train(trainX, trainY, lrnRate, 10, maxEpochs);
  console.log("Training complete ");

  // 4. evaluate model
  console.log("\nComputing acc within 0.10 of true ");
  let trainAcc = nn.accuracy(trainX, trainY, 0.10);
  let testAcc = nn.accuracy(testX, testY, 0.10);
  console.log("Accuracy on training data = " +
    trainAcc.toFixed(4).toString()); 
  console.log("Accuracy on test data     = " +
    testAcc.toFixed(4).toString());

  // 5. save trained model
  fn = ".\\Models\\people_income_wts.txt";
  console.log("\nSaving model weights and biases to: ");
  console.log(fn);
  nn.saveWeights(fn);

  // 6. use trained model
  console.log("\nPredict for M 34 Oklahoma moderate ");
  let x = [-1, 0.34, 0,0,1, 0,1,0];
  let predicted = nn.computeOutput(x);
  console.log("\nPredicted income: ");
  console.log(predicted.toFixed(5).toString());

  //process.stdout.write("\033[0m");  // reset
  console.log("\nEnd demo");
}

main()

Code for utility functions:

// utilities_lib.js
// ES6

let FS = require('fs');

// ----------------------------------------------------------

function loadTxt(fn, delimit, usecols, comment) {
  // efficient but mildly complicated
  let all = FS.readFileSync(fn, "utf8");  // giant string
  all = all.trim();  // strip final crlf in file
  let lines = all.split("\n");  // array of lines

  // count number non-comment lines
  let nRows = 0;
  for (let i = 0; i "lt" lines.length; ++i) {
    if (!lines[i].startsWith(comment))
      ++nRows;
  }
  let nCols = usecols.length;
  let result = matMake(nRows, nCols, 0.0); 
 
  let r = 0;  // into lines
  let i = 0;  // into result[][]
  while (r "lt" lines.length) {
    if (lines[r].startsWith(comment)) {
      ++r;  // next row
    }
    else {
      let tokens = lines[r].split(delimit);
      for (let j = 0; j "lt" nCols; ++j) {
        result[i][j] = parseFloat(tokens[usecols[j]]);
      }
      ++r;
      ++i;
    }
  }

  return result;
}

// ----------------------------------------------------------

function arange(n)
{
  let result = [];
  for (let i = 0; i "lt" n; ++i) {
    result[i] = Math.trunc(i);
  }
  return result;
}

// ----------------------------------------------------------

class Erratic
{
  constructor(seed)
  {
    this.seed = seed + 0.5;  // avoid 0
  }

  next()
  {
    let x = Math.sin(this.seed) * 1000;
    let result = x - Math.floor(x);  // [0.0,1.0)
    this.seed = result;  // for next call
    return result;
  }

  nextInt(lo, hi)
  {
    let x = this.next();
    return Math.trunc((hi - lo) * x + lo);
  }
}

// ----------------------------------------------------------

function vecMake(n, val)
{
  let result = [];
  for (let i = 0; i "lt" n; ++i) {
    result[i] = val;
  }
  return result;
}

function matMake(rows, cols, val)
{
  let result = [];
  for (let i = 0; i "lt" rows; ++i) {
    result[i] = [];
    for (let j = 0; j "lt" cols; ++j) {
      result[i][j] = val;
    }
  }
  return result;
}

function matToOneHot(m, n)
{
  // convert ordinal (0,1,2 . .) to one-hot
  let rows = m.length;
  let cols = m[0].length;
  let result = matMake(rows, n);
  for (let i = 0; i "lt" rows; ++i) {
    let k = Math.trunc(m[i][0]);  // 0,1,2 . .
    result[i] = vecMake(n, 0.0);  // [0.0  0.0  0.0]
    result[i][k] = 1.0;  // [ 0.0  1.0  0.0]
  }

  return result;
}

function matToVec(m)
{
  let r = m.length;
  let c = m[0].length;
  let result = 	vecMake(r*c, 0.0);
  let k = 0;
  for (let i = 0; i "lt" r; ++i) {
    for (let j = 0; j "lt" c; ++j) {
      result[k++] = m[i][j];
    }
  }
  return result;
}

function vecShow(v, dec, len)
{
  for (let i = 0; i "lt" v.length; ++i) {
    if (i != 0 "and" i % len == 0) {
      process.stdout.write("\n");
    }
    if (v[i] "gte" 0.0) {
      process.stdout.write(" ");  // + or - space
    }
    process.stdout.write(v[i].toFixed(dec));
    process.stdout.write("  ");
  }
  process.stdout.write("\n");
}

function vecShow(vec, dec, wid, nl)
{
  for (let i = 0; i "lt" vec.length; ++i) {
    let x = vec[i];
    if (Math.abs(x) "lt" 0.000001) x = 0.0  // avoid -0.00
    let xx = x.toFixed(dec);
    let s = xx.toString().padStart(wid, ' ');
    process.stdout.write(s);
    process.stdout.write(" ");
  }

  if (nl == true)
    process.stdout.write("\n");
}


function matShow(m, dec, wid)
{
  let rows = m.length;
  let cols = m[0].length;
  for (let i = 0; i "lt" rows; ++i) {
    for (let j = 0; j "lt" cols; ++j) {
      if (m[i][j] "gte" 0.0) {
        process.stdout.write(" ");  // + or - space
      }
      process.stdout.write(m[i][j].toFixed(dec));
      process.stdout.write("  ");
    }
    process.stdout.write("\n");
  }
}

function argmax(v)
{
  let result = 0;
  let m = v[0];
  for (let i = 0; i "lt" v.length; ++i) {
    if (v[i] "gt" m) {
      m = v[i];
      result = i;
    }
  }
  return result;
}

function hyperTan(x)
{
  if (x "lt" -10.0) {
    return -1.0;
  }
  else if (x "gt" 10.0) {
    return 1.0;
  }
  else {
    return Math.tanh(x);
  }
}

function logSig(x)
{
  if (x "lt" -10.0) {
    return 0.0;
  }
  else if (x "gt" 10.0) {
    return 1.0;
  }
  else {
    return 1.0 / (1.0 + Math.exp(-x));
  }
}

function vecMax(vec)
{
  let mx = vec[0];
  for (let i = 0; i "lt" vec.length; ++i) {
    if (vec[i] "gt" mx) {
      mx = vec[i];
    }
  }
  return mx;
}

function softmax(vec)
{
  //let m = Math.max(...vec);  // or 'spread' operator
  let m = vecMax(vec);
  let result = [];
  let sum = 0.0;
  for (let i = 0; i "lt" vec.length; ++i) {
    result[i] = Math.exp(vec[i] - m);
    sum += result[i];
  }
  for (let i = 0; i "lt" result.length; ++i) {
    result[i] = result[i] / sum;
  }
  return result;
}

module.exports = {
  vecMake,
  matMake,
  matToOneHot,
  matToVec,
  vecShow,
  matShow,
  argmax,
  loadTxt,
  arange,
  Erratic,
  hyperTan,
  logSig,
  vecMax,
  softmax
};

Training data:

# people_train.txt
#
# sex (-1 = male, 1 = female), age / 100,
# state (michigan = 100, nebraska = 010, oklahoma = 001),
# income / 100_000,
# politics (conservative = 100, moderate = 010, liberal = 001)
#
1, 0.24, 1, 0, 0, 0.2950, 0, 0, 1
-1, 0.39, 0, 0, 1, 0.5120, 0, 1, 0
1, 0.63, 0, 1, 0, 0.7580, 1, 0, 0
-1, 0.36, 1, 0, 0, 0.4450, 0, 1, 0
1, 0.27, 0, 1, 0, 0.2860, 0, 0, 1
1, 0.50, 0, 1, 0, 0.5650, 0, 1, 0
1, 0.50, 0, 0, 1, 0.5500, 0, 1, 0
-1, 0.19, 0, 0, 1, 0.3270, 1, 0, 0
1, 0.22, 0, 1, 0, 0.2770, 0, 1, 0
-1, 0.39, 0, 0, 1, 0.4710, 0, 0, 1
1, 0.34, 1, 0, 0, 0.3940, 0, 1, 0
-1, 0.22, 1, 0, 0, 0.3350, 1, 0, 0
1, 0.35, 0, 0, 1, 0.3520, 0, 0, 1
-1, 0.33, 0, 1, 0, 0.4640, 0, 1, 0
1, 0.45, 0, 1, 0, 0.5410, 0, 1, 0
1, 0.42, 0, 1, 0, 0.5070, 0, 1, 0
-1, 0.33, 0, 1, 0, 0.4680, 0, 1, 0
1, 0.25, 0, 0, 1, 0.3000, 0, 1, 0
-1, 0.31, 0, 1, 0, 0.4640, 1, 0, 0
1, 0.27, 1, 0, 0, 0.3250, 0, 0, 1
1, 0.48, 1, 0, 0, 0.5400, 0, 1, 0
-1, 0.64, 0, 1, 0, 0.7130, 0, 0, 1
1, 0.61, 0, 1, 0, 0.7240, 1, 0, 0
1, 0.54, 0, 0, 1, 0.6100, 1, 0, 0
1, 0.29, 1, 0, 0, 0.3630, 1, 0, 0
1, 0.50, 0, 0, 1, 0.5500, 0, 1, 0
1, 0.55, 0, 0, 1, 0.6250, 1, 0, 0
1, 0.40, 1, 0, 0, 0.5240, 1, 0, 0
1, 0.22, 1, 0, 0, 0.2360, 0, 0, 1
1, 0.68, 0, 1, 0, 0.7840, 1, 0, 0
-1, 0.60, 1, 0, 0, 0.7170, 0, 0, 1
-1, 0.34, 0, 0, 1, 0.4650, 0, 1, 0
-1, 0.25, 0, 0, 1, 0.3710, 1, 0, 0
-1, 0.31, 0, 1, 0, 0.4890, 0, 1, 0
1, 0.43, 0, 0, 1, 0.4800, 0, 1, 0
1, 0.58, 0, 1, 0, 0.6540, 0, 0, 1
-1, 0.55, 0, 1, 0, 0.6070, 0, 0, 1
-1, 0.43, 0, 1, 0, 0.5110, 0, 1, 0
-1, 0.43, 0, 0, 1, 0.5320, 0, 1, 0
-1, 0.21, 1, 0, 0, 0.3720, 1, 0, 0
1, 0.55, 0, 0, 1, 0.6460, 1, 0, 0
1, 0.64, 0, 1, 0, 0.7480, 1, 0, 0
-1, 0.41, 1, 0, 0, 0.5880, 0, 1, 0
1, 0.64, 0, 0, 1, 0.7270, 1, 0, 0
-1, 0.56, 0, 0, 1, 0.6660, 0, 0, 1
1, 0.31, 0, 0, 1, 0.3600, 0, 1, 0
-1, 0.65, 0, 0, 1, 0.7010, 0, 0, 1
1, 0.55, 0, 0, 1, 0.6430, 1, 0, 0
-1, 0.25, 1, 0, 0, 0.4030, 1, 0, 0
1, 0.46, 0, 0, 1, 0.5100, 0, 1, 0
-1, 0.36, 1, 0, 0, 0.5350, 1, 0, 0
1, 0.52, 0, 1, 0, 0.5810, 0, 1, 0
1, 0.61, 0, 0, 1, 0.6790, 1, 0, 0
1, 0.57, 0, 0, 1, 0.6570, 1, 0, 0
-1, 0.46, 0, 1, 0, 0.5260, 0, 1, 0
-1, 0.62, 1, 0, 0, 0.6680, 0, 0, 1
1, 0.55, 0, 0, 1, 0.6270, 1, 0, 0
-1, 0.22, 0, 0, 1, 0.2770, 0, 1, 0
-1, 0.50, 1, 0, 0, 0.6290, 1, 0, 0
-1, 0.32, 0, 1, 0, 0.4180, 0, 1, 0
-1, 0.21, 0, 0, 1, 0.3560, 1, 0, 0
1, 0.44, 0, 1, 0, 0.5200, 0, 1, 0
1, 0.46, 0, 1, 0, 0.5170, 0, 1, 0
1, 0.62, 0, 1, 0, 0.6970, 1, 0, 0
1, 0.57, 0, 1, 0, 0.6640, 1, 0, 0
-1, 0.67, 0, 0, 1, 0.7580, 0, 0, 1
1, 0.29, 1, 0, 0, 0.3430, 0, 0, 1
1, 0.53, 1, 0, 0, 0.6010, 1, 0, 0
-1, 0.44, 1, 0, 0, 0.5480, 0, 1, 0
1, 0.46, 0, 1, 0, 0.5230, 0, 1, 0
-1, 0.20, 0, 1, 0, 0.3010, 0, 1, 0
-1, 0.38, 1, 0, 0, 0.5350, 0, 1, 0
1, 0.50, 0, 1, 0, 0.5860, 0, 1, 0
1, 0.33, 0, 1, 0, 0.4250, 0, 1, 0
-1, 0.33, 0, 1, 0, 0.3930, 0, 1, 0
1, 0.26, 0, 1, 0, 0.4040, 1, 0, 0
1, 0.58, 1, 0, 0, 0.7070, 1, 0, 0
1, 0.43, 0, 0, 1, 0.4800, 0, 1, 0
-1, 0.46, 1, 0, 0, 0.6440, 1, 0, 0
1, 0.60, 1, 0, 0, 0.7170, 1, 0, 0
-1, 0.42, 1, 0, 0, 0.4890, 0, 1, 0
-1, 0.56, 0, 0, 1, 0.5640, 0, 0, 1
-1, 0.62, 0, 1, 0, 0.6630, 0, 0, 1
-1, 0.50, 1, 0, 0, 0.6480, 0, 1, 0
1, 0.47, 0, 0, 1, 0.5200, 0, 1, 0
-1, 0.67, 0, 1, 0, 0.8040, 0, 0, 1
-1, 0.40, 0, 0, 1, 0.5040, 0, 1, 0
1, 0.42, 0, 1, 0, 0.4840, 0, 1, 0
1, 0.64, 1, 0, 0, 0.7200, 1, 0, 0
-1, 0.47, 1, 0, 0, 0.5870, 0, 0, 1
1, 0.45, 0, 1, 0, 0.5280, 0, 1, 0
-1, 0.25, 0, 0, 1, 0.4090, 1, 0, 0
1, 0.38, 1, 0, 0, 0.4840, 1, 0, 0
1, 0.55, 0, 0, 1, 0.6000, 0, 1, 0
-1, 0.44, 1, 0, 0, 0.6060, 0, 1, 0
1, 0.33, 1, 0, 0, 0.4100, 0, 1, 0
1, 0.34, 0, 0, 1, 0.3900, 0, 1, 0
1, 0.27, 0, 1, 0, 0.3370, 0, 0, 1
1, 0.32, 0, 1, 0, 0.4070, 0, 1, 0
1, 0.42, 0, 0, 1, 0.4700, 0, 1, 0
-1, 0.24, 0, 0, 1, 0.4030, 1, 0, 0
1, 0.42, 0, 1, 0, 0.5030, 0, 1, 0
1, 0.25, 0, 0, 1, 0.2800, 0, 0, 1
1, 0.51, 0, 1, 0, 0.5800, 0, 1, 0
-1, 0.55, 0, 1, 0, 0.6350, 0, 0, 1
1, 0.44, 1, 0, 0, 0.4780, 0, 0, 1
-1, 0.18, 1, 0, 0, 0.3980, 1, 0, 0
-1, 0.67, 0, 1, 0, 0.7160, 0, 0, 1
1, 0.45, 0, 0, 1, 0.5000, 0, 1, 0
1, 0.48, 1, 0, 0, 0.5580, 0, 1, 0
-1, 0.25, 0, 1, 0, 0.3900, 0, 1, 0
-1, 0.67, 1, 0, 0, 0.7830, 0, 1, 0
1, 0.37, 0, 0, 1, 0.4200, 0, 1, 0
-1, 0.32, 1, 0, 0, 0.4270, 0, 1, 0
1, 0.48, 1, 0, 0, 0.5700, 0, 1, 0
-1, 0.66, 0, 0, 1, 0.7500, 0, 0, 1
1, 0.61, 1, 0, 0, 0.7000, 1, 0, 0
-1, 0.58, 0, 0, 1, 0.6890, 0, 1, 0
1, 0.19, 1, 0, 0, 0.2400, 0, 0, 1
1, 0.38, 0, 0, 1, 0.4300, 0, 1, 0
-1, 0.27, 1, 0, 0, 0.3640, 0, 1, 0
1, 0.42, 1, 0, 0, 0.4800, 0, 1, 0
1, 0.60, 1, 0, 0, 0.7130, 1, 0, 0
-1, 0.27, 0, 0, 1, 0.3480, 1, 0, 0
1, 0.29, 0, 1, 0, 0.3710, 1, 0, 0
-1, 0.43, 1, 0, 0, 0.5670, 0, 1, 0
1, 0.48, 1, 0, 0, 0.5670, 0, 1, 0
1, 0.27, 0, 0, 1, 0.2940, 0, 0, 1
-1, 0.44, 1, 0, 0, 0.5520, 1, 0, 0
1, 0.23, 0, 1, 0, 0.2630, 0, 0, 1
-1, 0.36, 0, 1, 0, 0.5300, 0, 0, 1
1, 0.64, 0, 0, 1, 0.7250, 1, 0, 0
1, 0.29, 0, 0, 1, 0.3000, 0, 0, 1
-1, 0.33, 1, 0, 0, 0.4930, 0, 1, 0
-1, 0.66, 0, 1, 0, 0.7500, 0, 0, 1
-1, 0.21, 0, 0, 1, 0.3430, 1, 0, 0
1, 0.27, 1, 0, 0, 0.3270, 0, 0, 1
1, 0.29, 1, 0, 0, 0.3180, 0, 0, 1
-1, 0.31, 1, 0, 0, 0.4860, 0, 1, 0
1, 0.36, 0, 0, 1, 0.4100, 0, 1, 0
1, 0.49, 0, 1, 0, 0.5570, 0, 1, 0
-1, 0.28, 1, 0, 0, 0.3840, 1, 0, 0
-1, 0.43, 0, 0, 1, 0.5660, 0, 1, 0
-1, 0.46, 0, 1, 0, 0.5880, 0, 1, 0
1, 0.57, 1, 0, 0, 0.6980, 1, 0, 0
-1, 0.52, 0, 0, 1, 0.5940, 0, 1, 0
-1, 0.31, 0, 0, 1, 0.4350, 0, 1, 0
-1, 0.55, 1, 0, 0, 0.6200, 0, 0, 1
1, 0.50, 1, 0, 0, 0.5640, 0, 1, 0
1, 0.48, 0, 1, 0, 0.5590, 0, 1, 0
-1, 0.22, 0, 0, 1, 0.3450, 1, 0, 0
1, 0.59, 0, 0, 1, 0.6670, 1, 0, 0
1, 0.34, 1, 0, 0, 0.4280, 0, 0, 1
-1, 0.64, 1, 0, 0, 0.7720, 0, 0, 1
1, 0.29, 0, 0, 1, 0.3350, 0, 0, 1
-1, 0.34, 0, 1, 0, 0.4320, 0, 1, 0
-1, 0.61, 1, 0, 0, 0.7500, 0, 0, 1
1, 0.64, 0, 0, 1, 0.7110, 1, 0, 0
-1, 0.29, 1, 0, 0, 0.4130, 1, 0, 0
1, 0.63, 0, 1, 0, 0.7060, 1, 0, 0
-1, 0.29, 0, 1, 0, 0.4000, 1, 0, 0
-1, 0.51, 1, 0, 0, 0.6270, 0, 1, 0
-1, 0.24, 0, 0, 1, 0.3770, 1, 0, 0
1, 0.48, 0, 1, 0, 0.5750, 0, 1, 0
1, 0.18, 1, 0, 0, 0.2740, 1, 0, 0
1, 0.18, 1, 0, 0, 0.2030, 0, 0, 1
1, 0.33, 0, 1, 0, 0.3820, 0, 0, 1
-1, 0.20, 0, 0, 1, 0.3480, 1, 0, 0
1, 0.29, 0, 0, 1, 0.3300, 0, 0, 1
-1, 0.44, 0, 0, 1, 0.6300, 1, 0, 0
-1, 0.65, 0, 0, 1, 0.8180, 1, 0, 0
-1, 0.56, 1, 0, 0, 0.6370, 0, 0, 1
-1, 0.52, 0, 0, 1, 0.5840, 0, 1, 0
-1, 0.29, 0, 1, 0, 0.4860, 1, 0, 0
-1, 0.47, 0, 1, 0, 0.5890, 0, 1, 0
1, 0.68, 1, 0, 0, 0.7260, 0, 0, 1
1, 0.31, 0, 0, 1, 0.3600, 0, 1, 0
1, 0.61, 0, 1, 0, 0.6250, 0, 0, 1
1, 0.19, 0, 1, 0, 0.2150, 0, 0, 1
1, 0.38, 0, 0, 1, 0.4300, 0, 1, 0
-1, 0.26, 1, 0, 0, 0.4230, 1, 0, 0
1, 0.61, 0, 1, 0, 0.6740, 1, 0, 0
1, 0.40, 1, 0, 0, 0.4650, 0, 1, 0
-1, 0.49, 1, 0, 0, 0.6520, 0, 1, 0
1, 0.56, 1, 0, 0, 0.6750, 1, 0, 0
-1, 0.48, 0, 1, 0, 0.6600, 0, 1, 0
1, 0.52, 1, 0, 0, 0.5630, 0, 0, 1
-1, 0.18, 1, 0, 0, 0.2980, 1, 0, 0
-1, 0.56, 0, 0, 1, 0.5930, 0, 0, 1
-1, 0.52, 0, 1, 0, 0.6440, 0, 1, 0
-1, 0.18, 0, 1, 0, 0.2860, 0, 1, 0
-1, 0.58, 1, 0, 0, 0.6620, 0, 0, 1
-1, 0.39, 0, 1, 0, 0.5510, 0, 1, 0
-1, 0.46, 1, 0, 0, 0.6290, 0, 1, 0
-1, 0.40, 0, 1, 0, 0.4620, 0, 1, 0
-1, 0.60, 1, 0, 0, 0.7270, 0, 0, 1
1, 0.36, 0, 1, 0, 0.4070, 0, 0, 1
1, 0.44, 1, 0, 0, 0.5230, 0, 1, 0
1, 0.28, 1, 0, 0, 0.3130, 0, 0, 1
1, 0.54, 0, 0, 1, 0.6260, 1, 0, 0

Test data:

# people_test.txt
#
-1, 0.51, 1, 0, 0, 0.6120, 0, 1, 0
-1, 0.32, 0, 1, 0, 0.4610, 0, 1, 0
1, 0.55, 1, 0, 0, 0.6270, 1, 0, 0
1, 0.25, 0, 0, 1, 0.2620, 0, 0, 1
1, 0.33, 0, 0, 1, 0.3730, 0, 0, 1
-1, 0.29, 0, 1, 0, 0.4620, 1, 0, 0
1, 0.65, 1, 0, 0, 0.7270, 1, 0, 0
-1, 0.43, 0, 1, 0, 0.5140, 0, 1, 0
-1, 0.54, 0, 1, 0, 0.6480, 0, 0, 1
1, 0.61, 0, 1, 0, 0.7270, 1, 0, 0
1, 0.52, 0, 1, 0, 0.6360, 1, 0, 0
1, 0.3, 0, 1, 0, 0.3350, 0, 0, 1
1, 0.29, 1, 0, 0, 0.3140, 0, 0, 1
-1, 0.47, 0, 0, 1, 0.5940, 0, 1, 0
1, 0.39, 0, 1, 0, 0.4780, 0, 1, 0
1, 0.47, 0, 0, 1, 0.5200, 0, 1, 0
-1, 0.49, 1, 0, 0, 0.5860, 0, 1, 0
-1, 0.63, 0, 0, 1, 0.6740, 0, 0, 1
-1, 0.3, 1, 0, 0, 0.3920, 1, 0, 0
-1, 0.61, 0, 0, 1, 0.6960, 0, 0, 1
-1, 0.47, 0, 0, 1, 0.5870, 0, 1, 0
1, 0.3, 0, 0, 1, 0.3450, 0, 0, 1
-1, 0.51, 0, 0, 1, 0.5800, 0, 1, 0
-1, 0.24, 1, 0, 0, 0.3880, 0, 1, 0
-1, 0.49, 1, 0, 0, 0.6450, 0, 1, 0
1, 0.66, 0, 0, 1, 0.7450, 1, 0, 0
-1, 0.65, 1, 0, 0, 0.7690, 1, 0, 0
-1, 0.46, 0, 1, 0, 0.5800, 1, 0, 0
-1, 0.45, 0, 0, 1, 0.5180, 0, 1, 0
-1, 0.47, 1, 0, 0, 0.6360, 1, 0, 0
-1, 0.29, 1, 0, 0, 0.4480, 1, 0, 0
-1, 0.57, 0, 0, 1, 0.6930, 0, 0, 1
-1, 0.2, 1, 0, 0, 0.2870, 0, 0, 1
-1, 0.35, 1, 0, 0, 0.4340, 0, 1, 0
-1, 0.61, 0, 0, 1, 0.6700, 0, 0, 1
-1, 0.31, 0, 0, 1, 0.3730, 0, 1, 0
1, 0.18, 1, 0, 0, 0.2080, 0, 0, 1
1, 0.26, 0, 0, 1, 0.2920, 0, 0, 1
-1, 0.28, 1, 0, 0, 0.3640, 0, 0, 1
-1, 0.59, 0, 0, 1, 0.6940, 0, 0, 1
Posted in JavaScript | Leave a comment

Programmatically Analyzing Chess Games Using Stockfish With Python

One rainy Saturday afternoon, I thought I’d investigate the possibility of programmatically analyzing chess positions and entire chess games.

After spending some time on the Internet, I realized there were lots of possible ways to approach this problem. I ended up using a Python library of functions that interact with the Stockfish chess engine.

First I downloaded the Stockfish engine for Windows from https://stockfishchess.org/download/windows/ by clicking on the 64-bit link button. This downloaded file stockfish-windows-x86-64.zip to my Downloads directory. I unzipped the file, and copied the extracted files to a local directory C:\Python\Stockfish that I created. The unzipped root directory has the Stockfish engine named as stockfish-windows-x86-64.exe in subdirectory Stockfish\stockfish-windows-x86-64\stockfish.

You don’t run the Stockfish executable directly. Stockfish needs an application program that accesses the executable.

I installed a Python library interface to Stockfish by opening a command shell and issuing the command “pip install stockfish”. The stockfish Python library is just an interface — it doesn’t include the Stockfish engine. The stockfish library is very slick and the documentation at https://pypi.org/project/stockfish/ is very good.

After a few hours of experimentation, I was able to programmatically analyze a chess game. Part of the output of my demo program looks like:

----------

position = 40 | white to move |  move # 21

position in FEN =
r2qr3/pp1b1pkp/2ppnn2/4pp2/3PP3/P3PQNP/BPP3P1/R4RK1 w - - 0 21

position:
+---+---+---+---+---+---+---+---+
| r |   |   | q | r |   |   |   | 8
+---+---+---+---+---+---+---+---+
| p | p |   | b |   | p | k | p | 7
+---+---+---+---+---+---+---+---+
|   |   | p | p | n | n |   |   | 6
+---+---+---+---+---+---+---+---+
|   |   |   |   | p | p |   |   | 5
+---+---+---+---+---+---+---+---+
|   |   |   | P | P |   |   |   | 4
+---+---+---+---+---+---+---+---+
| P |   |   |   | P | Q | N | P | 3
+---+---+---+---+---+---+---+---+
| B | P | P |   |   |   | P |   | 2
+---+---+---+---+---+---+---+---+
| R |   |   |   |   | R | K |   | 1
+---+---+---+---+---+---+---+---+
  a   b   c   d   e   f   g   h

Position evaluation = {'type': 'cp', 'value': 376}

Best moves in this position:
{'Move': 'g3f5', 'Centipawn': 399, 'Mate': None}
{'Move': 'd4e5', 'Centipawn': 185, 'Mate': None}
{'Move': 'f3f5', 'Centipawn': 54, 'Mate': None}
{'Move': 'h3h4', 'Centipawn': -173, 'Mate': None}
{'Move': 'a2e6', 'Centipawn': -269, 'Mate': None}

----------

The evaluation values, such as 399 and -269 above, are measured in centipawns, or 1/100 of a pawn. A positive value means the position favors White. A negative value means the position favors Black. Almost all popular chess programs convert these evaluations by dividing by 100, and so the move 21.Ng3xf5 results in a position with an evaluation of +3.99 pawns in favor of White, and the move 21.Ba2xe6 results in a position with an evaluation of -2.69 pawns in favor of Black.



Left: The Stockfish engine download page. Right: The online tool I used to convert PGN notation to FEN notation.



Left: The documentation for the Python Stockfish library. Right: The Nakamura-Topalov game at chessgames.com


I was analyzing a game between Hikaru Nakamura (White) and Veselin Topalov (Black) from 2017. I downloaded the game in PGN format from chessgames.com which is:

[Event "Champions Showdown in Saint Louis (Blitz)"]
[Site "St Louis, MO USA"]
[Date "2017.11.12"]
[EventDate "2017.10.21"]
[Round "12.1"]
[Result "1-0"]
[White "Hikaru Nakamura"]
[Black "Veselin Topalov"]
[ECO "C26"]
[WhiteElo "2774"]
[BlackElo "2749"]
[PlyCount "43"]

1. e4 e5 2. Nc3 Nf6 3. Bc4 Bc5 4. d3 c6 5. Bb3 d6 6. Nf3 O-O
7. h3 Nbd7 8. O-O Bb6 9. a3 Nc5 10. Ba2 Ne6 11. Ne2 Re8
12. Be3 Bxe3 13. fxe3 Qc7 14. Nh4 Qd8 15. Nf3 Bd7 16. Ng3 g6
17. d4 Qc7 18. Nh4 Qd8 19. Qf3 Kg7 20. Nhf5+ gxf5 21. Nxf5+
Kg6 22. Bxe6 1-0

I was unable to determine how to use the stockfish Python library with PGN data, so I had to convert the PGN data to FEN (“Forsyth–Edwards Notation”) data. Luckily I discovered a very nice online tool to do this at https://www.lutanho.net/pgn/pgn2fen.html. The game in FEN format is:

rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2
rnbqkbnr/pppp1ppp/8/4p3/4P3/2N5/PPPP1PPP/R1BQKBNR b KQkq - 1 2
rnbqkb1r/pppp1ppp/5n2/4p3/4P3/2N5/PPPP1PPP/R1BQKBNR w KQkq - 2 3
rnbqkb1r/pppp1ppp/5n2/4p3/2B1P3/2N5/PPPP1PPP/R1BQK1NR b KQkq - 3 3
. . .
r2qr3/pp1b1p1p/2ppnnk1/4pN2/3PP3/P3PQ1P/BPP3P1/R4RK1 w - - 1 22
r2qr3/pp1b1p1p/2ppBnk1/4pN2/3PP3/P3PQ1P/1PP3P1/R4RK1 b - - 0 22

Each line of data is a position from the game. The full FEN is listed below. The Wikipedia article on Forsyth–Edwards Notation has a good explanation.

Here are a few of the key lines of my demo program. These statements fire up the Stockfish engine:

from stockfish import Stockfish

loc = "C:\\Python\\Stockfish\\" + \
  "stockfish-windows-x86-64\\stockfish\\" + \
  "stockfish-windows-x86-64.exe"
stockfish = Stockfish(path=loc)

These lines set and display a position from the game:

game = ".\\game_04.txt"  # FEN data
f = open(game, "r")
. . .
  line = f.readline()  # read a line from the FEN data
  stockfish.set_fen_position(line)  # set position using FEN
  vis_pos = stockfish.get_board_visual() # ASCII visual
  print("position: ")
  print(vis_pos)
. . .

All in all it was a very interesting experiment and there are many ideas to explore. For example, chess positions where about half of the possible moves result in positive evaluations, and the other possible moves result in negative evaluations, seem like “risky” positions in some sense. I suspect some grandmaster chess players tend to make moves that lead to risky positions, as opposed to objectively the best move, because a risky position gives their opponent more chance to make a mistake.

I’m sure there’s a lot more to learn about the stockfish library, but even so, I think I’ve made a good start.



Chess grandmasters often have very warped personalities because they must dedicate essentially their entire life to chess. For example, Robert Fisher (1943-2008, 11th champion 1972-1975) was truly a bizarre human being — and not in a nice way. But several of the 17 modern world chess champions have reputations as being nice people. These nice guys of chess include Jose Raul Capablanca (1888-1942, 3rd champion 1921-1927), Max Euwe (1901-1981, 5th champion 1935-1937), Vasily Smyslov (1921-2010, 7th champion 1957-1958), Boris Spassky (b. 1937, 10th champion 1969-1971), and Viswanathan Anand (b. 1969, 15th champion 2007-2013).

Left: Jose Raul Capablanca. Center: Max Euwe. Right: Viswanathan Anand.


Demo program:

# stockfish_demo.py
# Anaconda3-2023.09-0  Python 3.11.5
# Windows 10/11

from stockfish import Stockfish

loc = "C:\\Python\\Stockfish\\" + \
  "stockfish-windows-x86-64\\stockfish\\" + \
  "stockfish-windows-x86-64.exe"

stockfish = Stockfish(path=loc)
stockfish.update_engine_parameters({"UCI_Elo": 2000})
p = stockfish.get_parameters()
print("\nstockfish parameters: ")
print(p)

game = ".\\game_04.txt"
f = open(game, "r")
pos_number = 0
while True:
  line = f.readline()
  if not line: break
  if line.startswith("#"): continue
  if line.startswith("["): continue
  line = line.strip()
  print("\n----------")
  print("\nposition = " + str(pos_number) +\
    " | ", end="")
  tokens = line.split(" ")
  if tokens[1] == "w":
    print("white to move | ", end="")
  elif tokens[1] == "b":
    print("black to move | ", end="")
  print(" move # " + str(tokens[-1]))
  print("\nposition in FEN = ")
  print(line)
  stockfish.set_fen_position(line)
  vis_pos = stockfish.get_board_visual()
  print("\nposition: ")
  print(vis_pos)

  curr_eval = stockfish.get_evaluation()
  print("Position evaluation = ", end="")
  print(curr_eval)

  bms = stockfish.get_top_moves(5)
  print("\nBest moves in this position:")
  for i in range(len(bms)):
    print(bms[i])

  pos_number += 1
  print("\n----------")
  # input()
f.close()

print("\nEnd analysis ")

Data:

# game_04.txt
#
# [Event "Champions Showdown in Saint Louis (Blitz)"]
# [Site "St Louis, MO USA"]
# [Date "2017.11.12"]
# [EventDate "2017.10.21"]
# [Round "12.1"]
# [Result "1-0"]
# [White "Hikaru Nakamura"]
# [Black "Veselin Topalov"]
# [ECO "C26"]
# [WhiteElo "2774"]
# [BlackElo "2749"]
# [PlyCount "43"]
#
# 1. e4 e5 2. Nc3 Nf6 3. Bc4 Bc5 4. d3 c6 5. Bb3 d6 6. Nf3 O-O
# 7. h3 Nbd7 8. O-O Bb6 9. a3 Nc5 10. Ba2 Ne6 11. Ne2 Re8
# 12. Be3 Bxe3 13. fxe3 Qc7 14. Nh4 Qd8 15. Nf3 Bd7 16. Ng3 g6
# 17. d4 Qc7 18. Nh4 Qd8 19. Qf3 Kg7 20. Nhf5+ gxf5 21. Nxf5+
# Kg6 22. Bxe6 1-0
#
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2
rnbqkbnr/pppp1ppp/8/4p3/4P3/2N5/PPPP1PPP/R1BQKBNR b KQkq - 1 2
rnbqkb1r/pppp1ppp/5n2/4p3/4P3/2N5/PPPP1PPP/R1BQKBNR w KQkq - 2 3
rnbqkb1r/pppp1ppp/5n2/4p3/2B1P3/2N5/PPPP1PPP/R1BQK1NR b KQkq - 3 3
rnbqk2r/pppp1ppp/5n2/2b1p3/2B1P3/2N5/PPPP1PPP/R1BQK1NR w KQkq - 4 4
rnbqk2r/pppp1ppp/5n2/2b1p3/2B1P3/2NP4/PPP2PPP/R1BQK1NR b KQkq - 0 4
rnbqk2r/pp1p1ppp/2p2n2/2b1p3/2B1P3/2NP4/PPP2PPP/R1BQK1NR w KQkq - 0 5
rnbqk2r/pp1p1ppp/2p2n2/2b1p3/4P3/1BNP4/PPP2PPP/R1BQK1NR b KQkq - 1 5
rnbqk2r/pp3ppp/2pp1n2/2b1p3/4P3/1BNP4/PPP2PPP/R1BQK1NR w KQkq - 0 6
rnbqk2r/pp3ppp/2pp1n2/2b1p3/4P3/1BNP1N2/PPP2PPP/R1BQK2R b KQkq - 1 6
rnbq1rk1/pp3ppp/2pp1n2/2b1p3/4P3/1BNP1N2/PPP2PPP/R1BQK2R w KQ - 2 7
rnbq1rk1/pp3ppp/2pp1n2/2b1p3/4P3/1BNP1N1P/PPP2PP1/R1BQK2R b KQ - 0 7
r1bq1rk1/pp1n1ppp/2pp1n2/2b1p3/4P3/1BNP1N1P/PPP2PP1/R1BQK2R w KQ - 1 8
r1bq1rk1/pp1n1ppp/2pp1n2/2b1p3/4P3/1BNP1N1P/PPP2PP1/R1BQ1RK1 b - - 2 8
r1bq1rk1/pp1n1ppp/1bpp1n2/4p3/4P3/1BNP1N1P/PPP2PP1/R1BQ1RK1 w - - 3 9
r1bq1rk1/pp1n1ppp/1bpp1n2/4p3/4P3/PBNP1N1P/1PP2PP1/R1BQ1RK1 b - - 0 9
r1bq1rk1/pp3ppp/1bpp1n2/2n1p3/4P3/PBNP1N1P/1PP2PP1/R1BQ1RK1 w - - 1 10
r1bq1rk1/pp3ppp/1bpp1n2/2n1p3/4P3/P1NP1N1P/BPP2PP1/R1BQ1RK1 b - - 2 10
r1bq1rk1/pp3ppp/1bppnn2/4p3/4P3/P1NP1N1P/BPP2PP1/R1BQ1RK1 w - - 3 11
r1bq1rk1/pp3ppp/1bppnn2/4p3/4P3/P2P1N1P/BPP1NPP1/R1BQ1RK1 b - - 4 11
r1bqr1k1/pp3ppp/1bppnn2/4p3/4P3/P2P1N1P/BPP1NPP1/R1BQ1RK1 w - - 5 12
r1bqr1k1/pp3ppp/1bppnn2/4p3/4P3/P2PBN1P/BPP1NPP1/R2Q1RK1 b - - 6 12
r1bqr1k1/pp3ppp/2ppnn2/4p3/4P3/P2PbN1P/BPP1NPP1/R2Q1RK1 w - - 0 13
r1bqr1k1/pp3ppp/2ppnn2/4p3/4P3/P2PPN1P/BPP1N1P1/R2Q1RK1 b - - 0 13
r1b1r1k1/ppq2ppp/2ppnn2/4p3/4P3/P2PPN1P/BPP1N1P1/R2Q1RK1 w - - 1 14
r1b1r1k1/ppq2ppp/2ppnn2/4p3/4P2N/P2PP2P/BPP1N1P1/R2Q1RK1 b - - 2 14
r1bqr1k1/pp3ppp/2ppnn2/4p3/4P2N/P2PP2P/BPP1N1P1/R2Q1RK1 w - - 3 15
r1bqr1k1/pp3ppp/2ppnn2/4p3/4P3/P2PPN1P/BPP1N1P1/R2Q1RK1 b - - 4 15
r2qr1k1/pp1b1ppp/2ppnn2/4p3/4P3/P2PPN1P/BPP1N1P1/R2Q1RK1 w - - 5 16
r2qr1k1/pp1b1ppp/2ppnn2/4p3/4P3/P2PPNNP/BPP3P1/R2Q1RK1 b - - 6 16
r2qr1k1/pp1b1p1p/2ppnnp1/4p3/4P3/P2PPNNP/BPP3P1/R2Q1RK1 w - - 0 17
r2qr1k1/pp1b1p1p/2ppnnp1/4p3/3PP3/P3PNNP/BPP3P1/R2Q1RK1 b - - 0 17
r3r1k1/ppqb1p1p/2ppnnp1/4p3/3PP3/P3PNNP/BPP3P1/R2Q1RK1 w - - 1 18
r3r1k1/ppqb1p1p/2ppnnp1/4p3/3PP2N/P3P1NP/BPP3P1/R2Q1RK1 b - - 2 18
r2qr1k1/pp1b1p1p/2ppnnp1/4p3/3PP2N/P3P1NP/BPP3P1/R2Q1RK1 w - - 3 19
r2qr1k1/pp1b1p1p/2ppnnp1/4p3/3PP2N/P3PQNP/BPP3P1/R4RK1 b - - 4 19
r2qr3/pp1b1pkp/2ppnnp1/4p3/3PP2N/P3PQNP/BPP3P1/R4RK1 w - - 5 20
r2qr3/pp1b1pkp/2ppnnp1/4pN2/3PP3/P3PQNP/BPP3P1/R4RK1 b - - 6 20
r2qr3/pp1b1pkp/2ppnn2/4pp2/3PP3/P3PQNP/BPP3P1/R4RK1 w - - 0 21
r2qr3/pp1b1pkp/2ppnn2/4pN2/3PP3/P3PQ1P/BPP3P1/R4RK1 b - - 0 21
r2qr3/pp1b1p1p/2ppnnk1/4pN2/3PP3/P3PQ1P/BPP3P1/R4RK1 w - - 1 22
r2qr3/pp1b1p1p/2ppBnk1/4pN2/3PP3/P3PQ1P/1PP3P1/R4RK1 b - - 0 22
Posted in Miscellaneous | Leave a comment

Regression Example Using LightGBM (Light Gradient Boosting Machine)

I’ve been looking at the LightGBM (light gradient boosting machine) system lately. One morning before work, I figured I’d zap out a regression demo.

LightGBM is a sophisticated tree-based system that can perform classification (multi-class and binary), regression, and ranking.

There are three programming language interfaces to LightGBM — C, Python, R. I like the relatively easy-to-use Python scikit-learn API. LightGBM isn’t installed by default with the Anaconda Python distribution I use, so I installed it with the command “pip install lightgbm”.

For my demo, I used one of my standard synthetic datasets. The regression problem goal is to predict income from sex, age, State, and political leaning. The 240-item tab-delimited raw data looks like:

F   24   michigan   29500.00   liberal
M   39   oklahoma   51200.00   moderate
F   63   nebraska   75800.00   conservative
M   36   michigan   44500.00   moderate
F   27   nebraska   28600.00   liberal
. . .

For LightGBM, it’s best to use ordinal encoding for categorical variables. I encoded the sex variable as M = 0 and F = 1. I encoded State as Michigan = 0, Nebraska = 1, Oklahoma =2. I encoded politics as conservative = 0, moderate = 1, liberal = 2.

Because LGBM is tree-based, it’s not necessary to normalize numeric data.

I split the encoded data into a 200-item set of training data and a 40-item set of test data. The resulting comma-delimited encoded data looks like:

1, 24, 0, 29500.00, 2
0, 39, 2, 51200.00, 1
1, 63, 1, 75800.00, 0
0, 36, 0, 44500.00, 1
1, 27, 1, 28600.00, 2
. . .

The key statements of my demo program are:

import numpy as np
import lightgbm as lgbm  # Python scikit API

train_file = ".\\Data\\people_train.txt"
# sex, age, State, income, politics
#  0    1     2       3       4
x_train = np.loadtxt(train_file, usecols=[0,1,2,4],
  delimiter=",", comments="#", dtype=np.float64)
y_train = np.loadtxt(train_file, usecols=3,
  delimiter=",", comments="#", dtype=np.float64)

params = {
  'objective': 'regression', # not required
  'boosting_type': 'gbdt',  # default
  'num_leaves': 31,  # default
  'learning_rate': 0.05,  # default = 0.10
  'min_data_in_leaf': 2,  # default = 20
  'random_state': 99,  # default = None
  'verbosity': -1
}
model = lgbm.LGBMRegressor(**params) 
model.fit(x_train, y_train)

The main challenge when using LightGBM is wading through the dozens of parameters. There are 57 Learning Control Parameters (min_data_in_leaf, bagging_fraction, etc.), and the LGBMRegressor module has 19 parameters, for a total of 76 parameters to deal with. Here are the 19 model parameters:

boosting_type='gbdt', 
num_leaves=31,
max_depth=-1,
learning_rate=0.1,
n_estimators=100,
subsample_for_bin=200000,
objective=None,
class_weight=None,
min_split_gain=0.0,
min_child_weight=0.001,
min_child_samples=20,
subsample=1.0,
subsample_freq=0,
colsample_bytree=1.0,
reg_alpha=0.0,
reg_lambda=0.0,
random_state=None,
n_jobs=None,
importance_type='split',
**kwargs

Because the number of parameters is not manageable, you must rely on the default values and then try to find the handful of parameters that will create a good model. For my demo, I changed the learning rate from default 0.10 to 0.05, the random_state (from default None to an arbitrary value of 99, to get reproducible results), and the min_data_in_leaf from the default of 20 to 2 — it had a big effect. I also set verbosity to -1 to suppress messages, but in a non-demo scenario you really want to see all system warning and error messages. The near-impossibility of fully understanding all the LightGBM parameters and their interactions is the main reason why I rarely use LightGBM.

Anyway, the LightGBM model predicted the 40-item test data with 85% accuracy (34 out of 40 correct) where a correct income prediction is one that’s within 10% of the true income. This is roughly comparable to the accuracy achieved by a neural network binary classifier. When LightGBM works, it often works very well. However, tree-based systems are highly susceptible to overfitting, although the LightGBM algorithms mitigate overfitting.



There are many ways to generate income. A well-known movie theme is the attractive woman who is after a rich man — the “gold digger”. Here are three gold digger comedies that I like.

In “Heartbreakers” (2001) Max (actress Sigourney Weaver) and Page (actress Jennifer Love Hewitt) are a mother-daughter team of con artists. The conspire to get tycoon William Tensey (actor Gene Hackman) to propose marriage to Max. This movie has some scenes that I thought were hilarious.

In “Tommy Boy” (1995), widower “Big Tom” Callahan is a wealthy owner of an automobile parts company. Beverly (actress Bo Derek) tricks Big Tom into marrying her. Her plan is foiled by son Tommy (actor Chris Farley) who has a heart of gold and a head of lead — but who rises to the challenge when the chips are down. Another movie with some very funny scenes, and it has become something of a cult favorite.

In “Gentlemen Prefer Blondes” (1953), Lorelei Lee (actress Marilyn Monroe) and her friend Dorothy (actress Jane Russell) are showgirls looking for husbands. Lorelei, who has a good heart, meets and falls in love with young scion Gus Esmond. Great scene at the end of the movie where the rich father of Gus says, “Young lady, you don’t fool me one bit. Have you got the nerve to stand there and expect me to believe that you don’t want to marry my son for his money?” Lorelei replies, “Of course not. I want to marry him for YOUR money.” And they all live happily ever after. A famous, iconic, and quite entertaining movie.


Demo program. Replace the “lt” (less than) with Boolean operator symbol.

# people_income_lgbm.py

import numpy as np
import lightgbm as lgbm

def accuracy(model, data_x, data_y, pct_close):
  n = len(data_x)
  n_correct = 0; n_wrong = 0
  for i in range(n):
    x = data_x[i].reshape(1, -1)
    y = data_y[i]  # true income
    pred = model.predict(x)  # predicted income []
    if np.abs(pred[0] - y) "lt" np.abs(pct_close * y):
      n_correct += 1
    else:
      n_wrong += 1
  return (n_correct * 1.0) / (n_correct + n_wrong)

def main():
  # 0. get started
  print("\nBegin People predict income using LightGBM ")
  print("Predict income from sex, age, State, politics ")
  np.random.seed(1)

  # 1. load data
  # sex, age, State, income, politics
  #  0    1     2       3       4
  print("\nLoading train and test data ")
  train_file = ".\\Data\\people_train.txt"
  test_file = ".\\Data\\people_test.txt"

  x_train = np.loadtxt(train_file, usecols=[0,1,2,4],
    delimiter=",", comments="#", dtype=np.float64)
  y_train = np.loadtxt(train_file, usecols=3,
    delimiter=",", comments="#", dtype=np.float64)

  x_test = np.loadtxt(test_file, usecols=[0,1,2,4],
    delimiter=",", comments="#", dtype=np.float64)
  y_test = np.loadtxt(test_file, usecols=3,
    delimiter=",", comments="#", dtype=np.float64)

  np.set_printoptions(precision=0, suppress=True)
  print("\nFirst few train data: ")
  for i in range(3):
    print(x_train[i], end="")
    print("  | " + str(y_train[i]))
  print(". . . ")

  # 2. create and train model
  print("\nCreating and training LightGBM regression model ")
  params = {
    'objective': 'regression',  # not required
    'boosting_type': 'gbdt',  # default
    'num_leaves': 31,  # default
    'learning_rate': 0.05,  # default = 0.10
    'feature_fraction': 1.0,  # default
    'min_data_in_leaf': 2,  # default = 20
    'random_state': 99,
    'verbosity': -1
  }
  model = lgbm.LGBMRegressor(**params)  # scikit API
  model.fit(x_train, y_train)
  print("Done ")

  # 3. evaluate model
  print("\nEvaluating model accuracy (within 0.10) ")
  acc_train = accuracy(model, x_train, y_train, 0.10)
  print("accuracy on train data = %0.4f " % acc_train)
  acc_test = accuracy(model, x_test, y_test, 0.10)
  print("accuracy on test data = %0.4f " % acc_test)

  # 4. use model
  print("\nPredicting income for M 35 Oklahoma moderate ")
  x = np.array([[0, 35, 2, 1]], dtype=np.float64)
  y_pred = model.predict(x)
  print("\nPredicted income = %0.2f " % y_pred[0])

  print("\nEnd demo ")

if __name__ == "__main__":
  main()

Training data:

# people_train.txt
# sex (M = 0, F = 1)
# age
# State (Michigan = 0, Nebraska = 1, Oklahoma = 2)
# income
# politics (conservative = 0, moderate = 1, liberal = 2)
#
1,24,0,29500.00,2
0,39,2,51200.00,1
1,63,1,75800.00,0
0,36,0,44500.00,1
1,27,1,28600.00,2
1,50,1,56500.00,1
1,50,2,55000.00,1
0,19,2,32700.00,0
1,22,1,27700.00,1
0,39,2,47100.00,2
1,34,0,39400.00,1
0,22,0,33500.00,0
1,35,2,35200.00,2
0,33,1,46400.00,1
1,45,1,54100.00,1
1,42,1,50700.00,1
0,33,1,46800.00,1
1,25,2,30000.00,1
0,31,1,46400.00,0
1,27,0,32500.00,2
1,48,0,54000.00,1
0,64,1,71300.00,2
1,61,1,72400.00,0
1,54,2,61000.00,0
1,29,0,36300.00,0
1,50,2,55000.00,1
1,55,2,62500.00,0
1,40,0,52400.00,0
1,22,0,23600.00,2
1,68,1,78400.00,0
0,60,0,71700.00,2
0,34,2,46500.00,1
0,25,2,37100.00,0
0,31,1,48900.00,1
1,43,2,48000.00,1
1,58,1,65400.00,2
0,55,1,60700.00,2
0,43,1,51100.00,1
0,43,2,53200.00,1
0,21,0,37200.00,0
1,55,2,64600.00,0
1,64,1,74800.00,0
0,41,0,58800.00,1
1,64,2,72700.00,0
0,56,2,66600.00,2
1,31,2,36000.00,1
0,65,2,70100.00,2
1,55,2,64300.00,0
0,25,0,40300.00,0
1,46,2,51000.00,1
0,36,0,53500.00,0
1,52,1,58100.00,1
1,61,2,67900.00,0
1,57,2,65700.00,0
0,46,1,52600.00,1
0,62,0,66800.00,2
1,55,2,62700.00,0
0,22,2,27700.00,1
0,50,0,62900.00,0
0,32,1,41800.00,1
0,21,2,35600.00,0
1,44,1,52000.00,1
1,46,1,51700.00,1
1,62,1,69700.00,0
1,57,1,66400.00,0
0,67,2,75800.00,2
1,29,0,34300.00,2
1,53,0,60100.00,0
0,44,0,54800.00,1
1,46,1,52300.00,1
0,20,1,30100.00,1
0,38,0,53500.00,1
1,50,1,58600.00,1
1,33,1,42500.00,1
0,33,1,39300.00,1
1,26,1,40400.00,0
1,58,0,70700.00,0
1,43,2,48000.00,1
0,46,0,64400.00,0
1,60,0,71700.00,0
0,42,0,48900.00,1
0,56,2,56400.00,2
0,62,1,66300.00,2
0,50,0,64800.00,1
1,47,2,52000.00,1
0,67,1,80400.00,2
0,40,2,50400.00,1
1,42,1,48400.00,1
1,64,0,72000.00,0
0,47,0,58700.00,2
1,45,1,52800.00,1
0,25,2,40900.00,0
1,38,0,48400.00,0
1,55,2,60000.00,1
0,44,0,60600.00,1
1,33,0,41000.00,1
1,34,2,39000.00,1
1,27,1,33700.00,2
1,32,1,40700.00,1
1,42,2,47000.00,1
0,24,2,40300.00,0
1,42,1,50300.00,1
1,25,2,28000.00,2
1,51,1,58000.00,1
0,55,1,63500.00,2
1,44,0,47800.00,2
0,18,0,39800.00,0
0,67,1,71600.00,2
1,45,2,50000.00,1
1,48,0,55800.00,1
0,25,1,39000.00,1
0,67,0,78300.00,1
1,37,2,42000.00,1
0,32,0,42700.00,1
1,48,0,57000.00,1
0,66,2,75000.00,2
1,61,0,70000.00,0
0,58,2,68900.00,1
1,19,0,24000.00,2
1,38,2,43000.00,1
0,27,0,36400.00,1
1,42,0,48000.00,1
1,60,0,71300.00,0
0,27,2,34800.00,0
1,29,1,37100.00,0
0,43,0,56700.00,1
1,48,0,56700.00,1
1,27,2,29400.00,2
0,44,0,55200.00,0
1,23,1,26300.00,2
0,36,1,53000.00,2
1,64,2,72500.00,0
1,29,2,30000.00,2
0,33,0,49300.00,1
0,66,1,75000.00,2
0,21,2,34300.00,0
1,27,0,32700.00,2
1,29,0,31800.00,2
0,31,0,48600.00,1
1,36,2,41000.00,1
1,49,1,55700.00,1
0,28,0,38400.00,0
0,43,2,56600.00,1
0,46,1,58800.00,1
1,57,0,69800.00,0
0,52,2,59400.00,1
0,31,2,43500.00,1
0,55,0,62000.00,2
1,50,0,56400.00,1
1,48,1,55900.00,1
0,22,2,34500.00,0
1,59,2,66700.00,0
1,34,0,42800.00,2
0,64,0,77200.00,2
1,29,2,33500.00,2
0,34,1,43200.00,1
0,61,0,75000.00,2
1,64,2,71100.00,0
0,29,0,41300.00,0
1,63,1,70600.00,0
0,29,1,40000.00,0
0,51,0,62700.00,1
0,24,2,37700.00,0
1,48,1,57500.00,1
1,18,0,27400.00,0
1,18,0,20300.00,2
1,33,1,38200.00,2
0,20,2,34800.00,0
1,29,2,33000.00,2
0,44,2,63000.00,0
0,65,2,81800.00,0
0,56,0,63700.00,2
0,52,2,58400.00,1
0,29,1,48600.00,0
0,47,1,58900.00,1
1,68,0,72600.00,2
1,31,2,36000.00,1
1,61,1,62500.00,2
1,19,1,21500.00,2
1,38,2,43000.00,1
0,26,0,42300.00,0
1,61,1,67400.00,0
1,40,0,46500.00,1
0,49,0,65200.00,1
1,56,0,67500.00,0
0,48,1,66000.00,1
1,52,0,56300.00,2
0,18,0,29800.00,0
0,56,2,59300.00,2
0,52,1,64400.00,1
0,18,1,28600.00,1
0,58,0,66200.00,2
0,39,1,55100.00,1
0,46,0,62900.00,1
0,40,1,46200.00,1
0,60,0,72700.00,2
1,36,1,40700.00,2
1,44,0,52300.00,1
1,28,0,31300.00,2
1,54,2,62600.00,0

Test data:

# people_test.txt
#
0,51,0,61200.00,1
0,32,1,46100.00,1
1,55,0,62700.00,0
1,25,2,26200.00,2
1,33,2,37300.00,2
0,29,1,46200.00,0
1,65,0,72700.00,0
0,43,1,51400.00,1
0,54,1,64800.00,2
1,61,1,72700.00,0
1,52,1,63600.00,0
1,30,1,33500.00,2
1,29,0,31400.00,2
0,47,2,59400.00,1
1,39,1,47800.00,1
1,47,2,52000.00,1
0,49,0,58600.00,1
0,63,2,67400.00,2
0,30,0,39200.00,0
0,61,2,69600.00,2
0,47,2,58700.00,1
1,30,2,34500.00,2
0,51,2,58000.00,1
0,24,0,38800.00,1
0,49,0,64500.00,1
1,66,2,74500.00,0
0,65,0,76900.00,0
0,46,1,58000.00,0
0,45,2,51800.00,1
0,47,0,63600.00,0
0,29,0,44800.00,0
0,57,2,69300.00,2
0,20,0,28700.00,2
0,35,0,43400.00,1
0,61,2,67000.00,2
0,31,2,37300.00,1
1,18,0,20800.00,2
1,26,2,29200.00,2
0,28,0,36400.00,2
0,59,2,69400.00,2
Posted in Machine Learning | Leave a comment

Clustering Mixed Categorical and Numeric Data Using k-Means With C#

Data clustering is the process of grouping data items together so that similar items are in the same group/cluster. For strictly numeric data, the k-means clustering technique is simplest, and the most commonly used. For non-numeric, i.e. categorical data, there are fairly complicated techniques that use entropy or Bayesian probability or categorical utility. But clustering mixed categorical and numeric data is very tricky.

I use a technique for clustering mixed data that I haven’t seen described anywhere. Briefly, for numeric data, I use min-max normalization. For standard nominal categorical data, I encode using one-over-n-hot encoding. For binary categorical data, I use reduced one-over-n-hot encoding (zero-zero-point-five). For ordinal categorical data, I encode using equal-interval encoding. After normalizing and encoding this way, all values will be between 0.0 and 1.0 so that k-means can be used without modification.

The normalization and encoding is best explained using a concrete example. I created a synthtic 240-item dataset that looks like:

F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
. . .

Each line represents a person. The fields are sex, height, age, State, income, political leaning.

The encoded and normalized data looks like:

0.5, 0.25, 0.12, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.42, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.90, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.36, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.18, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
. . .

The sex variable is binary categorical so I encode M = 0.0 and F = 0.5.

The height variable is ordinal categorical so I use equal-interval encoding as short = 0.25, medium = 0.50, tall = 0.75.

The age variable is numeric so I use min-max normalization. The min age in the datset is 18 and the max age is 68, so normalized age = (age – 18) / (68 – 18).

There are four possible values for the nominal categorical State variable so I encode them as Arkansas = 0.25 0 0 0, Colorado = 0 0.25 0 0, Delaware = 0 0 0.25 0, Illinois = 0 0 0 0.25.

The income variable is numeric so I use min-max normalization. The min income in the dataset is $20,300 and the max income is $81,800, so normalized income = (income – 20300) / (81800 – 20300).

There are three possible values for the nominal categorical variable so I encode them as conservative = 0.3333 0 0, moderate = 0 0.3333 0, liberal = 0 0 0.3333.

I fed the encoded and normalized data to a C# implementation of k–means clustering. I used k = 3 clusters and got this clustering:

Clustering with k=3 seed=0
Done

Result clustering:
  0  1  2  0  0  2  2  0  0  1  0  0  0  1  2  2 . . .
Result WCSS = 49.3195

The seed value controls the initial random cluster assignments. Different seed values should give very similar (but not necessarily identical) results. If different seed values give significantly different results, the k-means technique is not a good choice for the dataset.

The clustering result means item [0] belongs to cluster 0, item[1] belongs to cluster 1, item [2] belongs to cluster 2, item [3] belongs to cluster 0, and so on. The WCSS (within cluster sum of squares) is the value that k-means attempts to minimize, so smaller values are better.

Another way to view the clustering results is by-cluster:

cluster 0 | count = 89 :
   0    3    4    7    8   10   11   12   17   18   19  . . .

cluster 1 | count = 77 :
   1    9   13   16   21   30   31   36   37   38   42  . . .

cluster 2 | count = 74 :
   2    5    6   14   15   20   22   23   25   26   27  . . .

This means data items [0], [3], [4], etc. are in cluster 0, and so on. A third way to view the results is source data by cluster. For cluster 1:

cluster 1:
[  1]  M tall 39 delaware 51200 moderate
[  9]  M tall 39 delaware 47100 liberal
[ 13]  M tall 33 colorado 46400 moderate
 . . .

So cluster 1 looks like the “tall male mid-30s” cluster. The demo program concludes by displaying the 3 means/centroids for the clusters:

Means:
[  0]    0.3  0.40  0.19  0.09  0.06  0.09  0.02  0.2568  0.1049  0.1123  0.1161
[  1]    0.0  0.63  0.66  0.08  0.06  0.08  0.03  0.6758  0.0390  0.1645  0.1299
[  2]    0.5  0.32  0.69  0.07  0.08  0.05  0.05  0.6542  0.1576  0.1531  0.0225

The data items assigned to cluster 0 average to (0.3, 0.40, 0.19, 0.09, 0.06, 0.09, 0.02, 0.2568, 0.1049, 0.1123, 0.1161). All the data items assigned to cluster 0 are closer to that mean/centroid vector than to the other two means/centroids. And so on.

Compared to specialized techniques for clustering mixed categorical and numeric data (such as k-prototypes clustering) an advantage of the technique described here is that you can use any k-means implementation. For example, I passed the normalized and encoded data to the scikit-learn library KMeans module and got identical results. I’ve listed that Python program at the very bottom of this post.



I’m a big fan of old 1950s science fiction movies. Here’s a cluster of three movies that I like, which feature very slow-moving threats.

Left: In “Caltiki the Immortal Monster” (1959), Caltiki is a big blob monster that lives in ancient Mayan ruins. He moves at a glacial pace, yet somehow manages to trap several archeologists. I watched this at least one hundred times on TV when I was young.

Center: In “From Hell It Came” (1957), native Kimo is falsely accused of murder and is executed. Kimo’s body is placed in a hollow tree stump — that has been exposed to radiation from atomic tests. Bad idea. Tree-Kimo may be the slowest threat in sci fi movie history, but I like this film anyway.

Right: In “The Creeping Unknown” (1955), also known as “The Quatermass Xperiment”, Dr. Quatermass oversees a first-men-into-space effort. Three go up. Only one returns. He’s infected with something and becomes a blob-like creature that threatens to grow until it overwhelms the planet.


Demo program. Replace “lt” (less than), “gt”, “lte”, “gte” with Boolean operator symbols. (My blog editor often chokes on these symbols).

using System;
using System.IO;
using System.Collections.Generic;

namespace ClusterMixedKMeans
{
  internal class ClusterMixedProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin mixed data k-means" +
        " using C# ");

      string rf =
        "..\\..\\..\\Data\\people_raw_space.txt";
      string[] rawFileArray = FileLoad(rf, "#");

      Console.WriteLine("\nRaw source data: ");
      for (int i = 0; i "lt" 4; ++i)
      {
        Console.Write("[" + i.ToString().PadLeft(3) + "]  ");
        Console.WriteLine(rawFileArray[i]);
      }
      Console.WriteLine(" . . . ");

      // preprocessed data version
      string fn =
        "..\\..\\..\\Data\\people_encoded.txt";
      double[][] X = MatLoad(fn,
        new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 },
        ',', "#");

      // programmatic version
      //string rf =
      //  "..\\..\\..\\Data\\people_raw_space.txt";
      //double[][] X = NormAndEncode(rf, ' ', "#");

      Console.WriteLine("\nEncoded data: ");
      // decimals to display
      int[] decs = new int[] { 1, 2,2,2,2,2,2, 4,4,4,4 };
      MatShow(X, decs, 4, true);

      Console.WriteLine("\nClustering with k=3 seed=0");
      KMeans km = new KMeans(X, k:3, seed:0);
      // km.trials = X.Length * 5; // set n trials explicit
      int[] clustering = km.Cluster();
      Console.WriteLine("Done ");
      
      Console.WriteLine("\nResult clustering: ");
      VecShow(clustering, 3, 16);
      Console.WriteLine("Result WCSS = " + 
        km.bestWCSS.ToString("F4"));

      List"lt"int"gt"[] clusterLists = 
        ItemsByCluster(clustering, k:3);
      Console.WriteLine("\nItem indices by cluster ID: ");
      ShowItemIndicesByCluster(clusterLists, 12);

      Console.WriteLine("\nSource data by cluster ID: ");
      ShowItemsByCluster(clusterLists, rawFileArray, 3);

      Console.WriteLine("\nMeans: ");
      MatShow(km.bestMeans, decs, nRows:3,
        showIndices:true);
 
      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();
    } // Main

    // ------------------------------------------------------
    // helper: NormAndEncode() for this data only
    // ------------------------------------------------------

    static double[][] NormAndEncode(string fn, char delim,
      string comment)
    {
      // specific to this demo data
      // F,short,24,arkansas,29500,liberal
      // M,tall,39,delaware,51200,moderate
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      string line = "";
      string[] tokens = null;

      double[][] result = new double[240][];
      for (int k = 0; k "lt" 240; ++k)
        result[k] = new double[11];

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true) continue;
        line = line.Trim();
        tokens = line.Split(delim);

        // sex
        string sexStr = tokens[0].Trim();
        if (sexStr == "M") result[i][0] = 0.0;
        else if (sexStr == "F") result[i][0] = 0.5;
        // height
        string heightStr = tokens[1].Trim();
        if (heightStr == "short") result[i][1] = 0.25;
        else if (heightStr == "medium") result[i][1] = 0.50;
        else if (heightStr == "tall") result[i][1] = 0.75;
        // age
        double age = double.Parse(tokens[2].Trim());
        double ageMin = 18.0;
        double ageMax = 68.0;
        result[i][2] = (age - ageMin) / (ageMax - ageMin);
        // State
        string stateStr = tokens[3].Trim();
        if (stateStr == "arkansas") result[i][3] = 0.25;
        else if (stateStr == "colorado") result[i][4] = 0.25;
        else if (stateStr == "delaware") result[i][5] = 0.25;
        else if (stateStr == "illinois") result[i][6] = 0.25;
        // income
        double income = double.Parse(tokens[4]);
        double incomeMin = 20300.0;
        double incomeMax = 81800.0;
        result[i][7] = 
          (income - incomeMin) / (incomeMax - incomeMin);
        // political leaning
        string politicsStr = tokens[5].Trim();
        if (politicsStr == "conservative") 
          result[i][8] = 0.3333;
        else if (politicsStr == "moderate") 
          result[i][9] = 0.3333;
        else if (politicsStr == "liberal") 
          result[i][10] = 0.3333;

        ++i;  // next row
      }
      return result;
    }

    // ------------------------------------------------------
    // helpers specifically for k-means: ItemsByCluster(),
    // ShowItemIndicesByCluster(), ShowItemsByCluster()
    // ------------------------------------------------------

    static List"lt"int"gt"[] ItemsByCluster(int[] clustering,
      int k)
    {
      // this.clustering is like [2, 0, 1, 1, . . ]
      List"lt"int"gt"[] result = new List"lt"int"gt"[k];
      // array of Lists of int
      for (int cid = 0; cid "lt" k; ++cid)
        result[cid] = new List"lt"int"gt"();

      int n = clustering.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        int clusterID = clustering[i];
        result[clusterID].Add(i);
      }
      return result;
    }

    // ------------------------------------------------------

    static void ShowItemIndicesByCluster(List"lt"int"gt"[]
      arr, int nItemsPerCluster)
    {
      // nItemsPerCluster limits display
      for (int cid = 0; cid "lt" arr.Length; ++cid)
      {
        Console.WriteLine("\ncluster " + cid + 
          " | count = " + arr[cid].Count + " : ");
        if (arr[cid].Count "lt" nItemsPerCluster) 
          nItemsPerCluster = arr[cid].Count;
        for (int i = 0; i "lt" nItemsPerCluster; ++i)
        {
          Console.Write(arr[cid][i].ToString().
            PadLeft(4) + " ");
        }
        if (nItemsPerCluster "lt" arr[cid].Count)
          Console.Write(" . . . ");
        Console.WriteLine("");
      }
    }

    // ------------------------------------------------------

    static void ShowItemsByCluster(List"lt"int"gt"[] arr,
      string[] rawData, int nItemsPerCluster)
    {
      // nItemsPerCluster limits display
      for (int cid = 0; cid "lt" arr.Length; ++cid)
      {
        Console.WriteLine("\ncluster " + cid + ": ");
        if (arr[cid].Count "lt" nItemsPerCluster)
          nItemsPerCluster = arr[cid].Count;
        for (int i = 0; i "lt" nItemsPerCluster; ++i)
        {
          int idx = arr[cid][i];
          string s = rawData[idx];
          Console.Write("[" + idx.ToString().
            PadLeft(3) + "]  ");
          Console.WriteLine(s);
        }
        if (nItemsPerCluster "lt" arr[cid].Count)
          Console.WriteLine(" . . . ");
        else Console.WriteLine("");
      }
    }

    // ------------------------------------------------------
    // general helpers:
    // MatShow(), VecShow(), FileLoad(), MatLoad()
    // ------------------------------------------------------

    // ------------------------------------------------------

    static void MatShow(double[][] m, int[] decs,
      int nRows, bool showIndices)
    {
      // decs[] = number decimals to display for each column
      for (int i = 0; i "lt" nRows; ++i)
      {
        if (showIndices == true)
          Console.Write("[" + i.ToString().
            PadLeft(3) + "]  ");
        for (int j = 0; j "lt" m[0].Length; ++j)
        {
          double v = m[i][j];
          Console.Write(v.ToString("F" + decs[j]).
            PadLeft(decs[j] + 4));
        }
        Console.WriteLine("");
      }
      if (nRows "lt" m.Length)
        Console.WriteLine(" . . . ");
    }

    // ------------------------------------------------------

    static void VecShow(int[] vec, int wid, int nItems)
    {
      if (vec.Length "lt" nItems) nItems = vec.Length;
      for (int i = 0; i "lt" nItems; ++i)
      {
        Console.Write(vec[i].ToString().PadLeft(wid));
      }
      if (nItems "lt" vec.Length) Console.Write(" . . . ");
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    static string[] FileLoad(string fn, string comment)
    {
      List"lt"string"gt" lst = new List"lt"string"gt"();
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      string line = "";
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment)) continue;
        line = line.Trim();
        lst.Add(line);
      }
      sr.Close(); ifs.Close();
      string[] result = lst.ToArray();
      return result;
    }

    // ------------------------------------------------------

    static double[][] MatLoad(string fn, int[] usecols,
      char sep, string comment)
    {
      // self-contained
      int nRows = 0;
      string line = "";
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
        if (line.StartsWith(comment) == false)
          ++nRows;
      sr.Close(); ifs.Close();  // could reset fp instead

      int nCols = usecols.Length;
      double[][] result = new double[nRows][];
      for (int r = 0; r "lt" nRows; ++r)
        result[r] = new double[nCols];

      line = "";
      string[] tokens = null;
      ifs = new FileStream(fn, FileMode.Open);
      sr = new StreamReader(ifs);

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        tokens = line.Split(sep);
        for (int j = 0; j "lt" nCols; ++j)
        {
          int k = usecols[j];  // into tokens
          result[i][j] = double.Parse(tokens[k]);
        }
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }

    // ------------------------------------------------------
  } // Program

  public class KMeans
  {
    // all members public for easier debugging
    public double[][] data;
    public int k;
    public int N;
    public int dim;
    public int trials;  // to find best
    public int maxIter; // inner loop
    public Random rnd;
    public int[] clustering; // scratch not final
    public double[][] means; // scratch not final

    public int[] bestClustering;
    public double[][] bestMeans; // allocated in Cluster()
    public double bestWCSS;

    // ------------------------------------------------------
    // public methods:
    //   KMeans(), Cluster()
    //
    // private methods:
    //   Initialize(), Shuffle(),  SumSquared(), WCSS()
    //   EucDistance(), ArgMin(), AreEqual(),
    //   UpdateMeans(), UpdateClustering(), ClusterOnce()
    // ------------------------------------------------------

    public KMeans(double[][] data, int k, int seed)
    {
      this.data = data;  // by ref
      this.k = k;  // assumes k is 2 or greater
      this.N = data.Length;
      this.dim = data[0].Length;
      this.trials = N * 5;   // for Cluster()
      this.maxIter = N * 2;  // sanity for ClusterOnce()
      this.Initialize(seed); // seed, means, clustering
    }

    public int[] Cluster()
    {
      // special case k = 1
      if (this.k == 1)
      {
        // single mean of all data
        for (int i = 0; i "lt" this.data.Length; ++i)
          for (int j = 0; j "lt" this.dim; ++j)
            this.means[0][j] += this.data[i][j];
        for (int j = 0; j "lt" this.dim; ++j)
          this.means[0][j] /= this.N;
        this.bestMeans = Copy(this.means);

        // all items belong to cluster 0
        for (int i = 0; i "lt" this.N; ++i)
          this.clustering[i] = 0;

        // WCSS
        double wcss = 0.0;
        for (int i = 0; i "lt" this.N; ++i)
          wcss += SumSquared(this.bestMeans[0],
            this.data[i]);
        this.bestWCSS = wcss;

        return this.clustering;
      }

      // k = 2 or greater
      this.bestWCSS = this.WCSS();  // initial clustering
      this.bestClustering = Copy(this.clustering);
      this.bestMeans = Copy(this.means);

      for (int i = 0; i "lt" this.trials; ++i)
      {
        this.Initialize(i);  // new seed, means, clustering
        int[] clustering = this.ClusterOnce();
        double wcss = this.WCSS();
        if (wcss "lt" this.bestWCSS)
        {
          this.bestWCSS = wcss;
          this.bestClustering = Copy(clustering);
          this.bestMeans = Copy(this.means);
        }
      }
      return this.bestClustering;
    } // Cluster()

    private int[] ClusterOnce()
    {
      bool ok = true;
      int sanityCt = 1;
      while (sanityCt "lte" this.maxIter)  // N * 2
      {
        if ((ok = this.UpdateClustering() == false)) break;
        if ((ok = this.UpdateMeans() == false)) break;
        ++sanityCt;
      }
      // consider warning if sanity "gt" maxIter
      return this.clustering;
    } // ClusterOnce()

    private void Initialize(int seed)
    {
      this.rnd = new Random(seed);
      this.clustering = new int[this.N];  // scratch
      this.means = new double[this.k][];  // scratch
      for (int i = 0; i "lt" this.k; ++i)
        this.means[i] = new double[this.dim];

      // initial clustering
      // Random Partition (not Forgy or k-means++)
      int[] indices = new int[this.N];
      for (int i = 0; i "lt" this.N; ++i)
        indices[i] = i;
      Shuffle(indices);
      for (int i = 0; i "lt" this.k; ++i)  // first k items
        this.clustering[indices[i]] = i;
      for (int i = this.k; i "lt" this.N; ++i)
        this.clustering[indices[i]] =
          this.rnd.Next(0, this.k); // remaining items
      this.UpdateMeans();
    }

    private void Shuffle(int[] indices)
    {
      // Fisher-Yates mini-algorithm
      int n = indices.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        int r = this.rnd.Next(i, n);
        int tmp = indices[i];
        indices[i] = indices[r];
        indices[r] = tmp;
      }
    }

    private static double SumSquared(double[] v1,
      double[] v2)
    {
      // used by EucDistance() and WCSS()
      int dim = v1.Length;
      double sum = 0.0;
      for (int i = 0; i "lt" dim; ++i)
        sum += (v1[i] - v2[i]) * (v1[i] - v2[i]);
      return sum;
    }

    private static double EucDistance(double[] item,
      double[] mean)
    {
      double ss = SumSquared(item, mean);
      return Math.Sqrt(ss);
    }

    private static int ArgMin(double[] v)
    {
      // index of smallest value in v
      int dim = v.Length;
      int minIdx = 0;
      double minVal = v[0];
      for (int i = 0; i "lt" v.Length; ++i)
      {
        if (v[i] "lt" minVal)
        {
          minVal = v[i];
          minIdx = i;
        }
      }
      return minIdx;
    }

    private static bool AreEqual(int[] a1, int[] a2)
    {
      // to check if clustering has changed
      int dim = a1.Length;
      for (int i = 0; i "lt" dim; ++i)
        if (a1[i] != a2[i]) return false;
      return true;
    }

    private static int[] Copy(int[] arr)
    {
      // called by Cluster()
      // make a copy of new best clustering
      int dim = arr.Length;
      int[] result = new int[dim];
      for (int i = 0; i "lt" dim; ++i)
        result[i] = arr[i];
      return result;
    }

    private static double[][] Copy(double[][] matrix)
    {
      // make a copy of new best means
      int nr = matrix.Length;
      int nc = matrix[0].Length;
      double[][] result = new double[nr][];
      for (int i = 0; i "lt" nr; ++i)
        result[i] = new double[nc];
      for (int i = 0; i "lt" nr; ++i)
        for (int j = 0; j "lt" nc; ++j)
          result[i][j] = matrix[i][j];
      return result;
    }

    private bool UpdateMeans()
    {
      // first, verify no zero-counts
      // should never happen
      int[] counts = new int[this.k];
      for (int i = 0; i "lt" this.N; ++i)
      {
        int cid = this.clustering[i];
        ++counts[cid];
      }
      for (int kk = 0; kk "lt" this.k; ++kk)
      {
        if (counts[kk] == 0)
          throw
            new Exception("0-count in UpdateMeans()");
      }

      // compute proposed new means
      for (int kk = 0; kk "lt" this.k; ++kk)
        counts[kk] = 0;  // reset
      double[][] newMeans = new double[this.k][];
      for (int i = 0; i "lt" this.k; ++i)
        newMeans[i] = new double[this.dim];
      for (int i = 0; i "lt" this.N; ++i)
      {
        int cid = this.clustering[i];
        ++counts[cid];
        for (int j = 0; j "lt" this.dim; ++j)
          newMeans[cid][j] += this.data[i][j];
      }
      for (int kk = 0; kk "lt" this.k; ++kk)
        if (counts[kk] == 0)
          return false;  // bad attempt to update

      for (int kk = 0; kk "lt" this.k; ++kk)
        for (int j = 0; j "lt" this.dim; ++j)
          newMeans[kk][j] /= counts[kk];

      // copy new means
      for (int kk = 0; kk "lt" this.k; ++kk)
        for (int j = 0; j "lt" this.dim; ++j)
          this.means[kk][j] = newMeans[kk][j];

      return true;
    } // UpdateMeans()

    private bool UpdateClustering()
    {
      // first, verify no zero-counts
      int[] counts = new int[this.k];
      for (int i = 0; i "lt" this.N; ++i)
      {
        int cid = this.clustering[i];
        ++counts[cid];
      }
      // should never happen
      for (int kk = 0; kk "lt" this.k; ++kk)
      {
        if (counts[kk] == 0)
          throw new
            Exception("0-count in UpdateClustering()");
      }

      // proposed new clustering
      int[] newClustering = new int[this.N];
      for (int i = 0; i "lt" this.N; ++i)
        newClustering[i] = this.clustering[i];

      double[] distances = new double[this.k];
      for (int i = 0; i "lt" this.N; ++i)
      {
        for (int kk = 0; kk "lt" this.k; ++kk)
        {
          distances[kk] =
            EucDistance(this.data[i], this.means[kk]);
          int newID = ArgMin(distances);
          newClustering[i] = newID;
        }
      }

      if (AreEqual(this.clustering, newClustering) == true)
        return false;  // no change; short-circuit

      // make sure no count went to 0
      for (int i = 0; i "lt" this.k; ++i)
        counts[i] = 0;  // reset
      for (int i = 0; i "lt" this.N; ++i)
      {
        int cid = newClustering[i];
        ++counts[cid];
      }
      for (int kk = 0; kk "lt" this.k; ++kk)
        if (counts[kk] == 0)
          return false;  // bad update attempt

      // no 0 counts so update
      for (int i = 0; i "lt" this.N; ++i)
        this.clustering[i] = newClustering[i];

      return true;
    } // UpdateClustering()
    
    private double WCSS()
    {
      // within-cluster sum of squares
      double sum = 0.0;
      for (int i = 0; i "lt" this.N; ++i)
      {
        int cid = this.clustering[i];
        double[] mean = this.means[cid];
        double ss = SumSquared(this.data[i], mean);
        sum += ss;
      }
      return sum;
    }

  } // class KMeans
} // ns

Raw data:

# people_raw_space.txt
# space delimited
#
F short 24 arkansas 29500 liberal
M tall 39 delaware 51200 moderate
F short 63 colorado 75800 conservative
M medium 36 illinois 44500 moderate
F short 27 colorado 28600 liberal
F short 50 colorado 56500 moderate
F medium 50 illinois 55000 moderate
M tall 19 delaware 32700 conservative
F short 22 illinois 27700 moderate
M tall 39 delaware 47100 liberal
F short 34 arkansas 39400 moderate
M medium 22 illinois 33500 conservative
F medium 35 delaware 35200 liberal
M tall 33 colorado 46400 moderate
F short 45 colorado 54100 moderate
F short 42 illinois 50700 moderate
M tall 33 colorado 46800 moderate
F tall 25 delaware 30000 moderate
M medium 31 colorado 46400 conservative
F short 27 arkansas 32500 liberal
F short 48 illinois 54000 moderate
M tall 64 illinois 71300 liberal
F medium 61 colorado 72400 conservative
F short 54 illinois 61000 conservative
F short 29 arkansas 36300 conservative
F short 50 delaware 55000 moderate
F medium 55 illinois 62500 conservative
F medium 40 illinois 52400 conservative
F short 22 arkansas 23600 liberal
F short 68 colorado 78400 conservative
M tall 60 illinois 71700 liberal
M tall 34 delaware 46500 moderate
M medium 25 delaware 37100 conservative
M short 31 illinois 48900 moderate
F short 43 delaware 48000 moderate
F short 58 colorado 65400 liberal
M tall 55 illinois 60700 liberal
M tall 43 colorado 51100 moderate
M tall 43 delaware 53200 moderate
M medium 21 arkansas 37200 conservative
F short 55 delaware 64600 conservative
F short 64 colorado 74800 conservative
M tall 41 illinois 58800 moderate
F medium 64 delaware 72700 conservative
M medium 56 illinois 66600 liberal
F short 31 delaware 36000 moderate
M tall 65 delaware 70100 liberal
F tall 55 illinois 64300 conservative
M short 25 arkansas 40300 conservative
F short 46 delaware 51000 moderate
M tall 36 illinois 53500 conservative
F short 52 illinois 58100 moderate
F short 61 delaware 67900 conservative
F short 57 delaware 65700 conservative
M tall 46 colorado 52600 moderate
M tall 62 arkansas 66800 liberal
F short 55 illinois 62700 conservative
M medium 22 delaware 27700 moderate
M tall 50 illinois 62900 conservative
M tall 32 illinois 41800 moderate
M short 21 delaware 35600 conservative
F medium 44 colorado 52000 moderate
F short 46 illinois 51700 moderate
F short 62 colorado 69700 conservative
F short 57 illinois 66400 conservative
M medium 67 illinois 75800 liberal
F short 29 arkansas 34300 liberal
F short 53 illinois 60100 conservative
M tall 44 arkansas 54800 moderate
F medium 46 colorado 52300 moderate
M tall 20 illinois 30100 moderate
M medium 38 illinois 53500 moderate
F short 50 colorado 58600 moderate
F short 33 colorado 42500 moderate
M tall 33 colorado 39300 moderate
F short 26 colorado 40400 conservative
F short 58 arkansas 70700 conservative
F tall 43 illinois 48000 moderate
M medium 46 arkansas 64400 conservative
F short 60 arkansas 71700 conservative
M tall 42 arkansas 48900 moderate
M tall 56 delaware 56400 liberal
M short 62 colorado 66300 liberal
M short 50 arkansas 64800 moderate
F short 47 illinois 52000 moderate
M tall 67 colorado 80400 liberal
M tall 40 delaware 50400 moderate
F short 42 colorado 48400 moderate
F short 64 arkansas 72000 conservative
M medium 47 arkansas 58700 liberal
F medium 45 colorado 52800 moderate
M tall 25 delaware 40900 conservative
F short 38 arkansas 48400 conservative
F short 55 delaware 60000 moderate
M tall 44 arkansas 60600 moderate
F medium 33 arkansas 41000 moderate
F short 34 delaware 39000 moderate
F short 27 colorado 33700 liberal
F short 32 colorado 40700 moderate
F tall 42 illinois 47000 moderate
M short 24 delaware 40300 conservative
F short 42 colorado 50300 moderate
F short 25 delaware 28000 liberal
F short 51 colorado 58000 moderate
M medium 55 colorado 63500 liberal
F short 44 arkansas 47800 liberal
M short 18 arkansas 39800 conservative
M tall 67 colorado 71600 liberal
F short 45 delaware 50000 moderate
F short 48 arkansas 55800 moderate
M short 25 colorado 39000 moderate
M tall 67 arkansas 78300 moderate
F short 37 delaware 42000 moderate
M short 32 arkansas 42700 moderate
F short 48 arkansas 57000 moderate
M tall 66 delaware 75000 liberal
F tall 61 arkansas 70000 conservative
M medium 58 delaware 68900 moderate
F short 19 arkansas 24000 liberal
F short 38 delaware 43000 moderate
M medium 27 arkansas 36400 moderate
F short 42 arkansas 48000 moderate
F short 60 arkansas 71300 conservative
M tall 27 delaware 34800 conservative
F tall 29 colorado 37100 conservative
M medium 43 arkansas 56700 moderate
F medium 48 arkansas 56700 moderate
F medium 27 delaware 29400 liberal
M tall 44 arkansas 55200 conservative
F short 23 colorado 26300 liberal
M tall 36 colorado 53000 liberal
F short 64 delaware 72500 conservative
F short 29 delaware 30000 liberal
M short 33 arkansas 49300 moderate
M tall 66 colorado 75000 liberal
M medium 21 delaware 34300 conservative
F short 27 arkansas 32700 liberal
F short 29 arkansas 31800 liberal
M tall 31 arkansas 48600 moderate
F short 36 delaware 41000 moderate
F short 49 colorado 55700 moderate
M short 28 arkansas 38400 conservative
M medium 43 delaware 56600 moderate
M medium 46 colorado 58800 moderate
F short 57 arkansas 69800 conservative
M short 52 delaware 59400 moderate
M tall 31 delaware 43500 moderate
M tall 55 arkansas 62000 liberal
F short 50 arkansas 56400 moderate
F short 48 colorado 55900 moderate
M medium 22 delaware 34500 conservative
F short 59 delaware 66700 conservative
F short 34 arkansas 42800 liberal
M tall 64 arkansas 77200 liberal
F short 29 delaware 33500 liberal
M medium 34 colorado 43200 moderate
M medium 61 arkansas 75000 liberal
F short 64 delaware 71100 conservative
M short 29 arkansas 41300 conservative
F short 63 colorado 70600 conservative
M medium 29 colorado 40000 conservative
M tall 51 arkansas 62700 moderate
M tall 24 delaware 37700 conservative
F medium 48 colorado 57500 moderate
F short 18 arkansas 27400 conservative
F short 18 arkansas 20300 liberal
F short 33 colorado 38200 liberal
M medium 20 delaware 34800 conservative
F short 29 delaware 33000 liberal
M short 44 delaware 63000 conservative
M tall 65 delaware 81800 conservative
M tall 56 arkansas 63700 liberal
M medium 52 delaware 58400 moderate
M medium 29 colorado 48600 conservative
M tall 47 colorado 58900 moderate
F medium 68 arkansas 72600 liberal
F short 31 delaware 36000 moderate
F short 61 colorado 62500 liberal
F short 19 colorado 21500 liberal
F tall 38 delaware 43000 moderate
M tall 26 arkansas 42300 conservative
F short 61 colorado 67400 conservative
F short 40 arkansas 46500 moderate
M medium 49 arkansas 65200 moderate
F medium 56 arkansas 67500 conservative
M short 48 colorado 66000 moderate
F short 52 arkansas 56300 liberal
M tall 18 arkansas 29800 conservative
M tall 56 delaware 59300 liberal
M medium 52 colorado 64400 moderate
M medium 18 colorado 28600 moderate
M tall 58 arkansas 66200 liberal
M tall 39 colorado 55100 moderate
M tall 46 arkansas 62900 moderate
M medium 40 colorado 46200 moderate
M medium 60 arkansas 72700 liberal
F short 36 colorado 40700 liberal
F short 44 arkansas 52300 moderate
F short 28 arkansas 31300 liberal
F short 54 delaware 62600 conservative
M medium 51 arkansas 61200 moderate
M short 32 colorado 46100 moderate
F short 55 arkansas 62700 conservative
F short 25 delaware 26200 liberal
F medium 33 delaware 37300 liberal
M medium 29 colorado 46200 conservative
F short 65 arkansas 72700 conservative
M tall 43 colorado 51400 moderate
M short 54 colorado 64800 liberal
F short 61 colorado 72700 conservative
F short 52 colorado 63600 conservative
F short 30 colorado 33500 liberal
F short 29 arkansas 31400 liberal
M tall 47 delaware 59400 moderate
F short 39 colorado 47800 moderate
F short 47 delaware 52000 moderate
M medium 49 arkansas 58600 moderate
M tall 63 delaware 67400 liberal
M medium 30 arkansas 39200 conservative
M tall 61 delaware 69600 liberal
M medium 47 delaware 58700 moderate
F short 30 delaware 34500 liberal
M medium 51 delaware 58000 moderate
M medium 24 arkansas 38800 moderate
M short 49 arkansas 64500 moderate
F medium 66 delaware 74500 conservative
M tall 65 arkansas 76900 conservative
M short 46 colorado 58000 conservative
M tall 45 delaware 51800 moderate
M short 47 arkansas 63600 conservative
M tall 29 arkansas 44800 conservative
M tall 57 delaware 69300 liberal
M medium 20 arkansas 28700 liberal
M medium 35 arkansas 43400 moderate
M tall 61 delaware 67000 liberal
M short 31 delaware 37300 moderate
F short 18 arkansas 20800 liberal
F medium 26 delaware 29200 liberal
M medium 28 arkansas 36400 liberal
M tall 59 delaware 69400 liberal

Encoded and normalized data:


# people_encoded.txt
#
# sex (M = 0.0, F = 0.5)
# height (short = 0.25, medium = 0.50, tall = 0.75)
# age (min = 18, max = 68)
# State [Arkansas = (0.25 0 0 0), Colorado = (0 0.25 0 0),
#   Delaware (0 0 0.25 0), Illinois (0 0 0 0.25)]
# income (min = 20,300.00 max = 81,800.00)
# politics [(conservative = 0.3333 0 0), moderate (0 0.3333 0),
#   liberal (0 0 0.3333)]
# 
0.5, 0.25, 0.12, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.42, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.90, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.36, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.18, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.64, 0.00, 0.25, 0.00, 0.00, 0.5886, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.64, 0.00, 0.00, 0.00, 0.25, 0.5642, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.02, 0.00, 0.00, 0.25, 0.00, 0.2016, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.08, 0.00, 0.00, 0.00, 0.25, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.42, 0.00, 0.00, 0.25, 0.00, 0.4358, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.32, 0.25, 0.00, 0.00, 0.00, 0.3106, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.08, 0.00, 0.00, 0.00, 0.25, 0.2146, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.34, 0.00, 0.00, 0.25, 0.00, 0.2423, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.30, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.54, 0.00, 0.25, 0.00, 0.00, 0.5496, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.48, 0.00, 0.00, 0.00, 0.25, 0.4943, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.30, 0.00, 0.25, 0.00, 0.00, 0.4309, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.14, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.26, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.18, 0.25, 0.00, 0.00, 0.00, 0.1984, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.60, 0.00, 0.00, 0.00, 0.25, 0.5480, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.92, 0.00, 0.00, 0.00, 0.25, 0.8293, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.86, 0.00, 0.25, 0.00, 0.00, 0.8472, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.72, 0.00, 0.00, 0.00, 0.25, 0.6618, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.22, 0.25, 0.00, 0.00, 0.00, 0.2602, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.64, 0.00, 0.00, 0.25, 0.00, 0.5642, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.74, 0.00, 0.00, 0.00, 0.25, 0.6862, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.44, 0.00, 0.00, 0.00, 0.25, 0.5220, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.08, 0.25, 0.00, 0.00, 0.00, 0.0537, 0.0000, 0.0000, 0.3333
0.5, 0.25, 1.00, 0.00, 0.25, 0.00, 0.00, 0.9447, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.84, 0.00, 0.00, 0.00, 0.25, 0.8358, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.32, 0.00, 0.00, 0.25, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.14, 0.00, 0.00, 0.25, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.26, 0.00, 0.00, 0.00, 0.25, 0.4650, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.50, 0.00, 0.00, 0.25, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.80, 0.00, 0.25, 0.00, 0.00, 0.7333, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.74, 0.00, 0.00, 0.00, 0.25, 0.6569, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.50, 0.00, 0.25, 0.00, 0.00, 0.5008, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.50, 0.00, 0.00, 0.25, 0.00, 0.5350, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.06, 0.25, 0.00, 0.00, 0.00, 0.2748, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.74, 0.00, 0.00, 0.25, 0.00, 0.7203, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.92, 0.00, 0.25, 0.00, 0.00, 0.8862, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.46, 0.00, 0.00, 0.00, 0.25, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.92, 0.00, 0.00, 0.25, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.76, 0.00, 0.00, 0.00, 0.25, 0.7528, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.26, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.94, 0.00, 0.00, 0.25, 0.00, 0.8098, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.74, 0.00, 0.00, 0.00, 0.25, 0.7154, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.14, 0.25, 0.00, 0.00, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.56, 0.00, 0.00, 0.25, 0.00, 0.4992, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.36, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.68, 0.00, 0.00, 0.00, 0.25, 0.6146, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.86, 0.00, 0.00, 0.25, 0.00, 0.7740, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.78, 0.00, 0.00, 0.25, 0.00, 0.7382, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.56, 0.00, 0.25, 0.00, 0.00, 0.5252, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.88, 0.25, 0.00, 0.00, 0.00, 0.7561, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.74, 0.00, 0.00, 0.00, 0.25, 0.6894, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.08, 0.00, 0.00, 0.25, 0.00, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.64, 0.00, 0.00, 0.00, 0.25, 0.6927, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.28, 0.00, 0.00, 0.00, 0.25, 0.3496, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.06, 0.00, 0.00, 0.25, 0.00, 0.2488, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.52, 0.00, 0.25, 0.00, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.56, 0.00, 0.00, 0.00, 0.25, 0.5106, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.88, 0.00, 0.25, 0.00, 0.00, 0.8033, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.78, 0.00, 0.00, 0.00, 0.25, 0.7496, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.98, 0.00, 0.00, 0.00, 0.25, 0.9024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.22, 0.25, 0.00, 0.00, 0.00, 0.2276, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.70, 0.00, 0.00, 0.00, 0.25, 0.6472, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.52, 0.25, 0.00, 0.00, 0.00, 0.5610, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.56, 0.00, 0.25, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.04, 0.00, 0.00, 0.00, 0.25, 0.1593, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.40, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.64, 0.00, 0.25, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.30, 0.00, 0.25, 0.00, 0.00, 0.3610, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.30, 0.00, 0.25, 0.00, 0.00, 0.3089, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.16, 0.00, 0.25, 0.00, 0.00, 0.3268, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.80, 0.25, 0.00, 0.00, 0.00, 0.8195, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.50, 0.00, 0.00, 0.00, 0.25, 0.4504, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.56, 0.25, 0.00, 0.00, 0.00, 0.7171, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.84, 0.25, 0.00, 0.00, 0.00, 0.8358, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.48, 0.25, 0.00, 0.00, 0.00, 0.4650, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.76, 0.00, 0.00, 0.25, 0.00, 0.5870, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.88, 0.00, 0.25, 0.00, 0.00, 0.7480, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.64, 0.25, 0.00, 0.00, 0.00, 0.7236, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.58, 0.00, 0.00, 0.00, 0.25, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.98, 0.00, 0.25, 0.00, 0.00, 0.9772, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.44, 0.00, 0.00, 0.25, 0.00, 0.4894, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.48, 0.00, 0.25, 0.00, 0.00, 0.4569, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.92, 0.25, 0.00, 0.00, 0.00, 0.8407, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.58, 0.25, 0.00, 0.00, 0.00, 0.6244, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.54, 0.00, 0.25, 0.00, 0.00, 0.5285, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.14, 0.00, 0.00, 0.25, 0.00, 0.3350, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.40, 0.25, 0.00, 0.00, 0.00, 0.4569, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.74, 0.00, 0.00, 0.25, 0.00, 0.6455, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.52, 0.25, 0.00, 0.00, 0.00, 0.6553, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.30, 0.25, 0.00, 0.00, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.32, 0.00, 0.00, 0.25, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.18, 0.00, 0.25, 0.00, 0.00, 0.2179, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.28, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.48, 0.00, 0.00, 0.00, 0.25, 0.4341, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.12, 0.00, 0.00, 0.25, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.48, 0.00, 0.25, 0.00, 0.00, 0.4878, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.14, 0.00, 0.00, 0.25, 0.00, 0.1252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.66, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.74, 0.00, 0.25, 0.00, 0.00, 0.7024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.52, 0.25, 0.00, 0.00, 0.00, 0.4472, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.00, 0.25, 0.00, 0.00, 0.00, 0.3171, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.98, 0.00, 0.25, 0.00, 0.00, 0.8341, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.54, 0.00, 0.00, 0.25, 0.00, 0.4829, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.60, 0.25, 0.00, 0.00, 0.00, 0.5772, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.14, 0.00, 0.25, 0.00, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.98, 0.25, 0.00, 0.00, 0.00, 0.9431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.38, 0.00, 0.00, 0.25, 0.00, 0.3528, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.28, 0.25, 0.00, 0.00, 0.00, 0.3642, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.60, 0.25, 0.00, 0.00, 0.00, 0.5967, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.96, 0.00, 0.00, 0.25, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.86, 0.25, 0.00, 0.00, 0.00, 0.8081, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.80, 0.00, 0.00, 0.25, 0.00, 0.7902, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.02, 0.25, 0.00, 0.00, 0.00, 0.0602, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.40, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.18, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.48, 0.25, 0.00, 0.00, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.84, 0.25, 0.00, 0.00, 0.00, 0.8293, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.18, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.22, 0.00, 0.25, 0.00, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.50, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.60, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.18, 0.00, 0.00, 0.25, 0.00, 0.1480, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.52, 0.25, 0.00, 0.00, 0.00, 0.5675, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.10, 0.00, 0.25, 0.00, 0.00, 0.0976, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.36, 0.00, 0.25, 0.00, 0.00, 0.5317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.92, 0.00, 0.00, 0.25, 0.00, 0.8488, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.22, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.30, 0.25, 0.00, 0.00, 0.00, 0.4715, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.96, 0.00, 0.25, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.06, 0.00, 0.00, 0.25, 0.00, 0.2276, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.18, 0.25, 0.00, 0.00, 0.00, 0.2016, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.22, 0.25, 0.00, 0.00, 0.00, 0.1870, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.26, 0.25, 0.00, 0.00, 0.00, 0.4602, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.36, 0.00, 0.00, 0.25, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.62, 0.00, 0.25, 0.00, 0.00, 0.5756, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.20, 0.25, 0.00, 0.00, 0.00, 0.2943, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.50, 0.00, 0.00, 0.25, 0.00, 0.5902, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.56, 0.00, 0.25, 0.00, 0.00, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.78, 0.25, 0.00, 0.00, 0.00, 0.8049, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.68, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.26, 0.00, 0.00, 0.25, 0.00, 0.3772, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.74, 0.25, 0.00, 0.00, 0.00, 0.6780, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.64, 0.25, 0.00, 0.00, 0.00, 0.5870, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.60, 0.00, 0.25, 0.00, 0.00, 0.5789, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.08, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.82, 0.00, 0.00, 0.25, 0.00, 0.7545, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.32, 0.25, 0.00, 0.00, 0.00, 0.3659, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.92, 0.25, 0.00, 0.00, 0.00, 0.9252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.22, 0.00, 0.00, 0.25, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.32, 0.00, 0.25, 0.00, 0.00, 0.3724, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.86, 0.25, 0.00, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.92, 0.00, 0.00, 0.25, 0.00, 0.8260, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.22, 0.25, 0.00, 0.00, 0.00, 0.3415, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.90, 0.00, 0.25, 0.00, 0.00, 0.8179, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.22, 0.00, 0.25, 0.00, 0.00, 0.3203, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.66, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.12, 0.00, 0.00, 0.25, 0.00, 0.2829, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.60, 0.00, 0.25, 0.00, 0.00, 0.6049, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.00, 0.25, 0.00, 0.00, 0.00, 0.1154, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.00, 0.25, 0.00, 0.00, 0.00, 0.0000, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.30, 0.00, 0.25, 0.00, 0.00, 0.2911, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.04, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.22, 0.00, 0.00, 0.25, 0.00, 0.2065, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.52, 0.00, 0.00, 0.25, 0.00, 0.6943, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.94, 0.00, 0.00, 0.25, 0.00, 1.0000, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.76, 0.25, 0.00, 0.00, 0.00, 0.7057, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.68, 0.00, 0.00, 0.25, 0.00, 0.6195, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.22, 0.00, 0.25, 0.00, 0.00, 0.4602, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.58, 0.00, 0.25, 0.00, 0.00, 0.6276, 0.0000, 0.3333, 0.0000
0.5, 0.50, 1.00, 0.25, 0.00, 0.00, 0.00, 0.8504, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.26, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.86, 0.00, 0.25, 0.00, 0.00, 0.6862, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.02, 0.00, 0.25, 0.00, 0.00, 0.0195, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.40, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.16, 0.25, 0.00, 0.00, 0.00, 0.3577, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.86, 0.00, 0.25, 0.00, 0.00, 0.7659, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.44, 0.25, 0.00, 0.00, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.62, 0.25, 0.00, 0.00, 0.00, 0.7301, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.76, 0.25, 0.00, 0.00, 0.00, 0.7675, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.60, 0.00, 0.25, 0.00, 0.00, 0.7431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.68, 0.25, 0.00, 0.00, 0.00, 0.5854, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.00, 0.25, 0.00, 0.00, 0.00, 0.1545, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.76, 0.00, 0.00, 0.25, 0.00, 0.6341, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.68, 0.00, 0.25, 0.00, 0.00, 0.7171, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.00, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.80, 0.25, 0.00, 0.00, 0.00, 0.7463, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.42, 0.00, 0.25, 0.00, 0.00, 0.5659, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.56, 0.25, 0.00, 0.00, 0.00, 0.6927, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.44, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.84, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.36, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.52, 0.25, 0.00, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.20, 0.25, 0.00, 0.00, 0.00, 0.1789, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.72, 0.00, 0.00, 0.25, 0.00, 0.6878, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.66, 0.25, 0.00, 0.00, 0.00, 0.6650, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.28, 0.00, 0.25, 0.00, 0.00, 0.4195, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.74, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.14, 0.00, 0.00, 0.25, 0.00, 0.0959, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.30, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.22, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.94, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.50, 0.00, 0.25, 0.00, 0.00, 0.5057, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.72, 0.00, 0.25, 0.00, 0.00, 0.7236, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.86, 0.00, 0.25, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.68, 0.00, 0.25, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.24, 0.00, 0.25, 0.00, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.22, 0.25, 0.00, 0.00, 0.00, 0.1805, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.58, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.42, 0.00, 0.25, 0.00, 0.00, 0.4472, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.58, 0.00, 0.00, 0.25, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.62, 0.25, 0.00, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.90, 0.00, 0.00, 0.25, 0.00, 0.7659, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.24, 0.25, 0.00, 0.00, 0.00, 0.3073, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.86, 0.00, 0.00, 0.25, 0.00, 0.8016, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.58, 0.00, 0.00, 0.25, 0.00, 0.6244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.24, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.66, 0.00, 0.00, 0.25, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.12, 0.25, 0.00, 0.00, 0.00, 0.3008, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.62, 0.25, 0.00, 0.00, 0.00, 0.7187, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.96, 0.00, 0.00, 0.25, 0.00, 0.8813, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.94, 0.25, 0.00, 0.00, 0.00, 0.9203, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.56, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.54, 0.00, 0.00, 0.25, 0.00, 0.5122, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.58, 0.25, 0.00, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.22, 0.25, 0.00, 0.00, 0.00, 0.3984, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.78, 0.00, 0.00, 0.25, 0.00, 0.7967, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.04, 0.25, 0.00, 0.00, 0.00, 0.1366, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.34, 0.25, 0.00, 0.00, 0.00, 0.3756, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.86, 0.00, 0.00, 0.25, 0.00, 0.7593, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.26, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.00, 0.25, 0.00, 0.00, 0.00, 0.0081, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.16, 0.00, 0.00, 0.25, 0.00, 0.1447, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.20, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.82, 0.00, 0.00, 0.25, 0.00, 0.7984, 0.0000, 0.0000, 0.3333

Python k-means program gives identical results:

# kmeans_demo.py

import numpy as np
from sklearn.cluster import KMeans

fn = ".\\Data\\people_encoded.txt"
X = np.loadtxt(fn, usecols=[0,1,2,3,4,5,6,7,8,9,10],
  delimiter=',', comments="#", dtype=np.float64)
print("\nsource data:")
print(X)
km = KMeans(n_clusters=3, random_state=1, init='random')
km.fit(X)
print("\nclustering (first 12) = ")
print(km.labels_)
print("\nWCSS = %0.4f " % km.inertia_)
print("\ncounts: ")
print(np.sum(km.labels_ == 0))
print(np.sum(km.labels_ == 1))
print(np.sum(km.labels_ == 2))
print("\nmeans: ")
np.set_printoptions(precision=2)
print(km.cluster_centers_)

Output:

C:\VSM\ClusterMixedKMeans: python kmeans_scikit_demo.py

source data:
[[0.5  0.25 0.12 ... 0.   0.   0.33]
 [0.   0.75 0.42 ... 0.   0.33 0.  ]
 [0.5  0.25 0.9  ... 0.33 0.   0.  ]
 ...
 [0.5  0.5  0.16 ... 0.   0.   0.33]
 [0.   0.5  0.2  ... 0.   0.   0.33]
 [0.   0.75 0.82 ... 0.   0.   0.33]]

clustering (first 12) =
[0 1 2 0 0 2 2 0 0 1 0 0]

WCSS = 49.3195

counts:
89
77
74

means:
[[0.26 0.4  0.19 0.09 0.06 0.09 0.02 0.26 0.1  0.11 0.12]
 [0.   0.63 0.66 0.08 0.06 0.08 0.03 0.68 0.04 0.16 0.13]
 [0.5  0.32 0.69 0.07 0.08 0.05 0.05 0.65 0.16 0.15 0.02]]
Posted in Machine Learning | Leave a comment

Data Anomaly Detection For Mixed Data Using a Self-Organizing Map (SOM) From Scratch JavaScript

Several days ago, I put together a demo of data anomaly detection for mixed numeric and categorical data using a self-organizing map (SOM), from scratch, using the C# language. Then, a few days later, I refactored the C# version to Python. And then, for this blog post, I figured I’d refactor the system to raw JavaScript.

Refactoring a non-trivial system from one language to another always gives me new insights into the system, algorithms, and data structures, as well as features of the two programming languages involved.

A self-organizing map (SOM) is a data structure and associated algorithms that can be used to cluster data. Each cluster has a representative vector, somewhat similar to the way each cluster in k-means clustering has an associated mean/centroid. Data items that are assigned to a SOM cluster but are far (Euclidean distance) from the cluster representative vector are anomalous.

I made a 240-item set of synthetic data that looks like:

F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
. . .

The fields are sex, height, age, State, income, political leaning.

Because SOM clustering uses Euclidean distance, the data must be normalized and encoded. I used min-max normalization on the age (min = 18, max = 68) and income (min = $20,300, max = $81,800) columns. I used one-over-n-hot encoding on the sex, State, and political leaning columns. I used equal-interval encoding for the height column, because it has a natural order.

The resulting normalized and encoded data looks like:

0.5, 0.25, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
. . .

I set up the demo SOM map as size 2-by-2 for a total of 4 map nodes. Creating a SOM map is an iterative process that requires a steps_max value (I used 1,000) and a lrn_rate_max value (I used 2.00). SOM maps are very sensitive to these values, and they must be determined by trial and error. I monitored the SOM map building every 200 iterations by computing the sum of Euclidean distances (SED) between map node vectors and data items assigned to the map node / cluster:

Computing SOM clustering
map build step 0     |  SED = 311.4767
map build step 200   |  SED = 229.7895
map build step 400   |  SED = 160.0903
map build step 600   |  SED = 122.9567
map build step 800   |  SED = 105.7636
Done

Each of the 4 map nodes is identified by a [row][col] pair of indices. The resulting four map node associated vectors are:

SOM map nodes:
[0][0] :  0.00  0.66  0.70  0.08  0.09  0.03  0.05  0.71  0.02  0.12  0.19
[0][1] :  0.50  0.33  0.22  0.08  0.07  0.09  0.01  0.21  0.02  0.08  0.23
[1][0] :  0.50  0.31  0.70  0.05  0.10  0.07  0.04  0.66  0.17  0.14  0.02
[1][1] :  0.00  0.53  0.19  0.03  0.06  0.11  0.05  0.32  0.16  0.18  0.00

It’s important to look at the SOM mapping to determine if the steps_max and lrn_rate_max parameter values are good. The 240 data items were assigned to map nodes according to this distribution:

SOM mapping:
[0][0] : 43 items
[0][1] : 49 items
[1][0] : 77 items
[1][1] : 71 items

These counts seem reasonable. My demo has a function to display the [r][c] cluster ID for each data item. The first four cluster assignments are:

clustering:
X[0] :  0  1
X[1] :  1  1
X[2] :  1  0
X[3] :  1  1
. . .

After the SOM map was constructed, I analyzed the data, looking for the data item assigned to each cluster/node that is farthest from the map node representative vector:

Analyzing

node [0][0] :
  most anomalous data idx =  229
  0.00  0.25  0.58  0.25  0.00  0.00  0.00  0.70  0.33  0.00  0.00
  M  short   47  arkansas  63600  conservative
  distance = 0.6081

node [0][1] :
  most anomalous data idx =  179
  0.50  0.75  0.40  0.00  0.00  0.25  0.00  0.37  0.00  0.33  0.00
  F  tall    38  delaware  43000  moderate
  distance = 0.6267

node [1][0] :
  most anomalous data idx =   99
  0.50  0.75  0.48  0.00  0.00  0.00  0.25  0.43  0.00  0.33  0.00
  F  tall    42  illinois  47000  moderate
  distance = 0.6505

node [1][1] :
  most anomalous data idx =  232
  0.00  0.50  0.04  0.25  0.00  0.00  0.00  0.14  0.00  0.00  0.33
  M  medium  20  arkansas  28700  liberal
  distance = 0.5363

I displayed the index of the anomalous data item, its normalized and encoded form, its raw form, and the distance from the item to its map node vector. In a non-demo scenario, these data items would be examined to determine if they are in fact anomalies, and if so, what might be the cause.

An interesting exploration!



Inanimate objects normally don’t go around trying to kill people. Here are three anomalies in movies. Left: In “Rubber” (2010), a tire in the desert becomes sentient and has psychokinetic powers. A clever, funny, strange, experimental horror film. Center: In “Amityville 4: The Evil Escapes” (1989), a haunted house has an evil lamp. A full on horror film that’s pretty scary. Right: “Killer Sofa” (2019) is a comedy-horror film made in New Zealand, where, I guess they call recliner chairs sofas. I thought this film about a chair that is possessed by an evil spirit was very nicely done — I especially liked scenes where the chair shuffles around the apartment in which it lives.


Demo code. Replace “lt” (less than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

// anomaly_som.js
// self-organizing map (SOM) anomaly detection

let FS = require('fs');  // to read data file

// ----------------------------------------------------------

class ClusterSOM
{
  constructor(X, mapRows, mapCols, seed)
  {
    this.mapRows = mapRows;
    this.mapCols = mapCols;
    this.data = X;  // by ref
    this.seed = seed + 0.5;  // avoid 0
    let dim = X[0].length;
    this.map = this.makeMap(mapRows, mapCols, dim);
    for (let i = 0; i "lt" mapRows; ++i) {
      for (let j = 0; j "lt" mapCols; ++j) {
        for (let k = 0; k "lt" dim; ++k) {
          this.map[i][j][k] = this.next();  // random
        }
      }
    }
    this.mapping = 
      this.makeMap(mapRows, mapCols, 1); // matrix of lists
  }

  // --------------------------------------------------------
  // methods: cluster(), getClustering(), analyze()
  // --------------------------------------------------------

  cluster(lrnRateMax, stepsMax) 
  {
    let n = this.data.length;
    let dim = this.data[0].length;
    let rangeMax = this.mapRows + this.mapCols;

    for (let step = 0; step "lt" stepsMax; ++step) {

      if (step % Math.trunc(stepsMax / 5) == 0) {
        process.stdout.write("map build step = ");
        process.stdout.write(step.toString().
          padStart(4, ' '));
        let sum = 0.0;
        for (let ix = 0; ix "lt" n; ++ix) {
          let RC = this.closestNode(ix);

          //console.log(RC[0].toString());
          //console.log(RC[1].toString());

          let item = this.data[ix];
          let node = this.map[RC[0]][RC[1]];
          let dist = this.eucDistance(item, node);
          sum += dist;
        }
        console.log("  |  SED = " + 
          sum.toFixed(4).toString().padStart(9, ' '));
      } // show progress

      let pctLeft = 1.0 - ((step * 1.0) / stepsMax);
      let currRange = (pctLeft * rangeMax);
      let currLrnRate = pctLeft * lrnRateMax;

      // pick a random index
      let idx = this.nextInt(0, n);
      let bmuRC = this.closestNode(idx);
      // move each map node
      for (let i = 0; i "lt" this.mapRows; ++i) {
        for (let j = 0; j "lt" this.mapCols; ++j) {
          if (this.manhattDist(bmuRC[0],
            bmuRC[1], i, j) "lte" currRange) {
            for (let d = 0; d "lt" dim; ++d) {
              this.map[i][j][d] = this.map[i][j][d] +
                currLrnRate * (this.data[idx][d] -
                  this.map[i][j][d]);
            } // d
          } // if
        } // j
      } // i

    } // step
    // map has been created

    // compute mapping
    for (let idx = 0; idx "lt" n; ++idx) {
      let rc = this.closestNode(idx);
      let r = rc[0]; let c = rc[1];
      this.mapping[r][c].push(idx);
    }

    // mapping has dummy 0.0 first values
    // remove them
    for (let i = 0; i "lt" this.mapRows; ++i) {
      for (let j = 0; j "lt" this.mapCols; ++j) {
        if (this.mapping[i][j].length "gte" 2)
          this.mapping[i][j].shift();  // remove first
      }
    }

    return;
  } // cluster()

  // --------------------------------------------------------

  getClustering()
  {
    // cluster (r,c) ID for every data item
    let n = this.data.length;
    let result = this.matMake(n, 2, 0.0);
    for (let i = 0; i "lt" this.mapRows; ++i) {
      for (let j = 0; j "lt" this.mapCols; ++j) {
        for (let k = 0; k "lt" this.mapping[i][j].length;
          ++k) {
          let idx = this.mapping[i][j][k];
          result[idx][0] = i;
          result[idx][1] = j;
        }
      }
    }
    return result;
  }

  // --------------------------------------------------------

  analyze(rawFileArray)
  {
    for (let i = 0; i "lt" this.mapRows; ++i) {
      for (let j = 0; j "lt" this.mapCols; ++j) {
        let nodeVec = this.map[i][j];
        let largeDist = 0.0;
        let anomIdx = 0;
        for (let k = 0; k "lt" this.mapping[i][j].length;
          ++k) {
          let idx = this.mapping[i][j][k];
          let item = this.data[idx];
          let dist = this.eucDistance(nodeVec, item);
          if (dist "gt" largeDist) {
            largeDist = dist;
            anomIdx = idx;
          }
        } // k

        console.log("\nnode [" + i.toString() + "][" +
          j.toString() + "] : ");
        console.log("  most anomalous data idx = " +
          anomIdx.toFixed(0).toString().padStart(4, " "));

        for (let jj = 0; jj "lt" this.data[anomIdx].length;
          ++jj) {
          //process.stdout.write(this.data[anomIdx][jj].
          //  toFixed(4).toString().padStart(8));
          process.stdout.write(this.data[anomIdx][jj].
            toFixed(2).toString().padStart(6));
        }
        console.log("");

        console.log("  " + rawFileArray[anomIdx]);
        console.log("  distance = " + 
          largeDist.toFixed(4).toString());
      } // j
    } // i
  }

  // --------------------------------------------------------
  // helper functions: makeMap(), matMake(), vecMake(),
  //  closestNode(), eucDistance(), manhattDist(), next(),
  //  nextInt()
  // --------------------------------------------------------

  makeMap(d1, d2, d3)
  {
    let result = [];
    for (let i = 0; i "lt" d1; ++i) {
      result[i] = [];
      for (let j = 0; j "lt" d2; ++j) {
        result[i][j] = [];
        for (let k = 0; k "lt" d3; ++k) {
          result[i][j][k] = 0.0;
        }
      }
    }
    return result;
  }

  // --------------------------------------------------------

  matMake(rows, cols, val)
  {
    let result = [];
    for (let i = 0; i "lt" rows; ++i) {
      result[i] = [];
      for (let j = 0; j "lt" cols; ++j) {
        result[i][j] = val;
      }
    }
    return result;
  }

  // --------------------------------------------------------

  vecMake(n, val)
  {
    let result = [];
    for (let i = 0; i "lt" n; ++i) {
      result[i] = val;
    }
    return result;
  }

  // --------------------------------------------------------

  closestNode(idx)
  {
    // r,c of map vec closest to data[idx]
    let smallDist = 100000000.0;
    let result = this.vecMake(2, 0);
    result[0] = -1;
    result[1] = -1;
    for (let i = 0; i "lt" this.map.length; ++i) {
      for (let j = 0; j "lt" this.map[0].length; ++j) {
        let dist = this.eucDistance(this.data[idx],
          this.map[i][j]);
        if (dist "lt" smallDist) {
          smallDist = dist;
          result[0] = i;
          result[1] = j;
        }
      }
    }
    return result;
  } // closestNode()

  // --------------------------------------------------------

  eucDistance(v1, v2)
  {
    let dim = v1.length;
    let sum = 0.0;
    for (let i = 0; i "lt" dim; ++i)
      sum += (v1[i] - v2[i]) * (v1[i] - v2[i]);
    return Math.sqrt(sum);    
  }

  // --------------------------------------------------------

  manhattDist(r1, c1, r2, c2)
  {
    return Math.abs(r1 - r2) + Math.abs(c1 - c2);
  }

  // --------------------------------------------------------

  next() // next double
  {
    let x = Math.sin(this.seed) * 1000;
    let result = x - Math.floor(x);  // [0.0,1.0)
    this.seed = result;  // for next call
    return result;
  }

  // --------------------------------------------------------

  nextInt(lo, hi)
  {
    let x = this.next();
    return Math.trunc((hi - lo) * x + lo);
  }

  // --------------------------------------------------------

} // class ClusterSOM


// ----------------------------------------------------------
// helpers for main(): loadTxt(), fileLoad(),
//  matShow(), vecShow()
// ----------------------------------------------------------

function loadTxt(fn, delimit, usecols, comment) 
{
  // efficient but mildly complicated
  let all = FS.readFileSync(fn, "utf8");  // giant string
  all = all.trim();  // strip final crlf in file
  let lines = all.split("\n");  // array of lines

  // count number non-comment lines
  let nRows = 0;
  for (let i = 0; i "lt" lines.length; ++i) {
    if (!lines[i].startsWith(comment))
      ++nRows;
  }
  nCols = usecols.length;
  let result = [];
  for (let i = 0; i "lt" nRows; ++i) {
    result[i] = [];
    for (let j = 0; j "lt" nCols; ++j) {
      result[i][j] = 0.0;
    }
  }
  
  let r = 0;  // into lines
  let i = 0;  // into result[][]
  while (r "lt" lines.length) {
    if (lines[r].startsWith(comment)) {
      ++r;  // next row
    }
    else {
      let tokens = lines[r].split(delimit);
      for (let j = 0; j "lt" nCols; ++j) {
        result[i][j] = parseFloat(tokens[usecols[j]]);
      }
      ++r;
      ++i;
    }
  }

  return result;
} // loadTxt()

// ----------------------------------------------------------

function fileLoad(fn, comment) 
{
  // efficient but mildly complicated
  let all = FS.readFileSync(fn, "utf8");  // giant string
  all = all.trim();  // strip final crlf in file
  let lines = all.split("\n");  // array of lines

  let result = [];
  for (let i = 0; i "lt" lines.length; ++i) {
    if (!lines[i].startsWith(comment)) {
      result.push(lines[i].trim());
    }
  }

  return result;
} // fileLoad()

// ----------------------------------------------------------

function matShow(m, dec, wid, showIndices)
{
  let rows = m.length;
  let cols = m[0].length;
  for (let i = 0; i "lt" rows; ++i) {
    if (showIndices == true)
      process.stdout.write("[" + i.toString().
        padStart(3, ' ') + "]");
    for (let j = 0; j "lt" cols; ++j) {
      let v = m[i][j];
      if (Math.abs(v) "lt" 0.000001) v = 0.0  // avoid -0
      let vv = v.toFixed(dec);
      let s = vv.toString().padStart(wid, ' ');
      process.stdout.write(s);
      process.stdout.write("  ");
    }
    process.stdout.write("\n");
  }
}

// ----------------------------------------------------------

function vecShow(vec, dec, wid)
{
  for (let i = 0; i "lt" vec.length; ++i) {
    let x = vec[i].toFixed(dec);
    let s = x.toString().padStart(wid, ' ');
    process.stdout.write(s);
    process.stdout.write(" ");
  }
  process.stdout.write("\n");
}

// ----------------------------------------------------------

function main()
{
  console.log("\nBegin self-organizing" +
    " map (SOM) anomaly analysis using JavaScript ");

  // 1. load data
  console.log("\nLoading 240-item People dataset ");
  let rf = ".\\Data\\people_raw.txt";
  let rawFileArray = fileLoad(rf, "#");
  // for (let i = 0; i "lt" rawFileArray.length; ++i) {
  //   console.log(rawFileArray[i]);
  // }

  let fn = ".\\Data\\people_240.txt";
  let X = loadTxt(fn, ",", [0,1,2,3,4,5,6,7,8,9,10], "#");
  // matShow(X, 1, 8, true);
  console.log("\nFirst three normalized and encoded: ");
  for (let i = 0; i "lt" 3; ++i) {
    vecShow(X[i], 4, 8);
  }

  // 2. create ClusterSOM object and cluster
  let mapRows = 2;
  let mapCols = 2;
  let lrnRateMax = 2.00;
  let stepsMax = 1000;
  console.log("\nSetting mapRows = " + 
    mapRows.toString());
  console.log("Setting mapCols = " + 
    mapCols.toString());
  console.log("Setting  lrnRateMax = " + 
    lrnRateMax.toFixed(3).toString());
  console.log("Setting stepsMax = " + 
    stepsMax.toString());

  console.log("\nComputing SOM clustering ");
  let seed = 1;
  som = new ClusterSOM(X, mapRows, mapCols, seed);
  som.cluster(lrnRateMax, stepsMax);
  console.log("Done ");

  // 3. show the SOM map and mapping
  console.log("\nSOM map nodes: ");
  for (let i = 0; i "lt" mapRows; ++i) {
    for (let j = 0; j "lt" mapCols; ++j) {
      process.stdout.write("[" + i.toString() +
        "][" + j.toString() + "] : ");
      //vecShow(som.map[i][j], 4, 7);
      vecShow(som.map[i][j], 2, 5);
    }
  }

  console.log("\nSOM mapping: ");
  for (let i = 0; i "lt" mapRows; ++i) {
    for (let j = 0; j "lt" mapCols; ++j) {
      // show count
      process.stdout.write("[" + i.toString() + "][" +
        j.toString() + "] : ");
      console.log(som.mapping[i][j].length.toString() +
        " items ");
    }
  }

  // 4. show clustering result
  console.log("\nclustering: ");
  let clustering = som.getClustering();
  for (let i = 0; i "lt" 4; ++i) { // first four
    process.stdout.write("X[" + i.toString() + "] : ");
    vecShow(clustering[i], 0, 2);
  }

  // 5. anomaly analysis
  console.log("\nAnalyzing ");
  som.analyze(rawFileArray);
 
  console.log("\nEnd SOM anomaly demo");
}

main()

Raw data:

# people_raw.txt
#
F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
F  short   50  colorado  56500  moderate
F  medium  50  illinois  55000  moderate
M  tall    19  delaware  32700  conservative
F  short   22  illinois  27700  moderate
M  tall    39  delaware  47100  liberal
F  short   34  arkansas  39400  moderate
M  medium  22  illinois  33500  conservative
F  medium  35  delaware  35200  liberal
M  tall    33  colorado  46400  moderate
F  short   45  colorado  54100  moderate
F  short   42  illinois  50700  moderate
M  tall    33  colorado  46800  moderate
F  tall    25  delaware  30000  moderate
M  medium  31  colorado  46400  conservative
F  short   27  arkansas  32500  liberal
F  short   48  illinois  54000  moderate
M  tall    64  illinois  71300  liberal
F  medium  61  colorado  72400  conservative
F  short   54  illinois  61000  conservative
F  short   29  arkansas  36300  conservative
F  short   50  delaware  55000  moderate
F  medium  55  illinois  62500  conservative
F  medium  40  illinois  52400  conservative
F  short   22  arkansas  23600  liberal
F  short   68  colorado  78400  conservative
M  tall    60  illinois  71700  liberal
M  tall    34  delaware  46500  moderate
M  medium  25  delaware  37100  conservative
M  short   31  illinois  48900  moderate
F  short   43  delaware  48000  moderate
F  short   58  colorado  65400  liberal
M  tall    55  illinois  60700  liberal
M  tall    43  colorado  51100  moderate
M  tall    43  delaware  53200  moderate
M  medium  21  arkansas  37200  conservative
F  short   55  delaware  64600  conservative
F  short   64  colorado  74800  conservative
M  tall    41  illinois  58800  moderate
F  medium  64  delaware  72700  conservative
M  medium  56  illinois  66600  liberal
F  short   31  delaware  36000  moderate
M  tall    65  delaware  70100  liberal
F  tall    55  illinois  64300  conservative
M  short   25  arkansas  40300  conservative
F  short   46  delaware  51000  moderate
M  tall    36  illinois  53500  conservative
F  short   52  illinois  58100  moderate
F  short   61  delaware  67900  conservative
F  short   57  delaware  65700  conservative
M  tall    46  colorado  52600  moderate
M  tall    62  arkansas  66800  liberal
F  short   55  illinois  62700  conservative
M  medium  22  delaware  27700  moderate
M  tall    50  illinois  62900  conservative
M  tall    32  illinois  41800  moderate
M  short   21  delaware  35600  conservative
F  medium  44  colorado  52000  moderate
F  short   46  illinois  51700  moderate
F  short   62  colorado  69700  conservative
F  short   57  illinois  66400  conservative
M  medium  67  illinois  75800  liberal
F  short   29  arkansas  34300  liberal
F  short   53  illinois  60100  conservative
M  tall    44  arkansas  54800  moderate
F  medium  46  colorado  52300  moderate
M  tall    20  illinois  30100  moderate
M  medium  38  illinois  53500  moderate
F  short   50  colorado  58600  moderate
F  short   33  colorado  42500  moderate
M  tall    33  colorado  39300  moderate
F  short   26  colorado  40400  conservative
F  short   58  arkansas  70700  conservative
F  tall    43  illinois  48000  moderate
M  medium  46  arkansas  64400  conservative
F  short   60  arkansas  71700  conservative
M  tall    42  arkansas  48900  moderate
M  tall    56  delaware  56400  liberal
M  short   62  colorado  66300  liberal
M  short   50  arkansas  64800  moderate
F  short   47  illinois  52000  moderate
M  tall    67  colorado  80400  liberal
M  tall    40  delaware  50400  moderate
F  short   42  colorado  48400  moderate
F  short   64  arkansas  72000  conservative
M  medium  47  arkansas  58700  liberal
F  medium  45  colorado  52800  moderate
M  tall    25  delaware  40900  conservative
F  short   38  arkansas  48400  conservative
F  short   55  delaware  60000  moderate
M  tall    44  arkansas  60600  moderate
F  medium  33  arkansas  41000  moderate
F  short   34  delaware  39000  moderate
F  short   27  colorado  33700  liberal
F  short   32  colorado  40700  moderate
F  tall    42  illinois  47000  moderate
M  short   24  delaware  40300  conservative
F  short   42  colorado  50300  moderate
F  short   25  delaware  28000  liberal
F  short   51  colorado  58000  moderate
M  medium  55  colorado  63500  liberal
F  short   44  arkansas  47800  liberal
M  short   18  arkansas  39800  conservative
M  tall    67  colorado  71600  liberal
F  short   45  delaware  50000  moderate
F  short   48  arkansas  55800  moderate
M  short   25  colorado  39000  moderate
M  tall    67  arkansas  78300  moderate
F  short   37  delaware  42000  moderate
M  short   32  arkansas  42700  moderate
F  short   48  arkansas  57000  moderate
M  tall    66  delaware  75000  liberal
F  tall    61  arkansas  70000  conservative
M  medium  58  delaware  68900  moderate
F  short   19  arkansas  24000  liberal
F  short   38  delaware  43000  moderate
M  medium  27  arkansas  36400  moderate
F  short   42  arkansas  48000  moderate
F  short   60  arkansas  71300  conservative
M  tall    27  delaware  34800  conservative
F  tall    29  colorado  37100  conservative
M  medium  43  arkansas  56700  moderate
F  medium  48  arkansas  56700  moderate
F  medium  27  delaware  29400  liberal
M  tall    44  arkansas  55200  conservative
F  short   23  colorado  26300  liberal
M  tall    36  colorado  53000  liberal
F  short   64  delaware  72500  conservative
F  short   29  delaware  30000  liberal
M  short   33  arkansas  49300  moderate
M  tall    66  colorado  75000  liberal
M  medium  21  delaware  34300  conservative
F  short   27  arkansas  32700  liberal
F  short   29  arkansas  31800  liberal
M  tall    31  arkansas  48600  moderate
F  short   36  delaware  41000  moderate
F  short   49  colorado  55700  moderate
M  short   28  arkansas  38400  conservative
M  medium  43  delaware  56600  moderate
M  medium  46  colorado  58800  moderate
F  short   57  arkansas  69800  conservative
M  short   52  delaware  59400  moderate
M  tall    31  delaware  43500  moderate
M  tall    55  arkansas  62000  liberal
F  short   50  arkansas  56400  moderate
F  short   48  colorado  55900  moderate
M  medium  22  delaware  34500  conservative
F  short   59  delaware  66700  conservative
F  short   34  arkansas  42800  liberal
M  tall    64  arkansas  77200  liberal
F  short   29  delaware  33500  liberal
M  medium  34  colorado  43200  moderate
M  medium  61  arkansas  75000  liberal
F  short   64  delaware  71100  conservative
M  short   29  arkansas  41300  conservative
F  short   63  colorado  70600  conservative
M  medium  29  colorado  40000  conservative
M  tall    51  arkansas  62700  moderate
M  tall    24  delaware  37700  conservative
F  medium  48  colorado  57500  moderate
F  short   18  arkansas  27400  conservative
F  short   18  arkansas  20300  liberal
F  short   33  colorado  38200  liberal
M  medium  20  delaware  34800  conservative
F  short   29  delaware  33000  liberal
M  short   44  delaware  63000  conservative
M  tall    65  delaware  81800  conservative
M  tall    56  arkansas  63700  liberal
M  medium  52  delaware  58400  moderate
M  medium  29  colorado  48600  conservative
M  tall    47  colorado  58900  moderate
F  medium  68  arkansas  72600  liberal
F  short   31  delaware  36000  moderate
F  short   61  colorado  62500  liberal
F  short   19  colorado  21500  liberal
F  tall    38  delaware  43000  moderate
M  tall    26  arkansas  42300  conservative
F  short   61  colorado  67400  conservative
F  short   40  arkansas  46500  moderate
M  medium  49  arkansas  65200  moderate
F  medium  56  arkansas  67500  conservative
M  short   48  colorado  66000  moderate
F  short   52  arkansas  56300  liberal
M  tall    18  arkansas  29800  conservative
M  tall    56  delaware  59300  liberal
M  medium  52  colorado  64400  moderate
M  medium  18  colorado  28600  moderate
M  tall    58  arkansas  66200  liberal
M  tall    39  colorado  55100  moderate
M  tall    46  arkansas  62900  moderate
M  medium  40  colorado  46200  moderate
M  medium  60  arkansas  72700  liberal
F  short   36  colorado  40700  liberal
F  short   44  arkansas  52300  moderate
F  short   28  arkansas  31300  liberal
F  short   54  delaware  62600  conservative
M  medium  51  arkansas  61200  moderate
M  short   32  colorado  46100  moderate
F  short   55  arkansas  62700  conservative
F  short   25  delaware  26200  liberal
F  medium  33  delaware  37300  liberal
M  medium  29  colorado  46200  conservative
F  short   65  arkansas  72700  conservative
M  tall    43  colorado  51400  moderate
M  short   54  colorado  64800  liberal
F  short   61  colorado  72700  conservative
F  short   52  colorado  63600  conservative
F  short   30  colorado  33500  liberal
F  short   29  arkansas  31400  liberal
M  tall    47  delaware  59400  moderate
F  short   39  colorado  47800  moderate
F  short   47  delaware  52000  moderate
M  medium  49  arkansas  58600  moderate
M  tall    63  delaware  67400  liberal
M  medium  30  arkansas  39200  conservative
M  tall    61  delaware  69600  liberal
M  medium  47  delaware  58700  moderate
F  short   30  delaware  34500  liberal
M  medium  51  delaware  58000  moderate
M  medium  24  arkansas  38800  moderate
M  short   49  arkansas  64500  moderate
F  medium  66  delaware  74500  conservative
M  tall    65  arkansas  76900  conservative
M  short   46  colorado  58000  conservative
M  tall    45  delaware  51800  moderate
M  short   47  arkansas  63600  conservative
M  tall    29  arkansas  44800  conservative
M  tall    57  delaware  69300  liberal
M  medium  20  arkansas  28700  liberal
M  medium  35  arkansas  43400  moderate
M  tall    61  delaware  67000  liberal
M  short   31  delaware  37300  moderate
F  short   18  arkansas  20800  liberal
F  medium  26  delaware  29200  liberal
M  medium  28  arkansas  36400  liberal
M  tall    59  delaware  69400  liberal

Normalized and encoded data:

# people_240.txt
#
# sex (M = 0.0, F = 0.5)
# height (short, medium, tall)
# age (min = 18, max = 68)
# State (Arkansas, Colorado, Delaware, Illinois)
# income (min = $20,300, max = $81,800)
# political leaning (conservative, moderate, liberal)
#
0.5, 0.25, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6400, 0.00, 0.25, 0.00, 0.00, 0.5886, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.6400, 0.00, 0.00, 0.00, 0.25, 0.5642, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.0200, 0.00, 0.00, 0.25, 0.00, 0.2016, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0800, 0.00, 0.00, 0.00, 0.25, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.4200, 0.00, 0.00, 0.25, 0.00, 0.4358, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3200, 0.25, 0.00, 0.00, 0.00, 0.3106, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.00, 0.25, 0.2146, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.3400, 0.00, 0.00, 0.25, 0.00, 0.2423, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5400, 0.00, 0.25, 0.00, 0.00, 0.5496, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.00, 0.00, 0.25, 0.4943, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.4309, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.2600, 0.00, 0.25, 0.00, 0.00, 0.4244, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.1984, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6000, 0.00, 0.00, 0.00, 0.25, 0.5480, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9200, 0.00, 0.00, 0.00, 0.25, 0.8293, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.8472, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7200, 0.00, 0.00, 0.00, 0.25, 0.6618, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.2602, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6400, 0.00, 0.00, 0.25, 0.00, 0.5642, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6862, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.4400, 0.00, 0.00, 0.00, 0.25, 0.5220, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0800, 0.25, 0.00, 0.00, 0.00, 0.0537, 0.0000, 0.0000, 0.3333
0.5, 0.25, 1.0000, 0.00, 0.25, 0.00, 0.00, 0.9447, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.8400, 0.00, 0.00, 0.00, 0.25, 0.8358, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3200, 0.00, 0.00, 0.25, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.2600, 0.00, 0.00, 0.00, 0.25, 0.4650, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8000, 0.00, 0.25, 0.00, 0.00, 0.7333, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6569, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5000, 0.00, 0.25, 0.00, 0.00, 0.5008, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.5350, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0600, 0.25, 0.00, 0.00, 0.00, 0.2748, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7400, 0.00, 0.00, 0.25, 0.00, 0.7203, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9200, 0.00, 0.25, 0.00, 0.00, 0.8862, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.4600, 0.00, 0.00, 0.00, 0.25, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.7600, 0.00, 0.00, 0.00, 0.25, 0.7528, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9400, 0.00, 0.00, 0.25, 0.00, 0.8098, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.7154, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.1400, 0.25, 0.00, 0.00, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.5600, 0.00, 0.00, 0.25, 0.00, 0.4992, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3600, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6800, 0.00, 0.00, 0.00, 0.25, 0.6146, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.7740, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7800, 0.00, 0.00, 0.25, 0.00, 0.7382, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.5252, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8800, 0.25, 0.00, 0.00, 0.00, 0.7561, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7400, 0.00, 0.00, 0.00, 0.25, 0.6894, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.25, 0.00, 0.1203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.6400, 0.00, 0.00, 0.00, 0.25, 0.6927, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.2800, 0.00, 0.00, 0.00, 0.25, 0.3496, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.0600, 0.00, 0.00, 0.25, 0.00, 0.2488, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.5200, 0.00, 0.25, 0.00, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5600, 0.00, 0.00, 0.00, 0.25, 0.5106, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8800, 0.00, 0.25, 0.00, 0.00, 0.8033, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7800, 0.00, 0.00, 0.00, 0.25, 0.7496, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.9800, 0.00, 0.00, 0.00, 0.25, 0.9024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.2276, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7000, 0.00, 0.00, 0.00, 0.25, 0.6472, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5610, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.0400, 0.00, 0.00, 0.00, 0.25, 0.1593, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.4000, 0.00, 0.00, 0.00, 0.25, 0.5398, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6400, 0.00, 0.25, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.3610, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.3089, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1600, 0.00, 0.25, 0.00, 0.00, 0.3268, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8000, 0.25, 0.00, 0.00, 0.00, 0.8195, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.5000, 0.00, 0.00, 0.00, 0.25, 0.4504, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.5600, 0.25, 0.00, 0.00, 0.00, 0.7171, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8358, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.4800, 0.25, 0.00, 0.00, 0.00, 0.4650, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.7600, 0.00, 0.00, 0.25, 0.00, 0.5870, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.8800, 0.00, 0.25, 0.00, 0.00, 0.7480, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.6400, 0.25, 0.00, 0.00, 0.00, 0.7236, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5800, 0.00, 0.00, 0.00, 0.25, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9800, 0.00, 0.25, 0.00, 0.00, 0.9772, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4400, 0.00, 0.00, 0.25, 0.00, 0.4894, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.25, 0.00, 0.00, 0.4569, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.9200, 0.25, 0.00, 0.00, 0.00, 0.8407, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5800, 0.25, 0.00, 0.00, 0.00, 0.6244, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.5400, 0.00, 0.25, 0.00, 0.00, 0.5285, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.3350, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4000, 0.25, 0.00, 0.00, 0.00, 0.4569, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.7400, 0.00, 0.00, 0.25, 0.00, 0.6455, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.6553, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.3000, 0.25, 0.00, 0.00, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3200, 0.00, 0.00, 0.25, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1800, 0.00, 0.25, 0.00, 0.00, 0.2179, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2800, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.3333, 0.0000
0.5, 0.75, 0.4800, 0.00, 0.00, 0.00, 0.25, 0.4341, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.1200, 0.00, 0.00, 0.25, 0.00, 0.3252, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4800, 0.00, 0.25, 0.00, 0.00, 0.4878, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.1252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6600, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.7400, 0.00, 0.25, 0.00, 0.00, 0.7024, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.4472, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.3171, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9800, 0.00, 0.25, 0.00, 0.00, 0.8341, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5400, 0.00, 0.00, 0.25, 0.00, 0.4829, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5772, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.1400, 0.00, 0.25, 0.00, 0.00, 0.3041, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9800, 0.25, 0.00, 0.00, 0.00, 0.9431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3800, 0.00, 0.00, 0.25, 0.00, 0.3528, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2800, 0.25, 0.00, 0.00, 0.00, 0.3642, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5967, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9600, 0.00, 0.00, 0.25, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.8600, 0.25, 0.00, 0.00, 0.00, 0.8081, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.8000, 0.00, 0.00, 0.25, 0.00, 0.7902, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0200, 0.25, 0.00, 0.00, 0.00, 0.0602, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.4000, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4800, 0.25, 0.00, 0.00, 0.00, 0.4504, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8293, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.1800, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.75, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.2732, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5000, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.6000, 0.25, 0.00, 0.00, 0.00, 0.5919, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.1800, 0.00, 0.00, 0.25, 0.00, 0.1480, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5675, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1000, 0.00, 0.25, 0.00, 0.00, 0.0976, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.3600, 0.00, 0.25, 0.00, 0.00, 0.5317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8488, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.1577, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.3000, 0.25, 0.00, 0.00, 0.00, 0.4715, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9600, 0.00, 0.25, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0600, 0.00, 0.00, 0.25, 0.00, 0.2276, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1800, 0.25, 0.00, 0.00, 0.00, 0.2016, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.1870, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.2600, 0.25, 0.00, 0.00, 0.00, 0.4602, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.3600, 0.00, 0.00, 0.25, 0.00, 0.3366, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6200, 0.00, 0.25, 0.00, 0.00, 0.5756, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.2943, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.5000, 0.00, 0.00, 0.25, 0.00, 0.5902, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.6260, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.7800, 0.25, 0.00, 0.00, 0.00, 0.8049, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.6800, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.3772, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.7400, 0.25, 0.00, 0.00, 0.00, 0.6780, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.6400, 0.25, 0.00, 0.00, 0.00, 0.5870, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.5789, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0800, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8200, 0.00, 0.00, 0.25, 0.00, 0.7545, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.3200, 0.25, 0.00, 0.00, 0.00, 0.3659, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.9200, 0.25, 0.00, 0.00, 0.00, 0.9252, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.3200, 0.00, 0.25, 0.00, 0.00, 0.3724, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.8600, 0.25, 0.00, 0.00, 0.00, 0.8894, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.9200, 0.00, 0.00, 0.25, 0.00, 0.8260, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.3415, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9000, 0.00, 0.25, 0.00, 0.00, 0.8179, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.3203, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.6600, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1200, 0.00, 0.00, 0.25, 0.00, 0.2829, 0.3333, 0.0000, 0.0000
0.5, 0.50, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.6049, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.1154, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.0000, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3000, 0.00, 0.25, 0.00, 0.00, 0.2911, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0400, 0.00, 0.00, 0.25, 0.00, 0.2358, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2200, 0.00, 0.00, 0.25, 0.00, 0.2065, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.5200, 0.00, 0.00, 0.25, 0.00, 0.6943, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9400, 0.00, 0.00, 0.25, 0.00, 1.0000, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7600, 0.25, 0.00, 0.00, 0.00, 0.7057, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6800, 0.00, 0.00, 0.25, 0.00, 0.6195, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.4602, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5800, 0.00, 0.25, 0.00, 0.00, 0.6276, 0.0000, 0.3333, 0.0000
0.5, 0.50, 1.0000, 0.25, 0.00, 0.00, 0.00, 0.8504, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2553, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.6862, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.0200, 0.00, 0.25, 0.00, 0.00, 0.0195, 0.0000, 0.0000, 0.3333
0.5, 0.75, 0.4000, 0.00, 0.00, 0.25, 0.00, 0.3691, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.1600, 0.25, 0.00, 0.00, 0.00, 0.3577, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.7659, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.4400, 0.25, 0.00, 0.00, 0.00, 0.4260, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.7301, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.7600, 0.25, 0.00, 0.00, 0.00, 0.7675, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.6000, 0.00, 0.25, 0.00, 0.00, 0.7431, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.6800, 0.25, 0.00, 0.00, 0.00, 0.5854, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.1545, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7600, 0.00, 0.00, 0.25, 0.00, 0.6341, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6800, 0.00, 0.25, 0.00, 0.00, 0.7171, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.0000, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8000, 0.25, 0.00, 0.00, 0.00, 0.7463, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.4200, 0.00, 0.25, 0.00, 0.00, 0.5659, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.5600, 0.25, 0.00, 0.00, 0.00, 0.6927, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.4400, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.8400, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.3600, 0.00, 0.25, 0.00, 0.00, 0.3317, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.5200, 0.25, 0.00, 0.00, 0.00, 0.5203, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.1789, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.7200, 0.00, 0.00, 0.25, 0.00, 0.6878, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.6600, 0.25, 0.00, 0.00, 0.00, 0.6650, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.2800, 0.00, 0.25, 0.00, 0.00, 0.4195, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.7400, 0.25, 0.00, 0.00, 0.00, 0.6894, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.1400, 0.00, 0.00, 0.25, 0.00, 0.0959, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.3000, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2200, 0.00, 0.25, 0.00, 0.00, 0.4211, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.9400, 0.25, 0.00, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5000, 0.00, 0.25, 0.00, 0.00, 0.5057, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.7200, 0.00, 0.25, 0.00, 0.00, 0.7236, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.8600, 0.00, 0.25, 0.00, 0.00, 0.8520, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.6800, 0.00, 0.25, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.5, 0.25, 0.2400, 0.00, 0.25, 0.00, 0.00, 0.2146, 0.0000, 0.0000, 0.3333
0.5, 0.25, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.1805, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.6358, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.4200, 0.00, 0.25, 0.00, 0.00, 0.4472, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.5154, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.6228, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.9000, 0.00, 0.00, 0.25, 0.00, 0.7659, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2400, 0.25, 0.00, 0.00, 0.00, 0.3073, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.8016, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.5800, 0.00, 0.00, 0.25, 0.00, 0.6244, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.2400, 0.00, 0.00, 0.25, 0.00, 0.2309, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.6600, 0.00, 0.00, 0.25, 0.00, 0.6130, 0.0000, 0.3333, 0.0000
0.0, 0.50, 0.1200, 0.25, 0.00, 0.00, 0.00, 0.3008, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.6200, 0.25, 0.00, 0.00, 0.00, 0.7187, 0.0000, 0.3333, 0.0000
0.5, 0.50, 0.9600, 0.00, 0.00, 0.25, 0.00, 0.8813, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.9400, 0.25, 0.00, 0.00, 0.00, 0.9203, 0.3333, 0.0000, 0.0000
0.0, 0.25, 0.5600, 0.00, 0.25, 0.00, 0.00, 0.6130, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.5400, 0.00, 0.00, 0.25, 0.00, 0.5122, 0.0000, 0.3333, 0.0000
0.0, 0.25, 0.5800, 0.25, 0.00, 0.00, 0.00, 0.7041, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.2200, 0.25, 0.00, 0.00, 0.00, 0.3984, 0.3333, 0.0000, 0.0000
0.0, 0.75, 0.7800, 0.00, 0.00, 0.25, 0.00, 0.7967, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.0400, 0.25, 0.00, 0.00, 0.00, 0.1366, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.3400, 0.25, 0.00, 0.00, 0.00, 0.3756, 0.0000, 0.3333, 0.0000
0.0, 0.75, 0.8600, 0.00, 0.00, 0.25, 0.00, 0.7593, 0.0000, 0.0000, 0.3333
0.0, 0.25, 0.2600, 0.00, 0.00, 0.25, 0.00, 0.2764, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.0000, 0.25, 0.00, 0.00, 0.00, 0.0081, 0.0000, 0.0000, 0.3333
0.5, 0.50, 0.1600, 0.00, 0.00, 0.25, 0.00, 0.1447, 0.0000, 0.0000, 0.3333
0.0, 0.50, 0.2000, 0.25, 0.00, 0.00, 0.00, 0.2618, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.8200, 0.00, 0.00, 0.25, 0.00, 0.7984, 0.0000, 0.0000, 0.3333
Posted in JavaScript, Machine Learning | Leave a comment