James D. McCaffrey

Computing the Hand Probabilities for an Old 1970s Five-Reel Poker Machine

Posted on April 9, 2024 by jamesdmccaffrey

One morning before work, I decided to put together a simulation to compute the poker hand probabilities for an old 1970s vintage five-reel poker machine I bought from eBay. (It was my first ever eBay purchase).

The machine reel configuration is very strange:

Machine reels:
[1]: 2d 3c 3c 4h 7d 8c 9h Ts Qd Kc As
[2]: 2s 2s 4d 5c 6h 7s 9d Tc Jh Qs Ah
[3]: 2h 3s 5d 6c 7h 8s Td Jc Qh Ks Ac
[4]: 2c 3h 4s 6d 7c 8h 9s Jd Qc Kh Ad
[5]: 3d 4c 5h 5s 6s 8d 9c Th Js Kd Xc

Each reel has 11 cards. Reel #1 has two Three of clubs. Reel #2 has two Two of spades. Reel #5 has a Joker wildcard (“Xc”). And reel #5 has the Five of hearts and the Five of spades, while Reel #4 has no Five at all, and therefore it’s impossible to get four Fives without the use of the Joker wildcard.

Left: This is the machine I bought that has a very strange reel configuration described in this article. It is an exact Hong Kong copy by the “Best Ever” company of one made by the well-known Japanese WACO toy company circa 1971. Right: Here’s an old machine I analyzed in an earlier post. The reel configuration is much more logical. The machine was made by the Chas Fey Company circa 1895.

My motivation came from watching the movie “Oppenheimer” (2024) with my pal Ken L. The movie centers around the Manhattan Project and the development of the first atomic bomb. A key part of the Manhattan Project was the invention of computer simulation, for intractable math probability problems, by Stanislaw Ulam. The technique was secretly code-named the Monte Carlo method, after the famous casino.

Trying to mathematically calculate the probabilities of each of the 11 possible hands (HighCard, OnePair, TwoPair, ThreeKind, Straight, Flush, FullHouse, FourKind, StraightFlush, RoyalFlush, FiveKind) for the old machine would be an absolute nightmare. But using my five-card poker library to run a simulation to compute hand probabilities was fairly easy.

First, I modified my standard poker library code to handle a Joker card type and a FiveKind poker hand type. Next I wrote a helper function to convert a five-card poker hand that has a Joker (from reel #5) to the best possible standard five card hand. The idea is to replace the Joker with every possible one of the 52 cards and track which card produces the best hand.

I ran a simulation of 10,000,000 pulls of the old poker machine and got these results:

                machine     standard deck
HighCard        0.39524370  0.50117700
OnePair         0.46452310  0.42256900
TwoPair         0.05732700  0.04753900
ThreeKind       0.06588560  0.02112800
Straight        0.00556710  0.00392500
Flush           0.00404670  0.00196500
FullHouse       0.00423250  0.00144100
FourKind        0.00272860  0.00024010
StraightFlush   0.00040510  0.00001390
RoyalFlush      0.00001340  0.00000154
FiveKind        0.00002720  0.00000000

Apparently, the presence of the Joker wildcard on reel #5 and the duplicate Three of clubs and Two of spades cards on reels #1 and #2, significantly raise the probabilities of all hands better than the HighCard. Also, the machine probability of getting five of a kind is roughly twice that of the probability of a royal flush (but both probabilities are very small).

A very fun and interesting experiment on a cold and wet Pacific Northwest morning.

I used to play poker in Las Vegas but now I do so only very rarely. Playing poker requires a big time commitment and I don’t like being physically squished between two other players while sitting at the poker table. Here are two AI-generated images from the prompt “poker player with a drink”. Based on my experience, the image on the right is far closer to reality than the image on the left.

Demo code. Replace “lt” (less-than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

using System;
using System.IO;

namespace PokerJoker
{
  internal class PkoerJokerProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin poker machine simulation ");

      // WACO desktop (big) version
      string[] reels = new string[5];
      reels[0] = "2d 3c 3c 4h 7d 8c 9h Ts Qd Kc As";
      reels[1] = "2s 2s 4d 5c 6h 7s 9d Tc Jh Qs Ah";
      reels[2] = "2h 3s 5d 6c 7h 8s Td Jc Qh Ks Ac";
      reels[3] = "2c 3h 4s 6d 7c 8h 9s Jd Qc Kh Ad";
      reels[4] = "3d 4c 5h 5s 6s 8d 9c Th Js Kd Xc";

      Console.WriteLine("\nMachine reels: ");
      for (int i = 0; i "lt" 5; ++i)
      {
        Console.WriteLine("[" + (i+1) + "]: " + reels[i]);
      }

      string[] handTypes = new string[] { "HighCard",
        "OnePair", "TwoPair" , "ThreeKind" , "Straight",
        "Flush" , "FullHouse",  "FourKind",
        "StraightFlush", "RoyalFlush", "FiveKind" };

      double[] stdProbs = new double[] { 0.50117700,
        0.42256900, 0.04753900, 0.02112800, 0.00392500,
        0.00196500, 0.00144100, 0.00024010, 0.00001390,
        0.00000154, 0.00000000};

      int nTrials = 10_000_000;
      Console.WriteLine("\nPlaying machine " +
        nTrials + " times");
      double[] probs = MachineProbs(reels, nTrials, 0);
      Console.WriteLine("");

      Console.WriteLine("          " +
        "      machine     standard deck");
      for (int i = 0; i "lt" 11; ++i)
      {
        Console.Write(handTypes[i].ToString().
          PadRight(14) + "  ");
        Console.Write(probs[i].ToString("F8") + "  ");
        Console.WriteLine(stdProbs[i].ToString("F8"));
      }

      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();

    } // Main

    static double[] MachineProbs(string[] reels,
      int nTrials, int seed)
    {
      Random rnd = new Random(seed);
      ////string[] reels = new string[5];
      //// WACO desktop (big) version
      //reels[0] = "2d 3c 3c 4h 7d 8c 9h Ts Qd Kc As";
      //reels[1] = "2s 2s 4d 5c 6h 7s 9d Tc Jh Qs Ah";
      //reels[2] = "2h 3s 5d 6c 7h 8s Td Jc Qh Ks Ac";
      //reels[3] = "2c 3h 4s 6d 7c 8h 9s Jd Qc Kh Ad";
      //reels[4] = "3d 4c 5h 5s 6s 8d 9c Th Js Kd Xc";

      int[] counts = new int[11]; // HighCard thru FiveKind
      double[] probs = new double[11];  // 11 hand types

      for (int t = 0; t "lt" nTrials; ++t)
      {
        string hs = "";
        for (int i = 0; i "lt" 5; ++i) // each reel
        {
          int j = rnd.Next(0, 11); // 0 to 10 inclusive
          int idx = 3 * j;  // 0 to 33: 0, 3, 6 . . 30
          string cs = reels[i].Substring(idx, 2);
          hs = hs + cs;  // like "Kc9h2d5dXc"
        }

        Hand h = new Hand(hs);  // List of 5 cards
        if (Hand.HasJoker(h) == true)
          h = MakeBestHand(h);

        //if (h.GetHandTypeStr() == "FiveKind")
        //{
        //  Console.WriteLine("found five - kind at trial " + t);
        //  Console.ReadLine();
        //}
        int hIdx = h.GetHandTypeInt();  // high c to 5 k
        ++counts[hIdx];
      }

      for (int i = 0; i "lt" 11; ++i)
        probs[i] = (1.0 * counts[i]) / nTrials;

      return probs;
    }

    static Hand MakeBestHand(Hand h)
    {
      // assumes h has Joker at [4]
      // make a copy of input h
      Hand copyInput = new Hand(h.cards);
      Hand currBest = new Hand("2c3d4h5s7c"); // worst

      // replace Joker with all possible 52 cards
      for (int r = 2; r "lt" 15; ++r) // 2 thru A
      {
        for (int s = 0; s "lt" 4; ++s) // c, d, h, s
        {
          Card check = new Card(r, s);
          copyInput.cards[4] = check;
          Hand candidate = new Hand(copyInput.cards);

          if (Hand.Compare(candidate, currBest) == +1)
            currBest = candidate;
        }
      }

      return currBest;
    } // MakeBestHand()

  } // Program

  // ===== Card =============================================

  public class Card : IComparable"lt"Card"gt"
  {
    // 0,1 not used. 11=J, 12=Q, 13=K, 14=A, 15=Joker
    public int rank;
    // 0=clubs, 1=diamonds, 2=hearts, 3=spades
    public int suit;

    public Card(int rank, int suit)
    {
      this.rank = rank; this.suit = suit;
    }

    public Card(string c)
    {
      char rnk = c[0]; char sut = c[1];
      if (rnk == 'T') this.rank = 10;
      else if (rnk == 'J') this.rank = 11;
      else if (rnk == 'Q') this.rank = 12;
      else if (rnk == 'K') this.rank = 13;
      else if (rnk == 'A') this.rank = 14;
      else if (rnk == 'X') this.rank = 15;
      else this.rank = int.Parse(rnk.ToString());

      if (sut == 'c') this.suit = 0;
      else if (sut == 'd') this.suit = 1;
      else if (sut == 'h') this.suit = 2;
      else if (sut == 's') this.suit = 3;
    }

    public override string ToString()
    {
      string rnk = ""; string sut = "";
      if (this.rank == 10) rnk = "T";
      else if (this.rank == 11) rnk = "J";
      else if (this.rank == 12) rnk = "Q";
      else if (this.rank == 13) rnk = "K";
      else if (this.rank == 14) rnk = "A";
      else if (this.rank == 15) rnk = "X";
      else rnk = this.rank.ToString();

      if (this.suit == 0) sut = "c";
      else if (this.suit == 1) sut = "d";
      else if (this.suit == 2) sut = "h";
      else if (this.suit == 3) sut = "s";

      return rnk + sut;
    }

    public int CompareTo(Card other)
    {
      // sort cards in a hand from low (Two) to high (Ace)
      if (this.rank.CompareTo(other.rank) == 0)
      {
        return this.suit.CompareTo(other.suit);
      }
      return this.rank.CompareTo(other.rank);
    }

  } // class Card

  // ========================================================
  // ===== Hand =============================================

  public class Hand
  {
    // 5-card poker hand
    // Hand types: "FiveKind",
    // "RoyalFlush", "StraightFlush",
    // "FourKind", "FullHouse", "Flush" , "Straight",
    // "ThreeKind", "TwoPair", "OnePair", "HighCard"

    public List"lt"Card"gt" cards;

    public Hand(List"lt"Card"gt" lst)
    {
      this.cards = new List"lt"Card"gt"();
      for (int i = 0; i "lt" 5; ++i)
        this.cards.Add(lst[i]);
      this.cards.Sort();
    }

    public Hand(Card c1, Card c2, Card c3,
      Card c4, Card c5)
    {
      this.cards = new List"lt"Card"gt"();
      this.cards.Add(c1); this.cards.Add(c2);
      this.cards.Add(c3); this.cards.Add(c4);
      this.cards.Add(c5);
      this.cards.Sort();
    }

    public Hand(string s) // s like "Js3h7d7cXc"
    {
      this.cards = new List"lt"Card"gt"();
      this.cards.Add(new Card(s.Substring(0, 2)));
      this.cards.Add(new Card(s.Substring(2, 2)));
      this.cards.Add(new Card(s.Substring(4, 2)));
      this.cards.Add(new Card(s.Substring(6, 2)));
      this.cards.Add(new Card(s.Substring(8, 2)));
      this.cards.Sort();
    }

    //public override string ToString()
    //{
    //  string h = "";
    //  for (int i = 0; i "lt" 4; ++i)
    //    h += this.cards[i].ToString() + "-";
    //  h += this.cards[4];
    //  return h;
    //}

    public string ToString(bool hyphen = false)
    {
      string h = "";
      for (int i = 0; i "lt" 4; ++i)
      {
        h += this.cards[i].ToString();
        if (hyphen == true) h += "-";
      }
      h += this.cards[4];
      return h;
    }

    // ------------------------------------------------------

    // Hand Type methods:
    // GetHandTypeStr(), GetHandTypeInt(),
    //
    // IsFiveKind(),
    // IsRoyalFlush(), IsStraightFlush(), 
    // IsFourKind(), IsFullHouse(), IsFlush(),
    // IsStraight(), IsThreeKind(), IsTwoPair(),
    // IsOnePair(), IsHighCard()
    //
    // helpers: HasJoker(), HasFlush(), HasStraight(),

    // ------------------------------------------------------

    public string GetHandTypeStr()
    {
      if (IsFiveKind(this) == true)
        return "FiveKind";
      else if (IsRoyalFlush(this) == true)
        return "RoyalFlush";
      else if (IsStraightFlush(this) == true)
        return "StraightFlush";
      else if (IsFourKind(this) == true)
        return "FourKind";
      else if (IsFullHouse(this) == true)
        return "FullHouse";
      else if (IsFlush(this) == true)
        return "Flush";
      else if (IsStraight(this) == true)
        return "Straight";
      else if (IsThreeKind(this) == true)
        return "ThreeKind";
      else if (IsTwoPair(this) == true)
        return "TwoPair";
      else if (IsOnePair(this) == true)
        return "OnePair";
      else if (IsHighCard(this) == true)
        return "HighCard";
      else
        return "Unknown";
    }

    // ------------------------------------------------------

    public int GetHandTypeInt()
    {
      if (IsFiveKind(this) == true)
        return 10;
      else if (IsRoyalFlush(this) == true)
        return 9;
      else if (IsStraightFlush(this) == true)
        return 8;
      else if (IsFourKind(this) == true)
        return 7;
      else if (IsFullHouse(this) == true)
        return 6;
      else if (IsFlush(this) == true)
        return 5;
      else if (IsStraight(this) == true)
        return 4;
      else if (IsThreeKind(this) == true)
        return 3;
      else if (IsTwoPair(this) == true)
        return 2;
      else if (IsOnePair(this) == true)
        return 1;
      else if (IsHighCard(this) == true)
        return 0;
      else
        return -1;
    }

    // ------------------------------------------------------

    private static bool HasFlush(Hand h)
    {
      if ((h.cards[0].suit == h.cards[1].suit) "and"
          (h.cards[1].suit == h.cards[2].suit) "and"
          (h.cards[2].suit == h.cards[3].suit) "and"
          (h.cards[3].suit == h.cards[4].suit))
        return true;

      return false;
    }

    // ------------------------------------------------------

    private static bool HasStraight(Hand h)
    {
      // check special case of Ace-low straight
      // 2, 3, 4, 5, A when sorted
      if (h.cards[0].rank == 2 "and"
        h.cards[1].rank == 3 "and"
        h.cards[2].rank == 4 "and"
        h.cards[3].rank == 5 "and"
        h.cards[4].rank == 14)
        return true;

      // otherwise, check for 5 consecutive
      if ((h.cards[0].rank == h.cards[1].rank - 1) "and"
        (h.cards[1].rank == h.cards[2].rank - 1) "and"
        (h.cards[2].rank == h.cards[3].rank - 1) "and"
        (h.cards[3].rank == h.cards[4].rank - 1))
        return true;

      return false;
    }

    // ------------------------------------------------------

    public static bool HasJoker(Hand h)
    {
      // if sorted, a Joker will be at [4]
      if (h.cards[4].rank == 15)
        return true;
      else
        return false;
    }

    // ------------------------------------------------------

    private static bool IsFiveKind(Hand h)
    {
      if ((h.cards[0].rank == h.cards[1].rank) "and"
          (h.cards[1].rank == h.cards[2].rank) "and"
          (h.cards[2].rank == h.cards[3].rank) "and"
          (h.cards[3].rank == h.cards[4].rank))
        return true;
      else
        return false;
    }

    // ------------------------------------------------------

    private static bool IsRoyalFlush(Hand h)
    {
      if (HasStraight(h) == true "and" HasFlush(h) == true
        "and" h.cards[0].rank == 10)
        return true;
      else
        return false;
    }

    // ------------------------------------------------------

    private static bool IsStraightFlush(Hand h)
    {
      if (HasStraight(h) == true "and" HasFlush(h) == true
        "and" h.cards[0].rank != 10)
        return true;
      else
        return false;
    }

    // ------------------------------------------------------

    private static bool IsFourKind(Hand h)
    {
      // AAAA B or B AAAA if sorted
      if ((h.cards[0].rank == h.cards[1].rank) "and"
        (h.cards[1].rank == h.cards[2].rank) "and"
        (h.cards[2].rank == h.cards[3].rank) "and"
        (h.cards[3].rank != h.cards[4].rank))
        return true;

      if ((h.cards[1].rank == h.cards[2].rank) "and"
        (h.cards[2].rank == h.cards[3].rank) "and"
        (h.cards[3].rank == h.cards[4].rank) "and"
        (h.cards[0].rank != h.cards[1].rank))
        return true;

      return false;
    }

    // ------------------------------------------------------

    private static bool IsFullHouse(Hand h)
    {
      // AAA BB or BB AAA if sorted
      if ((h.cards[0].rank == h.cards[1].rank) "and"
        (h.cards[1].rank == h.cards[2].rank) "and"
        (h.cards[3].rank == h.cards[4].rank) "and"
        (h.cards[2].rank != h.cards[3].rank))
        return true;

      // BB AAA
      if ((h.cards[0].rank == h.cards[1].rank) "and"
        (h.cards[2].rank == h.cards[3].rank) "and"
        (h.cards[3].rank == h.cards[4].rank) "and"
        (h.cards[1].rank != h.cards[2].rank))
        return true;

      return false;
    }

    // ------------------------------------------------------

    private static bool IsFlush(Hand h)
    {
      if (HasFlush(h) == true "and"
        HasStraight(h) == false)
        return true; // no StraightFlush or RoyalFlush
      else
        return false;
    }

    // ------------------------------------------------------

    private static bool IsStraight(Hand h)
    {
      if (HasStraight(h) == true "and"
        HasFlush(h) == false) // no SF or RF
        return true;
      else
        return false;
    }

    // ------------------------------------------------------

    private static bool IsThreeKind(Hand h)
    {
      // AAA B C or B AAA C or B C AAA if sorted
      if ((h.cards[0].rank == h.cards[1].rank) "and"
        (h.cards[1].rank == h.cards[2].rank) "and"
        (h.cards[2].rank != h.cards[3].rank) "and"
        (h.cards[3].rank != h.cards[4].rank))
        return true;

      if ((h.cards[1].rank == h.cards[2].rank) "and"
        (h.cards[2].rank == h.cards[3].rank) "and"
        (h.cards[0].rank != h.cards[1].rank) "and"
        (h.cards[3].rank != h.cards[4].rank))
        return true;

      if ((h.cards[2].rank == h.cards[3].rank) "and"
        (h.cards[3].rank == h.cards[4].rank) "and"
        (h.cards[0].rank != h.cards[1].rank) "and"
        (h.cards[1].rank != h.cards[2].rank))
        return true;

      return false;
    }

    // ------------------------------------------------------

    private static bool IsTwoPair(Hand h)
    {
      // AA BB C or AA C BB or C AA BB if sorted
      if ((h.cards[0].rank == h.cards[1].rank) "and"
        (h.cards[2].rank == h.cards[3].rank) "and"
        (h.cards[1].rank != h.cards[2].rank) "and"
        (h.cards[3].rank != h.cards[4].rank))
        return true;  // AA BB C

      if ((h.cards[0].rank == h.cards[1].rank) "and"
        (h.cards[3].rank == h.cards[4].rank) "and"
        (h.cards[1].rank != h.cards[2].rank) "and"
        (h.cards[2].rank != h.cards[3].rank))
        return true;  // AA C BB

      if ((h.cards[1].rank == h.cards[2].rank) "and"
        (h.cards[3].rank == h.cards[4].rank) "and"
        (h.cards[0].rank != h.cards[1].rank) "and"
        (h.cards[2].rank != h.cards[3].rank))
        return true;  // C AA BB

      return false;
    }

    // ------------------------------------------------------

    private static bool IsOnePair(Hand h)
    {
      // AA B C D or B AA C D or B C AA D or B C D AA
      if ((h.cards[0].rank == h.cards[1].rank) "and"
        (h.cards[1].rank != h.cards[2].rank) "and"
        (h.cards[2].rank != h.cards[3].rank) "and"
        (h.cards[3].rank != h.cards[4].rank))
        return true;  // AA B C D

      if ((h.cards[1].rank == h.cards[2].rank) "and"
        (h.cards[0].rank != h.cards[1].rank) "and"
        (h.cards[2].rank != h.cards[3].rank) "and"
        (h.cards[3].rank != h.cards[4].rank))
        return true;  // B AA C D

      if ((h.cards[2].rank == h.cards[3].rank) "and"
        (h.cards[0].rank != h.cards[1].rank) "and"
        (h.cards[1].rank != h.cards[2].rank) "and"
        (h.cards[3].rank != h.cards[4].rank))
        return true;  // B C AA D

      if ((h.cards[3].rank == h.cards[4].rank) "and"
        (h.cards[0].rank != h.cards[1].rank) "and"
        (h.cards[1].rank != h.cards[2].rank) "and"
        (h.cards[2].rank != h.cards[3].rank))
        return true;  // B C D AA

      return false;
    }

    // ------------------------------------------------------

    private static bool IsHighCard(Hand h)
    {
      if (HasFlush(h) == true)
        return false;
      else if (HasStraight(h) == true)
        return false;
      else
      {
        // all remaining have at least one pair
        if ((h.cards[0].rank == h.cards[1].rank) ||
            (h.cards[1].rank == h.cards[2].rank) ||
            (h.cards[2].rank == h.cards[3].rank) ||
            (h.cards[3].rank == h.cards[4].rank))
          return false;
      }

      return true;
    }

    // ------------------------------------------------------

    // Hand comparison methods
    // Hand.Compare() calls:
    // BreakTieFiveKind(),
    // BreakTieStraightFlush(), BreakTieFourKind(),
    // BreakTieFullHouse(), BreakTieFlush(),
    // BreakTieStraight(), BreakTieThreeKind(),
    // BreakTieTwoPair(), BreakTieOnePair(),
    // BreakTieHighCard()

    // ------------------------------------------------------

    public static int Compare(Hand h1, Hand h2)
    {
      // -1 if h1 "lt" h2, +1 if h1 "gt" h2, 0 if h1 == h2

      int h1Idx = h1.GetHandTypeInt();  // like 6
      int h2Idx = h2.GetHandTypeInt();

      // different hand types - easy
      if (h1Idx "lt" h2Idx)
        return -1;
      else if (h1Idx "gt" h2Idx)
        return +1;
      else // same hand types so break tie
      {
        string h1HandType = h1.GetHandTypeStr();
        string h2HandType = h2.GetHandTypeStr();

        if (h1HandType != h2HandType)
          Console.WriteLine("Logic error ");

        if (h1HandType == "RoyalFlush")
          return 0; // two Royal Flush always tie
        else if (h1HandType == "StraightFlush")
          return BreakTieStraightFlush(h1, h2);
        else if (h1HandType == "FourKind")
          return BreakTieFourKind(h1, h2);
        else if (h1HandType == "FullHouse")
          return BreakTieFullHouse(h1, h2);
        else if (h1HandType == "Flush")
          return BreakTieFlush(h1, h2);
        else if (h1HandType == "Straight")
          return BreakTieStraight(h1, h2);
        else if (h1HandType == "ThreeKind")
          return BreakTieThreeKind(h1, h2);
        else if (h1HandType == "TwoPair")
          return BreakTieTwoPair(h1, h2);
        else if (h1HandType == "OnePair")
          return BreakTieOnePair(h1, h2);
        else if (h1HandType == "HighCard")
          return BreakTieHighCard(h1, h2);
        else if (h1HandType == "FiveKind")
          return BreakTieFiveKind(h1, h2);
      }
      return -2;  // error
    }

    // ------------------------------------------------------

    private static int BreakTieFiveKind(Hand h1, 
      Hand h2)
    {
      // all 5 same rank so check any card
      if (h1.cards[0].rank "lt" h2.cards[0].rank)
        return -1;
      else if (h1.cards[0].rank "gt" h2.cards[0].rank)
        return 1;
      else if (h1.cards[0].rank == h2.cards[0].rank)
        return 0;
      else
        throw new Exception("Fatal logic BreakTieFiveKind");
    }

    // ------------------------------------------------------

    private static int BreakTieStraightFlush(Hand h1,
      Hand h2)
    {
      // check special case of Ace-low straight flush
      // check one or two Ace-low hands
      // h1 is Ace-low, h2 not Ace-low, h1 is less
      if ((h1.cards[0].rank == 2 "and"
        h1.cards[4].rank == 14) "and"  // because sorted!
        !(h2.cards[0].rank == 2 "and"
        h2.cards[4].rank == 14))
        return -1;

      //  h1 not Ace-low, h2 is Ace-low, h1 is better
      else if (!(h1.cards[0].rank == 2 "and"
        h1.cards[4].rank == 14) "and"
        (h2.cards[0].rank == 2 "and"
        h2.cards[4].rank == 14))
        return +1;
      //  two Ace-low hands
      else if ((h1.cards[0].rank == 2 "and"
        h1.cards[4].rank == 14) "and"  // Ace-low
        (h2.cards[0].rank == 2 "and"
        h2.cards[4].rank == 14))  // Ace-low
        return 0;

      //  no Ace-low straight flush so check high cards
      if (h1.cards[4].rank "lt" h2.cards[4].rank)
        return -1;
      else if (h1.cards[4].rank "gt" h2.cards[4].rank)
        return 1;
      else
        return 0;
    }

    private static int BreakTieFourKind(Hand h1, Hand h2)
    {
      // AAAA-B or B-AAAA
      // the off-card is at [0] or at [4]
      // find h1 four-card and off-card ranks
      int h1FourRank; int h1OffRank;
      if (h1.cards[0].rank == h1.cards[1].rank)
      {
        // 1st two cards same so off-rank at [4]
        h1FourRank = h1.cards[0].rank;
        h1OffRank = h1.cards[4].rank;
      }
      else
      {
        // 1st two cards diff so off-rank at [0]
        h1FourRank = h1.cards[4].rank;
        h1OffRank = h1.cards[0].rank;
      }

      int h2FourRank; int h2OffRank;
      if (h2.cards[0].rank == h2.cards[1].rank)
      {
        h2FourRank = h2.cards[0].rank;
        h2OffRank = h2.cards[4].rank;
      }
      else
      {
        h2FourRank = h2.cards[4].rank;
        h2OffRank = h2.cards[0].rank;
      }

      if (h1FourRank "lt" h2FourRank) // like 4K, 4A
        return -1;
      else if (h1FourRank "gt" h2FourRank)
        return +1;
      else // both hands have same four-kind (mult. decks)
      {
        if (h1OffRank "lt" h2OffRank)
          return -1;  // like 3c 9c9d9h9s "lt" Qd 9c9d9h9s
        else if (h1OffRank "gt" h2OffRank)
          return +1;  // like Jc 4c4d4h4s "gt" 9s 4c4d4h4s
        else if (h1OffRank == h2OffRank)
          return 0;
      }
      throw new Exception("Fatal logic BreakTieFourKind");
    }

    private static int BreakTieFullHouse(Hand h1, Hand h2)
    {
      // determine high rank (3 kind) and low rank (2 kind)
      // JJJ 55 or 33 KKK
      // if [1] == [2] 3 kind at [0][1][2]
      // if [1] != [2] 3 kind at [2][3][4]
      int h1ThreeRank; int h1TwoRank;
      if (h1.cards[1].rank == h1.cards[2].rank)
      {
        // if [1] == [2] 3 kind at [0][1][2]
        h1ThreeRank = h1.cards[0].rank;
        h1TwoRank = h1.cards[4].rank;
      }
      else
      {
        // if [1] != [2] 3 kind at [2][3][4]
        h1ThreeRank = h1.cards[4].rank;
        h1TwoRank = h1.cards[0].rank;
      }

      int h2ThreeRank; int h2TwoRank;
      if (h2.cards[1].rank == h2.cards[2].rank)
      {
        // if [1] == [2] 3 kind at [0][1][2]
        h2ThreeRank = h2.cards[0].rank;
        h2TwoRank = h2.cards[4].rank;
      }
      else
      {
        // if [1] != [2] 3 kind at [2][3][4]
        h2ThreeRank = h2.cards[4].rank;
        h2TwoRank = h2.cards[0].rank;
      }

      if (h1ThreeRank "lt" h2ThreeRank)
        return -1;
      else if (h1ThreeRank "gt" h2ThreeRank)
        return +1;
      else // both hands same three-kind (mult. decks)
      {
        if (h1TwoRank "lt" h2TwoRank)
          return -1;  // like 3c3d 9c9d9h "lt" QdQs 9c9d9h
        else if (h1TwoRank "gt" h2TwoRank)
          return +1;  // like 3c3d 9c9d9h "gt" 2d2s 9c9d9h
        else if (h1TwoRank == h2TwoRank)
          return 0;
      }
      throw new Exception("Fatal logic BreakTieFullHouse");
    }

    private static int BreakTieFlush(Hand h1, Hand h2)
    {
      // compare rank of high cards
      if (h1.cards[4].rank "lt" h2.cards[4].rank)
        return -1;
      else if (h1.cards[4].rank "gt" h2.cards[4].rank)
        return +1;
      // high cards equal so check at [3]
      else if (h1.cards[3].rank "lt" h2.cards[3].rank)
        return -1;
      else if (h1.cards[3].rank "gt" h2.cards[3].rank)
        return +1;
      // and so on
      else if (h1.cards[2].rank "lt" h2.cards[2].rank)
        return -1;
      else if (h1.cards[2].rank "gt" h2.cards[2].rank)
        return +1;
      //
      else if (h1.cards[1].rank "lt" h2.cards[1].rank)
        return -1;
      else if (h1.cards[1].rank "gt" h2.cards[1].rank)
        return +1;
      //
      else if (h1.cards[0].rank "lt" h2.cards[0].rank)
        return -1;
      else if (h1.cards[0].rank "gt" h2.cards[0].rank)
        return +1;
      //
      else
        return 0; // all ranks the same!
    }

    private static int BreakTieStraight(Hand h1, Hand h2)
    {
      // both hands are straights but one could be Ace-low
      // check special case of one or two Ace-low hands
      // h1 is Ace-low, h2 not Ace-low. h1 is less
      if ((h1.cards[0].rank == 2 "and"  // Ace-low (sorted!)
        h1.cards[4].rank == 14) "and"
        !(h2.cards[0].rank == 2 "and"
        h2.cards[4].rank == 14))
        return -1;
      // h1 not Ace-low, h2 is Ace-low, h1 is better
      else if (!(h1.cards[0].rank == 2 "and"
        h1.cards[4].rank == 14) "and"
        (h2.cards[0].rank == 2 "and"
        h2.cards[4].rank == 14))
        return +1;
      // two Ace-low hands
      else if ((h1.cards[0].rank == 2 "and"
        h1.cards[4].rank == 14) "and"
        (h2.cards[0].rank == 2 "and"
        h2.cards[4].rank == 14))
        return 0;

      // no Ace-low hands so just check high card
      if (h1.cards[4].rank "lt" h2.cards[4].rank)
        return -1;
      else if (h1.cards[4].rank "gt" h2.cards[4].rank)
        return +1;
      else if (h1.cards[4].rank == h2.cards[4].rank)
        return 0;
      else
        throw new
          Exception("Fatal logic BreakTieStraight");
    }

    private static int BreakTieThreeKind(Hand h1, Hand h2)
    {
      // assumes multiple decks possible
      // (TTT L H) or (L TTT H) or (L H TTT)
      int h1ThreeRank = 0; int h1LowRank = 0;
      int h1HighRank = 0;
      if (h1.cards[0].rank == h1.cards[1].rank "and"
        h1.cards[1].rank == h1.cards[2].rank)
      {
        h1ThreeRank = h1.cards[0].rank;
        h1LowRank = h1.cards[3].rank;
        h1HighRank = h1.cards[4].rank;
      }
      else if (h1.cards[1].rank == h1.cards[2].rank "and"
        h1.cards[2].rank == h1.cards[3].rank)
      {
        h1LowRank = h1.cards[0].rank;
        h1ThreeRank = h1.cards[1].rank;
        h1HighRank = h1.cards[4].rank;
      }
      else if (h1.cards[2].rank == h1.cards[3].rank "and"
        h1.cards[3].rank == h1.cards[4].rank)
      {
        h1LowRank = h1.cards[0].rank;
        h1HighRank = h1.cards[1].rank;
        h1ThreeRank = h1.cards[4].rank;
      }

      int h2ThreeRank = 0; int h2LowRank = 0;
      int h2HighRank = 0;
      if (h2.cards[0].rank == h2.cards[1].rank "and"
        h2.cards[1].rank == h2.cards[2].rank)
      {
        h2ThreeRank = h2.cards[0].rank;
        h2LowRank = h2.cards[3].rank;
        h2HighRank = h2.cards[4].rank;
      }
      else if (h2.cards[1].rank == h2.cards[2].rank "and"
        h2.cards[2].rank == h2.cards[3].rank)
      {
        h2LowRank = h2.cards[0].rank;
        h2ThreeRank = h2.cards[1].rank;
        h2HighRank = h2.cards[4].rank;
      }
      else if (h2.cards[2].rank == h2.cards[3].rank "and"
        h2.cards[3].rank == h2.cards[4].rank)
      {
        h2LowRank = h2.cards[0].rank;
        h2HighRank = h2.cards[1].rank;
        h2ThreeRank = h2.cards[4].rank;
      }

      if (h1ThreeRank "lt" h2ThreeRank)
        return -1;
      else if (h1ThreeRank "gt" h2ThreeRank)
        return +1;
      // both hands three-kind same (mult. decks)
      else if (h1HighRank "lt" h2HighRank)
        return -1;
      else if (h1HighRank "gt" h2HighRank)
        return +1;
      //
      else if (h1LowRank "lt" h2LowRank)
        return -1;
      else if (h1LowRank "gt" h2LowRank)
        return +1;
      //
      else // wow!
        return 0;
    }

    private static int BreakTieTwoPair(Hand h1, Hand h2)
    {
      // (LL X HH) or (LL HH X) or (X LL HH)
      int h1LowRank = 0; int h1HighRank = 0;
      int h1OffRank = 0;
      if (h1.cards[0].rank == h1.cards[1].rank "and"
        h1.cards[3].rank == h1.cards[4].rank)
      {
        // (LL X HH)
        h1LowRank = h1.cards[0].rank;
        h1HighRank = h1.cards[4].rank;
        h1OffRank = h1.cards[2].rank;
      }
      else if (h1.cards[0].rank == h1.cards[1].rank "and"
        h1.cards[2].rank == h1.cards[3].rank)
      {
        // (LL HH X)
        h1LowRank = h1.cards[0].rank;
        h1HighRank = h1.cards[2].rank;
        h1OffRank = h1.cards[4].rank;
      }
      else if (h1.cards[1].rank == h1.cards[2].rank "and"
        h1.cards[3].rank == h1.cards[4].rank)
      {
        // (X LL HH)
        h1LowRank = h1.cards[1].rank;
        h1HighRank = h1.cards[3].rank;
        h1OffRank = h1.cards[0].rank;
      }

      int h2LowRank = 0; int h2HighRank = 0;
      int h2OffRank = 0;
      if (h2.cards[0].rank == h2.cards[1].rank "and"
        h2.cards[3].rank == h2.cards[4].rank)
      {
        // (LL X HH)
        h2LowRank = h2.cards[0].rank;
        h2HighRank = h2.cards[4].rank;
        h2OffRank = h2.cards[2].rank;
      }
      else if (h2.cards[0].rank == h2.cards[1].rank "and"
        h2.cards[2].rank == h2.cards[3].rank)
      {
        // (LL HH X)
        h2LowRank = h2.cards[0].rank;
        h2HighRank = h2.cards[2].rank;
        h2OffRank = h2.cards[4].rank;
      }
      else if (h2.cards[1].rank == h2.cards[2].rank "and"
        h2.cards[3].rank == h2.cards[4].rank)
      {
        // (X LL HH)
        h2LowRank = h2.cards[1].rank;
        h2HighRank = h2.cards[3].rank;
        h2OffRank = h2.cards[0].rank;
      }

      if (h1HighRank "lt" h2HighRank)
        return -1;
      else if (h1HighRank "gt" h2HighRank)
        return +1;
      else if (h1LowRank "lt" h2LowRank)
        return -1;
      else if (h1LowRank "gt" h2LowRank)
        return +1;
      else if (h1OffRank "lt" h2OffRank)
        return -1;
      else if (h1OffRank "gt" h2OffRank)
        return +1;
      else
        return 0;
    }

    private static int BreakTieOnePair(Hand h1, Hand h2)
    {
      // (PP L M H) or (L PP M H)
      // or (L M PP H) or (L M H PP)
      int h1PairRank = 0; int h1LowRank = 0;
      int h1MediumRank = 0; int h1HighRank = 0;
      if (h1.cards[0].rank == h1.cards[1].rank)
      {
        // (PP L M H)
        h1PairRank = h1.cards[0].rank;
        h1LowRank = h1.cards[2].rank;
        h1MediumRank = h1.cards[3].rank;
        h1HighRank = h1.cards[4].rank;
      }
      else if (h1.cards[1].rank == h1.cards[2].rank)
      {
        // (L PP M H)
        h1PairRank = h1.cards[1].rank;
        h1LowRank = h1.cards[0].rank;
        h1MediumRank = h1.cards[3].rank;
        h1HighRank = h1.cards[4].rank;
      }
      else if (h1.cards[2].rank == h1.cards[3].rank)
      {
        // (L M PP H)
        h1PairRank = h1.cards[2].rank;
        h1LowRank = h1.cards[0].rank;
        h1MediumRank = h1.cards[1].rank;
        h1HighRank = h1.cards[4].rank;
      }
      else if (h1.cards[3].rank == h1.cards[4].rank)
      {
        // (L M H PP)
        h1PairRank = h1.cards[4].rank;
        h1LowRank = h1.cards[0].rank;
        h1MediumRank = h1.cards[1].rank;
        h1HighRank = h1.cards[2].rank;
      }

      int h2PairRank = 0; int h2LowRank = 0;
      int h2MediumRank = 0; int h2HighRank = 0;
      if (h2.cards[0].rank == h2.cards[1].rank)
      {
        // (PP L M H)
        h2PairRank = h2.cards[0].rank;
        h2LowRank = h2.cards[2].rank;
        h2MediumRank = h2.cards[3].rank;
        h2HighRank = h2.cards[4].rank;
      }
      else if (h2.cards[1].rank == h2.cards[2].rank)
      {
        // (L PP M H)
        h2PairRank = h2.cards[1].rank;
        h2LowRank = h2.cards[0].rank;
        h2MediumRank = h2.cards[3].rank;
        h2HighRank = h2.cards[4].rank;
      }
      else if (h2.cards[2].rank == h2.cards[3].rank)
      {
        // (L M PP H)
        h2PairRank = h2.cards[2].rank;
        h2LowRank = h2.cards[0].rank;
        h2MediumRank = h2.cards[1].rank;
        h2HighRank = h2.cards[4].rank;
      }
      else if (h2.cards[3].rank == h2.cards[4].rank)
      {
        // (L M H PP)
        h2PairRank = h2.cards[4].rank;
        h2LowRank = h2.cards[0].rank;
        h2MediumRank = h2.cards[1].rank;
        h2HighRank = h2.cards[2].rank;
      }

      if (h1PairRank "lt" h2PairRank)
        return -1;
      else if (h1PairRank "gt" h2PairRank)
        return +1;
      //
      else if (h1HighRank "lt" h2HighRank)
        return -1;
      else if (h1HighRank "gt" h2HighRank)
        return +1;
      //
      else if (h1MediumRank "lt" h2MediumRank)
        return -1;
      else if (h1MediumRank "gt" h2MediumRank)
        return +1;
      //
      else if (h1LowRank "lt" h2LowRank)
        return -1;
      else if (h1LowRank "gt" h2LowRank)
        return +1;
      //
      else
        return 0;
    }

    private static int BreakTieHighCard(Hand h1, Hand h2)
    {
      if (h1.cards[4].rank "lt" h2.cards[4].rank)
        return -1;
      else if (h1.cards[4].rank "gt" h2.cards[4].rank)
        return +1;
      //
      else if (h1.cards[3].rank "lt" h2.cards[3].rank)
        return -1;
      else if (h1.cards[3].rank "gt" h2.cards[3].rank)
        return +1;
      //
      else if (h1.cards[2].rank "lt" h2.cards[2].rank)
        return -1;
      else if (h1.cards[2].rank "gt" h2.cards[2].rank)
        return +1;
      //
      else if (h1.cards[1].rank "lt" h2.cards[1].rank)
        return -1;
      else if (h1.cards[1].rank "gt" h2.cards[1].rank)
        return +1;
      //
      else if (h1.cards[0].rank "lt" h2.cards[0].rank)
        return -1;
      else if (h1.cards[0].rank "gt" h2.cards[0].rank)
        return +1;
      //
      else
        return 0;
    }

    // ------------------------------------------------------

  } // class Hand


} // ns

Posted in Poker | Leave a comment

Nearest Centroid Classification From Scratch Using C#

Posted on April 8, 2024 by jamesdmccaffrey

Nearest centroid classification is arguably the simplest form of machine learning classification. The means (aka centroids) for each class in the training data are computed. To classify a new item x, you pick the class associated with the mean that is closest to x.

I put together a demo using the C# language. I used a subset of the Penguin Dataset. The raw data looks like:

0, 39.5, 17.4, 186, 3800
0, 40.3, 18.0, 195, 3250
0, 36.7, 19.3, 193, 3450
. . .
2, 44.0, 13.6, 208, 4350
2, 42.7, 13.7, 208, 3950

Each line of data represents a penguin. The fields are species to predict (0 = Adelie, 1 = Chinstrap, 2 = Gentoo), and the predictors are bill length, bill depth, flipper length, body mass. The full Penguin Dataset has 345 items, where 333 are valid (no missing values). My demo data has 60 training items (20 of each species), and 30 test items (10 of each species). All of the 90 demo items are females because nearest centroid classification doesn’t handle categorical predictors, such as sex, very well.

My demo program normalizes the training and test data using the divide-by-constant technique. The divisor values are (60.0, 20.0, 200.0, 5000.0) so that all the normalized values are between 0.0 and 1.0 (nearly). Because nearest centroid classification uses Euclidean distance, data must be normalized so that predictors with large magnitudes (such as body mass) don’t overwhelm predictors with small magnitudes. The normalized data looks like:

0.6583   0.8700   0.9300   0.7600
0.6717   0.9000   0.9750   0.6500
0.6117   0.9650   0.9650   0.6900
0.6483   0.8900   0.9050   0.7250
. . .

The training method just computes the means of the three classes of the normalized data:

[0]   0.6344   0.9100   0.9250   0.6813
[1]   0.7731   0.8828   0.9510   0.7050
[2]   0.7495   0.6992   1.0565   0.9230

To classify a new data item like x = (39.5, 17.4, 186, 3800), the data is normalized, and then the normalized input is compared to each mean, and the class (0, 1, 2) of the closest mean is the prediction. The model scores 0.9833 accuracy on the training data (59 out of 60 correct) and 1.0000 accuracy on the test data (30 out of 30 correct). Quite impressive.

Looking at the three centroids/means, a low value for predictor [0] (bill length) identifies class 0. A low value for predictor [1] (bill width) identifies, or a high value for predictor [3] (body mass) identifies class 2. Nearest centroid classification works well for this data only because it’s an easy problem. In most cases, nearest centroid classification doesn’t work so well, and is mostly useful as a baseline for more powerful classification techniques.

It’s not clear to me if nearest centroid classification can be used with mixed numeric and categorical data. I suspect that by using one-over-n-hot encoding on regular categorical data, and equal-interval encoding on ordered categorical data, nearest centroid classification will work well. But I need to explore this question.

Nearest centroid classification can be easily adapted to a simple anomaly detection system. Data items that are farthest from a mean are anomalous in some way. In situations where the source data doesn’t have an explicit class label, it’s possible to designate any non-numeric column as an implied class label.

Computing the distance between a centroid and a vector is relatively easy. It’s not so easy to compute a distance between two images. Here are three nice portraits that an Internet image search says are a close distance from each other. Left: By Dodi Ballada. Center: By Emerico Toth. Right: By Mona Niko.

Demo data. Replace “lt” (less-than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

using System;
using System.IO;

// simple means-based classification

namespace NearestCentroidClassification
{
  internal class NearestCentroidProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin nearest " +
        "centroid classification demo ");

      // 1. load data
      Console.WriteLine("\nLoading penguin subset " +
        "train and test data ");
      string trainFile =
        "..\\..\\..\\Data\\penguin_train_60.txt";
      double[][] trainX = MatLoad(trainFile,
        new int[] { 1, 2, 3, 4 }, ',', "#");
      Console.WriteLine("\nX raw:");
      MatShow(trainX, 1, 9, 4, true);

      double[] divisors =
        new double[] { 60.0, 20.0, 200.0, 5000.0 };
      double[][] trainX_Norm =
        MatNormalize(trainX, divisors);
      Console.WriteLine("\nNormalized X: ");
      MatShow(trainX_Norm, 4, 9, 4, true);

      int[] trainY = VecLoad(trainFile, 0, "#");
      //VecShow(trainY, 3);

      string testFile =
        "..\\..\\..\\Data\\penguin_test_30.txt";
      double[][] testX = MatLoad(testFile,
        new int[] { 1, 2, 3, 4 }, ',', "#");
      double[][] testX_Norm =
        MatNormalize(testX, divisors);

      int[] testY = VecLoad(testFile, 0, "#");
      //VecShow(testY, 3);

      Console.WriteLine("\nCreating " +
        "NearestCentroidClassifier object ");
      int numClasses = 3;
      NearestCentroidClassifier ncc =
        new NearestCentroidClassifier(numClasses);
      Console.WriteLine("Training classifier ");
      ncc.Train(trainX_Norm, trainY);
      Console.WriteLine("Done ");

      Console.WriteLine("\nClass means: ");
      MatShow(ncc.means, 4, 9, 3, true);

      double accTrain = ncc.Accuracy(trainX_Norm, trainY);
      Console.WriteLine("\nAccuracy on train: " +
        accTrain.ToString("F4"));

      double accTest = ncc.Accuracy(testX_Norm, testY);
      Console.WriteLine("Accuracy on test: " +
        accTest.ToString("F4"));

      Console.WriteLine("\nPrediciting species" +
        " for x = 46.5, 17.9, 192, 3500");

      string[] speciesNames = new string[] { "Adelie",
        "Chinstrap", "Gentoo" };
      double[] xRaw = { 46.5, 17.9, 192, 3500 }; // c = 1
      double[] xNorm = VecNormalize(xRaw, divisors);
      Console.Write("Normalized x =");
      VecShow(xNorm, 4, 9);

      int species = ncc.Predict(xNorm);
      Console.WriteLine(species);
      Console.WriteLine(speciesNames[species]);

      Console.WriteLine("\nEnd demo ");
      Console.ReadLine();
    } // Main

    // ------------------------------------------------------

    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment)
    {
      // count number of non-comment lines
      int nRows = 0;
      string line = "";
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
        if (line.StartsWith(comment) == false)
          ++nRows;
      sr.Close(); ifs.Close();

      // make result matrix
      int nCols = usecols.Length;
      double[][] result = new double[nRows][];
      for (int r = 0; r "lt" nRows; ++r)
        result[r] = new double[nCols];

      line = "";
      string[] tokens = null;
      ifs = new FileStream(fn, FileMode.Open);
      sr = new StreamReader(ifs);

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        tokens = line.Split(sep);
        for (int j = 0; j "lt" nCols; ++j)
        {
          int k = usecols[j];  // into tokens
          result[i][j] = double.Parse(tokens[k]);
        }
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }

    // ------------------------------------------------------

    public static int[] VecLoad(string fn, int usecol,
      string comment)
    {
      char dummySep = ',';
      double[][] tmp = MatLoad(fn, new int[] { usecol },
        dummySep, comment);
      int n = tmp.Length;
      int[] result = new int[n];
      for (int i = 0; i "lt" n; ++i)
        result[i] = (int)tmp[i][0];
      return result;
    }

    // ------------------------------------------------------

    public static double[][] MatNormalize(double[][] X,
      double[] divisors)
    {
      int n = X.Length; int dim = X[0].Length;
      double[][] result = new double[n][];
      for (int i = 0; i "lt" n; ++i)
        result[i] = new double[dim];
      for (int j = 0; j "lt" dim; ++j)
        for (int i = 0; i "lt" n; ++i)
          result[i][j] = X[i][j] / divisors[j];

      return result;
    } // MatNormal

    // ------------------------------------------------------

    public static double[] VecNormalize(double[] x,
      double[] divisors)
    {
      int n = x.Length;
      double[] result = new double[n];
      for (int i = 0; i "lt" n; ++i)
        result[i] = x[i] / divisors[i];
      return result;
    }

    // ------------------------------------------------------

    public static void MatShow(double[][] M, int dec,
      int wid, int numRows, bool showIndices)
    {
      double small = 1.0 / Math.Pow(10, dec);
      for (int i = 0; i "lt" numRows; ++i)
      {
        if (showIndices == true)
        {
          int pad = M.Length.ToString().Length;
          Console.Write("[" + i.ToString().
            PadLeft(pad) + "]");
        }
        for (int j = 0; j "lt" M[0].Length; ++j)
        {
          double v = M[i][j];
          if (Math.Abs(v) "lt" small) v = 0.0;
          Console.Write(v.ToString("F" + dec).
            PadLeft(wid));
        }
        Console.WriteLine("");
      }
      if (numRows "lt" M.Length) Console.WriteLine(". . .");
    }

    // ------------------------------------------------------

    public static void VecShow(int[] vec, int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        if (i != 0 "and" i % 10 == 0) Console.WriteLine("");
        Console.Write(vec[i].ToString().PadLeft(wid));
      }
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void VecShow(double[] vec, int decimals,
      int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
        Console.Write(vec[i].ToString("F" + decimals).
          PadLeft(wid));
      Console.WriteLine("");
    }

    // ------------------------------------------------------

  } // Program

  public class NearestCentroidClassifier
  {
    public int numClasses;
    public double[][] means;  // mean of each class

    public NearestCentroidClassifier(int numClasses)
    {
      this.numClasses = numClasses;
    }

    public void Train(double[][] trainX, int[] trainY)
    {
      // compute mean of each class
      int n = trainX.Length;
      int dim = trainX[0].Length;

      this.means = new double[this.numClasses][];
      for (int c = 0; c "lt" numClasses; ++c)
        this.means[c] = new double[dim];

      double[][] sums = new double[this.numClasses][];
      for (int c = 0; c "lt" numClasses; ++c)
        sums[c] = new double[dim];

      int[][] counts = new int[this.numClasses][];
      for (int c = 0; c "lt" numClasses; ++c)
        counts[c] = new int[dim];

      for (int i = 0; i "lt" n; ++i)
      {
        int c = trainY[i];
        for (int j = 0; j "lt" dim; ++j)
        {
          sums[c][j] += trainX[i][j];
          ++counts[c][j];
        }
      }

      for (int c = 0; c "lt" this.numClasses; ++c)
        for (int j = 0; j "lt" dim; ++j)
          this.means[c][j] = sums[c][j] / counts[c][j];

      // // less efficient but more clear
      //for (int c = 0; c "lt" this.numClasses; ++c)
      //{
      //  for (int j = 0; j "lt" dim; ++j) // each col
      //  {
      //    double colSum = 0.0;
      //    int colCount = 0;
      //    for (int i = 0; i "lt" n; ++i)  // each row
      //    {
      //      if (trainY[i] != c) continue;
      //      colSum += trainX[i][j];
      //      ++colCount;
      //    }
      //    this.means[c][j] = colSum / colCount;
      //  } // each col
      // } // each class

    } // Train

    public int Predict(double[] x)
    {
      double[] distances = new double[this.numClasses];
      for (int c = 0; c "lt" this.numClasses; ++c)
        distances[c] = EucDistance(x, this.means[c]);

      double smallestDist = distances[0];
      int result = 0;
      for (int c = 0; c "lt" this.numClasses; ++c)
      {
        if (distances[c] "lt" smallestDist)
        {
          smallestDist = distances[c];
          result = c;
        }
      }
      return result;
    }

    // -------------------------------------------------------

    public double Accuracy(double[][] dataX, int[] dataY)
    {
      int nCorrect = 0;
      int nWrong = 0;
      int n = dataX.Length;
      for (int i = 0; i "lt" n; ++i)
      {
        int c = this.Predict(dataX[i]);
        if (c == dataY[i])
          ++nCorrect;
        else
          ++nWrong;
      }
      return (nCorrect * 1.0) / (nCorrect + nWrong);
    }

    private double EucDistance(double[] v1, double[] v2)
    {
      int dim = v1.Length;
      double sum = 0.0;
      for (int d = 0; d "lt" dim; ++d)
        sum += (v1[d] - v2[d]) * (v1[d] - v2[d]);
      return Math.Sqrt(sum);
    }

  } // class NearestCentroidClassifier

} // ns

Training data:

# penguin_train_60.txt
# species (0 = Adelie, 1 = Chinstrap, 2 = Gentoo)
# bill length, bill depth, flipper length, body mass
# all female
#
0, 39.5, 17.4, 186, 3800
0, 40.3, 18.0, 195, 3250
0, 36.7, 19.3, 193, 3450
0, 38.9, 17.8, 181, 3625
0, 41.1, 17.6, 182, 3200
0, 36.6, 17.8, 185, 3700
0, 38.7, 19.0, 195, 3450
0, 34.4, 18.4, 184, 3325
0, 37.8, 18.3, 174, 3400
0, 35.9, 19.2, 189, 3800
0, 35.3, 18.9, 187, 3800
0, 40.5, 17.9, 187, 3200
0, 37.9, 18.6, 172, 3150
0, 39.5, 16.7, 178, 3250
0, 39.5, 17.8, 188, 3300
0, 36.4, 17.0, 195, 3325
0, 42.2, 18.5, 180, 3550
0, 37.6, 19.3, 181, 3300
0, 36.5, 18.0, 182, 3150
0, 36.0, 18.5, 186, 3100
1, 46.5, 17.9, 192, 3500
1, 45.4, 18.7, 188, 3525
1, 45.2, 17.8, 198, 3950
1, 46.1, 18.2, 178, 3250
1, 46.0, 18.9, 195, 4150
1, 46.6, 17.8, 193, 3800
1, 47.0, 17.3, 185, 3700
1, 45.9, 17.1, 190, 3575
1, 58.0, 17.8, 181, 3700
1, 46.4, 18.6, 190, 3450
1, 42.4, 17.3, 181, 3600
1, 43.2, 16.6, 187, 2900
1, 46.7, 17.9, 195, 3300
1, 50.5, 18.4, 200, 3400
1, 46.4, 17.8, 191, 3700
1, 40.9, 16.6, 187, 3200
1, 42.5, 16.7, 187, 3350
1, 47.5, 16.8, 199, 3900
1, 47.6, 18.3, 195, 3850
1, 46.9, 16.6, 192, 2700
2, 46.1, 13.2, 211, 4500
2, 48.7, 14.1, 210, 4450
2, 46.5, 13.5, 210, 4550
2, 45.4, 14.6, 211, 4800
2, 43.3, 13.4, 209, 4400
2, 40.9, 13.7, 214, 4650
2, 45.5, 13.7, 214, 4650
2, 45.8, 14.6, 210, 4200
2, 42.0, 13.5, 210, 4150
2, 46.2, 14.5, 209, 4800
2, 45.1, 14.5, 215, 5000
2, 46.5, 14.5, 213, 4400
2, 42.9, 13.1, 215, 5000
2, 48.2, 14.3, 210, 4600
2, 42.8, 14.2, 209, 4700
2, 45.1, 14.5, 207, 5050
2, 49.1, 14.8, 220, 5150
2, 42.6, 13.7, 213, 4950
2, 44.0, 13.6, 208, 4350
2, 42.7, 13.7, 208, 3950

Test data:

# penguin_test_30.txt
0, 37.0, 16.9, 185, 3000
0, 36.0, 17.9, 190, 3450
0, 39.6, 17.7, 186, 3500
0, 35.0, 17.9, 190, 3450
0, 34.5, 18.1, 187, 2900
0, 39.0, 17.5, 186, 3550
0, 36.5, 16.6, 181, 2850
0, 35.7, 16.9, 185, 3150
0, 37.6, 17.0, 185, 3600
0, 36.4, 17.1, 184, 2850
1, 46.2, 17.5, 187, 3650
1, 45.5, 17.0, 196, 3500
1, 50.9, 17.9, 196, 3675
1, 50.1, 17.9, 190, 3400
1, 49.8, 17.3, 198, 3675
1, 48.1, 16.4, 199, 3325
1, 45.7, 17.3, 193, 3600
1, 42.5, 17.3, 187, 3350
1, 45.2, 16.6, 191, 3250
1, 45.6, 19.4, 194, 3525
2, 45.3, 13.7, 210, 4300
2, 43.6, 13.9, 217, 4900
2, 45.5, 13.9, 210, 4200
2, 44.9, 13.3, 213, 5100
2, 46.6, 14.2, 210, 4850
2, 45.1, 14.4, 210, 4400
2, 46.5, 14.4, 217, 4900
2, 43.8, 13.9, 208, 4300
2, 43.2, 14.5, 208, 4450
2, 45.3, 13.8, 208, 4200

Posted in Machine Learning | 1 Comment

Data Anomaly Detection Using a Self-Organizing Map (SOM) with C#

Posted on April 5, 2024 by jamesdmccaffrey

A self-organizing map (SOM) is a data structure and associated algorithms that can be used to cluster data. Each cluster has a representative vector. Data items that assigned to a SOM cluster but are far (Euclidean distance) from the cluster vector are anomalous.

I put together a demo. I used a 165-item subset of the Penguin Dataset. Each item represents a penguin. The raw data looks like:

Adelie     39.5  17.4  186  3800  Female
Chinstrap  46.5  17.9  192  3500  Female
Gentoo     46.1  13.2  211  4500  Female
. . .

The fields are species, bill length, bill width, flipper length, body mass, sex. The full Penguin Dataset has 333 valid items with no NA values. I used only the 165 Female items.

Because SOM clustering uses Euclidean distance, the data must be normalized. I used min-max normalization. The min and max values for the four numeric columns are:

# bill len min = 32.1 max = 58.0
# bill wid min = 13.1 max = 20.7
# flip len min = 172 max = 222
# body mass min = 2700 max = 5200

The equation is x’ = (x – min) / (max – min) where x’ is the normalized value, x is a raw value, min is smallest column value, max is largest column value. Each min-max normalized value is between 0 (the smallest value in a column) and 1 (the largest in a column).

I removed the species column and the sex column. The resulting normalized data looks like:

0.2857, 0.5658, 0.2800, 0.4400
0.3166, 0.6447, 0.4600, 0.2200
0.1776, 0.8158, 0.4200, 0.3000
. . .

I set up the demo SOM map as 2-by-2 for a total of 4 map nodes. Creating a SOM map is an iterative process that requires a stepsMax value (I used 1,000) and a lrnRateMax value (I used 0.75). These values must be determined by trial and error. I monitored the SOM map building every 200 iterations by computing the sum of Euclidean distances (SED) between map node vectors and data items assigned to the map node / cluster:

Computing SOM clustering
map build step =    0  |  SED =  102.9645
map build step =  200  |  SED =   95.4465
map build step =  400  |  SED =   77.5610
map build step =  600  |  SED =   51.4788
map build step =  800  |  SED =   50.1617
Done

Each of the 4 map nodes is identified by a [row][col] pair of indices. The four resulting map node vectors are:

SOM map nodes:
[0][0] :   0.5343  0.1602  0.8311  0.8046
[0][1] :   0.6010  0.6424  0.4009  0.3653
[1][0] :   0.3029  0.4344  0.5074  0.4359
[1][1] :   0.2014  0.5919  0.2638  0.2344

The 165 data items were assigned to map nodes according to this distribution:

SOM mapping:
[0][0] : 58 items
[0][1] : 30 items
[1][0] :  7 items
[1][1] : 70 items

After the SOM map was constructed, I analyzed the data, looking for the data item assigned to each cluster/node that is farthest from the map node vector:

node [0][0] :
   most anomalous data idx = 126
   0.4093  0.0789  0.7200  0.5000
   Gentoo  42.7  13.7  208  3950  Female
   distance to node vec = 0.3569

node [0][1] :
   most anomalous data idx = 54
   0.2896  1.0000  0.3800  0.4800
   Adelie  39.6  20.7  191  3900  Female
   distance to node vec = 0.4883

node [1][0] :
   most anomalous data idx = 42
   0.1390  0.6447  0.6000  0.3400
   Adelie  35.7  18.0  202  3550  Female
   distance to node vec = 0.2980

node [1][1] :
   most anomalous data idx = 68
   0.0000  0.3158  0.3200  0.1400
   Adelie  32.1  15.5  188  3050  Female
   distance to node vec = 0.3590

I displayed the index of the anomalous data item, its normalized form, its raw form, and the distance from the item to its map node vector. In a non-demo scenario, these data items would be examined to determine if they are in fact anomalies, and if so, what might be the cause.

It’s not clear of SOM anomaly detection can be applied to categorical data or to mixed numeric and categorical data. It might be possible to use one-over-n-hot encoding for standard categorical data, i.e. red = (0.33, 0, 0), blue = (0, 0.33, 0), green = (0, 0, 0.33) and use equal-interval encoding for categorical data that has inherent ordering, i.e., short = 0.25, medium = 0.50, tall = 0.75.

Interesting stuff.

Like many of the guys I work with, I fell in love with math as a young man via playing card games and dice games. Here are three sets of anomalous dice. Left: A multi-dice spinner from Mythroll Armory. You spin the big knob and the colored ball bearing indicates the result. Center: Self-rolling dice from Boogie Dice. They have a sound sensor and internal vibrating motor. You snap your fingers or clap, and the dice spring into action and roll themselves. Right: The Spinner of Fate is like a roulette wheel.

Demo code. Replace “lt” (less than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

using System;
using System.Collections.Generic;
using System.IO;

// code based on the Wikipedia article algorithm

namespace AnomalySOM
{
  class AnomalySOMProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("\nBegin self-organizing" +
        " map (SOM) anomaly analysis using C#");

      // 1. load data
      Console.WriteLine("\nLoading 165-item Penguin subset ");

      string rf = "..\\..\\..\\Data\\penguin_raw.txt";
      string[] rawFileArray = AnomalySOM.FileLoad(rf, "#");

      string fn = "..\\..\\..\\Data\\penguin_165.txt";
      double[][] X = AnomalySOM.MatLoad(fn,
        new int[] { 0, 1, 2, 3 }, ',', "#");
      Console.WriteLine("\nX first three rows" +
        " of normalized data: ");
      AnomalySOM.MatShow(X, 4, 8, 3, true);

      // 2. create AnomalySOM object and cluster
      int mapRows = 2;
      int mapCols = 2;
      double lrnRateMax = 0.75;
      int stepsMax = 1000;
      Console.WriteLine("\nSetting map size = " +
        mapRows + "x" + mapCols);
      Console.WriteLine("Setting lrnRateMax = " +
        lrnRateMax.ToString("F2"));
      Console.WriteLine("Setting stepsMax = " +
        stepsMax);

      Console.WriteLine("\nComputing SOM clustering ");
      AnomalySOM som = new AnomalySOM(X, mapRows,
        mapCols, seed: 0);
      som.Cluster(lrnRateMax, stepsMax);
      Console.WriteLine("Done ");

      // 3. show the SOM map and mapping
      Console.WriteLine("\nSOM map nodes: ");
      for (int i = 0; i "lt" mapRows; ++i)
      {
        for (int j = 0; j "lt" mapCols; ++j)
        {
          Console.Write("[" + i + "][" + j + "] : ");
          AnomalySOM.VecShow(som.map[i][j], 4, 8);
        }
      }

      Console.WriteLine("\nSOM mapping: ");
      for (int i = 0; i "lt" mapRows; ++i)
      {
        for (int j = 0; j "lt" mapCols; ++j)
        {
          Console.Write("[" + i + "][" + j + "] : ");
          //AnomalySOM.ListShow(som.mapping[i][j]);
          Console.WriteLine(som.mapping[i][j].
            Count + " items ");
        }
      }

      // 4. show mapping as an array
      //Console.WriteLine("\nClustering as array: ");
      //int[][] clustering = som.GetClustering();
      //for (int idx = 0; idx "lt" clustering.Length;
      //++idx)
      //{
      //  Console.Write(idx + ":" + "(" + 
      //    clustering[idx][0] + "," + 
      //    clustering[idx][1] + ")" +
      //    " ");
      //}
      //Console.WriteLine("");

      // 5. analyze for anomalies
      Console.WriteLine("\nAnalysis: ");
      som.Analyze(rawFileArray);

      Console.WriteLine("\nEnd SOM anomaly demo");
      Console.ReadLine();
    } // Main

    // ------------------------------------------------------

  } // Program

  public class AnomalySOM
  {
    //public int k;  // number clusters
    public int mapRows;
    public int mapCols;
    public double[][] data;  // data to cluster
    public double[][][] map;  // [r][c][vec]
    public List"lt"int"gt"[][] mapping;  // [r][c](List)
    public Random rnd;  // to initialize map cells

    // ------------------------------------------------------
    // methods: ctor(), Cluster(), GetClustering()
    // ------------------------------------------------------

    public AnomalySOM(double[][] X, int mapRows,
      int mapCols, int seed)
    {
      this.mapRows = mapRows; // for map
      this.mapCols = mapCols; // for map
      int dim = X[0].Length; // 4 for penguin data

      this.rnd = new Random(seed);
      this.data = X; // by ref
 
      // init map nodes to random vectors in [0, 1]
      this.map =
        new double[mapRows][][];  // [r][c][vec]
      for (int i = 0; i "lt" mapRows; ++i)
      {
        this.map[i] = new double[mapCols][];
        for (int j = 0; j "lt" mapCols; ++j)
        {
          this.map[i][j] = new double[dim];
          for (int d = 0; d "lt" dim; ++d)
            this.map[i][j][d] = this.rnd.NextDouble();
        }
      }

      this.mapping =
        new List"lt"int"gt"[mapRows][]; // [r][c][lst]
      for (int i = 0; i "lt" mapRows; ++i)
      {
        this.mapping[i] = new List"lt"int"gt"[mapCols];
        for (int j = 0; j "lt" mapCols; ++j)
          this.mapping[i][j] = new List"lt"int"gt"();
      }
    } // ctor

    // ------------------------------------------------------

    public void Analyze(string[] rawFileArray)
    {
      for (int i = 0; i "lt" this.mapRows; ++i)
      {
        for (int j = 0; j "lt" this.mapCols; ++j)
        {
          double[] nodeVec = this.map[i][j];
          double largeDist = 0.0;
          int mostAnomIdx = 0;
          for (int k = 0; k "lt" this.mapping[i][j].Count;
            ++k)
          {
            // index of item assigned to curr map node
            int idx = this.mapping[i][j][k]; 
            double[] item = this.data[idx];
            double dist = 
              AnomalySOM.EucDist(nodeVec, item);
            if (dist "gt" largeDist)
            {
              largeDist = dist;
              mostAnomIdx = idx;
            }
          }
          Console.WriteLine("\nnode [" + i + "][" +
            j + "] : ");
          Console.WriteLine("   most anomalous " +
            "data idx = " + mostAnomIdx);
          Console.Write(" ");
          AnomalySOM.VecShow(this.data[mostAnomIdx],
            4, 8);
          Console.WriteLine("   " + 
            rawFileArray[mostAnomIdx]);
          Console.WriteLine("   distance to" +
            " node vec = " + largeDist.ToString("F4"));
        } // j
      } // i

    } // Analyze()

    // ------------------------------------------------------

    public void Cluster(double lrnRateMax, int stepsMax)
    {
      int n = this.data.Length;
      int dim = this.data[0].Length;
      int rangeMax = this.mapRows + this.mapCols;

      // compute the map
      for (int step = 0; step "lt" stepsMax; ++step)
      {
        // show progress 5 + 1 times
        if (step % (int)(stepsMax / 5) == 0)
        {
          Console.Write("map build step = " +
            step.ToString().PadLeft(4));
          double sum = 0.0;  // sum Euclidean dists
          for (int ix = 0; ix "lt" n; ++ix)
          {
            int[] RC = ClosestNode(ix);
            int kk = RC[1];  // RC[0] is always 0
            double[] item = this.data[ix];
            double[] node = this.map[0][kk];
            double dist = EucDist(item, node);
            sum += dist;  // accumulate
          }
          Console.WriteLine("  |  SED = " +
            sum.ToString("F4").PadLeft(9));
        }

        double pctLeft = 1.0 - ((step * 1.0) / stepsMax);
        int currRange = (int)(pctLeft * rangeMax);
        double currLrnRate = pctLeft * lrnRateMax;
        // Pick random data index
        int idx = this.rnd.Next(0, n);
        // Get (row,col) of closest map node -- 'bmu'
        int[] bmuRC = ClosestNode(idx);
        // Move each map mode closer to the bmu
        for (int i = 0; i "lt" this.mapRows; ++i)
        {
          for (int j = 0; j "lt" mapCols; ++j)
          {
            if (ManDist(bmuRC[0],
              bmuRC[1], i, j) "lte" currRange)
            {
              for (int d = 0; d "lt" dim; ++d)
                this.map[i][j][d] = this.map[i][j][d] +
                  currLrnRate * (this.data[idx][d] -
                  this.map[i][j][d]);
            }
          } // j
        } // i
      } // step
      // map has been created

      // compute mapping
      for (int idx = 0; idx "lt" n; ++idx)
      {
        // node map coords of node closest to data(idx)
        int[] rc = ClosestNode(idx);
        int r = rc[0]; int c = rc[1];
        this.mapping[r][c].Add(idx);
      }

      // results in this.mapping
      return;
    } // Cluster()

    // ------------------------------------------------------

    public int[][] GetClustering()
    {
      // cluster ID = (r, c) for every data item
      int n = this.data.Length;
      int[][] result = new int[n][];  // ID for each item
      for (int i = 0; i "lt" n; ++i)
        result[i] = new int[2]; // row, col

      for (int i = 0; i "lt" this.mapRows; ++i)
      {
        for (int j = 0; j "lt" this.mapCols; ++j)
        {
          for (int k = 0; k "lt" this.mapping[i][j].Count;
            ++k)
          {
            int dataIdx = this.mapping[i][j][k];
            result[dataIdx][0] = i;
            result[dataIdx][1] = j;
          }
        }
      }
      return result;
    }

    // ------------------------------------------------------
    // helpers: ManDist(), EucDist(), ClosestNode()
    // ------------------------------------------------------

    private static int ManDist(int r1, int c1,
      int r2, int c2)
    {
      // Manhattan distance between two SOM map cells
      return Math.Abs(r1 - r2) + Math.Abs(c1 - c2);
    }

    // ------------------------------------------------------

    private static double EucDist(double[] v1,
      double[] v2)
    {
      // Euclidean distance between two data items
      double sum = 0;
      for (int i = 0; i "lt" v1.Length; ++i)
        sum += (v1[i] - v2[i]) * (v1[i] - v2[i]);
      return Math.Sqrt(sum);
    }

    // ------------------------------------------------------

    private int[] ClosestNode(int idx)
    {
      // r,c coords in map of node closest to data[idx]
      double smallDist = double.MaxValue;
      int[] result = new int[] { 0, 0 };  // (row, col)
      for (int i = 0; i "lt" this.map.Length; ++i)
      {
        for (int j = 0; j "lt" this.map[0].Length; ++j)
        {
          double dist = EucDist(this.data[idx],
            this.map[i][j]);
          if (dist "lt" smallDist)
          {
            smallDist = dist;
            result[0] = i;
            result[1] = j;
          }
        }
      }
      return result;
    }

    // ------------------------------------------------------

    // misc. public utility functions for convenience
    // MatLoad(), FileLoad, VecLoad(), MatShow(),
    // VecShow(), ListShow()

    // ------------------------------------------------------

    public static double[][] MatLoad(string fn,
      int[] usecols, char sep, string comment)
    {
      // count number of non-comment lines
      int nRows = 0;
      string line = "";
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      while ((line = sr.ReadLine()) != null)
        if (line.StartsWith(comment) == false)
          ++nRows;
      sr.Close(); ifs.Close();

      // make result matrix
      int nCols = usecols.Length;
      double[][] result = new double[nRows][];
      for (int r = 0; r "lt" nRows; ++r)
        result[r] = new double[nCols];

      line = "";
      string[] tokens = null;
      ifs = new FileStream(fn, FileMode.Open);
      sr = new StreamReader(ifs);

      int i = 0;
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment) == true)
          continue;
        tokens = line.Split(sep);
        for (int j = 0; j "lt" nCols; ++j)
        {
          int k = usecols[j];  // into tokens
          result[i][j] = double.Parse(tokens[k]);
        }
        ++i;
      }
      sr.Close(); ifs.Close();
      return result;
    }

    // ------------------------------------------------------

    public static string[] FileLoad(string fn,
      string comment)
    {
      List"lt"string"gt" lst = new List"lt"string"gt"();
      FileStream ifs = new FileStream(fn, FileMode.Open);
      StreamReader sr = new StreamReader(ifs);
      string line = "";
      while ((line = sr.ReadLine()) != null)
      {
        if (line.StartsWith(comment)) continue;
        line = line.Trim();
        lst.Add(line);
      }
      sr.Close(); ifs.Close();
      string[] result = lst.ToArray();
      return result;
    }

    // ------------------------------------------------------

    public static int[] VecLoad(string fn, int usecol,
      string comment)
    {
      char dummySep = ',';
      double[][] tmp = MatLoad(fn, new int[] { usecol },
        dummySep, comment);
      int n = tmp.Length;
      int[] result = new int[n];
      for (int i = 0; i "lt" n; ++i)
        result[i] = (int)tmp[i][0];
      return result;
    }

    // ------------------------------------------------------

    public static void MatShow(double[][] M, int dec,
      int wid, int numRows, bool showIndices)
    {
      double small = 1.0 / Math.Pow(10, dec);
      for (int i = 0; i "lt" numRows; ++i)
      {
        if (showIndices == true)
        {
          int pad = M.Length.ToString().Length;
          Console.Write("[" + i.ToString().
            PadLeft(pad) + "]");
        }
        for (int j = 0; j "lt" M[0].Length; ++j)
        {
          double v = M[i][j];
          if (Math.Abs(v) "lt" small) v = 0.0;
          Console.Write(v.ToString("F" + dec).
            PadLeft(wid));
        }
        Console.WriteLine("");
      }
      if (numRows "lt" M.Length)
        Console.WriteLine(". . . ");
    }

    // ------------------------------------------------------

    public static void VecShow(int[] vec, int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
        Console.Write(vec[i].ToString().PadLeft(wid));
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void VecShow(double[] vec, int decimals,
      int wid)
    {
      int n = vec.Length;
      for (int i = 0; i "lt" n; ++i)
        Console.Write(vec[i].ToString("F" + decimals).
          PadLeft(wid));
      Console.WriteLine("");
    }

    // ------------------------------------------------------

    public static void ListShow(List"lt"int"gt" lst)
    {
      int n = lst.Count;
      for (int i = 0; i "lt" n; ++i)
      {
        Console.Write(lst[i] + " ");
      }
      Console.WriteLine("");
    }

    // ------------------------------------------------------
  } // class SOM
} // ns

Raw data:

# penguin_raw.txt
#
# species, bill len, bill wid, flip len, body mass
# only female
# bill len min = 32.1 max = 58.0
# bill wid min = 13.1 max = 20.7
# flip len min = 172 max = 222
# body mass min = 2700 max = 5200
#
Adelie  39.5  17.4  186  3800  Female
Adelie  40.3  18.0  195  3250  Female
Adelie  36.7  19.3  193  3450  Female
Adelie  38.9  17.8  181  3625  Female
Adelie  41.1  17.6  182  3200  Female
Adelie  36.6  17.8  185  3700  Female
Adelie  38.7  19.0  195  3450  Female
Adelie  34.4  18.4  184  3325  Female
Adelie  37.8  18.3  174  3400  Female
Adelie  35.9  19.2  189  3800  Female
Adelie  35.3  18.9  187  3800  Female
Adelie  40.5  17.9  187  3200  Female
Adelie  37.9  18.6  172  3150  Female
Adelie  39.5  16.7  178  3250  Female
Adelie  39.5  17.8  188  3300  Female
Adelie  36.4  17.0  195  3325  Female
Adelie  42.2  18.5  180  3550  Female
Adelie  37.6  19.3  181  3300  Female
Adelie  36.5  18.0  182  3150  Female
Adelie  36.0  18.5  186  3100  Female
Adelie  37.0  16.9  185  3000  Female
Adelie  36.0  17.9  190  3450  Female
Adelie  39.6  17.7  186  3500  Female
Adelie  35.0  17.9  190  3450  Female
Adelie  34.5  18.1  187  2900  Female
Adelie  39.0  17.5  186  3550  Female
Adelie  36.5  16.6  181  2850  Female
Adelie  35.7  16.9  185  3150  Female
Adelie  37.6  17.0  185  3600  Female
Adelie  36.4  17.1  184  2850  Female
Adelie  35.5  16.2  195  3350  Female
Adelie  35.9  16.6  190  3050  Female
Adelie  33.5  19.0  190  3600  Female
Adelie  39.6  17.2  196  3550  Female
Adelie  35.5  17.5  190  3700  Female
Adelie  40.9  16.8  191  3700  Female
Adelie  36.2  16.1  187  3550  Female
Adelie  34.6  17.2  189  3200  Female
Adelie  36.7  18.8  187  3800  Female
Adelie  37.3  17.8  191  3350  Female
Adelie  36.9  18.6  189  3500  Female
Adelie  38.9  18.8  190  3600  Female
Adelie  35.7  18.0  202  3550  Female
Adelie  34.0  17.1  185  3400  Female
Adelie  36.2  17.3  187  3300  Female
Adelie  38.1  18.6  190  3700  Female
Adelie  33.1  16.1  178  2900  Female
Adelie  35.0  17.9  192  3725  Female
Adelie  37.7  16.0  183  3075  Female
Adelie  37.9  18.6  193  2925  Female
Adelie  38.6  17.2  199  3750  Female
Adelie  38.1  17.0  181  3175  Female
Adelie  38.1  16.5  198  3825  Female
Adelie  39.7  17.7  193  3200  Female
Adelie  39.6  20.7  191  3900  Female
Adelie  38.6  17.0  188  2900  Female
Adelie  35.7  17.0  189  3350  Female
Adelie  36.2  17.2  187  3150  Female
Adelie  40.2  17.0  176  3450  Female
Adelie  35.2  15.9  186  3050  Female
Adelie  38.8  17.6  191  3275  Female
Adelie  39.0  17.1  191  3050  Female
Adelie  38.5  17.9  190  3325  Female
Adelie  36.8  18.5  193  3500  Female
Adelie  38.1  17.6  187  3425  Female
Adelie  35.6  17.5  191  3175  Female
Adelie  37.0  16.5  185  3400  Female
Adelie  40.2  17.1  193  3400  Female
Adelie  32.1  15.5  188  3050  Female
Adelie  37.3  16.8  192  3000  Female
Adelie  36.6  18.4  184  3475  Female
Adelie  36.0  17.8  195  3450  Female
Adelie  36.0  17.1  187  3700  Female
Chinstrap  46.5  17.9  192  3500  Female
Chinstrap  45.4  18.7  188  3525  Female
Chinstrap  45.2  17.8  198  3950  Female
Chinstrap  46.1  18.2  178  3250  Female
Chinstrap  46.0  18.9  195  4150  Female
Chinstrap  46.6  17.8  193  3800  Female
Chinstrap  47.0  17.3  185  3700  Female
Chinstrap  45.9  17.1  190  3575  Female
Chinstrap  58.0  17.8  181  3700  Female
Chinstrap  46.4  18.6  190  3450  Female
Chinstrap  42.4  17.3  181  3600  Female
Chinstrap  43.2  16.6  187  2900  Female
Chinstrap  46.7  17.9  195  3300  Female
Chinstrap  50.5  18.4  200  3400  Female
Chinstrap  46.4  17.8  191  3700  Female
Chinstrap  40.9  16.6  187  3200  Female
Chinstrap  42.5  16.7  187  3350  Female
Chinstrap  47.5  16.8  199  3900  Female
Chinstrap  47.6  18.3  195  3850  Female
Chinstrap  46.9  16.6  192  2700  Female
Chinstrap  46.2  17.5  187  3650  Female
Chinstrap  45.5  17.0  196  3500  Female
Chinstrap  50.9  17.9  196  3675  Female
Chinstrap  50.1  17.9  190  3400  Female
Chinstrap  49.8  17.3  198  3675  Female
Chinstrap  48.1  16.4  199  3325  Female
Chinstrap  45.7  17.3  193  3600  Female
Chinstrap  42.5  17.3  187  3350  Female
Chinstrap  45.2  16.6  191  3250  Female
Chinstrap  45.6  19.4  194  3525  Female
Chinstrap  46.8  16.5  189  3650  Female
Chinstrap  45.7  17.0  195  3650  Female
Chinstrap  43.5  18.1  202  3400  Female
Chinstrap  50.2  18.7  198  3775  Female
Gentoo  46.1  13.2  211  4500  Female
Gentoo  48.7  14.1  210  4450  Female
Gentoo  46.5  13.5  210  4550  Female
Gentoo  45.4  14.6  211  4800  Female
Gentoo  43.3  13.4  209  4400  Female
Gentoo  40.9  13.7  214  4650  Female
Gentoo  45.5  13.7  214  4650  Female
Gentoo  45.8  14.6  210  4200  Female
Gentoo  42.0  13.5  210  4150  Female
Gentoo  46.2  14.5  209  4800  Female
Gentoo  45.1  14.5  215  5000  Female
Gentoo  46.5  14.5  213  4400  Female
Gentoo  42.9  13.1  215  5000  Female
Gentoo  48.2  14.3  210  4600  Female
Gentoo  42.8  14.2  209  4700  Female
Gentoo  45.1  14.5  207  5050  Female
Gentoo  49.1  14.8  220  5150  Female
Gentoo  42.6  13.7  213  4950  Female
Gentoo  44.0  13.6  208  4350  Female
Gentoo  42.7  13.7  208  3950  Female
Gentoo  45.3  13.7  210  4300  Female
Gentoo  43.6  13.9  217  4900  Female
Gentoo  45.5  13.9  210  4200  Female
Gentoo  44.9  13.3  213  5100  Female
Gentoo  46.6  14.2  210  4850  Female
Gentoo  45.1  14.4  210  4400  Female
Gentoo  46.5  14.4  217  4900  Female
Gentoo  43.8  13.9  208  4300  Female
Gentoo  43.2  14.5  208  4450  Female
Gentoo  45.3  13.8  208  4200  Female
Gentoo  45.7  13.9  214  4400  Female
Gentoo  45.8  14.2  219  4700  Female
Gentoo  43.5  14.2  220  4700  Female
Gentoo  47.7  15.0  216  4750  Female
Gentoo  46.5  14.8  217  5200  Female
Gentoo  46.4  15.0  216  4700  Female
Gentoo  47.5  14.2  209  4600  Female
Gentoo  45.2  13.8  215  4750  Female
Gentoo  49.1  14.5  212  4625  Female
Gentoo  47.4  14.6  212  4725  Female
Gentoo  44.9  13.8  212  4750  Female
Gentoo  43.4  14.4  218  4600  Female
Gentoo  47.5  14.0  212  4875  Female
Gentoo  47.5  15.0  218  4950  Female
Gentoo  45.5  14.5  212  4750  Female
Gentoo  44.5  14.7  214  4850  Female
Gentoo  46.9  14.6  222  4875  Female
Gentoo  48.4  14.4  203  4625  Female
Gentoo  48.5  15.0  219  4850  Female
Gentoo  47.2  15.5  215  4975  Female
Gentoo  41.7  14.7  210  4700  Female
Gentoo  43.3  14.0  208  4575  Female
Gentoo  50.5  15.2  216  5000  Female
Gentoo  43.5  15.2  213  4650  Female
Gentoo  46.2  14.1  217  4375  Female
Gentoo  47.2  13.7  214  4925  Female
Gentoo  46.8  14.3  215  4850  Female
Gentoo  45.2  14.8  212  5200  Female

Normalized data:

# penguin_165.txt
#
# bill len, bill wid, flip len, body mass
# all female
#
0.2857, 0.5658, 0.2800, 0.4400
0.3166, 0.6447, 0.4600, 0.2200
0.1776, 0.8158, 0.4200, 0.3000
0.2625, 0.6184, 0.1800, 0.3700
0.3475, 0.5921, 0.2000, 0.2000
0.1737, 0.6184, 0.2600, 0.4000
0.2548, 0.7763, 0.4600, 0.3000
0.0888, 0.6974, 0.2400, 0.2500
0.2201, 0.6842, 0.0400, 0.2800
0.1467, 0.8026, 0.3400, 0.4400
0.1236, 0.7632, 0.3000, 0.4400
0.3243, 0.6316, 0.3000, 0.2000
0.2239, 0.7237, 0.0000, 0.1800
0.2857, 0.4737, 0.1200, 0.2200
0.2857, 0.6184, 0.3200, 0.2400
0.1660, 0.5132, 0.4600, 0.2500
0.3900, 0.7105, 0.1600, 0.3400
0.2124, 0.8158, 0.1800, 0.2400
0.1699, 0.6447, 0.2000, 0.1800
0.1506, 0.7105, 0.2800, 0.1600
0.1892, 0.5000, 0.2600, 0.1200
0.1506, 0.6316, 0.3600, 0.3000
0.2896, 0.6053, 0.2800, 0.3200
0.1120, 0.6316, 0.3600, 0.3000
0.0927, 0.6579, 0.3000, 0.0800
0.2664, 0.5789, 0.2800, 0.3400
0.1699, 0.4605, 0.1800, 0.0600
0.1390, 0.5000, 0.2600, 0.1800
0.2124, 0.5132, 0.2600, 0.3600
0.1660, 0.5263, 0.2400, 0.0600
0.1313, 0.4079, 0.4600, 0.2600
0.1467, 0.4605, 0.3600, 0.1400
0.0541, 0.7763, 0.3600, 0.3600
0.2896, 0.5395, 0.4800, 0.3400
0.1313, 0.5789, 0.3600, 0.4000
0.3398, 0.4868, 0.3800, 0.4000
0.1583, 0.3947, 0.3000, 0.3400
0.0965, 0.5395, 0.3400, 0.2000
0.1776, 0.7500, 0.3000, 0.4400
0.2008, 0.6184, 0.3800, 0.2600
0.1853, 0.7237, 0.3400, 0.3200
0.2625, 0.7500, 0.3600, 0.3600
0.1390, 0.6447, 0.6000, 0.3400
0.0734, 0.5263, 0.2600, 0.2800
0.1583, 0.5526, 0.3000, 0.2400
0.2317, 0.7237, 0.3600, 0.4000
0.0386, 0.3947, 0.1200, 0.0800
0.1120, 0.6316, 0.4000, 0.4100
0.2162, 0.3816, 0.2200, 0.1500
0.2239, 0.7237, 0.4200, 0.0900
0.2510, 0.5395, 0.5400, 0.4200
0.2317, 0.5132, 0.1800, 0.1900
0.2317, 0.4474, 0.5200, 0.4500
0.2934, 0.6053, 0.4200, 0.2000
0.2896, 1.0000, 0.3800, 0.4800
0.2510, 0.5132, 0.3200, 0.0800
0.1390, 0.5132, 0.3400, 0.2600
0.1583, 0.5395, 0.3000, 0.1800
0.3127, 0.5132, 0.0800, 0.3000
0.1197, 0.3684, 0.2800, 0.1400
0.2587, 0.5921, 0.3800, 0.2300
0.2664, 0.5263, 0.3800, 0.1400
0.2471, 0.6316, 0.3600, 0.2500
0.1815, 0.7105, 0.4200, 0.3200
0.2317, 0.5921, 0.3000, 0.2900
0.1351, 0.5789, 0.3800, 0.1900
0.1892, 0.4474, 0.2600, 0.2800
0.3127, 0.5263, 0.4200, 0.2800
0.0000, 0.3158, 0.3200, 0.1400
0.2008, 0.4868, 0.4000, 0.1200
0.1737, 0.6974, 0.2400, 0.3100
0.1506, 0.6184, 0.4600, 0.3000
0.1506, 0.5263, 0.3000, 0.4000
0.5560, 0.6316, 0.4000, 0.3200
0.5135, 0.7368, 0.3200, 0.3300
0.5058, 0.6184, 0.5200, 0.5000
0.5405, 0.6711, 0.1200, 0.2200
0.5367, 0.7632, 0.4600, 0.5800
0.5598, 0.6184, 0.4200, 0.4400
0.5753, 0.5526, 0.2600, 0.4000
0.5328, 0.5263, 0.3600, 0.3500
1.0000, 0.6184, 0.1800, 0.4000
0.5521, 0.7237, 0.3600, 0.3000
0.3977, 0.5526, 0.1800, 0.3600
0.4286, 0.4605, 0.3000, 0.0800
0.5637, 0.6316, 0.4600, 0.2400
0.7104, 0.6974, 0.5600, 0.2800
0.5521, 0.6184, 0.3800, 0.4000
0.3398, 0.4605, 0.3000, 0.2000
0.4015, 0.4737, 0.3000, 0.2600
0.5946, 0.4868, 0.5400, 0.4800
0.5985, 0.6842, 0.4600, 0.4600
0.5714, 0.4605, 0.4000, 0.0000
0.5444, 0.5789, 0.3000, 0.3800
0.5174, 0.5132, 0.4800, 0.3200
0.7259, 0.6316, 0.4800, 0.3900
0.6950, 0.6316, 0.3600, 0.2800
0.6834, 0.5526, 0.5200, 0.3900
0.6178, 0.4342, 0.5400, 0.2500
0.5251, 0.5526, 0.4200, 0.3600
0.4015, 0.5526, 0.3000, 0.2600
0.5058, 0.4605, 0.3800, 0.2200
0.5212, 0.8289, 0.4400, 0.3300
0.5676, 0.4474, 0.3400, 0.3800
0.5251, 0.5132, 0.4600, 0.3800
0.4402, 0.6579, 0.6000, 0.2800
0.6988, 0.7368, 0.5200, 0.4300
0.5405, 0.0132, 0.7800, 0.7200
0.6409, 0.1316, 0.7600, 0.7000
0.5560, 0.0526, 0.7600, 0.7400
0.5135, 0.1974, 0.7800, 0.8400
0.4324, 0.0395, 0.7400, 0.6800
0.3398, 0.0789, 0.8400, 0.7800
0.5174, 0.0789, 0.8400, 0.7800
0.5290, 0.1974, 0.7600, 0.6000
0.3822, 0.0526, 0.7600, 0.5800
0.5444, 0.1842, 0.7400, 0.8400
0.5019, 0.1842, 0.8600, 0.9200
0.5560, 0.1842, 0.8200, 0.6800
0.4170, 0.0000, 0.8600, 0.9200
0.6216, 0.1579, 0.7600, 0.7600
0.4131, 0.1447, 0.7400, 0.8000
0.5019, 0.1842, 0.7000, 0.9400
0.6564, 0.2237, 0.9600, 0.9800
0.4054, 0.0789, 0.8200, 0.9000
0.4595, 0.0658, 0.7200, 0.6600
0.4093, 0.0789, 0.7200, 0.5000
0.5097, 0.0789, 0.7600, 0.6400
0.4440, 0.1053, 0.9000, 0.8800
0.5174, 0.1053, 0.7600, 0.6000
0.4942, 0.0263, 0.8200, 0.9600
0.5598, 0.1447, 0.7600, 0.8600
0.5019, 0.1711, 0.7600, 0.6800
0.5560, 0.1711, 0.9000, 0.8800
0.4517, 0.1053, 0.7200, 0.6400
0.4286, 0.1842, 0.7200, 0.7000
0.5097, 0.0921, 0.7200, 0.6000
0.5251, 0.1053, 0.8400, 0.6800
0.5290, 0.1447, 0.9400, 0.8000
0.4402, 0.1447, 0.9600, 0.8000
0.6023, 0.2500, 0.8800, 0.8200
0.5560, 0.2237, 0.9000, 1.0000
0.5521, 0.2500, 0.8800, 0.8000
0.5946, 0.1447, 0.7400, 0.7600
0.5058, 0.0921, 0.8600, 0.8200
0.6564, 0.1842, 0.8000, 0.7700
0.5907, 0.1974, 0.8000, 0.8100
0.4942, 0.0921, 0.8000, 0.8200
0.4363, 0.1711, 0.9200, 0.7600
0.5946, 0.1184, 0.8000, 0.8700
0.5946, 0.2500, 0.9200, 0.9000
0.5174, 0.1842, 0.8000, 0.8200
0.4788, 0.2105, 0.8400, 0.8600
0.5714, 0.1974, 1.0000, 0.8700
0.6293, 0.1711, 0.6200, 0.7700
0.6332, 0.2500, 0.9400, 0.8600
0.5830, 0.3158, 0.8600, 0.9100
0.3707, 0.2105, 0.7600, 0.8000
0.4324, 0.1184, 0.7200, 0.7500
0.7104, 0.2763, 0.8800, 0.9200
0.4402, 0.2763, 0.8200, 0.7800
0.5444, 0.1316, 0.9000, 0.6700
0.5830, 0.0789, 0.8400, 0.8900
0.5676, 0.1579, 0.8600, 0.8600
0.5058, 0.2237, 0.8000, 1.0000

Posted in Machine Learning | Leave a comment

“Microsoft Data Scientist Weighs In on AI Hallucinations” on the Pure AI Web Site

Posted on April 4, 2024 by jamesdmccaffrey

I contributed some technical information in an article titled “Microsoft Data Scientist Weighs In on AI Hallucinations” on the March 2024 edition of the Pure AI web site. See https://pureai.com/Articles/2024/03/15/hallucinations.aspx.

An AI system hallucination is a euphemism for an incorrect result. Recent research indicates that AI systems such as OpenAI’s ChatGPT chatbot and Google’s Gemini (formerly Bard) hallucinate much more frequently than you might guess, and that in many cases, incorrect results are surprisingly difficult to detect.

I’m quoted:

“When AI natural language systems, such as OpenAI ChatGPT and Google Gemini, generate their output, at each step, the probabilities of every possible next word are computed,” McCaffrey said.

“For example, if a current output is THE QUICK BROWN … then the probability that the next word is FOX is very high, likely about 0.98, and the probabilities of other continuations, such as HELICOPTER, will be very small. The AI system usually selects the next word that has the highest probability.

However, an internal setting called the temperature, controls the creativity of the AI output. In ChatGPT, a high temperature value makes the AI system more creative by selecting a continuation with a lower probability. This can lead to factual errors.

The problems arise when an AI system is trying to generate factual information, usually in the form of results that contain numbers. Furthermore, when an AI system is asked to provide references for its responses, it’s not uncommon for over 50 percent of the references to be inaccurate, with incorrect authors or even references that are complete fabrications.”

Google’s Gemini Images tool took AI hallucinations to a hilarious level. Left: The Founding Fathers apparently were led by a Black woman. I had no idea. Right: And according to Google, George Washington was Black. My history education must have been sorely lacking for me not to know this.

I know enough about AI image generation to know for sure that Google’s wokeness was deliberate, in the sense that the system was explicitly instructed to include Black people regardless of reality. Interestingly, some Google AI images generated by the prompt “inner-city criminals” contained only White and Asian thugs, and no Black people at all. Again, the system must have been explicitly instructed to respond that way — the results could not have happened without explicit instructions. The “criminal” images were quickly expunged from all Google search results. Shades of Orwell.

Posted in Machine Learning | Leave a comment

Data Clustering from Scratch Using a Self-Organizing Map (SOM) with JavaScript

Posted on April 3, 2024 by jamesdmccaffrey

Data clustering is a classical machine learning technique to group similar data items together. By far the most common clustering technique is k-means clustering. Three other more-or-less common clustering techniques include DBSCAN (“density based spatial clustering of applications with noise”), Gaussian mixture model clustering, and Spectral clustering.

A fifth clustering technique is self-organizing map (SOM) clustering. I recently implemented an SOM clustering function from scratch using C#. And then a few days later I refactored the C# implementation to Python.

Just for fun, I decided to refactor my C#/Python implementation of SOM clustering to JavaScript. (I’m one of the few people I know who likes coding with raw JavaScript).

Compared to other techniques, SOM clustering produces clustering where the clusters are constructed so that clusters with similar IDs are closer to each other than clusters with dissimilar cluster IDs. For example, suppose you set k = 5 and so each data item falls into one of clusters 0, 1, 2, 3, 4. Data items in clusters 0 and 1 are relatively closer to each other than data items in clusters 0 and 3.

Self-organizing maps can be used for data explorations other than clustering. A standard, non-clustering SOM creates a square matrix, such as 4-by-4, called the map. Each cell of the SOM map holds a vector that is representative of a cluster. But for a clustering SOM, even though you could create a square matrix map, it makes more sense to create a 1-by-k matrix/vector map.

For my demo program, I used a small 12-item subset of the Penguin dataset. The data is:

0, 39.5, 17.4, 186, 3800
0, 40.3, 18.0, 195, 3250
0, 36.7, 19.3, 193, 3450
0, 38.9, 17.8, 181, 3625
1, 46.5, 17.9, 192, 3500
1, 45.4, 18.7, 188, 3525
1, 45.2, 17.8, 198, 3950
1, 46.1, 18.2, 178, 3250
2, 46.1, 13.2, 211, 4500
2, 48.7, 14.1, 210, 4450
2, 46.5, 13.5, 210, 4550
2, 45.4, 14.6, 211, 4800

Each line represents a penguin. The columns are species (0 = Adelie, 1 = Chinstrap, 2 = Gentoo), bill length, bill depth, flipper length, body mass. The full Penguin dataset has 345 items (333 items that have no missing values).

The demo does not use the species labels, but notice that the data is artificially organized so that it’s already clustered.

The key calling statements in my demo are:

  // 1. load data
  let fn = ".\\Data\\penguin_12.txt";
  let X = loadTxt(fn, ",", [1,2,3,4], "#");
  
  // 2. standardize data
  let stdX = ClusterSOM.matStandard(X);
  
  // 3. create ClusterSOM object and cluster
  let k = 3;
  let lrnRateMax = 0.50;
  let stepsMax = 1000;
  let seed = 9;
  som = new ClusterSOM(stdX, k, seed);
  som.cluster(lrnRateMax, stepsMax);

  // 4. (show some internal data structures)

  // 5. show clustering result
  console.log("\nclustering: ");
  let clustering = som.getClustering();
  vecShow(clustering, 0, 3);

The clustering result is:

  2  2  2  2  1  1  1  1  0  0  0  0

Data items in clusters 2 and 1 are close. Data items in clusters 1 and 0 are close. Data items in clusters 2 and 0 are relatively farther apart.

SOM clustering is an iterative process and requires a maximum learning rate (the learning rate is decreased during SOM map construction) and a maximum number of iteration steps.

The source data should be normalized/standardized so that columns with large magnitudes, like body mass, don’t overwhelm columns with small magnitudes, like bill length. The standardized data looks like:

[  0]  -1.1657     0.3300    -0.8774    -0.1661
[  1]  -0.9475     0.6163    -0.0943    -1.2103
[  2]  -1.9291     1.2366    -0.2683    -0.8306
. . .

The penguin species labels are not used.

The demo 1-by-3 SOM map has three cells, one for each cluster. Each map cell has a representative vector that you can think of as a central vector of the data items assigned to the cluster. For the demo data, the constructed SOM map is:

k = 0:    0.7293   -1.3665    1.2570    1.3363
k = 1:    0.5848    0.6721   -0.7431   -0.7339
k = 2:   -1.3175    0.6553   -0.6516   -0.6548

Each data item is assigned to the cluster that has the closest map node:

k = 0:  8 9 10 11
k = 1:  4 5 6 7
k = 2:  0 1 2 3

This means data items 8, 9, 10, 11 are assigned to cluster 0 because those items are closest to the map node 0 vector: (0.7293, -1.3665, 1.2570, 1.3363).

The demo program iterates 1000 steps. This was determined by preliminary trial and error when the creation of the SOM map was monitored by computing the sum of the Euclidean distances (SED) within each cluster. When the SED stopped decreasing significantly at about step 1000, that’s a sign that the map build steps can stop.

The demo program computes the distance between each map node vector:

[ 0]  0.00    3.53    3.99
[ 1]  3.53    0.00    1.91
[ 2]  3.99    1.91    0.00

The largest distance is between cluster 0 and cluster 2 (3.99), as expected.

I am a big fan of early science fiction movies. I noticed that a cluster of four films from the 1950s all appear to use the same model spaceship. I couldn’t find any solid information, but these screenshots are convincing.

Left: “Flight to Mars” (1951). A crew finds that Mars is inhabited. Some Martians want to invade Earth but some Martians help the crew escape back to Earth.

Center Left: “World Without End” (1956). A crew from Earth orbits Mars in preparation for a later landing. On the return trip they time warp to the year 2508. In a post nuclear war world, they find peaceful remnants of civilization and mutants.

Center Right: “It! The Terror fromBeyond Space” (1958). A rescue mission to Mars finds only one survivor of the first landing there. On the return trip to Earth, a vicious alien animal sneaks on board and menaces the crew. This movie is the direct inspiration for “Alien” (1979).

Right: “The Queen of Outer Space” (1958). A crew from Earth lands on Venus and finds a planet ruled by women and an evil queen. The Earth men overthrow the queen and establish friendly relations with the Venusian women.

Demo program. Replace “lt” (less-than), “gt”, “lte”, “gte”, “and” with Boolean operator symbols.

// cluster_som.js
// self-organizing map (SOM) clustering

let FS = require('fs');  // to read data file

// ----------------------------------------------------------

class ClusterSOM
{
  constructor(X, k, seed)
  {
    let nRows = 1;
    let nCols = k;
    let dim = X[0].length;

    this.k = k;
    this.data = X;
    this.seed = seed + 0.5;  // avoid 0
    this.map = this.makeMap(1, k, dim);
    for (let i = 0; i "lt" nRows; ++i) {
      for (let j = 0; j "lt" nCols; ++j) {
        for (let k = 0; k "lt" dim; ++k) {
          this.map[i][j][k] = this.next();  // random
        }
      }
    }
    this.mapping = this.makeMap(1, k, 1); // list
  }

  // --------------------------------------------------------
  // methods: cluster(), getBetweenNodeDists(),
  //  getClustering()
  // --------------------------------------------------------

  cluster(lrnRateMax, stepsMax) 
  {
    let n = this.data.length;
    let dim = this.data[0].length;
    let nRows = 1;
    let nCols = this.k;
    let rangeMax = nRows + nCols;

    for (let step = 0; step "lt" stepsMax; ++step) {

      if (step % Math.trunc(stepsMax / 5) == 0) {
        process.stdout.write("map build step = ");
        process.stdout.write(step.toString().
          padStart(4, ' '));
        let sum = 0.0;
        for (let ix = 0; ix "lt" n; ++ix) {
          let RC = this.closestNode(ix);
          let kk = RC[1];
          let item = this.data[ix];
          let node = this.map[0][kk];
          let dist = this.eucDistance(item, node);
          sum += dist;
        }
        console.log("  |  SED = " + sum.toFixed(4).toString().
          padStart(9, ' '));
      } // show progress

      let pctLeft = 1.0 - ((step * 1.0) / stepsMax);
      let currRange = (pctLeft * rangeMax);
      let currLrnRate = pctLeft * lrnRateMax;

      // pick a random index
      let idx = this.nextInt(0, n);
      let bmuRC = this.closestNode(idx);
      // move each map node
      for (let i = 0; i "lt" nRows; ++i) {
        for (let j = 0; j "lt" nCols; ++j) {
          if (this.manhattDist(bmuRC[0],
            bmuRC[1], i, j) "lte" currRange) {
            for (let d = 0; d "lt" dim; ++d) {
              this.map[i][j][d] = this.map[i][j][d] +
                currLrnRate * (this.data[idx][d] -
                  this.map[i][j][d]);
            } // d
          } // if
        } // j
      } // i

    } // step
    // map has been created

    // compute mapping
    for (let idx = 0; idx "lt" n; ++idx) {
      let rc = this.closestNode(idx);
      let r = rc[0]; let c = rc[1];
      this.mapping[r][c].push(idx);
    }
    // mapping has dummy 0.0 first values
    // remove them
    for (let i = 0; i "lt" nRows; ++i) {
      for (let j = 0; j "lt" nCols; ++j) {
        if (this.mapping[i][j].length "gte" 2)
          this.mapping[i][j].shift();  // remove first
      }
    }

    return;
  } // cluster()

  // --------------------------------------------------------

  getBetweenNodeDists()
  {
    // between map node distances
    let result = this.matMake(this.k, this.k, 0.0);
    for (let i = 0; i "lt" this.k; ++i) {
      for (let j = i; j "lt" this.k; ++j) {
        let dist = this.eucDistance(this.map[0][i],
          this.map[0][j]);
        result[i][j] = dist;
        result[j][i] = dist;
      }
    }
    return result;
  }

  // --------------------------------------------------------

  getClustering()
  {
    // cluster ID for every data item
    let n = this.data.length;
    let result = this.vecMake(n, 0.0); // ID each item
    for (let kk = 0; kk "lt" this.k; ++kk) {
      for (let i = 0; i "lt" this.mapping[0][kk].length;
        ++i) {
        let dataIdx = this.mapping[0][kk][i];
        result[dataIdx] = kk;
      }
    }
    return result;
  }

  // --------------------------------------------------------
  // helper functions: makeMap(), matMake(), vecMake(),
  //  closestNode(), eucDistance(), manhattDist(), next(),
  //  nextInt(), matStandard()
  // --------------------------------------------------------

  makeMap(d1, d2, d3)
  {
    let result = [];
    for (let i = 0; i "lt" d1; ++i) {
      result[i] = [];
      for (let j = 0; j "lt" d2; ++j) {
        result[i][j] = [];
        for (let k = 0; k "lt" d3; ++k) {
          result[i][j][k] = 0.0;
        }
      }
    }
    return result;
  }

  // --------------------------------------------------------

  matMake(rows, cols, val)
  {
    let result = [];
    for (let i = 0; i "lt" rows; ++i) {
      result[i] = [];
      for (let j = 0; j "lt" cols; ++j) {
        result[i][j] = val;
      }
    }
    return result;
  }

  // --------------------------------------------------------

  vecMake(n, val)
  {
    let result = [];
    for (let i = 0; i "lt" n; ++i) {
      result[i] = val;
    }
    return result;
  }

  // --------------------------------------------------------

  closestNode(idx)
  {
    // r,c of map vec closest to data[idx]
    let smallDist = this.eucDistance(this.data[0],
      this.map[0][0]);
    let result = this.vecMake(2, 0);
    for (let i = 0; i "lt" this.map.length; ++i) {
      for (let j = 0; j "lt" this.map[0].length; ++j) {
        let dist = this.eucDistance(this.data[idx],
          this.map[i][j]);
        if (dist "lt" smallDist) {
          smallDist = dist;
          result[0] = i;
          result[1] = j;
        }
      }
    }
    return result;
  } // closestNode()

  // --------------------------------------------------------

  eucDistance(v1, v2)
  {
    let dim = v1.length;
    let sum = 0.0;
    for (let i = 0; i "lt" dim; ++i)
      sum += (v1[i] - v2[i]) * (v1[i] - v2[i]);
    return Math.sqrt(sum);    
  }

  // --------------------------------------------------------

  manhattDist(r1, c1, r2, c2)
  {
    return Math.abs(r1 - r2) + Math.abs(c1 - c2);
  }

  // --------------------------------------------------------

  next() // next double
  {
    let x = Math.sin(this.seed) * 1000;
    let result = x - Math.floor(x);  // [0.0,1.0)
    this.seed = result;  // for next call
    return result;
  }

  // --------------------------------------------------------

  nextInt(lo, hi)
  {
    let x = this.next();
    return Math.trunc((hi - lo) * x + lo);
  }

  // --------------------------------------------------------

  static matStandard(data)  // static so can use in main()
  {
    // scikit style z-score biased normalization
    let nRows = data.length;
    let nCols = data[0].length;
    let result = [];
    for (let i = 0; i "lt" nRows; ++i) {
      result[i] = [];
      for (let j = 0; j "lt" nCols; ++j) {
        result[i][j] = 0.0;
      }
    }

    // compute column means
    let means = [];
    for (let j = 0; j "lt" nCols; ++j)
      means[j] = 0.0;
    for (let j = 0; j "lt" nCols; ++j) {
      let sum = 0.0;
      for (let i = 0; i "lt" nRows; ++i) {
        sum += data[i][j];
      }
      means[j] = sum / nRows;
    }

    // compute (biased) stds
    let sds = [];
    for (let j = 0; j "lt" nCols; ++j)
      sds[j] = 0.0;
    for (let j = 0; j "lt" nCols; ++j) {
      let sum = 0.0;
      for (let i = 0; i "lt" nRows; ++i) {
        sum += (data[i][j] - means[j]) * 
          (data[i][j] - means[j]);
      }
      sds[j] = Math.sqrt(sum / nRows);  // biased
    } // j

    // normalize
    for (let j = 0; j "lt" nCols; ++j) {
      for (let i = 0; i "lt" nRows; ++i) {
        result[i][j] = 
          (data[i][j] - means[j]) / sds[j];
      }
    } // j

    return result;

  } // matStandard()

} // class ClusterSOM


// ----------------------------------------------------------
// helpers for main(): loadTxt(), matShow(), vecShow()
// ----------------------------------------------------------

function loadTxt(fn, delimit, usecols, comment) 
{
  // efficient but mildly complicated
  let all = FS.readFileSync(fn, "utf8");  // giant string
  all = all.trim();  // strip final crlf in file
  let lines = all.split("\n");  // array of lines

  // count number non-comment lines
  let nRows = 0;
  for (let i = 0; i "lt" lines.length; ++i) {
    if (!lines[i].startsWith(comment))
      ++nRows;
  }
  nCols = usecols.length;
  let result = [];
  for (let i = 0; i "lt" nRows; ++i) {
    result[i] = [];
    for (let j = 0; j "lt" nCols; ++j) {
      result[i][j] = 0.0;
    }
  }
  
  let r = 0;  // into lines
  let i = 0;  // into result[][]
  while (r "lt" lines.length) {
    if (lines[r].startsWith(comment)) {
      ++r;  // next row
    }
    else {
      let tokens = lines[r].split(delimit);
      for (let j = 0; j "lt" nCols; ++j) {
        result[i][j] = parseFloat(tokens[usecols[j]]);
      }
      ++r;
      ++i;
    }
  }

  return result;
} // loadTxt()

// ----------------------------------------------------------

function matShow(m, dec, wid, showIndices)
{
  let rows = m.length;
  let cols = m[0].length;
  for (let i = 0; i "lt" rows; ++i) {
    if (showIndices == true)
      process.stdout.write("[" + i.toString().
        padStart(3, ' ') + "]");
    for (let j = 0; j "lt" cols; ++j) {
      let v = m[i][j];
      if (Math.abs(v) "lt" 0.000001) v = 0.0  // avoid -0
      let vv = v.toFixed(dec);
      let s = vv.toString().padStart(wid, ' ');
      process.stdout.write(s);
      process.stdout.write("  ");
    }
    process.stdout.write("\n");
  }
}

// ----------------------------------------------------------

function vecShow(vec, dec, wid)
{
  for (let i = 0; i "lt" vec.length; ++i) {
    let x = vec[i].toFixed(dec);
    let s = x.toString().padStart(wid, ' ');
    process.stdout.write(s);
    process.stdout.write(" ");
  }
  process.stdout.write("\n");
}

// ----------------------------------------------------------

function main()
{
  console.log("\nBegin self-organizing" +
    " map (SOM) clustering using JavaScript ");

  // 1. load data
  console.log("\nLoading 12-item Penguin subset ");
  let fn = ".\\Data\\penguin_12.txt";
  let X = loadTxt(fn, ",", [1,2,3,4], "#");
  console.log("\nX: ");
  matShow(X, 1, 8, true);

  // 2. standardize data
  let stdX = ClusterSOM.matStandard(X);
  console.log("\nStandardized data: ");
  matShow(stdX, 4, 9, true);

  // 3. create ClusterSOM object and cluster
  let k = 3;
  let lrnRateMax = 0.50;
  let stepsMax = 1000;
  console.log("\nSetting num clusters k = " + 
    k.toString());
  console.log("Setting  lrnRateMax = " + 
    lrnRateMax.toFixed(3).toString());
  console.log("Setting stepsMax = " + 
    stepsMax.toString());

  console.log("\nComputing SOM clustering ");
  let seed = 9;
  som = new ClusterSOM(stdX, k, seed);
  som.cluster(lrnRateMax, stepsMax);
  console.log("Done ");

  // 4. show the SOM
  console.log("\nSOM map nodes: ");
  for (let kk = 0; kk "lt" k; ++kk) {
    process.stdout.write("k = " + 
      kk.toString() + ": ");
    vecShow(som.map[0][kk], 4, 9);
  }

  console.log("\nSOM mapping: ");
  for (let kk = 0; kk "lt" k; ++kk) {
    process.stdout.write("k = " + kk + ":   ");
      vecShow(som.mapping[0][kk], 0, 3);
  }

  let betweenDists = som.getBetweenNodeDists();
  console.log("\nBetween map node distances: ");
  matShow(betweenDists, 2, 6, true);

  // 5. show clustering result
  console.log("\nclustering: ");
  let clustering = som.getClustering();
  vecShow(clustering, 0, 3);
 
  console.log("\nEnd SOM clustering demo");
}

main()

Posted in JavaScript, Machine Learning | 1 Comment

A Lightweight Five-Card Poker Library Using Python

Posted on April 2, 2024 by jamesdmccaffrey

One morning before work, I decided to implement a lightweight five-card poker library using Python. My library has a Card class, a Hand class, and a SingleDeck class. The three main functions are: 1.) classify a hand (like “FullHouse”), 2.) compare two hands to determine which hand is better, 3.) deal a hand from a deck of 52 cards.

I didn’t implement my poker library starting from nothing — I refactored my existing C# poker library.

There are two ways to create a Card object:

  c1 = Card((14,3))  # Ace of spades
  print(c1.to_string())

  c2 = Card("Td")  # Ten of diamonds
  print(c2.to_string())

The first constructor accepts a (rank, suit) tuple. The rank values are 2 = Two, 3 = Three, . . 10 = Ten, 11 = Jack, 12 = Queen, 13 = King, 14 = Ace. Rank values of 0 and 1 are not used. The suit values are 0 = clubs, 1 = diamonds, 2 = hearts, 3 = spades.

There are three main ways to create a Hand object:

  h1 = Hand("7cTsJc8d9h")
  print(h1.to_string())

  h2 = Hand(Card("6s"), Card("Ah"), Card("6h"),
    Card("Ac"), Card("6d"))
  print(h2.to_string()) 

  lst = []
  lst.append(Card("5c")); lst.append(Card("5d"))
  lst.append(Card("9c")); lst.append(Card("9d"))
  lst.append(Card("Qh"))
  h3 = Hand(lst)
  print(h3.to_string())

The first constuctor accepts an easy-to-interpret string such as “7cTsJc8d9h”. The second constructor accepts five individual Card objects. The third constructor accepts a List of five Card objects.

Hand objects are sorted from low card (“2c”) to high card (“As”). The sorting makes a hand easier to interpret, and easier to classify and compare.

There are two methods to classify a Hand object. The get_handtype_str() method returns one of ten strings: “HighCard”, “OnePair”, “TwoPair” , “ThreeKind” , “Straight”, “Flush” , “FullHouse”, “FourKind”, “StraightFlush”, “RoyalFlush”. The get_handtype_int() method returns integer 0 (high card) through 9 (royal flush).

  print(h1.get_handtype_str() + " = ", end="")  # Straight
  print(h1.get_handtype_int())  # 4

  print(h2.get_handtype_str() + " = ", end="")  # FullHouse
  print(h2.get_handtype_int())  # 6

  print(h3.get_handtype_str() + " = ", end="")  # TwoPair
  print(h3.get_handtype_int())  # 2

There is a static Hand.compare(h1, h2) function. It returns -1 if h1 is less than h2, returns +1 if h1 is greater than h2, returns 0 if h1 equals h2.

  cmp = Hand.compare(h1, h2)  # -1 Straight less FullHouse
  cmp = Hand.compare(h2, h3)  # 1 FullHouse greater TwoPair

The SingleDeck class has a deal_hand() method and a deal_list_cards() method. The deal_hand() method returns a Hand object containing five Card objects. The deal_list_cards() method return a List of n Card objects.

  d1 = SingleDeck(seed=0)
  d1.shuffle()
  d1.show()

  h4 = d1.deal_hand()
  print(h4.to_string())

  list_cards = d1.deal_list_cards(38) # remove 38 cards
  d1.show()  # 9 cards left in deck

The poker library can be used in several ways. You can compute the probabilities of different hands using a simulation. You can find the best five-card hand from seven cards. And so on. I’ll post some examples at some point.

I love old mechanical and electro-mechanical coin operated devices. Left: This Sittman-Pitt machine from 1891 is arguably the world’s first slot machine. It didn’t have automatic payout, so most resources state that the “Liberty Bell” machine from 1895, which did have automatic payout, by the Chas Fey company, was the first true slot machine. Right: The “Pok-O-Reel” machine was manufactured about 1932 by the Groetchen Tool and Manufacturing Company.

Demo code. Replace “lt” (less-than), “gt”, “lte”, “gte” with Boolean operator symbols.

# poker.py

# a library of five-card poker functions
# Card, Hand, Deck classes

import numpy as np

# -----------------------------------------------------------

class Card:
  def __init__(self, *args):
    # like string "Jc" or tuple (11,3)
    # rank: 2=Two, 3=Three . . 14=Ace
    # suit: 0=clubs, 1=diamonds, 2=hearts, 3=spades

    if isinstance(args[0], str):
      rnk = args[0][0]; sut = args[0][1]
      if rnk == 'T': self.rank = 10
      elif rnk == 'J': self.rank = 11
      elif rnk == 'Q': self.rank = 12
      elif rnk == 'K': self.rank = 13
      elif rnk == 'A': self.rank = 14
      else: self.rank = int(rnk) 

      if sut == 'c': self.suit = 0
      elif sut == 'd': self.suit = 1
      elif sut == 'h': self.suit = 2
      elif sut == 's': self.suit = 3
    elif isinstance(args[0], tuple):
      self.rank = int(args[0][0])
      self.suit = int(args[0][1])

  def to_string(self):
    rnk = ""; sut = ""
    if self.rank == 10: rnk = "T"
    elif self.rank == 11: rnk = "J"
    elif self.rank == 12: rnk = "Q"
    elif self.rank == 13: rnk = "K"
    elif self.rank == 14: rnk = "A"
    else: rnk = str(self.rank)

    if self.suit == 0: sut = "c"
    elif self.suit == 1: sut = "d"
    elif self.suit == 2: sut = "h"
    elif self.suit == 3: sut = "s"

    return rnk + sut

  # ---------------------------------------------------------
# ----- end Card --------------------------------------------

class Hand:
  # 5-card poker hand. sorted List of Card objects
  # Hand types: "RoyalFlush", "StraightFlush",
  # "FourKind", "FullHouse", "Flush" , "Straight",
  # "ThreeKind", "TwoPair", "OnePair", "HighCard"

  def __init__(self, *args):
    self.cards = []
    if isinstance(args[0], list):
      for i in range(5):
        # c = Card(args[0][i])
        self.cards.append(args[0][i])
    elif len(args) == 5:
      self.cards.append(args[0])
      self.cards.append(args[1])
      self.cards.append(args[2])
      self.cards.append(args[3])
      self.cards.append(args[4])
    elif isinstance(args[0], str):
      for i in range(0,10,2):
        cs = args[0][i:i+2]
        c = Card(cs)
        self.cards.append(c)

    self.cards.sort(key=lambda x: (x.rank, x.suit))
  
  def to_string(self):
    s = ""
    for i in range(5):
      s += self.cards[i].to_string()
    return s

  # ---------------------------------------------------------

  # Hand Type methods:
  # get_handtype_str(), get_handtype_int() call:
  # 
  # is_royal_flush(), is_straight_flush(), 
  # is_four_kind(), is_full_house(), is_flush(),
  # is_straight(), is_three_kind(), is_two_pair(),
  # is_one_pair(), is_high_card()
  #
  # Helpers: has_straight(), has_flush()

  # ---------------------------------------------------------

  def get_handtype_str(self):
    if self.is_royal_flush() == True:
      return "RoyalFlush"
    elif self.is_straight_flush() == True:
      return "StraightFlush"
    elif self.is_four_kind() == True:
      return "FourKind"
    elif self.is_full_house() == True:
      return "FullHouse"
    elif self.is_flush() == True:
      return "Flush"
    elif self.is_straight() == True:
      return "Straight"
    elif self.is_three_kind() == True:
      return "ThreeKind"
    elif self.is_two_pair() == True:
      return "TwoPair"
    elif self.is_one_pair() == True:
      return "OnePair"
    elif self.is_high_card() == True:
      return "HighCard"
    else:
      return "Unknown"

  # ---------------------------------------------------------

  def get_handtype_int(self):
    if self.is_royal_flush() == True:
      return 9
    elif self.is_straight_flush() == True:
      return 8
    elif self.is_four_kind() == True:
      return 7
    elif self.is_full_house() == True:
      return 6
    elif self.is_flush() == True:
      return 5
    elif self.is_straight() == True:
      return 4
    elif self.is_three_kind() == True:
      return 3
    elif self.is_two_pair() == True:
      return 2
    elif self.is_one_pair() == True:
      return 1
    elif self.is_high_card() == True:
      return 0
    else:
      return -1

  # ---------------------------------------------------------

  def has_flush(self):
    if (self.cards[0].suit == self.cards[1].suit) and \
       (self.cards[1].suit == self.cards[2].suit) and \
       (self.cards[2].suit == self.cards[3].suit) and \
       (self.cards[3].suit == self.cards[4].suit):
      return True

    return False

  # ---------------------------------------------------------

  def has_straight(self):
    # check special case of Ace-low straight
    # 2, 3, 4, 5, A when sorted
    if self.cards[0].rank == 2 and \
      self.cards[1].rank == 3 and \
      self.cards[2].rank == 4 and \
      self.cards[3].rank == 5 and \
      self.cards[4].rank == 14:
        return True
    # otherwise, check for 5 consecutive card ranks
    if (self.cards[0].rank == self.cards[1].rank - 1) and \
      (self.cards[1].rank == self.cards[2].rank - 1) and \
      (self.cards[2].rank == self.cards[3].rank - 1) and \
      (self.cards[3].rank == self.cards[4].rank - 1):
      return True

    return False

  # ---------------------------------------------------------

  def is_royal_flush(self):
    if self.has_straight() == True and \
      self.has_flush() == True and \
        self.cards[0].rank == 10:  # assumes hand is sorted
      return True
    else:
      return False

  # ---------------------------------------------------------

  def is_straight_flush(self):
    if self.has_straight() == True and \
      self.has_flush() == True and \
        self.cards[0].rank != 10:  # no Royal Flush
      return True
    else:
      return False

  # ---------------------------------------------------------

  def is_four_kind(self):
    # AAAA B or B AAAA if sorted
    if (self.cards[0].rank == self.cards[1].rank) and \
      (self.cards[1].rank == self.cards[2].rank) and \
      (self.cards[2].rank == self.cards[3].rank) and \
      (self.cards[3].rank != self.cards[4].rank):
      return True

    if (self.cards[1].rank == self.cards[2].rank) and \
      (self.cards[2].rank == self.cards[3].rank) and \
      (self.cards[3].rank == self.cards[4].rank) and \
      (self.cards[0].rank != self.cards[1].rank):
      return True

    return False

  # ---------------------------------------------------------

  def is_full_house(self):
    # AAA BB or BB AAA if sorted
    if (self.cards[0].rank == self.cards[1].rank) and \
      (self.cards[1].rank == self.cards[2].rank) and \
      (self.cards[3].rank == self.cards[4].rank) and \
      (self.cards[2].rank != self.cards[3].rank):
      return True

    # BB AAA
    if (self.cards[0].rank == self.cards[1].rank) and \
      (self.cards[2].rank == self.cards[3].rank) and \
      (self.cards[3].rank == self.cards[4].rank) and \
      (self.cards[1].rank != self.cards[2].rank):
      return True

    return False

  # ---------------------------------------------------------

  def is_flush(self):
    if self.has_flush() == True and \
      self.has_straight() == False:
      return True  # no StraightFlush or RoyalFlush
    else:
      return False

  # ---------------------------------------------------------

  def is_straight(self):
    if self.has_straight() == True and \
      self.has_flush() == False:
      return True
    else:
      return False

  # ---------------------------------------------------------

  def is_three_kind(self):
    # AAA B C or B AAA C or B C AAA if sorted
    if (self.cards[0].rank == self.cards[1].rank) and \
      (self.cards[1].rank == self.cards[2].rank) and \
      (self.cards[2].rank != self.cards[3].rank) and \
      (self.cards[3].rank != self.cards[4].rank):
      return True

    if (self.cards[1].rank == self.cards[2].rank) and \
      (self.cards[2].rank == self.cards[3].rank) and \
      (self.cards[0].rank != self.cards[1].rank) and \
      (self.cards[3].rank != self.cards[4].rank):
      return True

    if (self.cards[2].rank == self.cards[3].rank) and \
      (self.cards[3].rank == self.cards[4].rank) and \
      (self.cards[0].rank != self.cards[1].rank) and \
      (self.cards[1].rank != self.cards[2].rank):
      return True

    return False

  # ---------------------------------------------------------

  def is_two_pair(self):
    # AA BB C or AA C BB or C AA BB if sorted
    if (self.cards[0].rank == self.cards[1].rank) and \
      (self.cards[2].rank == self.cards[3].rank) and \
      (self.cards[1].rank != self.cards[2].rank) and \
      (self.cards[3].rank != self.cards[4].rank):
      return True  # AA BB C

    if (self.cards[0].rank == self.cards[1].rank) and \
      (self.cards[3].rank == self.cards[4].rank) and \
      (self.cards[1].rank != self.cards[2].rank) and \
      (self.cards[2].rank != self.cards[3].rank):
      return True  # AA C BB

    if (self.cards[1].rank == self.cards[2].rank) and \
      (self.cards[3].rank == self.cards[4].rank) and \
      (self.cards[0].rank != self.cards[1].rank) and \
      (self.cards[2].rank != self.cards[3].rank):
      return True  # C AA BB

    return False

  # ---------------------------------------------------------

  def is_one_pair(self):
    # AA B C D or B AA C D or B C AA D or B C D AA
    if (self.cards[0].rank == self.cards[1].rank) and \
      (self.cards[1].rank != self.cards[2].rank) and \
      (self.cards[2].rank != self.cards[3].rank) and \
      (self.cards[3].rank != self.cards[4].rank):
      return True  # AA B C D

    if (self.cards[1].rank == self.cards[2].rank) and \
      (self.cards[0].rank != self.cards[1].rank) and \
      (self.cards[2].rank != self.cards[3].rank) and \
      (self.cards[3].rank != self.cards[4].rank):
      return True  # B AA C D

    if (self.cards[2].rank == self.cards[3].rank) and \
      (self.cards[0].rank != self.cards[1].rank) and \
      (self.cards[1].rank != self.cards[2].rank) and \
      (self.cards[3].rank != self.cards[4].rank):
      return True  # B C AA D

    if (self.cards[3].rank == self.cards[4].rank) and \
      (self.cards[0].rank != self.cards[1].rank) and \
      (self.cards[1].rank != self.cards[2].rank) and \
      (self.cards[2].rank != self.cards[3].rank):
      return True  # B C D AA

    return False

  # ---------------------------------------------------------

  def is_high_card(self):
    if self.has_flush() == True:
      return False
    elif self.has_straight() == True:
      return False
    else:
      # all remaining have at least one pair
      if (self.cards[0].rank == self.cards[1].rank) or \
        (self.cards[1].rank == self.cards[2].rank) or \
        (self.cards[2].rank == self.cards[3].rank) or \
        (self.cards[3].rank == self.cards[4].rank):
        return False

    return True

  # ---------------------------------------------------------

  # Hand comparison methods
  # Hand.compare() calls:
  # break_tie_straight_flush(), break_tie_four_kind(),
  # break_tie_full_house(), break_tie_flush(),
  # break_tie_straight(), break_tie_three_kind(),
  # break_tie_two_pair(), break_tie_one_pair(),
  # break_tie_high_card()

  # ---------------------------------------------------------

  @staticmethod
  def compare(h1, h2):
    # -1 if h1 "lt" h2, +1 if h1 "gt" h2, 0 if h1 == h2

    h1_idx = h1.get_handtype_int()  # like 6
    h2_idx = h2.get_handtype_int()

    # different hand types - easy
    if h1_idx "lt" h2_idx:
      return -1
    elif h1_idx "gt" h2_idx:
      return 1
    else: # same hand types so break tie
      h1_handtype = h1.get_handtype_str()
      h2_handtype = h2.get_handtype_str()

      if h1_handtype != h2_handtype:
        print("Fatal Logic Error in compare()")

      if h1_handtype == "RoyalFlush":
        return 0  # two Royal Flush always tie
      elif h1_handtype == "StraightFlush":
        return Hand.break_tie_straight_flush(h1, h2)
      elif h1_handtype == "FourKind":
        return Hand.break_tie_four_kind(h1, h2)
      elif h1_handtype == "FullHouse":
        return Hand.break_tie_full_house(h1, h2)
      elif h1_handtype == "Flush":
        return Hand.break_tie_flush(h1, h2)
      elif h1_handtype == "Straight":
        return Hand.break_tie_straight(h1, h2)
      elif h1_handtype == "ThreeKind":
        return Hand.break_tie_three_kind(h1, h2)
      elif h1_handtype == "TwoPair":
        return Hand.break_tie_two_pair(h1, h2)
      elif h1_handtype == "OnePair":
        return Hand.break_tie_one_pair(h1, h2)
      elif h1_handtype == "HighCard":
        return Hand.break_tie_high_card(h1, h2)

    return -2  # error

  # ---------------------------------------------------------

  @staticmethod
  def break_tie_straight_flush(h1, h2):
    # check one or two Ace-low hands
    # h1 is Ace-low, h2 not Ace-low.h1, is less
    if (h1.cards[0].rank == 2 and \
        h1.cards[4].rank == 14) and \
       not (h2.cards[0].rank == 2 and
       h2.cards[4].rank == 14):
      return -1
 
    # h1 not Ace-low, h2 is Ace-low, h1 is better
    if not (h1.cards[0].rank == 2 and \
         h1.cards[4].rank == 14) and \
       (h2.cards[0].rank == 2 and \
        h2.cards[4].rank == 14):
      return 1
  
    # two Ace-low hands
    if (h1.cards[0].rank == 2 and \
        h1.cards[4].rank == 14) and \
       (h2.cards[0].rank == 2 and \
        h2.cards[4].rank == 14):
      return 0

    # no Ace-low straight flush so check high cards
    if h1.cards[4].rank "lt" h2.cards[4].rank:
      return -1
    elif h1.cards[4].rank "gt" h2.cards[4].rank:
      return 1
    else:
      return 0
  
  # ---------------------------------------------------------

  @staticmethod
  def break_tie_four_kind(h1, h2):
    # AAAA B or B AAAA
    # the off-card is at [0] or at [4]
    # find h1 four-card and off-card ranks
    h1_four_rank = -1; h1_off_rank = -1
    if h1.cards[0].rank == h1.cards[1].rank:
      # 1st two cards same so off-rank at [4]
      h1_four_rank = h1.cards[0].rank
      h1_off_rank = h1.cards[4].rank
    else:
      # 1st two cards diff so off-rank at [0]
      h1_four_rank = h1.cards[4].rank
      h1_off_rank = h1.cards[0].rank

    h2_four_rank = -1; h2_off_rank = -1
    if h2.cards[0].rank == h2.cards[1].rank:
      h2_four_rank = h2.cards[0].rank
      h2_off_rank = h2.cards[4].rank
    else:
      h2_four_rank = h2.cards[4].rank
      h2_off_rank = h2.cards[0].rank

    if h1_four_rank "lt" h2_four_rank: # like 4K, 4A
      return -1
    elif h1_four_rank "gt" h2_four_rank:
      return 1
    else: # both hands have same four-kind (mult. decks)
      if h1_off_rank "lt" h2_off_rank:
        return -1  # like 3c 9c9d9h9s "lt" Qd 9c9d9h9s
      elif h1_off_rank "gt" h2_off_rank:
        return 1  # like Jc 4c4d4h4s "gt" 9s 4c4d4h4s
      elif h1_off_rank == h2_off_rank:
        return 0

    print("Fatal logic break_tie_four_kind")

  # ---------------------------------------------------------
  
  @staticmethod
  def break_tie_full_house(h1, h2):
    # determine high rank (3 kind) and low rank (2 kind)
    # JJJ 55 or 33 KKK
    # if [1] == [2] 3 kind at [0][1][2]
    # if [1] != [2] 3 kind at [2][3][4]

    h1_three_rank = -1; h1_two_rank = -1
    if h1.cards[1].rank == h1.cards[2].rank:
      # if [1] == [2] 3 kind at [0][1][2]
      h1_three_rank = h1.cards[0].rank
      h1_two_rank = h1.cards[4].rank
    else:
      # if [1] != [2] 3 kind at [2][3][4]
      h1_three_rank = h1.cards[4].rank
      h1_two_rank = h1.cards[0].rank

    h2_three_rank = -1; h2_two_rank = -1
    if h2.cards[1].rank == h2.cards[2].rank:
      # if [1] == [2] 3 kind at [0][1][2]
      h2_three_rank = h2.cards[0].rank
      h2_two_rank = h2.cards[4].rank
    else:
      # if [1] != [2] 3 kind at [2][3][4]
      h2_three_rank = h2.cards[4].rank
      h2_two_rank = h2.cards[0].rank

    if h1_three_rank "lt" h2_three_rank:
      return -1
    elif h1_three_rank "gt" h2_three_rank:
      return 1
    else:  # both hands same three-kind (mult. decks)
      if h1_two_rank "lt" h2_two_rank:
        return -1  # like 3c3d 9c9d9h "lt" QdQs 9c9d9h
      elif h1_two_rank "gt" h2_two_rank:
        return 1  # like 3c3d 9c9d9h "gt" 2d2s 9c9d9h
      elif h1_two_rank == h2_two_rank:
        return 0

    print("Fatal logic break_tie_full_house")

  # ---------------------------------------------------------

  @staticmethod
  def break_tie_flush(h1, h2):
    # compare rank of high cards
    if h1.cards[4].rank "lt" h2.cards[4].rank:
      return -1
    elif h1.cards[4].rank "gt" h2.cards[4].rank:
      return 1
    # high cards ranks equal so check at [3]
    elif h1.cards[3].rank "lt" h2.cards[3].rank:
      return -1
    elif h1.cards[3].rank "gt" h2.cards[3].rank:
      return 1
    # and so on
    elif h1.cards[2].rank "lt" h2.cards[2].rank:
      return -1
    elif h1.cards[2].rank "gt" h2.cards[2].rank:
      return 1
    #
    elif h1.cards[1].rank "lt" h2.cards[1].rank:
      return -1
    elif h1.cards[1].rank "gt" h2.cards[1].rank:
      return 1
    #
    elif h1.cards[0].rank "lt" h2.cards[0].rank:
      return -1
    elif h1.cards[0].rank "gt" h2.cards[0].rank:
      return 1
    #
    else:
      return 0  # all ranks the same

  # ---------------------------------------------------------

  @staticmethod
  def break_tie_straight(h1, h2):
    # both hands are straights but one could be Ace-low
    # check special case of one or two Ace-low hands
    # h1 is Ace-low, h2 not Ace-low. h1 is less
    if (h1.cards[0].rank == 2 and \
      h1.cards[4].rank == 14) and \
      not (h2.cards[0].rank == 2 and \
      h2.cards[4].rank == 14):
      return -1
    # h1 not Ace-low, h2 is Ace-low, h1 is better
    elif not (h1.cards[0].rank == 2 and \
      h1.cards[4].rank == 14) and \
      (h2.cards[0].rank == 2 and \
      h2.cards[4].rank == 14):
      return 1
    # two Ace-low hands
    elif (h1.cards[0].rank == 2 and \
      h1.cards[4].rank == 14) and \
      (h2.cards[0].rank == 2 and \
      h2.cards[4].rank == 14):
      return 0

    # no Ace-low hands so just check high card
    if h1.cards[4].rank "lt" h2.cards[4].rank:
      return -1
    elif h1.cards[4].rank "gt" h2.cards[4].rank:
      return +1
    elif h1.cards[4].rank == h2.cards[4].rank:
      return 0
    else:
      print("Fatal logic break_tie_straight")

  # ---------------------------------------------------------

  @staticmethod
  def break_tie_three_kind(h1, h2):
    # assumes multiple decks possible
    # TTT L H or L TTT H or L H TTT
    h1_three_rank = 0; h1_low_rank = 0
    h1_high_rank = 0
    if h1.cards[0].rank == h1.cards[1].rank and \
      h1.cards[1].rank == h1.cards[2].rank:
      h1_three_rank = h1.cards[0].rank
      h1_low_rank = h1.cards[3].rank
      h1_high_rank = h1.cards[4].rank
    elif h1.cards[1].rank == h1.cards[2].rank and \
      h1.cards[2].rank == h1.cards[3].rank:
      h1_low_rank = h1.cards[0].rank
      h1_three_rank = h1.cards[1].rank
      h1_high_rank = h1.cards[4].rank
    elif h1.cards[2].rank == h1.cards[3].rank and \
     h1.cards[3].rank == h1.cards[4].rank:
     h1_low_rank = h1.cards[0].rank
     h1_high_rank = h1.cards[1].rank
     h1_three_rank = h1.cards[4].rank
  
    h2_three_rank = 0; h2_low_rank = 0
    h2_high_rank = 0
    if h2.cards[0].rank == h2.cards[1].rank and \
      h2.cards[1].rank == h2.cards[2].rank:
      h2_three_rank = h2.cards[0].rank
      h2_low_rank = h2.cards[3].rank
      h2_high_rank = h2.cards[4].rank
    elif h2.cards[1].rank == h2.cards[2].rank and \
      h2.cards[2].rank == h2.cards[3].rank:
      h2_low_rank = h2.cards[0].rank
      h2_three_rank = h2.cards[1].rank
      h2_high_rank = h2.cards[4].rank
    elif h2.cards[2].rank == h2.cards[3].rank and \
      h2.cards[3].rank == h2.cards[4].rank:
      h2_low_rank = h2.cards[0].rank
      h2_high_rank = h2.cards[1].rank
      h2_three_rank = h2.cards[4].rank

    if h1_three_rank "lt" h2_three_rank:
      return -1
    elif h1_three_rank "gt" h2_three_rank:
      return 1
    # both hands three-kind same mult. decks
    elif h1_high_rank "lt" h2_high_rank:
      return -1
    elif h1_high_rank "gt" h2_high_rank:
      return 1
    elif h1_low_rank "lt" h2_low_rank:
      return -1
    elif h1_low_rank "gt" h2_low_rank:
      return 1
    #
    else:
      return 0

  # ---------------------------------------------------------

  @staticmethod
  def break_tie_two_pair(h1, h2):
    # LL X HH or LL HH X or X LL HH
    h1_low_rank = 0;  h1_high_rank = 0
    h1_off_rank = 0
    if h1.cards[0].rank == h1.cards[1].rank and \
      h1.cards[3].rank == h1.cards[4].rank:
      # LL X HH
      h1_low_rank = h1.cards[0].rank
      h1_high_rank = h1.cards[4].rank
      h1_off_rank = h1.cards[2].rank
    elif h1.cards[0].rank == h1.cards[1].rank and \
      h1.cards[2].rank == h1.cards[3].rank:
      # LL HH X
      h1_low_rank = h1.cards[0].rank
      h1_high_rank = h1.cards[2].rank
      h1_off_rank = h1.cards[4].rank
    elif h1.cards[1].rank == h1.cards[2].rank and \
      h1.cards[3].rank == h1.cards[4].rank:
      # X LL HH
      h1_low_rank = h1.cards[1].rank
      h1_high_rank = h1.cards[3].rank
      h1_off_rank = h1.cards[0].rank

    h2_low_rank = 0;  h2_high_rank = 0
    h2_off_rank = 0
    if h2.cards[0].rank == h2.cards[1].rank and \
      h2.cards[3].rank == h2.cards[4].rank:
      # LL X HH
      h2_low_rank = h2.cards[0].rank
      h2_high_rank = h2.cards[4].rank
      h2_off_rank = h2.cards[2].rank
    elif h2.cards[0].rank == h2.cards[1].rank and \
      h2.cards[2].rank == h2.cards[3].rank:
      # LL HH X
      h2_low_rank = h2.cards[0].rank
      h2_high_rank = h2.cards[2].rank
      h2_off_rank = h2.cards[4].rank
    elif h2.cards[1].rank == h2.cards[2].rank and \
      h2.cards[3].rank == h2.cards[4].rank:
      # X LL HH
      h2_low_rank = h2.cards[1].rank
      h2_high_rank = h2.cards[3].rank
      h2_off_rank = h2.cards[0].rank
 
    if h1_high_rank "lt" h2_high_rank:
      return -1
    elif h1_high_rank "gt" h2_high_rank:
      return 1
    elif h1_low_rank "lt" h2_low_rank:
      return -1
    elif h1_low_rank "gt" h2_low_rank:
      return 1
    elif h1_off_rank "lt" h2_off_rank:
      return -1
    elif h1_off_rank "gt" h2_off_rank:
      return 1
    else:
      return 0

  # ---------------------------------------------------------

  @staticmethod
  def break_tie_one_pair(h1, h2):
    # PP L M H or L PP M H
    # or L M PP H or L M H PP
    h1_pair_rank = 0; h1_low_rank = 0
    h1_medium_rank = 0; h1_high_rank = 0
    if h1.cards[0].rank == h1.cards[1].rank:
      # PP L M H
      h1_pair_rank = h1.cards[0].rank
      h1_low_rank = h1.cards[2].rank
      h1_medium_rank = h1.cards[3].rank
      h1_high_rank = h1.cards[4].rank
    elif h1.cards[1].rank == h1.cards[2].rank:
      # L PP M H
      h1_pair_rank = h1.cards[1].rank
      h1_low_rank = h1.cards[0].rank
      h1_medium_rank = h1.cards[3].rank
      h1_high_rank = h1.cards[4].rank
    elif h1.cards[2].rank == h1.cards[3].rank:
      # L M PP H
      h1_pair_rank = h1.cards[2].rank
      h1_low_rank = h1.cards[0].rank
      h1_medium_rank = h1.cards[1].rank
      h1_high_rank = h1.cards[4].rank
    elif h1.cards[3].rank == h1.cards[4].rank:
      # L M H PP
      h1_pair_rank = h1.cards[4].rank
      h1_low_rank = h1.cards[0].rank
      h1_medium_rank = h1.cards[1].rank
      h1_high_rank = h1.cards[2].rank
 
    h2_pair_rank = 0; h2_low_rank = 0
    h2_medium_rank = 0; h2_high_rank = 0
    if h2.cards[0].rank == h2.cards[1].rank:
      # PP L M H
      h2_pair_rank = h2.cards[0].rank
      h2_low_rank = h2.cards[2].rank
      h2_medium_rank = h2.cards[3].rank
      h2_high_rank = h2.cards[4].rank
    elif h2.cards[1].rank == h2.cards[2].rank:
      # L PP M H
      h2_pair_rank = h2.cards[1].rank
      h2_low_rank = h2.cards[0].rank
      h2_medium_rank = h2.cards[3].rank
      h2_high_rank = h2.cards[4].rank
    elif h2.cards[2].rank == h2.cards[3].rank:
      # L M PP H
      h2_pair_rank = h2.cards[2].rank
      h2_low_rank = h2.cards[0].rank
      h2_medium_rank = h2.cards[1].rank
      h2_high_rank = h2.cards[4].rank
    elif h2.cards[3].rank == h2.cards[4].rank:
      # L M H PP
      h2_pair_rank = h2.cards[4].rank
      h2_low_rank = h2.cards[0].rank
      h2_medium_rank = h2.cards[1].rank
      h2_high_rank = h2.cards[2].rank

    if h1_pair_rank "lt" h2_pair_rank:
      return -1
    elif h1_pair_rank "gt" h2_pair_rank:
      return 1
    #
    elif h1_high_rank "lt" h2_high_rank:
      return -1
    elif h1_high_rank "gt" h2_high_rank:
      return 1
    #
    elif h1_medium_rank "lt" h2_medium_rank:
      return -1
    elif h1_medium_rank "gt" h2_medium_rank:
      return 1
    #
    elif h1_low_rank "lt" h2_low_rank:
      return -1
    elif h1_low_rank "gt" h2_low_rank:
      return 1
    #
    else:
      return 0

  # ---------------------------------------------------------

  @staticmethod
  def break_tie_high_card(h1, h2):
    if h1.cards[4].rank "lt" h2.cards[4].rank:
      return -1
    elif h1.cards[4].rank "gt" h2.cards[4].rank:
      return 1
    #
    elif h1.cards[3].rank "lt" h2.cards[3].rank:
      return -1
    elif h1.cards[3].rank "gt" h2.cards[3].rank:
      return 1
    #
    elif h1.cards[2].rank "lt" h2.cards[2].rank:
      return -1
    elif h1.cards[2].rank "gt" h2.cards[2].rank:
      return 1
    #
    elif h1.cards[1].rank "lt" h2.cards[1].rank:
      return -1
    elif h1.cards[1].rank "gt" h2.cards[1].rank:
      return 1
    #
    elif h1.cards[0].rank "lt" h2.cards[0].rank:
      return -1
    elif h1.cards[0].rank "gt" h2.cards[0].rank:
      return 1
    #
    else:
      return 0

  # ---------------------------------------------------------
# ----- end Hand --------------------------------------------

class SingleDeck:
  def __init__(self, seed):
    self.rnd = np.random.RandomState(seed)
    self.deck = []
    self.curr_idx = 0
    for r in range(2,15):  # rank
      for s in range(0,4):  # suit
        self.deck.append(Card((r,s)))  # 2c 2d . . Ah As

  def shuffle(self):
    self.rnd.shuffle(self.deck)
    self.curr_idx = 0

  def deal_hand(self):
    if self.curr_idx "gt" 47:
      print("Not enough cards in deck to deal five ")
    lst = []
    for i in range(5):
      c = self.deck[self.curr_idx]
      lst.append(c)
      self.curr_idx += 1
    h = Hand(lst)
    return h
    
  def deal_list_cards(self, n_cards):
    # return a List of Card objects
    if self.curr_idx + n_cards "gt" 52:  # tricky
      print("Not enough cards in deck to deal " + \
        str(n_cards))
    lst = []
    for i in range(n_cards):
      c = self.deck[self.curr_idx]
      lst.append(c)
      self.curr_idx += 1
    return lst 

  def deal_hand_string(self):
    # unsorted string like "4cJs5dJc7h"
    result = ""
    for i in range(5):
      c = self.deck[self.curr_idx]  # Card
      result = result + c.to_string()
      self.curr_idx += 1
    return result
      
  def show(self):
    ct = 0
    for i in range(self.curr_idx, 52):
      if ct "gt" 0 and ct % 10 == 0: print("")
      print(self.deck[i].to_string() + " ", end="")
      ct += 1
    print("")
 
# ----- end SingleDeck --------------------------------------

def main():
  print("\nBegin poker demo ")

  # ----- Card ----------------------------------------------

  c1 = Card((14,3))  # Ace of spades
  print("\nCard c1 = ")
  print(c1.to_string())

  c2 = Card("Td")  # Ten of diamonds
  print("\nCard c2 = ")
  print(c2.to_string())

  # ----- Hand ----------------------------------------------

  h1 = Hand("7cTsJc8d9h")
  print("\nHand h1: ")
  print(h1.to_string())
  print(h1.get_handtype_str() + " = ", end="")  # Straight
  print(h1.get_handtype_int())  # 4

  h2 = Hand(Card("6s"), Card("Ah"), Card("6h"),
    Card("Ac"), Card("6d"))
  print("\nHand h2: ")
  print(h2.to_string()) 
  print(h2.get_handtype_str() + " = ", end="")  # FullHouse
  print(h2.get_handtype_int())  # 6

  lst = []
  lst.append(Card("5c")); lst.append(Card("5d"))
  lst.append(Card("9c")); lst.append(Card("9d"))
  lst.append(Card("Qh"))
  h3 = Hand(lst)
  print("\nHand h3: ")
  print(h3.to_string())
  print(h3.get_handtype_str() + " = ", end="")  # TwoPair
  print(h3.get_handtype_int())  # 2

  # ----- Compare -------------------------------------------

  print("\nHand.compare(h1, h2) = ")
  print(Hand.compare(h1, h2))

  print("\nHand.compare(h2, h3) = ")
  print(Hand.compare(h2, h3))
 
  # ----- SingleDeck ----------------------------------------

  print("\nCreating and shuffling deck ")
  d1 = SingleDeck(seed=0)
  d1.shuffle()
  d1.show()

  h4 = d1.deal_hand()
  print("\nDealing Hand from deck: ")
  print(h4.to_string())

  print("\nDealing 38 cards from deck")
  list_cards = d1.deal_list_cards(38)
  print("Deck is now: ")
  d1.show()

  print("\nEnd Poker demo ")

if __name__ == "__main__":
  main()

Posted in Poker | Leave a comment

Creating Some Synthetic Medical Data

Posted on April 1, 2024 by jamesdmccaffrey

I’ve been working on some ideas for advanced data anomaly detection that involves using a deep neural autoencoder that is augmented with a TransformerEncoder module. Because a TransformerEncoder processes each input vector (typically a sentence) as a sequence that has an ordering to it, I wanted some data that has order.

I knew from previous experience that a good first approach is to generate synthetic data, rather than try to find real data. I imagined medical data where each row represents a patient. There are 12 columns where each value is some sort of medical reading, such as blood pressure. I imagined that the 12 columns represent 12 hours (meaning readings are recorded once per hour).

I created a baseline fake patent data:

[0.2000, 0.2000, 0.3000, 0.5000, 0.4000, 0.8000, 0.7000, 0.7000, 0.5000, 0.3000, 0.2000, 0.3000]

Then for 100 synthetic patients, for each hour, I added a random value between -0.20 and +0.20 to the baseline reading. The resulting synthetic medical data looks like:

0.2195, 0.2861, 0.3411, 0.5180, 0.3695, 0.8584, 0.6750, 0.8567, 0.6855, 0.2534, 0.3167, 0.3116
0.2272, 0.3702, 0.1284, 0.3349, 0.2081, 0.9330, 0.8113, 0.8480, 0.6914, 0.4197, 0.1846, 0.4122
0.0473, 0.2560, 0.1573, 0.6779, 0.4087, 0.7659, 0.6058, 0.8097, 0.4825, 0.3274, 0.0075, 0.3471
. . .

Here’s a graph of the baseline data and data for the first two patients:

When working with machine learning, dealing with the source data is almost always time consuming.

OK, so I’ve got some synthetic medical data that has ordering. The next step is to try and find anomalies. But that’s for another day.

Creating synthetic data is one thing. Creating synthetic people is another. There are over 100 examples of male robots in movies, but very few female robots.

Left: In “Ex Machina” (2014) a billionaire creates a robot Ava. The servant girl Kyoko also turns out to be a robot. This movie is basically a modern version of Frankenstein. I give the movie a C+ grade.

Center: In “Metropolis” (1927), wealthy industrialists rule the future. They make a robot to impersonate Maria who is a sort of worker’s prophet. The first movie robot. Kind of a weird movie, so impossible to grade — interesting but not entertaining.

Right: In “Ghost in the Shell” (2017), creepy evil geisha robots kill some businessmen. I really didn’t understand this movie at all. My grade is a C-.

Demo program to create the synthetic data.

# make_medical_data.py

import numpy as np

base = np.array([0.2000, 0.2000, 0.3000, 0.5000,
                 0.4000, 0.8000, 0.7000, 0.7000, 
                 0.5000, 0.3000, 0.2000, 0.3000],
                 dtype=np.float64)

# print("\nbaseline data: ")
# print(base)

rnd = np.random.RandomState(0)
n = 100    # number of patients
dim = 12  # 12 hours
for i in range(n):
  # print(str(i) + " ", end="")
  for j in range(dim):
    x = base[j]
    xx = x + rnd.uniform(low = -0.20, high = 0.20)
    print("%0.4f" % xx, end="")
    if j < dim-1: print(", ", end="")
  print("")

print("\nDone ")

Posted in PyTorch, Transformers | Leave a comment

Running a PyTorch Program on Colab from a MacOS Machine

Posted on March 29, 2024 by jamesdmccaffrey

I sometimes present PyTorch training sessions at my workplace or at technical conferences. By far the biggest pain point is dealing with attendees’ PyTorch installations. So I’ve been investigating the feasibility of using the online Colab system. Briefly, Colab is a Web tool that can run a PyTorch program in the Cloud from a machine that has no Python or PyTorch installed. All you need is a Web browser.

A few days earlier, I successfully ran a PyTorch demo using Colab from a Windows machine using the Chrome browser. My next step was to try and run a PyTorch demo from a MacOS machine using the Safari browser. Bottom line: it worked.

Windows          Mac
--------------------------------
Notepad          TextEdit
Ctrl-c           Command-c
PrtScn key       Shift-Command-3
File Explorer    Finder
Chrome           Safari
cmd              Terminal (Z-shell)
  dir              ls
  md               mkdir
  cls              clear

My micro MacOS cheatsheet.

I fired up my ancient (circa 2015) MacBook Air laptop. I had previously written a PyTorch program (people_politics.py) and created training and test data (people.train.txt and people_test.txt) for one of my standard demos, and saved them on my MacOS machine. The goal is to predict a person’s political leaning (0 = conservative, 1 = moderate, 2 = liberal) from sex (male = -1, female = 1), age (divided by 100), State (100 = Michigan, 010 = Nebraska, 001 = Oklahoma), and income (divided by $100,000). The data looks like:

 1, 0.24, 1, 0, 0, 0.2950, 2
-1, 0.39, 0, 0, 1, 0.5120, 1
 1, 0.63, 0, 1, 0, 0.7580, 0
-1, 0.36, 1, 0, 0, 0.4450, 1
 1, 0.27, 0, 1, 0, 0.2860, 2
. . .

I launched the Safari browser and navigated to colab.research.google.com and clicked on File | New Notebook. In the upper left corner, I renamed the Notebook from Untitled1.ipynb to PeoplePolitics.ipynb.

The new Notebook had a single cell for the Python/PyTorch code. I opened the TextEdit program, navigated to the location of my people_politics.py program, did a Command-c to copy, and then did a Command-v to paste the code into the Notebook code cell.

To upload my training and test data, in the left margin, I clicked on the Folder icon and then clicked on the Upload icon. I navigated to the people_train.txt data file and uploaded the file to the temporary session storage. Then I did the same to upload the people_train.txt data file.

I edited my demo program to take into account that the data files were in local Colab storage. From this:

  # 1. create DataLoader objects
  print("\nCreating People Datasets ")

  train_file = "./Data/people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = "./Data/people_test.txt"
  test_ds = PeopleDataset(test_file)    # 40 rows

to this:

  # 1. create DataLoader objects
  print("\nCreating People Datasets ")

  train_file = "./people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = "./people_test.txt"
  test_ds = PeopleDataset(test_file)    # 40 rows

I clicked on the Run triangle and held my breath, and then was pleased to see the program run correctly the first time.

One of my favorite cartoonists is Gary Larson. Here are two teaching scenarios.

Demo code.

# people_politics.py
# predict politics type from sex, age, state, income
# MacOS + Colab version 

import numpy as np
import torch as T
device = T.device('cpu')  # apply to Tensor or Module

# -----------------------------------------------------------

class PeopleDataset(T.utils.data.Dataset):
  # sex  age    state    income   politics
  # -1   0.27   0  1  0   0.7610   2
  # +1   0.19   0  0  1   0.6550   0
  # sex: -1 = male, +1 = female
  # state: michigan, nebraska, oklahoma
  # politics: conservative, moderate, liberal

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,7),
      delimiter="\t", comments="#", dtype=np.float32)
    tmp_x = all_xy[:,0:6]   # cols [0,6) = [0,5]
    tmp_y = all_xy[:,6]     # 1-D

    self.x_data = T.tensor(tmp_x, 
      dtype=T.float32).to(device)
    self.y_data = T.tensor(tmp_y,
      dtype=T.int64).to(device)  # 1-D

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    trgts = self.y_data[idx] 
    return preds, trgts  # as a Tuple

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(6, 10)  # 6-(10-10)-3
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 3)

    T.nn.init.xavier_uniform_(self.hid1.weight)
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))
    z = T.tanh(self.hid2(z))
    z = T.log_softmax(self.oupt(z), dim=1)  # NLLLoss() 
    return z

# -----------------------------------------------------------

def accuracy(model, ds):
  # assumes model.eval()
  # item-by-item version
  n_correct = 0; n_wrong = 0
  for i in range(len(ds)):
    X = ds[i][0].reshape(1,-1)  # make it a batch
    Y = ds[i][1].reshape(1)  # 0 1 or 2, 1D
    with T.no_grad():
      oupt = model(X)  # logits form

    big_idx = T.argmax(oupt)  # 0 or 1 or 2
    if big_idx == Y:
      n_correct += 1
    else:
      n_wrong += 1

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------

def accuracy_quick(model, dataset):
  # assumes model.eval()
  X = dataset[0:len(dataset)][0]
  # Y = T.flatten(dataset[0:len(dataset)][1])
  Y = dataset[0:len(dataset)][1]
  with T.no_grad():
    oupt = model(X)  #  [40,3]  logits

  # (_, arg_maxs) = T.max(oupt, dim=1)
  arg_maxs = T.argmax(oupt, dim=1)  # argmax() is new
  num_correct = T.sum(Y==arg_maxs)
  acc = (num_correct * 1.0 / len(dataset))
  return acc.item()

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin People predict politics type ")
  T.manual_seed(1)
  np.random.seed(1)
  
  # 1. create DataLoader objects
  print("\nCreating People Datasets ")

  train_file = "./people_train.txt"
  train_ds = PeopleDataset(train_file)  # 200 rows

  test_file = "./people_test.txt"
  test_ds = PeopleDataset(test_file)    # 40 rows

  bat_size = 10
  train_ldr = T.utils.data.DataLoader(train_ds,
    batch_size=bat_size, shuffle=True)

# -----------------------------------------------------------

  # 2. create network
  print("\nCreating 6-(10-10)-3 neural network ")
  net = Net().to(device)
  net.train()

# -----------------------------------------------------------

  # 3. train model
  max_epochs = 1000
  ep_log_interval = 100
  lrn_rate = 0.01

  loss_func = T.nn.NLLLoss()  # assumes log_softmax()
  optimizer = T.optim.SGD(net.parameters(), lr=lrn_rate)

  print("\nbat_size = %3d " % bat_size)
  print("loss = " + str(loss_func))
  print("optimizer = SGD")
  print("max_epochs = %3d " % max_epochs)
  print("lrn_rate = %0.3f " % lrn_rate)

  print("\nStarting training")
  for epoch in range(0, max_epochs):
    # T.manual_seed(epoch+1)  # checkpoint reproducibility
    epoch_loss = 0  # for one full epoch

    for (batch_idx, batch) in enumerate(train_ldr):
      X = batch[0]  # inputs
      Y = batch[1]  # correct class/label/politics

      optimizer.zero_grad()
      oupt = net(X)
      loss_val = loss_func(oupt, Y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()
      optimizer.step()

    if epoch % ep_log_interval == 0:
      print("epoch = %5d  |  loss = %10.4f" % \
        (epoch, epoch_loss))

  print("Training done ")

# -----------------------------------------------------------

  # 4. evaluate model accuracy
  print("\nComputing model accuracy")
  net.eval()
  acc_train = accuracy(net, train_ds)  # item-by-item
  print("Accuracy on training data = %0.4f" % acc_train)
  acc_test = accuracy(net, test_ds) 
  print("Accuracy on test data = %0.4f" % acc_test)

  # 5. make a prediction
  print("\nPredicting politics for M  30  oklahoma  $50,000: ")
  X = np.array([[-1, 0.30,  0,0,1,  0.5000]], dtype=np.float32)
  X = T.tensor(X, dtype=T.float32).to(device) 

  with T.no_grad():
    logits = net(X)  # do not sum to 1.0
  probs = T.exp(logits)  # sum to 1.0
  probs = probs.numpy()  # numpy vector prints better
  np.set_printoptions(precision=4, suppress=True)
  print(probs)

  print("\nEnd People predict politics demo")

if __name__ == "__main__":
  main()

Training data:

# people_train.txt
# sex (M=-1, F=1)  age  state (michigan, 
# nebraska, oklahoma) income
# politics (consrvative, moderate, liberal)
#
1, 0.24, 1, 0, 0, 0.2950, 2
-1, 0.39, 0, 0, 1, 0.5120, 1
1, 0.63, 0, 1, 0, 0.7580, 0
-1, 0.36, 1, 0, 0, 0.4450, 1
1, 0.27, 0, 1, 0, 0.2860, 2
1, 0.50, 0, 1, 0, 0.5650, 1
1, 0.50, 0, 0, 1, 0.5500, 1
-1, 0.19, 0, 0, 1, 0.3270, 0
1, 0.22, 0, 1, 0, 0.2770, 1
-1, 0.39, 0, 0, 1, 0.4710, 2
1, 0.34, 1, 0, 0, 0.3940, 1
-1, 0.22, 1, 0, 0, 0.3350, 0
1, 0.35, 0, 0, 1, 0.3520, 2
-1, 0.33, 0, 1, 0, 0.4640, 1
1, 0.45, 0, 1, 0, 0.5410, 1
1, 0.42, 0, 1, 0, 0.5070, 1
-1, 0.33, 0, 1, 0, 0.4680, 1
1, 0.25, 0, 0, 1, 0.3000, 1
-1, 0.31, 0, 1, 0, 0.4640, 0
1, 0.27, 1, 0, 0, 0.3250, 2
1, 0.48, 1, 0, 0, 0.5400, 1
-1, 0.64, 0, 1, 0, 0.7130, 2
1, 0.61, 0, 1, 0, 0.7240, 0
1, 0.54, 0, 0, 1, 0.6100, 0
1, 0.29, 1, 0, 0, 0.3630, 0
1, 0.50, 0, 0, 1, 0.5500, 1
1, 0.55, 0, 0, 1, 0.6250, 0
1, 0.40, 1, 0, 0, 0.5240, 0
1, 0.22, 1, 0, 0, 0.2360, 2
1, 0.68, 0, 1, 0, 0.7840, 0
-1, 0.60, 1, 0, 0, 0.7170, 2
-1, 0.34, 0, 0, 1, 0.4650, 1
-1, 0.25, 0, 0, 1, 0.3710, 0
-1, 0.31, 0, 1, 0, 0.4890, 1
1, 0.43, 0, 0, 1, 0.4800, 1
1, 0.58, 0, 1, 0, 0.6540, 2
-1, 0.55, 0, 1, 0, 0.6070, 2
-1, 0.43, 0, 1, 0, 0.5110, 1
-1, 0.43, 0, 0, 1, 0.5320, 1
-1, 0.21, 1, 0, 0, 0.3720, 0
1, 0.55, 0, 0, 1, 0.6460, 0
1, 0.64, 0, 1, 0, 0.7480, 0
-1, 0.41, 1, 0, 0, 0.5880, 1
1, 0.64, 0, 0, 1, 0.7270, 0
-1, 0.56, 0, 0, 1, 0.6660, 2
1, 0.31, 0, 0, 1, 0.3600, 1
-1, 0.65, 0, 0, 1, 0.7010, 2
1, 0.55, 0, 0, 1, 0.6430, 0
-1, 0.25, 1, 0, 0, 0.4030, 0
1, 0.46, 0, 0, 1, 0.5100, 1
-1, 0.36, 1, 0, 0, 0.5350, 0
1, 0.52, 0, 1, 0, 0.5810, 1
1, 0.61, 0, 0, 1, 0.6790, 0
1, 0.57, 0, 0, 1, 0.6570, 0
-1, 0.46, 0, 1, 0, 0.5260, 1
-1, 0.62, 1, 0, 0, 0.6680, 2
1, 0.55, 0, 0, 1, 0.6270, 0
-1, 0.22, 0, 0, 1, 0.2770, 1
-1, 0.50, 1, 0, 0, 0.6290, 0
-1, 0.32, 0, 1, 0, 0.4180, 1
-1, 0.21, 0, 0, 1, 0.3560, 0
1, 0.44, 0, 1, 0, 0.5200, 1
1, 0.46, 0, 1, 0, 0.5170, 1
1, 0.62, 0, 1, 0, 0.6970, 0
1, 0.57, 0, 1, 0, 0.6640, 0
-1, 0.67, 0, 0, 1, 0.7580, 2
1, 0.29, 1, 0, 0, 0.3430, 2
1, 0.53, 1, 0, 0, 0.6010, 0
-1, 0.44, 1, 0, 0, 0.5480, 1
1, 0.46, 0, 1, 0, 0.5230, 1
-1, 0.20, 0, 1, 0, 0.3010, 1
-1, 0.38, 1, 0, 0, 0.5350, 1
1, 0.50, 0, 1, 0, 0.5860, 1
1, 0.33, 0, 1, 0, 0.4250, 1
-1, 0.33, 0, 1, 0, 0.3930, 1
1, 0.26, 0, 1, 0, 0.4040, 0
1, 0.58, 1, 0, 0, 0.7070, 0
1, 0.43, 0, 0, 1, 0.4800, 1
-1, 0.46, 1, 0, 0, 0.6440, 0
1, 0.60, 1, 0, 0, 0.7170, 0
-1, 0.42, 1, 0, 0, 0.4890, 1
-1, 0.56, 0, 0, 1, 0.5640, 2
-1, 0.62, 0, 1, 0, 0.6630, 2
-1, 0.50, 1, 0, 0, 0.6480, 1
1, 0.47, 0, 0, 1, 0.5200, 1
-1, 0.67, 0, 1, 0, 0.8040, 2
-1, 0.40, 0, 0, 1, 0.5040, 1
1, 0.42, 0, 1, 0, 0.4840, 1
1, 0.64, 1, 0, 0, 0.7200, 0
-1, 0.47, 1, 0, 0, 0.5870, 2
1, 0.45, 0, 1, 0, 0.5280, 1
-1, 0.25, 0, 0, 1, 0.4090, 0
1, 0.38, 1, 0, 0, 0.4840, 0
1, 0.55, 0, 0, 1, 0.6000, 1
-1, 0.44, 1, 0, 0, 0.6060, 1
1, 0.33, 1, 0, 0, 0.4100, 1
1, 0.34, 0, 0, 1, 0.3900, 1
1, 0.27, 0, 1, 0, 0.3370, 2
1, 0.32, 0, 1, 0, 0.4070, 1
1, 0.42, 0, 0, 1, 0.4700, 1
-1, 0.24, 0, 0, 1, 0.4030, 0
1, 0.42, 0, 1, 0, 0.5030, 1
1, 0.25, 0, 0, 1, 0.2800, 2
1, 0.51, 0, 1, 0, 0.5800, 1
-1, 0.55, 0, 1, 0, 0.6350, 2
1, 0.44, 1, 0, 0, 0.4780, 2
-1, 0.18, 1, 0, 0, 0.3980, 0
-1, 0.67, 0, 1, 0, 0.7160, 2
1, 0.45, 0, 0, 1, 0.5000, 1
1, 0.48, 1, 0, 0, 0.5580, 1
-1, 0.25, 0, 1, 0, 0.3900, 1
-1, 0.67, 1, 0, 0, 0.7830, 1
1, 0.37, 0, 0, 1, 0.4200, 1
-1, 0.32, 1, 0, 0, 0.4270, 1
1, 0.48, 1, 0, 0, 0.5700, 1
-1, 0.66, 0, 0, 1, 0.7500, 2
1, 0.61, 1, 0, 0, 0.7000, 0
-1, 0.58, 0, 0, 1, 0.6890, 1
1, 0.19, 1, 0, 0, 0.2400, 2
1, 0.38, 0, 0, 1, 0.4300, 1
-1, 0.27, 1, 0, 0, 0.3640, 1
1, 0.42, 1, 0, 0, 0.4800, 1
1, 0.60, 1, 0, 0, 0.7130, 0
-1, 0.27, 0, 0, 1, 0.3480, 0
1, 0.29, 0, 1, 0, 0.3710, 0
-1, 0.43, 1, 0, 0, 0.5670, 1
1, 0.48, 1, 0, 0, 0.5670, 1
1, 0.27, 0, 0, 1, 0.2940, 2
-1, 0.44, 1, 0, 0, 0.5520, 0
1, 0.23, 0, 1, 0, 0.2630, 2
-1, 0.36, 0, 1, 0, 0.5300, 2
1, 0.64, 0, 0, 1, 0.7250, 0
1, 0.29, 0, 0, 1, 0.3000, 2
-1, 0.33, 1, 0, 0, 0.4930, 1
-1, 0.66, 0, 1, 0, 0.7500, 2
-1, 0.21, 0, 0, 1, 0.3430, 0
1, 0.27, 1, 0, 0, 0.3270, 2
1, 0.29, 1, 0, 0, 0.3180, 2
-1, 0.31, 1, 0, 0, 0.4860, 1
1, 0.36, 0, 0, 1, 0.4100, 1
1, 0.49, 0, 1, 0, 0.5570, 1
-1, 0.28, 1, 0, 0, 0.3840, 0
-1, 0.43, 0, 0, 1, 0.5660, 1
-1, 0.46, 0, 1, 0, 0.5880, 1
1, 0.57, 1, 0, 0, 0.6980, 0
-1, 0.52, 0, 0, 1, 0.5940, 1
-1, 0.31, 0, 0, 1, 0.4350, 1
-1, 0.55, 1, 0, 0, 0.6200, 2
1, 0.50, 1, 0, 0, 0.5640, 1
1, 0.48, 0, 1, 0, 0.5590, 1
-1, 0.22, 0, 0, 1, 0.3450, 0
1, 0.59, 0, 0, 1, 0.6670, 0
1, 0.34, 1, 0, 0, 0.4280, 2
-1, 0.64, 1, 0, 0, 0.7720, 2
1, 0.29, 0, 0, 1, 0.3350, 2
-1, 0.34, 0, 1, 0, 0.4320, 1
-1, 0.61, 1, 0, 0, 0.7500, 2
1, 0.64, 0, 0, 1, 0.7110, 0
-1, 0.29, 1, 0, 0, 0.4130, 0
1, 0.63, 0, 1, 0, 0.7060, 0
-1, 0.29, 0, 1, 0, 0.4000, 0
-1, 0.51, 1, 0, 0, 0.6270, 1
-1, 0.24, 0, 0, 1, 0.3770, 0
1, 0.48, 0, 1, 0, 0.5750, 1
1, 0.18, 1, 0, 0, 0.2740, 0
1, 0.18, 1, 0, 0, 0.2030, 2
1, 0.33, 0, 1, 0, 0.3820, 2
-1, 0.20, 0, 0, 1, 0.3480, 0
1, 0.29, 0, 0, 1, 0.3300, 2
-1, 0.44, 0, 0, 1, 0.6300, 0
-1, 0.65, 0, 0, 1, 0.8180, 0
-1, 0.56, 1, 0, 0, 0.6370, 2
-1, 0.52, 0, 0, 1, 0.5840, 1
-1, 0.29, 0, 1, 0, 0.4860, 0
-1, 0.47, 0, 1, 0, 0.5890, 1
1, 0.68, 1, 0, 0, 0.7260, 2
1, 0.31, 0, 0, 1, 0.3600, 1
1, 0.61, 0, 1, 0, 0.6250, 2
1, 0.19, 0, 1, 0, 0.2150, 2
1, 0.38, 0, 0, 1, 0.4300, 1
-1, 0.26, 1, 0, 0, 0.4230, 0
1, 0.61, 0, 1, 0, 0.6740, 0
1, 0.40, 1, 0, 0, 0.4650, 1
-1, 0.49, 1, 0, 0, 0.6520, 1
1, 0.56, 1, 0, 0, 0.6750, 0
-1, 0.48, 0, 1, 0, 0.6600, 1
1, 0.52, 1, 0, 0, 0.5630, 2
-1, 0.18, 1, 0, 0, 0.2980, 0
-1, 0.56, 0, 0, 1, 0.5930, 2
-1, 0.52, 0, 1, 0, 0.6440, 1
-1, 0.18, 0, 1, 0, 0.2860, 1
-1, 0.58, 1, 0, 0, 0.6620, 2
-1, 0.39, 0, 1, 0, 0.5510, 1
-1, 0.46, 1, 0, 0, 0.6290, 1
-1, 0.40, 0, 1, 0, 0.4620, 1
-1, 0.60, 1, 0, 0, 0.7270, 2
1, 0.36, 0, 1, 0, 0.4070, 2
1, 0.44, 1, 0, 0, 0.5230, 1
1, 0.28, 1, 0, 0, 0.3130, 2
1, 0.54, 0, 0, 1, 0.6260, 0

Test data:

-1, 0.51, 1, 0, 0, 0.6120, 1
-1, 0.32, 0, 1, 0, 0.4610, 1
1, 0.55, 1, 0, 0, 0.6270, 0
1, 0.25, 0, 0, 1, 0.2620, 2
1, 0.33, 0, 0, 1, 0.3730, 2
-1, 0.29, 0, 1, 0, 0.4620, 0
1, 0.65, 1, 0, 0, 0.7270, 0
-1, 0.43, 0, 1, 0, 0.5140, 1
-1, 0.54, 0, 1, 0, 0.6480, 2
1, 0.61, 0, 1, 0, 0.7270, 0
1, 0.52, 0, 1, 0, 0.6360, 0
1, 0.30, 0, 1, 0, 0.3350, 2
1, 0.29, 1, 0, 0, 0.3140, 2
-1, 0.47, 0, 0, 1, 0.5940, 1
1, 0.39, 0, 1, 0, 0.4780, 1
1, 0.47, 0, 0, 1, 0.5200, 1
-1, 0.49, 1, 0, 0, 0.5860, 1
-1, 0.63, 0, 0, 1, 0.6740, 2
-1, 0.30, 1, 0, 0, 0.3920, 0
-1, 0.61, 0, 0, 1, 0.6960, 2
-1, 0.47, 0, 0, 1, 0.5870, 1
1, 0.30, 0, 0, 1, 0.3450, 2
-1, 0.51, 0, 0, 1, 0.5800, 1
-1, 0.24, 1, 0, 0, 0.3880, 1
-1, 0.49, 1, 0, 0, 0.6450, 1
1, 0.66, 0, 0, 1, 0.7450, 0
-1, 0.65, 1, 0, 0, 0.7690, 0
-1, 0.46, 0, 1, 0, 0.5800, 0
-1, 0.45, 0, 0, 1, 0.5180, 1
-1, 0.47, 1, 0, 0, 0.6360, 0
-1, 0.29, 1, 0, 0, 0.4480, 0
-1, 0.57, 0, 0, 1, 0.6930, 2
-1, 0.20, 1, 0, 0, 0.2870, 2
-1, 0.35, 1, 0, 0, 0.4340, 1
-1, 0.61, 0, 0, 1, 0.6700, 2
-1, 0.31, 0, 0, 1, 0.3730, 1
1, 0.18, 1, 0, 0, 0.2080, 2
1, 0.26, 0, 0, 1, 0.2920, 2
-1, 0.28, 1, 0, 0, 0.3640, 2
-1, 0.59, 0, 0, 1, 0.6940, 2

Posted in PyTorch | Leave a comment

“The t-SNE Data Visualization Technique from Scratch Using C#” in Visual Studio Magazine

Posted on March 28, 2024 by jamesdmccaffrey

I wrote an article titled “The t-SNE Data Visualization Technique from Scratch Using C#” in the March 2024 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2024/03/15/t-sne-data-visualization.aspx.

The t-SNE (“t-distributed Stochastic Neighbor Embedding”) technique is a method for visualizing high-dimensional data. The basic t-SNE technique is very specific: It converts data with three or more columns into a condensed version with just two columns. The condensed two-dimensional data can then be visualized as an XY graph.

The article presents a complete end-to-end demo using raw C#. The demo program begins by loading a tiny 15-item subset of the UCI Digits dataset into memory. Each data item has 64 pixel values so there’s no direct way to graph the data. The first three data items represent ‘0’ digits, the second five items represent ‘1’ digits, and the third five items represent ‘2’ digits.

Click to enlarge.

The t-SNE algorithm is applied to the source data, resulting in 15 items with just two columns. The reduced data is plotted as an XY graph where the first column is used as x values and the second column is used as the y values. The graph shows three distinct groups of data items as expected.

Using t-SNE is just one of several ways to visualize data that has three or more columns. Such systems are often called dimensionality reduction techniques. Principal component analysis (PCA) is similar in some respects to t-SNE. PCA doesn’t require any hyperparameters, but PCA visualization results tend to be very good or quite poor.

Another dimensionality reduction technique to visualize data is to use a neural network autoencoder with a central hidden layer that has two nodes. Neural autoencoders require many hyperparameters (number of hidden layers, hidden and output activation functions, learning rate, batch size, maximum epochs) and so a good visualization is more difficult to achieve. But a neural autoencoder can theoretically produce the best visualization.

I love miniature golf — kind of reduced dimensionality golf.

Left: Professor Hacker’s Lost Treasure in Branson, Missouri, has a kind of Indiana Jones theme. I’ve never been to Missouri, but if I ever go there, I’ll seek out this course.

Center: The Disneyland Hotel in Anaheim, California, had a Disney theme miniature golf course from 1961 to 1978. I played that course many times. I especially liked the Matterhorn hole and the Alice in Wonderland hole.

Right: The Hawaiian Rumble Adventure Golf in Orlando, Florida, has a tiki-Polynesian theme. I played that course once while waiting to board a Caribbean cruise leaving out of Port Canaveral the next day.

Posted in Machine Learning | Leave a comment

The Difference Between a Mean and a Centroid

Posted on March 27, 2024 by jamesdmccaffrey

In machine learning, the terms “mean” and “centroid” can, and are, used interchangeably. Technically, a mean applies to a set of individual numbers, and a centroid applies to a set of vectors.

For example,

x1 = 5
x2 = 12
x3 = 4

mean = (5 + 12 + 4) / 3
     = 7.0

and

v1 = (6, 11)
v2 = (10, 5)
v3 = (2, 14)

centroid = ((6 + 10 + 2) / 3, (11 + 5 + 14) / 3)
         = (6.0, 10.0)

Therefore, the “k-means” clustering algorithm really should be called the “k-centroids” algorithm, but nobody ever does so.

I’m a big fan of early science fiction movies from the 1950s. Some of my favorites were released with two different names.

Left: The U.K. movie “The Quatermass Xperiment” (1955) was titled “Enemy from Space” in the U.S. Parasitic aliens take over a small English village and an adjoining industrial complex in preparation for a full-scale invasion. The aliens are defeated.

Center: The U.K. movie “The Trollenberg Terror” (1958) was released in the U.S. as “The Crawling Eye”. Large tentacled aliens are in the Alps and are planning to take over the Earth. The aliens are defeated.

Right: The Japanese movie “Godzilla Raids Again” (1955) was edited and released in the U.S. as “Gigantis the Fire Monster”. It is the second appearance of Godzilla. Godzilla kills rival monster Anguirus. The Japanese air force defeats Godzilla.

Posted in Machine Learning | 1 Comment