How to Fine-Tune a Transformer Architecture NLP Model in Visual Studio Magazine

I wrote an article titled “How to Fine-Tune a Transformer Architecture NLP Model” in the November 2021 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/11/16/fine-tune-nlp-model.aspx.

My article describes how to fine-tune a pretrained Transformer Architecture model for natural language processing. Specifically, the article explains how to fine-tune a condensed version of a pretrained BERT model to create binary classifier for a subset of the IMDB movie review dataset. The goal is sentiment analysis — accept the text of a movie review (such as, “This movie was a great waste of my time”) and output class 0 (negative review) or class 1 (positive review).

You can think of a pretrained transformer architecture (TA) model as sort of an English language expert. But the TA expert doesn’t know anything about movies and so you provide additional training to fine-tune the model so that it understands the difference between a positive movie review and a negative review.

I present a demo program. It begins by loading a small 200-item subset of the IMDB movie review dataset into memory. The full IMDB dataset has 50,000 movie reviews — 25,000 reviews for training and 25,000 reviews for testing, where there are 12,500 positive and 12,500 negative reviews. Working with the full dataset is very time-consuming so the demo data uses just the first 100 positive training reviews and the first 100 negative training reviews.

The reviews are read into memory then converted to a data structure that holds integer tokens. For example, the word “movie” has token ID = 3185. The tokenized movie reviews data structure is fed to a PyTorch Dataset object that is used to send batches of tokenized reviews and their associated labels to training code.

After the movie review data has been prepared, the demo loads a pretrained DistilBERT model into memory. DistilBERT is a condensed (“distilled”), but still large, version of the huge BERT model. The uncased version of DistilBERT has 66 million weights and biases. Then the demo fine-tunes the pretrained model by training the model using standard PyTorch techniques. The demo concludes by saving the fine-tuned model to file.

Working with natural language processing (NLP) is very challenging.

Writing a novel is arguably the highest form of NLP. Here are three books I read over and over. Left: “A Princess of Mars” (1912) by Edgar Rice Burroughs. Center: “Starship Troopers” (1959) by Robert A. Heinlein. Right: “Tom Swift and His Diving Seacopter” (1956) by “Victor Appleton II” (ghost writer James Lawrence).

The Traveling Salesman Problem Using Simulated Annealing In C#

The Traveling Salesman Problem (TSP) is a standard combinatorial optimization problem. The goal is to find the best route that visits all cities, where best route usually means shortest distance.

Simulated Annealing is a very simple heuristic for solving combinatorial optimization problems. Briefly:

make an initial guess route
loop
generate a new route
if new route is better, accept route
if new route is worse, accept
route anyway with a small probability
decrease accept_probability slightly
end-loop
return best route found


The devil is in the details: defining how to generate a new route and modifying the small acceptance probability of accepting a worse route.

I implemented simulated annealing for TSP using the Python language for an article in Visual Studio Magazine. Here is a C# implementation of the VSM article program.

The C# program results (top) are slightly different than the Python program results (bottom) because simulated annealing is probabilistic and the C# and Python random number generation functions are a bit different.

Because Python and C# are both C-family languages, it was relatively easy to refactor the original Python code to C# code. I did have to implement a Shuffle() function in C# to replace the built-in Python shuffle() function, but I’ve done that many times (using the Fisher-Yates mini-algorithm).

Good fun.

When I was just out of college, I worked on a cruise ship for a couple of years. I got to visit dozens of interesting countries — but all of them bordered the ocean. I know next to nothing about land-locked places such as these three adjacent Central Asian countries. Left: Kazakhstan (population 19 million). Center: Uzbekistan (population 34 million). Right: Kyrgyzstan (population 7 million). I wonder how similar their spoken and written languages are.

Demo C# code. Replace “lt” and “gt” with symbols — my blogging editor chokes on these symbols.

using System;

namespace TSP_Annealing
{
class TSP_Annealing_Program
{
static void Main(string[] args)
{
Console.WriteLine("\nBegin TSP simulated annealing C# ");

int nCities = 20;
Console.WriteLine("\nSetting num_cities = " + nCities);
Console.WriteLine("\nOptimal solution is 0, 1, 2, . . " +
(nCities - 1));
Console.WriteLine("Optimal soln has total distance = " +
(nCities - 1));
Random rnd = new Random(2);
int maxIter = 25000;
double startTemperature = 10000.0;
double alpha = 0.98;

Console.WriteLine("\nSettings: ");
Console.WriteLine("max_iter = " + maxIter);
Console.WriteLine("start_temperature = " +
startTemperature.ToString("F1"));
Console.WriteLine("alpha = " + alpha);

Console.WriteLine("\nStarting solve() ");
int[] soln = Solve(nCities, rnd, maxIter,
startTemperature, alpha);
Console.WriteLine("Finished solve() ");

Console.WriteLine("\nBest solution found: ");
ShowArray(soln);
double dist = TotalDist(soln);
Console.WriteLine("\nTotal distance = " +
dist.ToString("F1"));

Console.WriteLine("\nEnd demo ");
}  // Main()

static double TotalDist(int[] route)
{
double d = 0.0;  // total distance between cities
int n = route.Length;
for (int i = 0; i "lt" n - 1; ++i)
{
if (route[i] "lt" route[i + 1])
d += (route[i + 1] - route[i]) * 1.0;
else
d += (route[i] - route[i + 1]) * 1.5;
}
return d;
}

static double Error(int[] route)
{
int n = route.Length;
double d = TotalDist(route);
double minDist = n - 1;
return d - minDist;
}

static int[] Adjacent(int[] route, Random rnd)
{
int n = route.Length;
int[] result = (int[])route.Clone();  // shallow is OK
int i = rnd.Next(0, n); int j = rnd.Next(0, n);
int tmp = result[i];
result[i] = result[j]; result[j] = tmp;
return result;
}

static void Shuffle(int[] route, Random rnd)
{
// Fisher-Yates algorithm
int n = route.Length;
for (int i = 0; i "lt" n; ++i)
{
int rIndx = rnd.Next(i, n);
int tmp = route[rIndx];
route[rIndx] = route[i];
route[i] = tmp;
}
}

static void ShowArray(int[] arr)
{
int n = arr.Length;
Console.Write("[ ");
for (int i = 0; i "lt" n; ++i)
Console.WriteLine(" ]");
}

static int[] Solve(int nCities, Random rnd, int maxIter,
double startTemperature, double alpha)
{
double currTemperature = startTemperature;
int[] soln = new int[nCities];
for (int i = 0; i "lt" nCities; ++i) { soln[i] = i; }
Shuffle(soln, rnd);
Console.WriteLine("Initial guess: ");
ShowArray(soln);

double err = Error(soln);
int iteration = 0;
while (iteration "lt" maxIter && err "gt" 0.0)
{
if (adjErr "lt" err)  // better route
{
}
else
{
double acceptProb =
double p = rnd.Next();
if (p "lt" acceptProb)  // accept anyway
{
}
}

if (currTemperature "lt" 0.00001)
currTemperature = 0.00001;
else
currTemperature *= alpha;
++iteration;
}  // while

return soln;
}  // Solve()

}  // Program class
}  // ns


NFL 2021 Week 13 Predictions – Zoltar Likes Vegas Favorites Bengals and Dolphins

Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #13 of the 2021 season. It usually takes Zoltar about four weeks to hit his stride and takes humans about eight weeks to get up to speed, so weeks six through nine are usually Zoltar’s sweet spot. After week nine, injuries start having a big effect.

Zoltar:      saints  by    2  dog =     cowboys    Vegas:     cowboys  by    5
Zoltar:  buccaneers  by    3  dog =     falcons    Vegas:  buccaneers  by  9.5
Zoltar:   cardinals  by    0  dog =       bears    Vegas:   cardinals  by    7
Zoltar:     bengals  by    5  dog =    chargers    Vegas:     bengals  by  1.5
Zoltar:     vikings  by    2  dog =       lions    Vegas:     vikings  by  7.5
Zoltar:       colts  by    4  dog =      texans    Vegas:       colts  by    7
Zoltar:      chiefs  by    6  dog =     broncos    Vegas:      chiefs  by  9.5
Zoltar:    dolphins  by    6  dog =      giants    Vegas:    dolphins  by  2.5
Zoltar:      eagles  by    0  dog =        jets    Vegas:      eagles  by  6.5
Zoltar:     raiders  by    4  dog =    redskins    Vegas:     raiders  by  1.5
Zoltar:      ravens  by    0  dog =    steelers    Vegas:      ravens  by    3
Zoltar:        rams  by   10  dog =     jaguars    Vegas:        rams  by   12
Zoltar:    seahawks  by    1  dog = fortyniners    Vegas: fortyniners  by  3.5
Zoltar:       bills  by    4  dog =    patriots    Vegas:       bills  by  3.5


Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I usually use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. In middle weeks I sometimes go ultra-aggressive and use a 1.0-point threshold.

Note: Because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is much too strongly biased towards Vegas underdogs. I need to fix this.

For week #13:

1. Zoltar likes Vegas underdog Saints against the Cowboys
2. Zoltar likes Vegas underdog Falcons against the Buccaneers
3. Zoltar likes Vegas underdog Bears against the Cardinals
4. Zoltar likes Vegas underdog Lions against the Vikings
5. Zoltar likes Vegas underdog Broncos against the Chiefs
6. Zoltar likes Vegas underdog Jets against the Eagles
7. Zoltar likes Vegas underdog Seahawks against the 49ers
8. Zoltar likes Vegas favorite Bengals over the Chargers
9. Zoltar likes Vegas favorite Dolphins over the Giants

For example, a bet on the underdog Saints against the Cowboys will pay off if the Saints win by any score, or if the favored Cowboys win but by less than the point spread of 5.0 points (in other words, by 4 points or less; if the Cowboys win by exactly 5 points the wager is a push).

Theoretically, if you must bet $110 to win$100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #12, against the Vegas point spread, Zoltar went 5-2 (using the standard 3.0 points as the advice threshold; using a conservative 4.0 points threshold Zoltar was 4-0). Overall, for the season, Zoltar is 46-35 against the spread (~56%).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #12, just predicting the winning team, Zoltar went 9-6 which is poor — as has been the case for the past few weeks, there were many upsets in week #12.

In week #12, just predicting the winning team, Vegas — “the wisdom of the crowd” also went 9-6, the same as Zoltar.

Zoltar sometimes predicts a 0-point margin of victory, which means the two teams are evenly matched. There are three such games in week #13. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.

My system is named after the Zoltar fortune teller machine you can find in arcades. Fortune teller machines have been around for many years. Center: This “Gypsy” machine was made about 1906 and is worth over $1 million dollars. Right: This is “Princess Doraldina”, made about 1928. Posted in Zoltar | Leave a comment Data Pipeline Processing in Machine Learning A common task when designing and implementing machine learning systems is reading data from a file (or SQL table) then performing a sequence of tasks on that data in order to get the data ready for consumption by an ML model. I don’t enjoy working on ML data processing pipelines because there’s always a tradeoff between efficiency, performance, and maintainability. This means there’s never a completely optimal design. For example, I am currently working on a new algorithm for positive and unlabeled learning (PUL). My system uses PyTorch. There is a small set of positive data (class 1 = patient has disease), and a large set of negative data (class 0 = no disease). The data must be read into memory, the predictors and target labels must be separated, the unlabeled data must be divided into a set for training and a set for scoring, the data must be converted to PyTorch tensors, some noise vectors have to be generated, and then the positive, unlabeled-in-training, and noise data must be combined. Whew! There are many dozens of possible sequences of actions in which to perform this pipeline processing. And there are dozens of design alternatives too. In the end it’s very satisfying when complex data pipelines are constructed. When I was new to software development, it was usually good enough just to get a system to work correctly. But with experience, it becomes increasingly important to implement systems and data pipelines that are efficient (for example, avoiding unnecessary file read operations), manageable (easy for other developers to understand and modify if necessary), and robust (resistant to many, but not all, errors). Two questionable designs. Left: An early concept for automobile seat belts (seat belts were not common until 1968). Right: I’m a software engineer, not a civil engineer, but I suspect this wheelchair ramp is less accessible than intended. Demo code: # test.py # test Dataset for Patient data PUL import numpy as np import torch as T device = T.device("cpu") # apply to Tensor or Module # ---------------------------------------------------------- # 1 2 3 4 5 6 # 3456789012345678901234567890123456789012345678901234567890 # ---------------------------------------------------------- class PatientDataset(T.utils.data.Dataset): # label age chol blood heart actual # 1 0.39 1 0 0 0.5432 1 0 0 1 # -9 0.29 0 0 1 0.4985 0 1 0 0 # . . . # [0] [1] [2 3 4] [5] [6 7 8] [9] def __init__(self, pos_file, unl_file, seed): self.rnd = np.random.RandomState(seed) self.reset_called = False # ---------------------------------------------------------- # 1. read positives x, y into memory as tensors tmp_pos_xy = np.loadtxt(pos_file, delimiter="\t", usecols=range(0,10), comments="#", dtype=np.float32) self.pos_x = \ T.tensor(tmp_pos_xy[:,1:9], dtype=T.float32).to(device) self.pos_y = \ T.tensor(tmp_pos_xy[:,0], dtype=T.int64).to(device) self.num_pos = len(self.pos_x) # 20 # 2. read all unlabeleds x, y, act into memory as tensors tmp_unl_xya = np.loadtxt(unl_file, delimiter="\t", usecols=range(0,10), comments="#", dtype=np.float32) self.unl_x = \ T.tensor(tmp_unl_xya[:,1:9], dtype=T.float32).to(device) self.unl_y = \ T.tensor(tmp_unl_xya[:,0], dtype=T.int64).to(device) self.unl_a = \ T.tensor(tmp_unl_xya[:,9], dtype=T.int64).to(device) self.num_unl = len(self.unl_x) # ---------------------------------------------------------- # 3. set sizes for noise, unlabeleds in-training self.num_noise = int(self.num_pos / 10) # not much noise self.num_unl_in = self.num_pos * 2 # more unlabeleds # 4. generate noise x, y (no actual) self.noise_x = None # see reset() self.noise_y = None # 5. break unlabeleds into in-train, out-train self.indices_all = np.arange(self.num_unl) # [0..179] self.indices_in_train = None self.indices_out_train = None self.unl_in_train_x = None self.unl_in_train_y = None self.unl_out_train_x = None self.unl_out_train_y = None # 6. concatenate pos, noise, unlabeled_in self.data_x = None self.data_y = None # ---------------------------------------------------------- def __len__(self): return len(self.data_x) # num_pos + num_noise + num_unl_in def __getitem__(self, idx): if self.reset_called == False: print("FATAL: you should call reset() before using ") preds = self.data_x[idx] target = self.data_y[idx] sample = { 'predictors' : preds, 'target' : target } return sample # ---------------------------------------------------------- def reset(self, unl_k): # new noise, new unlabeled-in, out to data_x, data_y # 1. noise tmp_noise_x = \ self.rnd.random((self.num_noise,8)).astype(np.float32) self.noise_x = \ T.tensor(tmp_noise_x, dtype=T.float32).to(device) tmp_noise_y = np.zeros(self.num_noise, dtype=np.int64) self.noise_y = \ T.tensor(tmp_noise_y, dtype=T.int64).to(device) # 2. unlabeled-in, out self.rnd.shuffle(self.indices_all) self.indices_in_train = \ self.indices_all[0:self.num_unl_in] self.indices_out_train = \ self.indices_all[self.num_unl_in:self.num_unl] self.unl_in_train_x = self.unl_x[self.indices_in_train] self.unl_in_train_y = self.unl_y[self.indices_in_train] self.unl_in_train_y[:] = unl_k # set unlabeled y values self.unl_out_train_x = self.unl_x[self.indices_out_train] self.unl_out_train_y = self.unl_y[self.indices_out_train] # 3. concatenate pos, noise, unlabeled_in self.data_x = \ T.cat((self.pos_x, self.noise_x, self.unl_in_train_x), \ dim=0) self.data_y = \ T.cat((self.pos_y, self.noise_y, self.unl_in_train_y), \ dim=0) self.reset_called = True # ---------------------------------------------------------- def main(): # 0. get started print("\nBegin Patient PUL Dataset test ") T.manual_seed(1) np.random.seed(1) # 1. prepare for training multiple times print("\nCreating Dataset \n") pos_file = ".\\Data\\patients_positive.txt" unl_file = ".\\Data\\patients_unlabeled.txt" patient_ds = PatientDataset(pos_file, unl_file, seed=1) patient_ds.reset(unl_k=99) # dummy value data_ldr = T.utils.data.DataLoader(patient_ds, \ batch_size=25, shuffle=False) for (batch_idx, batch) in enumerate(data_ldr): print(batch_idx) X = batch['predictors'] # inputs Y = batch['target'] # targets print(X) print(Y) input() print("\nEnd test ") if __name__ == "__main__": main()  Posted in Machine Learning | Leave a comment Multivariate Gaussian Probability Density Function from Scratch (Almost) I was writing some code that needed to compute the probability density function (PDF) value for a multidimensional Gaussian vector. The regular one-dimensional Gaussian function PDF is the bell-shaped curve. For example if mean = u = 0.0, and standard deviation = sd = 1.0, and x = 2.5 then pdf(x) is the value of the curve at x = 2.5, which is about 0.01753. The pdf() value isn’t a probability, but larger pdf() values mean more likely. In many scenarios, because pdf() values are often very small, using the log of the pdf() is more convenient mathematically. For the example above, logpdf(x) = log(0.01753) = -4.0438. The multivariate case extends this idea. For example, suppose dimension = k = 3, mean = u = [1.0, 3.0, 2.0], covariance = the 3×3 identity matrix, and x = [1.5, 1.5, 1.5]. Each value is a vector with 3 elements, and instead of a single standard deviation value, a 3×3 covariance matrix is used. Because x = [1.5, 1.5, 1.5] is fairly close to the mean of u = [1.0, 3.0, 2.0], the pdf() value for x would be smaller than the pdf() value for x2 = [7.0, 9.5, 8.0] because x2 is farther away from u. To compute logpdf() I used the built-in scipy.stats.multivariate_normal.logpdf() function from the SciPy library. But I wondered how easy it would be to remove the SciPy dependency and compute a logpdf() function using just NumPy — sort of from scratch. I found the equation for the pdf() for a multivariate Gaussian distribution on Wikipedia (after wading through a lot of noise). The equation uses matrix transpose, inverse, determinant, and multiplication, but NumPy has functions for all of those. I coded up a program-defined logpdf() function using just NumPy functions. Good fun. There are always tradeoffs between implementing a function from scratch vs. using an off-the-shelf library function. Three from-scratch origami style paper dresses. More interesting than off-the-shelf clothing but probably less practical. Demo code: # multivariate_normal_pdf_demo.py import numpy as np import scipy.stats as sps def my_logpdf(x, u, covar): k = len(x) # dimension a = np.transpose(x - u) b = np.linalg.inv(covar) c = x - u d = np.matmul(a, b) e = np.matmul(d, c) numer = np.exp(-0.5 * e) f = (2 * np.pi)**k g = np.linalg.det(covar) denom = np.sqrt(f * g) pdf = numer / denom return np.log(pdf) print("\nBegin multivariate Gaussian log-pdf demo ") u = np.array([1.0, 3.0, 2.0], dtype=np.float32) v = np.array([1.0, 1.0, 1.0], dtype=np.float32) covar = np.diag(v) x = np.array([1.5, 1.5, 1.5], dtype=np.float32) print("\nu = ", end=""); print(u) print("\ncovar = ") print(covar) print("\nx = ", end=""); print(x) logp1 = sps.multivariate_normal.logpdf(x, u, covar) print("\nlogpdf(x) using scipy logpdf() = %0.4f " % logp1) logp2 = my_logpdf(x, u, covar) print("\nlogpdf(x) using my_logpdf() = %0.4f " % logp2) print("\nEnd demo ")  Posted in Machine Learning | Leave a comment My Top Ten Favorite Science Fiction Movies of the 2000s There are many science fiction movies that I like from the decade of 2000 to 2009. Here are my 10 favorites. They’re not necessarily the best movies of the decade, but they’re ones I’d take with me if I was going on a long trip. 1. The Chronicles of Riddick (2004) – A wildly creative movie where Richard Riddick (played by actor Vin Diesel) is a Furyan warrior who defies the powerful race of Necromongers. I liked the action, plot, and cinematography. The movie was a box office disappointment and got poor reviews from critics, but it’s one of my favorite sci fi films of the 2000s. 2. Minority Report (2002) – A mystery story set in the year 2054. The PreCrime department arrests criminals before they commit crimes, based on the ability of three psychics called Precogs to see the future. Actor Tom Cruise plays Chief of PreCrime John Anderton, who is framed for a future murder. 3. Deja Vu (2006) – Agent Douglas Carlin (played by Denzel Washington) of the Bureau of Alcohol, Tobacco and Firearms (ATF) is sent back in time to prevent a domestic terrorist attack that takes place on a passenger ferry in New Orleans. Very clever plot with time travel paradoxes nicely explained. 4. Star Trek (2009) – A reboot of the original TV series in an alternate reality. Captain James T. Kirk (Chris Pine) and Spock (Zachary Quinto) are given command of the spaceship USS Enterprise. They combat Nero (Eric Bana), a Romulan from the future. 5. Iron Man (2008) – Wealthy businessman and engineering genius Tony Stark (actor Robert Downey Jr.) is captured by Arab terrorists but builds an amazing suit of armor and escapes. Stark later upgrades the suit and defeats his evil business partner played by actor Jeff Bridges. I don’t count this as a superhero movie because the ideas of Iron Man’s suit are (vaguely) feasible. 6. Paycheck (2003) – Michael Jennings (actor Ben Affleck) is a brilliant engineer who analyzes technologies and recreates them. To protect himself from legal problems, he wipes his memory clean after each job. On one job he builds a device to see into the future. I really liked the part where he left himself with 20 seemingly random objects that turn out to save him from capture: pair of sunglasses, bus pass, paperclip, a fortune from a fortune cookie, etc. 7. Space Cowboys (2000) – Four old astronauts from the beginning of the space program, who are now in their 60s (played by actors Tommy Lee Jones, Clint Eastwood, James Garner, and Donald Sutherland), are brought out of retirement. They must work with young astronauts to travel into space to deactivate an orbiting satellite that has nuclear missiles. 8. Surrogates (2009) – In the near future, people sit in pods and live their lives through remote controlled surrogate androids. FBI agent Tom Greer (actor Bruce Willis) investigates mysterious deaths. In the end, the surrogate system is destroyed and people must live as themselves. 9. The Cave (2005) – A group of cave-divers and scientists become trapped while exploring a water-filled cave system in Romania. They run into a pack of deadly creatures who turn out to be mutated humans, bats, and other animals. Very scary movie for me. Another movie I like that got bad reviews. 10. A Sound of Thunder (2005) – In 2055, a company offers rich people the chance to travel back in time to the days of dinosaurs. One time traveler accidentally steps on a butterfly which completely changes the history of evolution on the planet — and not for the better. This movie got almost unanimously terrible reviews and ended several careers, but I think it’s quite entertaining. Honorable Mention There were many movies that I liked but that didn’t quite make my top 10 for the decade. Here are a few: The Core (2003) – Scientists travel to the center of the Earth to explode a nuclear device to stabilize the planet. Dreamcatcher (2004) – Evil shape-shifting aliens invade Earth but are ultimately defeated by a different type of good alien. Mimic 2 (2001) – A human-sized mutated insect preys on inner city residents and wants to mate with a human female. Outlander (2008) – Kainan (Jim Caviezel) is an alien warrior who crash lands on Earth in the year 600 AD and kills an alien monster. And there were several very popular movies in the decade that I just didn’t like very much: Avatar (2009), three of the Star Wars movies (2002, 2005, 2008), A.I. Artificial Intelligence (2001), Planet of the Apes (2001), Terminator 3: Rise of the Machines (2003), War of the Worlds (2005), Transformers (2007). Posted in Top Ten | Leave a comment Researchers Explore Differential Privacy on Pure AI I contributed to an article titled “Researchers Explore Differential Privacy” on November 2021 edition of the Pure AI web site. See https://pureai.com/articles/2021/11/08/differential-privacy.aspx. In a nutshell, differential privacy consists of a set of techniques designed to prevent the leakage of sensitive data. In most situations, when a dataset is queried, instead of returning an exact answer, a small amount of random noise is added to the query result. The query result is now an approximation, but an adversary cannot easily use the result to discover sensitive information. Differential privacy is complicated and cannot be explained quickly. But loosely, one way to think about DP is that an algorithm is differentially private if you generate a data query result and can’t use the result to determine information about a particular person. For example, suppose you have an artificially tiny dataset of just five items with sensitive age information: ID Name Age ----------------- 001 Adams 40 002 Baker 30 003 Chang 50 004 Dunne 20 005 Eason 36  And suppose you can query the dataset for average age of two or more data items. If the average age of people 001 – 003 is 40.0 (therefore sum of ages is 120.0) and the average age of people 001 – 004 is 35.0 (therefore sum of ages is 140.0) then the age of person 004 is the difference of the sums = 140.0 – 120.0 = 20.0 years. Sensitive data has been leaked so this system is not differentially private. When adding noise to a query result, the noise is usually selected from the Laplace distribution which has theoretical advantages over the Gaussian (bell-shaped) distribution. The term “differential” in differential privacy is related to the idea of query results that differ by one item. Differentially private systems allow aggregate results, such as the average age of residents in a city from census data, to be made available without revealing specific information about an individual. Differential privacy is similar in some respects to cryptography in the sense that DP is paradoxically simple and complicated at the same time. In cryptography, to make a message secret all you have to do is scramble the message in some way. But scrambling a message so that an adversary cannot de-scramble it is astonishingly tricky. In the same way, to make a data query differentially private, all you have to do is add some random noise to the result. But the details and tradeoffs between query result accuracy and system privacy are quite tricky. There’s a lot that can go wrong in computer security. And there’s a lot that can go wrong when taking photographs with monkeys. Left: A baby monkey climbing on arm is not a happy experience. Center: The monkey is a little bit too unfriendly. Right: The male monkey is feeling a little bit too romantic. Posted in Machine Learning | Leave a comment NFL 2021 Week 12 Predictions – Zoltar Likes the Packers to Cover Against the Rams Zoltar is my NFL football prediction computer program. It uses reinforcement learning and a neural network. Here are Zoltar’s predictions for week #12 of the 2021 season. It usually takes Zoltar about four weeks to hit his stride and takes humans about eight weeks to get up to speed, so weeks six through nine are usually Zoltar’s sweet spot. After week nine, injuries start having a big effect. Zoltar: bears by 0 dog = lions Vegas: bears by 4.5 Zoltar: cowboys by 6 dog = raiders Vegas: cowboys by 7 Zoltar: bills by 0 dog = saints Vegas: bills by 3 Zoltar: steelers by 0 dog = bengals Vegas: bengals by 4 Zoltar: colts by 1 dog = buccaneers Vegas: buccaneers by 3 Zoltar: texans by 6 dog = jets Vegas: texans by 3 Zoltar: falcons by 0 dog = jaguars Vegas: falcons by 1 Zoltar: dolphins by 4 dog = panthers Vegas: panthers by 1 Zoltar: titans by 0 dog = patriots Vegas: patriots by 2.5 Zoltar: eagles by 0 dog = giants Vegas: eagles by 3 Zoltar: chargers by 0 dog = broncos Vegas: chargers by 3.5 Zoltar: packers by 6 dog = rams Vegas: packers by 1.5 Zoltar: fortyniners by 2 dog = vikings Vegas: fortyniners by 2.5 Zoltar: ravens by 6 dog = browns Vegas: ravens by 5.5 Zoltar: redskins by 2 dog = seahawks Vegas: seahawks by 3.5  Zoltar theoretically suggests betting when the Vegas line is “significantly” different from Zoltar’s prediction. In mid-season I usually use 3.0 points difference but for the first few weeks of the season I go a bit more conservative and use 4.0 points difference as the advice threshold criterion. In middle weeks I sometimes go ultra-aggressive and use a 1.0-point threshold. Note: Because of Zoltar’s initialization (all teams regress to an average power rating) and other algorithms, Zoltar is much too strongly biased towards Vegas underdogs. I need to fix this. For week #12: 1. Zoltar likes Vegas underdog Lions against the Bears 2. Zoltar likes Vegas underdog Steelers against the Bengals 3. Zoltar likes Vegas underdog Colts against the Buccaneers 4. Zoltar likes Vegas underdog Dolphins against the Panthers 5. Zoltar likes Vegas underdog Broncos against the Chargers 6. Zoltar likes Vegas favorite Packers over the Rams 7. Zoltar likes Vegas underdog Redskins against the Seahawks For example, a bet on the underdog Lions against the Bears will pay off if the Lions win by any score, or if the favored Bears win but by less than the point spread of 4.5 points (in other words, by 4 points or less). Theoretically, if you must bet$110 to win \$100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #11, against the Vegas point spread, Zoltar went 3-4 (using the standard 3.0 points as the advice threshold). Overall, for the season, Zoltar is 41-33 against the spread (~55%).

Just for fun, I track how well Zoltar does when just trying to predict just which team will win a game. This isn’t useful except for parlay betting. In week #11, just predicting the winning team, Zoltar went 8-7 which is very poor — as has been the case for the past few weeks, there were many upsets in week #11.

In week #11, just predicting the winning team, Vegas — “the wisdom of the crowd” also went 9-6 which is poor but better than Zoltar.

Zoltar sometimes predicts a 0-point margin of victory, which means the two teams are evenly matched. There are seven such games in week #12. In those situations, to pick a winner (only so I can track raw number of correct predictions) in the first few weeks of the season, Zoltar picks the home team to win. After that, Zoltar uses his algorithms to pick a winner.

Left: Electric football was invented in the late 1940s. This is a 1960s era version. The game is still very popular today. Center: Strat-O-Matic football was introduced in 1968 and was intended for teen boys as well as adults. The statistics of the game fascinated me when I was young and probably influenced my love of mathematics. Right: The 3M Pro Football game was introduced in 1966 but is no longer manufactured. I’ve played it a few times and enjoyed it a lot.

Working with Multi-Dimensional Arrays is Difficult

I often teach introductory machine learning to software engineers who have a lot of programming experience but have little experience with ML. One of the biggest challenges for engineers who are new to ML is something nobody ever talks about — working with multi-dimensional arrays.

The NumPy library has dozens of ndarray (“n-dimensional array”) functions. I was working on a problem where I needed to concatenate/merge two 2D arrays. The np.concatenate() function is overkill in the sense that it has many optional parameters.

Sometimes I prefer to implement helper functions from scratch, probably because I worked for many years in C/C++/C# environments where implementing from scratch was the standard approach for multi-dimensional array functions. Unfortunately, when working with Python, a from-scratch implementation of an array function is usually much slower than a library implementation because Python from-scratch looping is slow.

The moral of the story is that implementing ML systems has many layers — it’s difficult just to get a system to work, but there are also many subtle issues such as implementing-from-scratch vs. using library functions.

I was in Manhattan, New York City, for several days not too long ago. I spent one full day walking around from the Brooklyn Bridge up to Central Park. I was struck by the way buildings are concatenated together, producing incredible urban density. People who live in such conditions should read about the Rat Utopia experiment by John Calhoun.

Demo code:

# concat_demo.py

import numpy as np

def my_concat_rows(a, b):
(a_rows, a_cols) = a.shape  # mxn
(b_rows, b_cols) = b.shape  # rxn
result = np.zeros((a_rows + b_rows, a_cols))
for i in range(a_rows):
for j in range(a_cols):
result[i][j] = a[i][j]  # or result[i,j] = a[i,j]
for i in range(b_rows):  # 0, 1, ..
for j in range(a_cols):
result[i+a_rows][j] = b[i][j]
return result

a = np.array([[1.1, 1.2, 1.3],
[2.1, 2.2, 2.3],
[3.1, 3.2, 3.3],
[4.1, 4.2, 4.3]])  # 4x3

b = np.array([[5.1, 5.2, 5.3],
[6.1, 6.2, 6.3]])  # 2x3

print("\na = ")
print(a)
print("\nb = ")
print(b)

c = np.concatenate((a,b), axis=0) # by rows
print("\nc = ")
print(c)

d = my_concat_rows(a, b)
print("\nd = ")
print(d)


The Kendall Tau Distance For Permutations Example Python Code

Suppose you have two mathematical permutations such as p1 = (2,0,3,1) and p2 = (0,2,1,3) and you want to measure the distance / difference between the permutations. This is a surprisingly tricky question. One approach is to compute the Kendall Tau (KT) distance.

The KT distance is the number of pairs of permutation elements that have a different ordering in the two permutations. The idea is best explained by example. For p1 and p2 above, there are 6 possible pairs of elements. Using the elements of p1 as the basis the pairs are: 2-0, 2-3, 2-1, 0-3, 0-1, 3-1. In p2, the 2-0 pair order is reversed (0-2 in p2), and the 3-1 pair is reversed (1-3 in p2), but the other four pairs have the same ordering, and so the KT distance is 2. The normalized KT distance is 2/6 = 0.6667 where 6 is the total number of pairs.

The Wikipedia article on KT distance is awful, but the Wikipedia article does present a good example (on the right).

OK. I don’t like to criticize. But the Wikipedia entry on Kendal Tau distance is one of the weakest technical articles I’ve come across . . ever. The KT article is often misleading, is contradictory in several places, and has obvious errors. The only good part about the Wikipedia article is an example where it shows how to compute KT distance for 1-based permutations p1 = (1,2,3,4,5) and p2 =(3,4,1,2,5).

The Wikipedia article presents a Python implementation of KT distance. The code computes distance incorrectly (returning twice as many mis-orderings as correct) but then computes a correct value of normalized distance by dividing by twice as many possible orderings!

The Wikipedia article also mixes information about Kendall Tau correlation (a value between -1 and +1) with normalized KT distance (a value between 0 and 1) — similar concepts but different. Additionally, the Wikipedia article presents some C language code that is misleading because it computes KT distance that works only for a special case when one of the permutations is (0,1,2,3,..) or (1,2,3,..).

Here’s my Python implementation of KT distance:

def my_kendall_tau_dist(x, y):
# x, y are 0-based lists or np.arrays
n = len(x)
index_of = [None] * n  # lookup into y
for i in range(n):
v = y[i]; index_of[v] = i

d = 0  # distance = number pair misorderings
for i in range(n):
for j in range(i+1, n):
if index_of[x[i]] "gt" index_of[x[j]]:  # replace "gt"
d += 1
normer = n * (n - 1) / 2.0
nd = d / normer  # normalized distance
return (d, nd)


The KT distance is unusual in the sense that it’s easy to state and understand, but the implementation has many subtleties.

The moral of the story is that Wikipedia is a great resource but you shouldn’t assume the information there is always correct.

Poker and card games are all about permutations. Some people can toss playing cards with amazing skill. My college roommate Ed Koolish was an expert card thrower. Center: The current world record for farthest card thrown is held by Rick Smith who threw a card 216 feet, 4 inches on December 2, 2002. Here Smith throws cards at a target. Left and Right: I don’t think there are any competitions for card tossing style.

Demo code:

# kendall_tau_dist_demo.py

# kendall tau distance for permutations
# kt distance is not same as kt correlation
# replace "lt" and "gt" with symbols

import numpy as np

def my_kendall_tau_dist(x, y):
# number bubble sort swaps = number misordered pairs
# x, y are 0-based lists or np.arrays
n = len(x)
index_of = [None] * n  # lookup into y
for i in range(n):
v = y[i]; index_of[v] = i

d = 0  # distance = number pair misorderings
for i in range(n):
for j in range(i+1, n):
if index_of[x[i]] "gt" index_of[x[j]]:
d += 1
normer = n * (n - 1) / 2.0
nd = d / normer  # normalized distance
return (d, nd)

def normalised_kendall_tau_distance(values1, values2):
# from Wikipedia
n = len(values1)
assert len(values2) == n, \
"Both lists have to be of equal length"
i, j = np.meshgrid(np.arange(n), np.arange(n))
a = np.argsort(values1)
b = np.argsort(values2)
ndisordered = \
np.logical_or(np.logical_and(a[i] "lt" a[j], \
b[i] "gt" b[j]), np.logical_and(a[i] "gt" a[j], \
b[i] "lt" b[j])).sum()
normed = ndisordered / (n * (n-1))  # no 2
return ndisordered, normed

print("\nBegin kendall tau distance demo \n")

x = [0,3,1,6,2,5,4]  # list
y = [4,5,2,6,1,3,0]  # reversed x
z = [1,3,2,6,5,4,0]

print("x = ", end=""); print(x)
print("y = ", end=""); print(y)
print("z = ", end=""); print(z)

d1, nd1 = normalised_kendall_tau_distance(x, y)
d2, nd2 = my_kendall_tau_dist(x, y)
d3, nd3 = my_kendall_tau_dist(x, z)
d4, nd4 = my_kendall_tau_dist(y, z)
d5, nd5 = my_kendall_tau_dist(x, x)

print("\nxy dist and norm dist using Wikipedia kt(): \
%3d  %0.4f " % (d1,nd1))
print("xy dist and norm dist using my_kt_dist()  : \
%3d  %0.4f " % (d2,nd2))
print("xz dist and norm dist using my_kt_dist()  : \
%3d  %0.4f " % (d3,nd3))
print("yz dist and norm dist using my_kt_dist()  : \
%3d  %0.4f " % (d4,nd4))
print("xx dist and norm dist using my_kt_dist()  : \
%3d  %0.4f " % (d5,nd5))

print("\nWikipedia 1-based example: ")
x = np.array([1,2,3,4,5], dtype=np.int64)  # np.array
y = np.array([3,4,1,2,5], dtype=np.int64)
print("x = ", end=""); print(x)
print("y = ", end=""); print(y)
x -= 1  # convert to 0-based
y -= 1
(d, nd) = my_kendall_tau_dist(x, y)
print("norm distance = %0.4f " % nd)

print("\nEnd demo ")

Posted in Machine Learning | 2 Comments