An alternative to gradient descent is evolutionary optimization. Evolutionary optimization doesn’t use gradients (but evolutionary techniques require much, much more processing power which is why they’re rarely used).

Evolutionary optimization maintains a population of possible solutions (good weights). In an iterative process, two “good” possible solutions are selected and the combined to create a new, presumably better possible solution. Therefore, one of the many sub-problems when using evolutionary optimization is selecting “good” possible solutions.

At any point in time, you don’t want to always pick the two best solutions because a non-best possible solution could have good characteristics.

There are several techniques to choose good, but not necessarily best items. The most common techniques are roulette wheel selection and tournament selection.

Suppose you have 10 possible solutions, and their associated errors are [0.1, 0.2, . . 1.0]. So the best solution is at [0] and the second best is at [1] and so on. To use tournament selection, you select a random subset, and then pick the best from the random subset. Suppose you set the percentage of the subset to 0.4 (40%). This is often called the tau value. Then suppose the 40% random subset items are [5, 3, 6, 4] and so the associated errors are [0.6, 0.4, 0.7, 0.5]. From this subset, you’d pick item [3] because it has the smallest error.

The tau values controls selection pressure. If tau = 1.0 you always examine all the items an so you’ll always get the best item. If tau is small, say 20%, then you have a much greater chance of getting the non-best item.

Here is some C# code (where I’ve replaced less-than operators so my blog software doesn’t freak out):

static int Select(double[] errors, double tau, Random rnd) { // pick best from a random tau-percent of population int popSize = errors.Length; int numItems = (int)(popSize * tau); int[] allIndices = new int[popSize]; for (int i = 0; i less popSize; ++i) allIndices[i] = i; Shuffle(allIndices, rnd); int bestIdx = allIndices[0]; double bestErr = errors[allIndices[0]]; for (int i = 0; i less numItems; ++i) { int idx = allIndices[i]; if (errors[idx] less bestErr) { bestIdx = idx; bestErr = errors[idx]; } } return bestIdx; }

The idea is to use a Shuffle() function to scramble the order of the indices to pick a few randomly. Shuffle() uses the Fisher-Yates mini-algorithm:

static void Shuffle(int[] vec, Random rnd) { int n = vec.Length; for (int i = 0; i less n; ++i) { int ri = rnd.Next(i, n); int tmp = vec[ri]; vec[ri] = vec[i]; vec[i] = tmp; } }

For evolutionary optimization, you want two good items that are not the same:

static int[] SelectTwo(double[] errors, double tau, Random rnd) { int[] result = new int[2]; int ct = 0; result[0] = Select(errors, tau, rnd); while ((result[1] = Select(errors, tau, rnd)) == result[0] and ct less 100) ++ct; return result; }

Here I just use brute force to repeatedly pick a second item until it’s not the same as the first. I set a sanity stop of 100 tries.

Notice the SelectTwo() function calls Select() which calls Shuffle(). When writing complex software, it’s usually a good idea to mask complexity by refactoring into helper functions.

*The Venice Carnival has featured beautiful masks and costumes since the 12th century.*

Perceptron classification is arguably the most rudimentary machine learning (ML) technique. The perceptron technique can be used for binary classification, for example predicting if a person is male or female based on numeric predictors such as age, height, weight, and so on.

From a practical point of view, perceptron classification is useful mostly to provide a baseline result for comparison with more powerful ML techniques such as logistic regression and k-nearest neighbors. Perceptron classification is also interesting from a historical point of view as a predecessor to neural networks.

Perceptron classification is quite simple to implement but the technique only works well with simple data that is completely, or nearly, linearly separable.

In my article, I show a demo with a 10-item subset of the well-known Banknote Authentication dataset. The goal is to predict if a banknote (think euro or dollar bill) is authentic (coded -1) or a forgery (coded +1) based on four predictor values (image variance, skewness, kurtosis, and entropy).

My demo uses a variation of perceptron classification called averaged perceptron.

Although perceptron classification is simple and elegant, logistic regression is only slightly more complex and usually gives better results.

Some of my colleagues have asked me why averaged perceptron classification is part of the new ML.NET library. As it turns out, averaged perceptron was the first classifier algorithm implemented in the predecessor to ML.NET library, an internal Microsoft library from Microsoft Research named TMSN, which was later renamed to TLC. The averaged perceptron classifier was implemented first because it is so simple. The average perceptron classifier was retained from version to version, not because of its practical value, but because removing it would require quite a bit of effort.

*The word “perceptron” was derived from “perception”. Here are three random images from an Internet search for “perception art”.*

```
Zoltar: chiefs by 3 dog = fortyniners Vegas: chiefs by 1
```

Both Zoltar and Las Vegas slightly favor the Kansas City Chiefs over the San Francisco 49ers.

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. Therefore, Zoltar doesn’t really have a recommendation for this game. But, if Zoltar was forced to pick, he’d say bet on the Chiefs. Such a bet would pay off if the Chiefs win by more than 1 point (in other words, 2 points or more). If the Chiefs win by exactly 1 point the bet is a push. Any other result would be a loss of the bet.

===

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #20 Zoltar went 0-0 against the Vegas point spread because he had no hypothetical recommendations.

For the 2019 season, through week #20, Zoltar is an OK 54-34 (61% accuracy) against the Vegas spread.

Just for fun, I track how well Zoltar and Las Vegas do when trying to predict only which team will win (but not by how much). This isn’t useful except for parlay betting.

Just predicting winners, Zoltar was 2-0. Las Vegas was also 2-0 last week. Both Zoltar and Vegas correctly picked the Chiefs to beat the Titans, and the 49ers to beat the Packers.

For the season, just picking winners, Zoltar is a pretty decent 178-85 (67%) and Vegas is also pretty good at 170-90 (65%).

Note: Vegas has fewer games than Zoltar because Vegas had three pick’em games. Also there was one tie game in week #1 (Lions at Cardinals).

*My system is named after the Zoltar fortune teller machine you can find in arcades. Arcade Zoltar is named after the magic machine in the 1988 fantasy movie “Big” starring Tom Hanks. And the movie machine was probably named after the 1960s era Zoltan fortune teller arcade machine.*

I instantly ran into an unexpected problem — I could not find summary data anywhere. After spending a lot of time looking, I got mildly irritated. I decided that I wouldn’t allow myself to be defeated in my quest for data.

The SAT organization publishes an annual summary report in PDF format every year. I found and opened each annual report from 2000 to 2019 and then manually extracted the SAT math scores, dropped the numbers into Excel and made a graph. The process was quite time-consuming.

*SAT math scores from 2000 – 2019 are remarkably stable.*

The first thing I noticed is that SAT math scores are remarkably stable over time. The uptick in scores starting in 2017 was due to a change in the SAT test, not a sudden surge of math ability in high school seniors. Put another way, all efforts that have been aimed at reducing the achievement gap between groups have had virtually no effect whatsoever. Interestingly, a few years ago it was speculated that family income has a great effect on math achievement but that hypothesis/myth has been thoroughly debunked – the poorest-family majority race students score much higher on math than the richest-family minority students. What this means is anybody’s guess.

The second thing I noticed is that the SAT people stopped breaking down scores by race and gender. For all years before 2017 the annual reports were broken down so you could see the scores of by race and gender, but from 2017 onward, gender was combined for each race. Why this change to reduce information occurred is beyond me, but changes in reporting like this are usually motivated by political factors rather than math factors. Perhaps the fact that Black females consistently score dramatically lower in math than other groups is not a politically happy result.

wm wf white bm bf black 2000 549 514 530 436 419 426 2001 550 515 531 436 419 426 2002 552 517 533 438 419 427 2003 552 518 534 436 420 426 2004 550 514 531 438 420 427 2005 554 520 536 442 424 431 2006 555 520 536 438 423 429 2007 553 519 534 437 423 429 2008 555 521 537 434 420 426 2009 555 520 536 435 420 426 2010 555 519 536 436 422 428 2011 552 520 535 435 422 427 2012 554 520 536 436 422 428 2013 552 519 534 436 423 429 2014 552 519 534 435 423 429 2015 551 518 534 435 422 428 2016 550 518 533 430 422 425 2017 na na 553 na na 462 2018 na na 557 na na 463 2019 na na 553 na na 457

There’s no big moral to this story. The point is that even in a digital age, sometimes data is difficult to access. And in the end, numbers are just numbers; applying statistics that describe a group to an individual person is rarely a good idea.

*Through 2016, SAT reported scores by race and gender (for example, the 2003 report is on the left) but starting in 2017 scores were combined by race (2017 report on right). Why the SAT people did this is a mystery to me.*

I was looking at multiclass logistic regression recently. Regular logistic regression is a binary classification technique, for example, predicting if a person is male (0) or female (1) based on predictors/features such as height, shoe size, income, and so on.

*This demo shows that it’s possible to use Softmax for binary logistic regression but you have to hack a bit by using a dummy set of 0-value weights and bias — not a good idea.*

Multiclass logistic regression is an extension that can predict a variable that can be one of three or more values, for example, predicting is a person is a political conservative, moderate, or liberal.

*Note: The word “multiclass” is not a dictionary word so it should really be spelled as “multi-class” with a hyphen. But, as is often the case, machine learning terminology ignores convention and created a term on the fly. I find the habit of researchers and engineers creating words to be quite annoying.*

For regular logistic regression, suppose you have four predictors (x0, x1, x2, x3). The output is computed like so:

z = (w0 * x0) + (w1 * x1) + (w2 * x1) + (w3 * x3) + b p = logsig(-z)

where w0, w1, w2, w3 are weights and b is the bias. The p value will be between 0 and 1. The generic logsig(a) function is:

logsig(a) = 1.0 / (1.0 + exp(-a))

Notice you have to be extremely careful with the minus signs.

Suppose you have three classes. Multiclass logistic regression output is computed as:

z0 = (w00 * x0) + (w10 * x1) + (w20 * x1) + (w30 * x3) + b0 z1 = (w01 * x0) + (w11 * x1) + (w21 * x1) + (w31 * x3) + b1 z2 = (w02 * x0) + (w12 * x1) + (w22 * x1) + (w32 * x3) + b2 P = softmax(z0, z1, z2)

Here w is a weights matrix where the first index represents the predictor and the second index is the class. So w31 is the weight for predictor [3] and class [1]. The P result is a vector with three values that sum to 1 so that they can be interpreted as probabilities. The generic softmax function for three values is defined as:

sum = exp(z0) + exp(z1) + exp(z2) P0 = exp(z0) / sum P1 = exp(z1) / sum P2 = exp(z2) / sum

Notice there are no minus signs here.

Now, as it turns out, there is a very close but complex mathematical relationship between logistic sigmoid and softmax. (The Wikipedia article on logistic regression explains it quite well). It’s possible to use variations of logistic sigmoid or softmax for either binary or multiclass logistic regression, but from an engineering perspective, for binary logistic regression you should use logistic sigmoid and for multiclass logistic regression you should use softmax. Period.

*Three illustrations by artist Klaus Burgle (1926 -2015). He did many German science fiction book and magazine covers in the 1950s and 1960s.*

The 2020 Las Vegas event is just around the corner — March 1-6. See https://vslive.com/Home.aspx.

*A couple of screenshots of the event Web site. I’m in the image on the right. I don’t look that good in real life.*

Before I describe the details, let me cut to the chase and mention that early registration is good through tomorrow, Friday, Jan. 16, 2020. You can save $400.

VS Live is one of the longest-running technical conferences — I think this is the 27th consecutive year. The fact that VS Live has such longevity is a strong testament to its quality. When I talk to attendees at VS Live, it’s not uncommon for them to tell me they’ve attended many times.

As the name of the event suggests, VS Live is intended primarily for engineers and managers who work with the Microsoft technology stack. Unlike some conference that have a lot of thinly veiled Marketing and Sales content, VS Live is primarily an educational event. I’ve learned a ton of useful and valuable information at every event.

*Two photos from last year’s event in Las Vegas.*

I mentally categorize the technical events I attend by size: small (200 to 500 attendees), medium (500 to 2000 attendees), large (more than 2000 attendees). Each size has strengths and weaknesses. VS Live falls into my small category and its primary advantage is that the size fosters impromptu conversations with other speakers and attendees, where some of the most interesting information is exchanged.

All good conferences like VS Live are expensive. But VS Live delivers good value for the money in my opinion. Itβs usually not feasible to pay for such an event out of pocket, but many companies will fund your attendance as part of training. The conference Web site has a Sell Your Boss page at https://vslive.com/events/las-vegas-2020/information/sell-your-boss.aspx.

In the end, only you can decide if attending VS Live makes sense for you. So I encourage you to check out the Web site and look the agenda over.

*Left: “The Hangover Bail Bonds” — “because last night was no movie” – best motto award. Center: “Jesus Christ Bail Bonds” – most optimistic award. Right: “It Wasn’t Me Bail Bonds” – best name award.*

```
Zoltar: chiefs by 6 dog = titans Vegas: chiefs by 7.5
Zoltar: fortyniners by 1 dog = packers Vegas: fortyniners by 7.5
```

Zoltar theoretically suggests betting when the Vegas line is more than 3.0 points different from Zoltar’s prediction. Therefore, for week #20 Zoltar has one hypothetical suggestion.

Zoltar likes the Vegas underdog Green Bay Packers against the San Francisco 49ers. Zoltar thinks the 49ers are just a tiny 1 point better than the Packers, but Las Vegas thinks the 49ers are 7.5 points better than the Packers.

A bet on the Packers will pay off if the Packers win by any score or if the 49ers win but by less than 7.5 points (in other words, 7 points or fewer).

*Update: Oops. I forgot to factor in the 49ers home field advantage. Zoltar retracts his recommendation on the Packers.*

===

Theoretically, if you must bet $110 to win $100 (typical in Vegas) then you’ll make money if you predict at 53% accuracy or better. But realistically, you need to predict at 60% accuracy or better.

In week #19 Zoltar went 0-1 against the Vegas point spread. Zoltar incorrectly liked the Vegas underdog Texans against the Chiefs. The Texans jumped off to a huge 21-0 lead and then . . . the rest of the game was not pretty for anyone who bet on the Texans.

For the 2019 season, through week #19, Zoltar is an OK 54-34 (61% accuracy) against the Vegas spread.

Just for fun, I track how well Zoltar and Las Vegas do when trying to predict only which team will win (but not by how much). This isn’t useful except for parlay betting.

Just predicting winners, Zoltar was a good 3-1. Las Vegas was also 3-1 last week. Both Zoltar and Vegas thought the Ravens would beat the Titans but the Titans won handily.

For the season, just picking winners, Zoltar is a pretty decent 176-85 (67%) and Vegas is also pretty good at 168-90 (65%).

Note: Vegas has fewer games than Zoltar because Vegas had three pick’em games. Also there was one tie game in week #1 (Lions at Cardinals).

*My system is named after the Zoltar fortune teller machine. Here are two anonymous fortune tellers plus Rita Repulsa who appeared in an Internet image search for “fortune teller”. I think maybe the crystal-like ball in Rita’s staff influenced the result.*

Basic LR binary classification is relatively simple, and the results are somewhat interpretable (as opposed to some other machine learning classification techniques, notably neural networks). But basic LR binary classification has two key weaknesses. First, the technique only works well when the training data is simple, meaning mostly linearly separable. (Note: Weirdly, LR binary classification can perform extremely poorly when the training data is completely linearly separable β the reasons are quite mathematically complicated). Second, basic LR binary classification only works when you want to predict a binary result, as opposed to multi-class classification where you want to predict a variable that can take three or more discrete values.

There are two main extensions of basic LR to deal with the two LR weaknesses. The first variation is kernel logistic regression, which allows LR binary classification to deal with complex data that is not linearly separable, The second extension is multiclass logistic regression, which allows LR to deal with predicting a variable that can be three or more discrete values.

I haven’t looked at kernel logistic regression in quite a while so I thought it’d be fun to code up a demo to refresh my memory. I decided to use raw (no libraries) C# but the technique can be used with any programming language.

First, I created some dummy training data shown in the graph below. There are 21 training items. There are two predictor variables, which you can think of as a person’s age and weight. The goal is to predict a class which you can think of as male = 0 or female = 1. The data isn’t linearly separable so basic LR won’t work very well.

Kernel logistic regression is much more complicated than basic LR. For each of the 21 training items, a weight (often called an alpha value) is computed. Then to make a prediction for a new item, you compare against each training item, compute the sum of the corresponding alpha times a kernel function of the item to predict and each training item. Then you add the bias-alpha to the sum, the take the logistic sigmoid of the sum.

The result will be a p-value between 0.0 and 1.0. If the p-value is less than 0.5 the prediction is class 0, otherwise the prediction is class 1. There are many possible kernel functions and each possibility has one or more parameters — the choice of kernel and its parameters are hyperparameters that must be determined by trial and error.

static double ComputeOutput(double[] x, double[] alphas, double sigma, double[][] trainX) { // x is item to predict // bias is last cell of alphas[] int n = trainX.Length; // number items double sum = 0.0; for (int i = 0; i less-than n; ++i) sum += (alphas[i] * Kernel(x, trainX[i], sigma)); sum += alphas[n]; // add the bias return LogSig(sum); // result is [0.0, 1.0] }

Anyway, good fun. When I get some free time I’ll tidy up my code and write up an explanation and publish on the online Visual Studio Magazine — my resource of choice for information about machine learning using Microsoft technologies.

*Three images from an Internet search for “kernel art”. Left: A carved olive pit (kernel/seed). Center: Intricate cut paper sculpture. Right: I have no idea why this illustration is related to “kernel”, but it’s interesting.*

The IMPACT Conference is run by the Computer Measurement Group (CMG, https://www.cmg.org/). CMG is a not-for-profit organization that has been around for many years. It’s essentially a community of members who have a wide range of technical roles.

My keynote talk will be “Four Recent Advances in AI and Machine Learning for IT Organizations”. I will talk about ML/AI technologies and techniques that are available today, some ideas that will likely become practical within about two years, and some ideas that are a bit further out (perhaps five years).

Let me explain why I’m going to the IMPACT Conference and why you might want to consider attending this year or next year.

I get many requests to speak about machine learning and AI. This is due mostly to simple supply and demand. Machine learning and AI are potentially applicable to just about any scenario you can imagine — computer security, infrastructure optimization, legal, health care, consumer retail, and on and on. This creates a lot of demand for expert speakers at hundreds of conferences.

However, there aren’t many experts in machine learning and AI. And, I hate to be harsh, but many of my colleagues who are ML/AI experts are *terrible* speakers. I’m a very good speaker only because I had thousands of hours of practice during my days as a university professor. Public speaking is a skill that must be learned, just like most skills.

*The IMPACT Conference organizers sent me this banner image. https://cmgimpact.com/. They told me that if you register for the conference and use the code McCaffrey10 you can get a 10% discount. By Grabthar’s hammer, what a value!*

I’ve noticed that at most conferences, I always learn useful information, but it happens in unpredictable ways. Some useful information comes from the formal presentations, but just as often I pick up valuable information in unplanned, ad hoc conversations. And then sometimes the value comes from talking to someone at a conference Expo. The key is keeping your eyes and ears open and not being afraid to engage in conversations with strangers (conference attendees that is; it’s not such a good idea to engage in conversations with strangers at 3:00 AM on the Vegas Strip — trust me).

In addition to picking up useful information, I always return to work from a conference with renewed energy and enthusiasm, and I’m sure I’m more productive for my company.

Anyway, if you are a technology professional, check out the IMPACT Conference to see if it meets your career goals. Most technical conferences, even reasonably priced ones like IMPACT, are a bit too expensive for you to pay out-of-pocket. But even though quantifying the value of attending a conference is subjective, you can always make a strong case to your employer to fund you.

*The value of art is more subjective than anything in technology. Here are three paintings by artist Karol Bok. I like his work a lot but I can’t quantify its value.*

If the predictor variables are numeric, such as age, years of education, annual income and so on, then the decision rule will consist of clauses with less-than and greater-than-or-equal conditions. But what if one or more of the predictor variables is categorical, such as eye color with possible values blue, green, brown, hazel? Unlike many machine learning classifier algorithms, such as naive Bayes, a decision tree works fine with categorical data if you encode the values using an ordinal 0, 1, 2, . . scheme.

Suppose you have some raw data that looks like:

age edu eye party ======================== 38.0 12.0 green 0 22.0 16.0 blue 1 64.0 14.0 hazel 2 . . .

If you encode the eye color so that blue = 0, green = 1, brown = 2, hazel = 3 you get:

age edu eye party ====================== 38.0 12.0 1.0 0 22.0 16.0 0.0 1 64.0 14.0 3.0 2 . . .

And now if you apply decision tree creation code, your decision rules will resemble, “IF age greater-equal 40.0 AND edu less-than 14.0 AND eye less-than 2.0 THEN party = 1 (republican)”. The eye less-than 2.0 is equivalent to eye = 0.0 (blue) or eye = 1.0 (green).

Decision trees require quite a bit of tuning. If you make your tree go deep enough you’ll eventually isolate each eye color if that’s necessary, getting for example, “IF . . eye less-than 2.0 AND eye greater-equal 1.0 . . ” means if eye color equals 1 (green).

Now some machine learning libraries take a much more complex approach and accept unencoded categorical data. The decision tree creation code splits using equal and not-equal. For example, “IF age less-than 55.0 AND eye = hazel AND . . ” However, implementing this scheme requires nearly twice as much effort/code as the simple approach of using ordinal encoding for categorical data.

Now to be sure, I’ve left out some details and there are a few exceptions.

*Having different colored eyes is called heterochromia. Actresses Jane Seymour, Mila Kunis, and Kate Bosworth have heterochromia. I have a variation of the condition called central heterochromia where the outer part of my retina is green and the inner part is brown. Weirdly, the percentages of each color of my eyes vary significantly depending on my adrenaline level. Unlike the three talented actresses pictured, I have exactly zero non-technical talent.*