Training a Generative Adversarial Network (GAN)

A generative adversarial network (GAN) is a complex deep neural system that can be used to generate fake data based on a set of real data. This can be useful in several scenarios, including generating additional training data for a neural classifier when you have limited training data for one of the classes you’re trying to predict. GANs are most often used for image data but GANs can be used with any kind of data.

I understand GANs quite well now. But as with any kind of software system, I didn’t understand GANs until I implemented one in code and dissected every line of code. Here’s a screenshot of a GAN I created that generates fake ‘3’ UCI Digits (crude 8×8 grayscale images of a handwritten ‘3’):

While I was learning how to create and use GANs, I searched the Internet looking for information. I found many diagrams. However, in retrospect, it is clear that almost all of the information I found about GANs was posted by developers who really didn’t understand GANs at all. Post after post was basically a highly simplistic diagram that grotesquely understated how complex GANs are. Here’s an example of one of the diagrams I found on the Internet:

Now don’t get me wrong — the simple diagram is very good in the sense that it gives someone a general idea of how GANs work. But the diagram is so over-simplified it’s misleading for anyone who wants to actually implement a GAN, especially the training code. I figured I could create a more detailed diagram of how to train a GAN, so I fired up PowerPoint and created a diagram that shows one training iteration:

The first observation is that training a GAN is complicated. The second indirect observation is that, because of the complexity, there are a huge number of hyperparameters and design decisions. This makes GANs very difficult to work with. But like most things in computer science, once you understand GANs (after many hours of exploration), they’re really quite easy and fascinating.

Diagrams for computer science topics can have different levels of detail. Diagrams without much detail are good to understand general ideas, and diagrams with a lot of detail are better for understanding implementation.

When I get a chance, I’ll tidy up my demo code then write up an explanation, and then post the information here on my blog or on Visual Studio Magazine.


Street scenes of Victorian England by two of my favorite artists. The paintings have different levels of detail but that’s what art is all about. Left: A very detailed “Foregate Street, Chester” by Louise Rayner (1832-1924). Chester is a walled city in England. Right: A more abstract “London Street at Night” by John Atkinson Grimshaw (1836-1893). Grimshaw was one of the first artists to master city night scenes after the invention of electric lights made such scenes possible.

Posted in PyTorch | Leave a comment

Why I Dislike XGBoost and Why I Like XGBoost

First, the title of this blog post is moderately click-bait. I dislike many charateristics of XGBoost but I like some of them too.

XGBoost (“extreme gradient boost”) is a huge library of many functions, with hundreds of parameters and possible argument values. XGBoost can create machine learning models for multi-class classification, binary classification, and regression. XGBoost works by creating many decision trees; each tree learns to improve on the previous tree using an error gradient. The results are aggregated to final predictions.

I rarely use XGBoost but a lot of my colleagues who are relatively new to machine learning use XGBoost so I take XGBoost out for a spin every now and then.

The pros of XGBoost: 1.) It usually works very well for relatively simple tabular data style problems (multi-class and binary classification and regression). 2.) You don’t need to know much theory to get a prediction model up and running.

The cons: 1.) Because XGBoost is essentially a black box, you have to learn an enormous number of parameters to use it to its potential. 2.) Because XGBoost is a black box, you don’t gain new skills that transfer to other scenarios. For example, a knowledge of deep neural networks ala Keras or PyTorch allows you to create all kinds of interesting and complex systems. (But the tradeoff is you have to learn a lot of ML theory.)

I installed the xgboost library (v1.4.1) using pip with the command “pip install xgboost”. Then I coded up a demo that classifies the Iris dataset. I cheated a bit (to save time) and used scikit to load the 150-item, 3-class Iris data and split it into 135 training items (90%) and 15 test items (10%). In a non-demo scenario, data preparation takes a long time.

The Iris dataset is not challenging so it’s not a surprise that the trained model predicted all 15 test items correctly.

OK, that’s my dose of XGBoost for a while. Not great fun, but good-enough fun.



Artist Linda Wooten created this beautiful dollhouse-sized model of a pirate tavern. It reminds me of my days working on the Pirates of the Caribbean ride at Disneyland when I was a student in college at UC Irvine. Creating a detailed model with custom parts is like creating a machine learning model using a low-level library like PyTorch. Creating a pirate tavern model using Lego would be like creating a machine learning model using a high-level tool like XGBoost — quicker but not as beautiful.


# xgb_iris.py

# xgboost 1.4.1
# Python 1.8.0-CPU  Windows 10

import numpy as np
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split

def main():
  print("\nBegin Iris classification using XGBoost ")
  np.set_printoptions(precision=4, suppress=True)

  iris = datasets.load_iris()
  X = iris.data
  y = iris.target
  X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.10, random_state=1)

  d_train = xgb.DMatrix(X_train, label=y_train)
  d_test = xgb.DMatrix(X_test, label=y_test)

  params = {
    'max_depth': 3,  # tree depth 
    'eta': 0.3,      # training step
    'objective': 'multi:softprob',
    'eval_metric' : 'merror',
    'num_class': 3
  } 

  num_rounds = 5
  model = xgb.train(params, d_train, num_boost_round=num_rounds)

  preds = model.predict(d_test)
  print("\nprediction p-values: ")
  print(preds)

  print("\npredicted classes: ")
  pc = np.argmax(preds, axis=1)
  print(pc)

  print("\nactual classes: ")
  print(y_test)

  print("\nEnd demo ")

if __name__ == "__main__":
  main()
Posted in Machine Learning | 2 Comments

Tomek Links for Pruning Imbalanced Data

Imbalanced data occurs when you have machine learning training data with many items of one class and very few items of the other class. For example, some medical data might have many thousands of data items that are “no disease” (class 0) but only a few data items that are “disease” (class 1).

Most ML prediction systems don’t work well with highly imbalanced data because the majority class items overwhelm the minority class items. So, in most cases you must prune away some of the majority class items and/or generate some new synthetic minority class items.

The idea of Tomek links is to identify majority class data items to delete. In a nutshell, a Tomek link occurs between two data items that have different classes, but are the nearest neighbors to each other. The idea is best understood by a diagram:

In the diagram, look at item (0.2, 0.3), which is class 0. The nearest neighbor to (0.2 0.3) is (0.3, 0.2) and because it is a different class, the two items form a Tomek link.

On the other hand, the data item at (0.3, 0.9) has a nearest neighbor at (0.5, 0.9) but because both items are the same class, they don’t form a Tomek link.

The data in the diagram has a second Tomek link between (0.7, 0.4) and (0.8, 0.4).

Tomek links usually occur near a decision boundary (the pair in the lower left) or when one of the two items is “noise” (the pair on the right).

When you find a pair of data items that form a Tomek link, the item that has the majority class is a good candidate for deleting because it is either ambiguous (when near a decision boundary) or noise.

As a general rule of thumb, you must be very cautious when pruning away items from imbalanced datasets. However, you must be cautious when generating synthetic minority class data too because you might mask majority data.



A bit too much fun while drinking can lead to personal imbalance, which in turn can lead to beer-box masking. I know this from personal experience in college but the miracle of Internet image search provides concrete visual evidence too.

Posted in Machine Learning | Leave a comment

Researchers Explore Intelligent Sampling of Huge ML Datasets to Reduce Costs and Maintain Model Fairness

I contributed to an article titled “Researchers Explore Intelligent Sampling of Huge ML Datasets to Reduce Costs and Maintain Model Fairness” in the May 2021 edition of the online Pure AI site. See https://pureai.com/articles/2021/05/03/intelligent-ai-sampling.aspx.

Researchers devised a new technique to select an intelligent sample from a huge file of machine learning training data. The technique is called loss-proportional sampling. Briefly, a preliminary, crude prediction model is created using all source data, then loss (prediction error) information from the crude model is used to select a sample that is superior to a randomly selected sample.

The researchers demonstrated that using an intelligent sample of training data can generate prediction models that are fair. Additionally, the smaller size of the sample data enables quicker model training, which reduces the electrical energy required to train, which in turn reduces the CO2 emissions generated while training the model.

The ideas are best explained by an artificial example. Suppose you want to create a sophisticated machine learning model that predicts the credit worthiness of a loan application, 0 = reject application, 1 = approve application. You have an enormous file of training data — perhaps billions of historical data items. Each data item has predictor variables such as applicant age, sex, race, income, debt, savings and so on, and a class label indicating if the loan was repaid, 0 = failed to repay, 1 = successfully repaid loan.

To create an intelligent loss-proportional sample, you start by creating a crude binary classification model using the entire large source dataset. The most commonly used crude binary classification technique is called logistic regression. Using modern techniques, training a logistic regression model using an enormous data file is almost always feasible, unlike creating a more sophisticated model using a deep neural network.

After you have trained the crude model, you run all items in the large source dataset through the model. This will generate a loss (error) value for each source item, which is a measure of how far the prediction is from the actual class label. For example, the loss information might look like:


(large source dataset)
item  prediction  actual   loss   prob of selection
[0]    0.80         1      0.04   1.5 * 0.04 = 0.06
[1]    0.50         0      0.25   1.5 * 0.25 = 0.37
[2]    0.90         1      0.01   1.5 * 0.01 = 0.02
[3]    0.70         1      0.09   1.5 * 0.09 = 0.13
. . .
[N]    0.85         1      0.02   1.5 * 0.02 = 0.03

Here the loss value is the square of the difference between the prediction and the actual class label, but there are many other loss functions that can be used. In general, the loss values for data items that have rare features will be greater than the loss values for normal data items.

Next, you map the loss for each source data item to a probability. In the example above, each loss value is multiplied by a constant lambda = 1.5. Now, suppose you want a sample of data that is 10 percent of the size of the large source data. You iterate through the source dataset. For each item, you select it and add it to the sample with its associated probability. In the example above, item [0] would be selected with prob = 0.06 (likely not selected), then item [1] would be selected with prob = 0.37 (more likely) and so on. You repeat until the sample has the desired number of data items.

The ideas are fully explained in a 2013 research paper titled “Loss-Proportional Subsampling for Subsequent ERM” by P. Mineiro and N. Karampatziakisis. The paper is available online in several locations. Note that the title of the paper uses the term “subsampling” rather than “sampling.” This just means that a large source dataset is considered to be a sample from all possible problem data. Therefore, selecting from the source dataset gives a subsample.

In the early days of machine learning, there was often a lack of labeled training data. But, increasingly, machine learning efforts have access to enormous datasets, which makes techniques for intelligent sampling more and more important. The history of computer hardware and software is fascinating.


Left: Nicolas Temese created this beautiful diorama of an IBM 1401 mainframe computer system. The 1401 was introduced in 1959. It was one of the very first mass-produced machines. In the early 1960s there were about 20,000 computers on the entire planet and about half of these were 1401s. Right: This is a 1/12 scale diorama of Tony Stark’s (Iron Man) workshop system. It was created by Sherwyn Lazaga who works for a company called StudioGenesis.


Posted in Machine Learning | Leave a comment

Neural Networks, Dogs, and JavaScript

I was walking my two dogs, Riley and Kevin, early one wet Pacific Northwest Saturday morning. I enjoy walking and thinking while my dogs do their dog-thing and look for rabbits. My dogs have never caught a rabbit but they’re endlessly optimistic.

Many of my colleagues at my workplace say that they, like me, do a lot of their technical thinking while walking.


Left: The path I walk most often, behind my house. Right: My second choice path, around Yellow Lake, across the street from my house.

I had an idea that I was exploring mentally and the idea involved coding a demo using the JavaScript programming language. I hadn’t used JavaScript for several months. I know from experience, that before starting an experiement which uses a programming language that I haven’t used in a while, my best approach is to do a warm-up with that language. My usual warm-up of choice to refresh my memory of JavaScript syntax and language idioms is to look at an example of neural network input-output, implemented from scratch (no libraries).

So that’s what I did.

It’s easy to take potshots at the JavaScript language. It’s true that the language allows you to go wrong in many ways. But in a weird way, the riskiness of JavaScript has a strange appeal. When coding with JavaScript, you have to be very, very careful. This is, in part, why so many JavaScript frameworks exist. If I owned a startup company of some sort and my product or service was based on raw no-libraries JavaScript, I’d be terrified unless I had absolute expert, and I mean absolutely expert, developers. Junior developers with only a few years of experience would be an enevitable disaster.

But raw JavaScript, when used properly, has a kind of beauty to it. (Well, in my mental world anyway.)

I like using a neural network as a warm-up example to refresh language knowledge, because a NN requires all the basic language features: basic syntax, loops, functions, decision control, arrays, IO, and so on.

After about an hour, I had my warm-up example running. I structured the program with a top-level main() function.

The demo program defines a NeuralNet class. The addition of classes to the JavaScript language a few years ago (in ES6 I think) was a huge step forward in my opinion. The addition of the “let” keyword was also huge.

My demo creates a simple neural network with 3 input nodes, 4 hidden nodes, and 2 output nodes. The NN weights and biases are set to arbitrary values. The demo feeds [1.0, 2.0, 3.0] to the network, and the network outputs (0.4920, 0.5080) which I verified was correct by doing the calculations by hand.

OK, good fun for me. My dogs, once again did not catch a rabbit, but I successfully refreshed my memory of JavaScript, with its flawed beauty and syntax.



Freckles are sort of like the flawed beauty of JavaScript — it all depends on how you look at it. In medieval times, freckles were sometimes considered witches marks. Things did not go well for people who were labeled as witches. Today, some people go out of their way to hide freckles using makeup, but on the other hand some people apply fake freckles to their faces. Here are three images from an Internet search for “models with freckles”.


Code below. (long) Continue reading

Posted in Machine Learning | Leave a comment

A Minimal Binary Search Tree Example Using Python

I don’t use tree data structures very often. When I expect to use a tree, I usually refresh my memory by coding a minimal example — coding trees is tricky and a warm-up example saves time in the long run.

Specifically, I’m thinking I might implement an Isolation Forest for anomaly detection. An Isolation Forest partitions a dataset into chunks, and the partitioned data can be efficiently stored in a tree data structure.

I did a quick search of the Internet, looking for examples of a binary tree using Python. I found quite a few, but I quickly realized that my goal was to re-familiarize myself with tree code, and so I set out to code a minimal example from scratch.

I have implemented tree data structures many times, but even so I was surprised, once again, to remember how tricky tree code is.

My demo was as simple as possible. Each node holds an integer. I implemented a search tree — all values to the left of a given node are less than the value in the given node, and all values greater are to the right.

I used a recursive method to display the tree in an inorder way — left sub-tree, then root, then right sub-tree. I used an iterative (non-recursive) method to insert a new value into the tree.

When I was a college professor, my favorite class to teach was Data Structures. Back in those days, we used the Pascal programming language, which I liked a lot. I taught the Data Structures class at least a hundred times. Even so, I was pleasantly surprised that I remembered the details of implementing a tree insert() method.

The idea for insert() is to start a p and q references (essentially pointers) at the root node and then walk them down the tree with p in front and q trailing. You move p left or right according to the value to insert. When you hit the end of the tree, p will be None (similar to a null pointer) and q will point to the last value (leaf node) in the tree at the insertion location. You insert a new node at the left child or the right child of p, depending on whether the value q is pointing at is less than or greater than the value to insert.

The warm-up exercise was good fun, and a good refresher about the details of binary tree structures and their syntax. If I ever do decide to implement an Isolation Forest using a tree to hold the partitions, I won’t get stuck by the details of the tree part of the implementation.



Internet image search for “stuck in tree”. Left: A tourist in Florida is getting a too-close look at a banyan tree. Left-center: A woman is not having a good experience with a tree swing. Right-center: Classic fireman tree rescue. Right: I have no idea what’s going on in this photo.


Code below. Continue reading

Posted in Machine Learning | Leave a comment

Generating Synthetic ‘1’ Digits for the UCI Digits Dataset Using a Variational Autoencoder

A variational autonencoder (VAE) is a deep neural system that can generate synthetic data items. One possible use of a VAE is to generate synthetic minority class items (those with very few instances) for an imbalanced dataset. At least in theory — I’v never seen it done in practice.

So, I decided to code up a little experiment. I used the PyTorch neural library — my curret library of choice — but the same ideas can be implemented using TensorFlow or Keras. I started with the UCI Digits dataset. It has 3,823 training images. Each image is a crude handwritten digit from ‘0’ to ‘9’, represented by 8 by 8 (64) pixels where each pixel is a grayscale value between 0 and 16. There are about 380 of each image. (Annoyingly, the dataset isn’t exactly evenly distributed). I filtered out the 389 ‘1’ digits into a source data file. Then I trained a VAE on the ‘1’ digits. Then I used the trained VAE to generate five synthetic ‘1’ images. I was satisfied with the results.



It did take a bit of experimentation to tune the architecture of the VAE. My final architecture was 64-32-[4,4]-4-32-64. The interior [4,4] means that each digit is condensed to a distribution mean vector with four values and a distribution standard deviation (in the form of log-variance) vector with four values. VAEs are like riding a bicycle — easy once you know how but rather complicated when you don’t.

The tuning of the architecture determines the fidelity of the synthetic generated images. If the VAE is too good, it will memorize the training data. If the VAE is weak, the synthetic data will be too dissimilar to the source training data.

For my demo, I generated five synthetic ‘1’ digits. They seemed good — they all look like real UCI Digits ‘1’ images but they weren’t simple copies.

When I get some free time, I’ll tidy up my demo code, add some explanation, and publish it all in Visual Studio Magazine.

Good fun.



Starting with the mixed media portrait on the left, I did repeated Internet image searches for similar images to manually generate two synthetic versions of the original image. The starting image on the left is by artist Hans Jochem Bakker. The center image is by artist Daniel Arrhakis. The image on the right is by artist Randy Monteith.

I’m not sure how well my synthetic image generation idea worked, but I like all three portraits anyway. Finding beauty is always a good use of time.

Posted in Machine Learning | 2 Comments

Automatic Stopping for Logistic Regression Training

A few days ago I did some thought experiments about different schemes to automatically stop training a logistic regression model. I was motivated by the poor performance of the sckit library LogisticRegression model with default parameters.


I coded up a demo program using PyTorch and an auto-stop condition as described in ths post. The results seemed pretty good.

After quite a bit of thought, I decided to stop based on average squared error per training item. It’s best explained by example. Suppose your training data has just three items and the variable to predict, sex, is encoded as 0 = male, 1 = female, and each item has just two predictor values, age and income:

age   income  sex
----------------
0.39  0.54000  0
0.28  0.32000  1
0.40  0.64000  0

Now suppose that at some point during training the model’s computed outputs are [0.20, 0.60, 0.30]. The squared error terms for each of the three data items are:

(0 - 0.20)^2 = 0.04
(1 - 0.60)^2 = 0.16
(0 - 0.30)^2 = 0.09

The average squared error for the three items is (0.04 + 0.16 + 0.09) / 3 = 0.097. I call this average squared error to distinguish from mean squared error that’s used as the loss function for training. Both metrics give the same value.

To make a long story short, after a bit of experimentation, setting an auto-stop condition of average squared error less than 0.20 seemed to work pretty well.


Below: Photoshop artist James Fridman accepts requests from his fans to “fix” their photos. Fridman, hilariously, never knows when to stop.

Posted in Machine Learning | Leave a comment

Using Reinforcement Learning for Anomaly Detection

I often think about new machine learning ideas in two different, but related ways. One approach is to look at a specific, practical problem and then mentally examine my collection of ML techniques to see if I have a way to solve the problem. The second approach is to start by thinking about my ML techniques repertoire and then do some thought experiemnts that combine or modify two or more techniques to see if the hypothetical new technique can solve a problem.

So, one day while I was walking my dogs, my brain came up with an idea using a technique called Q-learning, from the field of Reinforcement Learning, to identify anomalies in a dataset.

Q-learning is a clever technique that can be used to find the best path through a maze. Each position in the maze is a State, and each possible move from a given position is an Action. Using an idea called the Bellamn equation, you can construct a table that assigns a Q value (“quality”) for every possible Action in every State. Then to solve the maze from the starting state, you repeatedly take the Action to move to the position/State that has the highest Q value.

See https://jamesmccaffrey.wordpress.com/2018/10/22/q-learning-using-python/ for a concrete example of Q-learning.

So my rather strange idea for an anomaly detection technique is the following. Suppose you have a dataset. Each item in the dataset is a State. You want to move through the dataset one item at a time to go from the first data item/State to the last data item/State. Using Q-learning, you construct a Q value for moving from each data item to another data item. And then . . . drum roll please . . . data items that have the lowest Q values are anomalies.

Maybe.

The next step is to do some experiments by actually writing code. Many of my ideas that combine techniques don’t work out. But some do.



The Taylor Aerocar is arguably the most successful of many attempts to combine an automobile and an airplane. Six were designed and built by a man named Moulton Taylor from 1949 to 1960. One is still flying. Software engineering experiments are much less risky than mechanical engineering experiments.

Posted in Machine Learning | Leave a comment

Some Thoughts About Dealing With Imbalanced Training Data

Suppose you have a binary classification problem where there are many of one class, but very few of the other class. For example, with medical data, you might have many thousands of data items representing people, that are class 0 (no disease), but only a few dozen items that are class 1 (have the disease). Or with security data, you could have thousands of data items that are class 0 (normal) but only a few items that are class 1 (security risk). Such datasets are called imbalanced.

If you train a prediction model using all of the imbalanced training data, the items with the majority class will overwhelm the items with the minority class. The resulting model will likely predict the majority class for any input.

The same ideas apply to multi-class classification. For simplicity, I’ll assume a binary classification scenario.

There are two approaches for dealing with imbalanced training data. You can delete some of the majority items so you have roughly equal numbers of both classes. Or you can generate new synthetic minority items. In practice, best results are usually obtained by combining techniques: delete some majority class items and also generate some synthetic minority class items.

Both of these two general approaches have dozens of variations. The fact that there are dozens of techniques indicates that no one technique works best.

A typical example of generating synthetic data is the SMOTE technique (“synthetic minority over-sampling technique”). It’s very simple. You repeatedly select a minority class data item, A, at random. Then you find the five nearest neighbor items of A, call them (B, C, D, E, F). Then you randomly select one of the nearest neighbors, suppose it’s E. Then you construct a new synthetic minority item by computing the average of A and E (with small random noise added).

There are many variations of the SMOTE technique. But they all assume that your data items are strictly numeric — you can’t directly find k-nearest neighbors on categorical data, and you can’t find an average of two categorical data items.

A typical example of deleting majority class items is called “down-sample and up-weight”. In its most basic form, you delete 50% (or whatever) of randomly selected majority class items, a factor of 2. Then, during training, when you compute the loss value for a majority item, you multiply the loss by 2. The idea is that there are really twice as many majority items as you’re training on, so you should weight a majority class item twice as much. This approach seems odd at first but because you have roughly equal numbers of majority and minority class items during training, the majority class items are less likely to overwhelm minority class items.

There are many variations of the down-sample and up-weight technique. An interesting paper written by one of my work colleagues starts by training a crude (but fast to create) prediction model on all data, and then uses the loss information generated by the crude model to intelligently select majority items to delete.

In general, whenever I read about a problem that uses a classical statistics or classical ML technique (such as k-nearest neighbors), I ponder using one of the deep neural techniques in my personal ML mental toolkit.

My first thought was that generating synthetic minority class items sounds like a problem that is well-suited for a variational autoencoder (VAE). A VAE is specifically designed to generate synthetic data. So, you’d train a VAE on the minority class items, then use the trained VAE to generate synthetic minority class items. Simple.

One advantage of the VAE approach compared to SMOTE is that a VAE can work with both numeric and categorical data. However, a VAE is much more complex than a k-NN based approach — possibly too complex for an average data scientist. I zapped together a proof of concept using PyTorch, where I created and trained a VAE to generate synthetic ‘1’ digits from the UCI Digits dataset. Each image is 8 by 8 grayscale pixels. The synthetic ‘1’ looked decent enough. However, when I generated several synthetic ‘1’ digits, they seemed too close to each other. This suggest my VAE is too good — it’s overfitting. One way to deal with a VAE that overfits is to adjust its architecture

As always, thought experiments are a good start, but I’d need to code some experiments to see what will actually happen.


No face has features that are perfectly balanced. In fact, a certain amount of imbalance in facial features contributes to attractiveness. Here are four attractive celebrities who have one eye that is noticeably smaller than the other. From left to right: Paris Hilton, Ariana Grande, Ryan Gosling, Angelina Jolie.


Posted in Machine Learning | Leave a comment