Principal Component Analysis (PCA) From Scratch Using Python

If you have some data with many features, principal component analysis (PCA) is a classical statistics technique that can be used to transform your data to a set with fewer features. This is called dimensionality reduction. For example, suppose you are looking at the MNIST image dataset. Each image has 28 x 28 = 784 features/pixels. This is difficult to work with, so you might want to use PCA to reduce the dimensionality from 784 down to 2 so you could graph the data on a 2D chart.

I coded up a short demo that does PCA using 1.) the scikit library PCA module, and 2.) using a PCA function implemented from scratch. (I got the code from a Web site about t-SNE dimensionality reduction).

I created a dummy dataset with 10 items, where each item has 6 features/columns:

[0.3750, 0.3750, 0.6250, 0.5000, 1.0000, 0.1250]
[0.5000, 1.0000, 0.3125, 0.6875, 1.0000, 0.0000]
[0.9375, 0.7500, 0.8125, 0.1250, 1.0000, 0.7500]
[0.4375, 0.6875, 0.5000, 0.4375, 1.0000, 0.1875]
[0.6875, 0.7500, 0.4375, 0.3750, 1.0000, 0.0625]
[0.5625, 0.6250, 0.5625, 0.2500, 0.9375, 0.0000]
[0.5625, 0.8125, 0.1875, 0.5000, 0.7500, 0.0000]
[0.9375, 0.5000, 0.8125, 0.2500, 1.0000, 0.2500]
[0.7500, 0.5625, 0.8125, 0.3750, 0.8125, 0.0625]
[0.9375, 0.5000, 0.7500, 0.3125, 1.0000, 0.1875]

I reduced the data to dim=2. Using the scikit PCA module I got:

 [-0.1089 -0.3185]
 [-0.5097  0.2176]
 [ 0.625   0.3629]
 [-0.1712  0.0079]
 [-0.1368  0.0754]
 [-0.0845 -0.1206]
 [-0.4504  0.1383]
 [ 0.3954 -0.0828]
 [ 0.1349 -0.1853]
 [ 0.3062 -0.0949]

Using the custom PCA function I got:

 [-0.1089  0.3185]
 [-0.5097 -0.2176]
 [ 0.625  -0.3629]
 [-0.1712 -0.0079]
 [-0.1368 -0.0754]
 [-0.0845  0.1206]
 [-0.4504 -0.1383]
 [ 0.3954  0.0828]
 [ 0.1349  0.1853]
 [ 0.3062  0.0949]

Notice the signs of the second component are reversed. This is annoying but OK. It happens because of the way eigenvalues are computed.

The scikit module has a built-in explained_variance_ratio_ property which gave (0.619, 0.1927). This means the first component explains 61.9% of the variance and the second component explains an additional 19.27% and so the total variance explained by the two components is 81.17%.

To compute the percentage of variance explained using the custom function, I computed PCA for 6 components, the same as the number of features in the source data. Each component (column) has a variance so I computed those and summed them. The percentage of variance explained by the first component is the variance of the first component divided by the sum of the variances.

An alternative approach for dimensionality reduction is to use a neural autoencoder. Using PCA and an autoencoder have pros and cons. PCA is (mostly) deterministic) but gives a slightly weaker representation because components are ordered by the amount of variance they explain. Autoencoders often give a better representation of the source data, but autoencoders are not deterministic in the sense that you can get very different results depending upon the neural architecture and training hyperparameters used.



One way to think about PCA is that it finds a reduced essence of data items. A book nook is a small decorative diorama, about the size of a book, that captures the essence of a concept. Left: A Harry Potter series book nook. Center: A generic book collection book nook. Right: A Hobbit / Lord of the Rings book nook.


Code below.

# pca_demo.py
import numpy as np
from sklearn import decomposition

def pca(X, n_dim):
  means = np.mean(X, axis=0)
  X = X - means
  square_m = np.dot(X.T, X)
  (evals, evecs) = np.linalg.eig(square_m)
  result = np.dot(X, evecs[:,0:n_dim])
  return result

x_data = np.array([
[0.3750, 0.3750, 0.6250, 0.5000, 1.0000, 0.1250],
[0.5000, 1.0000, 0.3125, 0.6875, 1.0000, 0.0000],
[0.9375, 0.7500, 0.8125, 0.1250, 1.0000, 0.7500],
[0.4375, 0.6875, 0.5000, 0.4375, 1.0000, 0.1875],
[0.6875, 0.7500, 0.4375, 0.3750, 1.0000, 0.0625],
[0.5625, 0.6250, 0.5625, 0.2500, 0.9375, 0.0000],
[0.5625, 0.8125, 0.1875, 0.5000, 0.7500, 0.0000],
[0.9375, 0.5000, 0.8125, 0.2500, 1.0000, 0.2500],
[0.7500, 0.5625, 0.8125, 0.3750, 0.8125, 0.0625],
[0.9375, 0.5000, 0.7500, 0.3125, 1.0000, 0.1875]],
 dtype=np.float32)

print("\nBegin PCA demo ")
np.set_printoptions(precision=4, suppress=True)
print("\nSource data: ")
print(x_data)

pca_2 = decomposition.PCA(2) 
pca_2.fit(x_data)
x_reduced = pca_2.transform(x_data)
print("\nComponents from scikit PCA() function: ")
print(x_reduced)
print("\nPercent variance explained via scikit: ")
ve = pca_2.explained_variance_ratio_
print(ve)

x_reduced = pca(x_data, 2)
print("\nComponents from custom pca() function: ")
print(x_reduced)

# compute percent variance explained
x_transform = pca(x_data, 6)  # all columns
v = np.var(x_transform, axis=0, ddof=1)
sv = np.sum(v)
ve = np.array([v[0] / sv, v[1] / sv])
print("\nPercent variance explained computed: ")
print(ve)

print("\nEnd demo ")
Posted in Machine Learning | Leave a comment

Logistic Regression Using PyTorch with L-BFGS in Visual Studio Magazine

I wrote an article titled “Logistic Regression Using PyTorch with L-BFGS” in the June 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/06/23/logistic-regression-pytorch.aspx.

Logistic regression is one of many machine learning techniques for binary classification — predicting one of two possible discrete values. In my article I give an end-to-end demo that predicts the sex (male = 0, female = 1) of a hospital patient based on age, county of residence, blood monocyte count and hospitalization history.

A logistic regression prediction equation looks somewhat like:

z = (w0 * age) + (w1 * county) + 
    (w2 * monocyte) + (w3 * history) + b
p = 1 / (1 + exp(-z))

The p value will be between 0 and 1. A p value less than 0.5 indicates a prediction of class 0, a p value greater than 0.5 indicates class 1.


Left: A screenshot of a demo run from the article. Right: Data and a graph for a simple problem, that I used to explain how logistic regression works.

Finding the values of the wi weights and the bias b is called training the model. The idea is to try different values of the weights and the bias to find the values that give the best results on training data that has known input values (age, etc.) and correct target values (sex = 0 or 1).

There is no closed form solution to find the best values of the weights and the bias, so there are at least a dozen major algorithms to estimate the weights and bias values. These optimization algorithms (in math terms you are minimizing error/loss) include stochastic gradient descent, iterated Newton-Raphson, Nelder-Mead (aka amoeba method), particle swarm optimization, evolutionary optimization, and . . . L-BFGS (“limited memory Broyden Fletcher Goldfarb Shanno”) algorithm.

The L-BFGS algorithm estimates a Calculus first derivative (gradient) and also a second derivative (Hessian). This requires all data to be in memory but produces very fast training.

To summarize, there are many tecniques to create a binary classification model that uses an equation with weights and biases. Each of these these equation-based techniques can be trained using one of many optimization algorithms. My article explained one specific scenario: logistic regression with L-BFGS using the PyTorch code library.

Three advantages of using PyTorch logistic regression with L-BFGS optimization are:

1. The simplicity of logistic regression compared to techniques like support vector machines
2. The flexibility of PyTorch compared to rigid high level systems such as scikit-learn
3. The speed of L-BFGS compared to most forms of stochastic gradient descent

Three disadvantages of the technique presented in the article are:

1. The crudeness of logistic regression compared to much more powerful models such as deep neural binary classifiers
2. Longer development time compared to ready-to-use models like the scikit-learn LogisticRegression() class
3. The requirement that all training data fit into memory compared to techniques that allow batch processing



Binary classification is sort of the quintessential form of machine learning. Chess is quintessentially binary, with the black and white pieces, two players, and black and white squares on the chessboard. I grew up loving chess which probably has something to do with why I love computer science and machine learning. Here are four images from an Internet image search for “chess queen Halloween costume”. I don’t find these images particularly appealing, but I do think they’re interesting in a binary sort of way.

Posted in PyTorch | Leave a comment

Using Built-In Library Functions vs. Using A Custom Function

I ran into some code that made me think about the nature of coding and computer science. I was reviewing some PyTorch documentation example code and saw a statement that looked something like:

foo = torch.tensor(foo[:-1]).cumsum(dim=0).to(device) 

From the context of the code, I knew what the result of the statement would be, but the syntax of the statement was not obvious to me at all. The problem is to compute offsets of some concatenated arrays. Suppose:

arr1 = [11 15 17]
         0  1  2

arr2 = [21 28]
         0  1

arr3 = [33 39 35 34]
         0  1  2  3

arr4 = [44 46 43]
         0  1  2

The lengths of the arrays are [3, 2, 4, 3].

If the four arrays were concatented:

crr = [11 15 17 21 28 33 39 35 34 44 46 43]
        0  1  2  3  4  5  6  7  8  9 10 11

The offsets of where each small array starts in the concatenated array would be: [0, 3, 5, 8]. So,to cut to the chase, you can compute offsets from concatenated arrays by computing the cumulative sums of their lengths. That is what the mysterious documentation code did.

On the other hand, you could write a custom function like so:

def compute_offsets(arr):
  # arr is a numpy array of lengths
  n = len(arr)
  result = np.zeros(n, dtype=np.int64)
  result[0] = 0
  for i in range(1, n):
    result[i] = result[i-1] + arr[i-1]
  return result 

Which approach is better? If you use the built-in cumsum() function, you take on a dependency (which could break your system if the container library changed implementation or interface) and the intent of the calling code is not obvious (although you could add a comment). If you write a custom function, you have extra lines of code, and there’s a chance you could have a bug in even the simplest code.

The reason this particular example got me thinking was that in my opinion, the pros and cons of library vs. custom are just about evenly balanced for this code. Choosing between using a library function or writing a custom function is rarely a black or white decision. One of the benefits of experience in software development is that you gain an intuitive sense of when to use library code and when to write custom code.



Horst P. Horst (1906-1999) was a famous photographer in the 1930s and 1940s. Here are three of his photographs of well-known fashion models of the time. Left: model Agneta Fischer. Center: model Bettina Bolegard. Right: model Georgia Carroll. The looks of Fischer and Bolegard appear very modern but Carroll’s looks seem quite 1940-ish to me (but trust me, I’m no expert on looks of the 1940s or of today either).

Posted in Miscellaneous | Leave a comment

Tokenizing Text Using the Basic English Algorithm

In natural language processing (NLP) problems you must tokenize the source text. This means you must split each word/token, usually convert to lower case, replace some punctuation, and so on. In Python, the spaCy library and the NLTK (natural language toolkit) library have many NLP functions including tokenizers. The TorchText library has tokenizers too.

In some NLP problem scenarios you need a custom tokenizer to deal with oddities of your source text. One rainy weekend day I sat down on my couch and refactored the TorchText “basic_english” tokenizer source code. The algorithm is:

1. Convert source text to lower case
2. add space before and after single-quote, period, comma,
left paren, right paren, exclam, question mark.
3. replace colon, semicolon, (br /) with a space
4. remove double-quote
5. split on whitespace

Somewhat unfortunately, the TorchText source code uses a Regular Expression approach — ugh. I am not a fan of regular expressions. When they work you’re fine but trying to debug a regular expression is a trip to the seventh level of Hell.

All of the TorchText tokenizer regular expression functionality is string replacement, so it would be an easy task to yank out the regular expression and refactor using simple string replacement. This would make it easy to deal with situations where you want to retain the capitalization of certain words that have important meaning in your source text. For example, if you were working with calendar date text, you might want to retain capitalization of months January, February, March, and so on, because “march” can be the verb, “may” can be the request word, and so on.

Using string replacement is slower than regular expressions but easier to customize. The code could look like:

class MyTokenizer:
  # use simple string replacement instead of RE
  def tokenize(self, line):
    line = line.lower()
    line = line.replace(".", " . ")
    line = line.replace(":", " ")
    line = line.replace('\"', "")
    # etc.
    return line.split()

my_toker = MyTokenizer()
line = "Blah, blah, whatever."
result = my_toker.tokenize(line)

Whenever I write code, I try to avoid external dependencies. Writing a custom text tokenizer is one way to eliminate an external dependency for NLP problems.



The term art tokenization can mean several things. One meaning is to create a set of blockchain tokens that establish ownership of an expensive piece of art like a Van Gogh or da Vinci. This allows partial ownership of a piece of art, and allows art to be traded like the way company shares can be traded in he stock market. The weird idea is that the physical presence of a work of art isn’t too important when digital images of it can be shown.

Here are three examples of science-inspired art. Will any of them ever be worth millions of dollars? Who knows — maybe. Left: by artist Fabian Oefner. Center: by Igor Siwanowicz. Right: by an anonymous middle school student, posted by his teacher, Jim Dodson.

Posted in Machine Learning | Leave a comment

Sentiment Analysis using a PyTorch Neural Network with an EmbeddingBag Layer

In computer science, and life, it helps to be smart but it’s also important to have determination. I’m not the smartest guy in the Universe, but once a problem gets stuck in my head it will stay there until it gets solved.

I’ve been looking at sentiment analysis using a PyTorch neural network with an EmbeddingBag layer. I started by looking at an example in the PyTorch documentation, but that example used the AG News dataset which has 1,000,000 short news snippets, which makes it extremely difficult to work with when you’re trying to dissect the example. Additionally, the demo used a built-in torchtext.datasets.AG_NEWS() class which magically serves up data in a special format — in real life you must deal with data wrangling yourself.

So, over the past couple of months I’ve been slowly but surely dissecting the documentation example so that I could create my own system. I hit a milestone recently when I got a complete end-to-end example working. I created 20 tiny movie reviews, each of which is labeled as 0 (negative sentiment) or 1 (positive). The goal is to train a neural model to correctly classify a tiny review as positive or negative.

In most natural language processing (NLP) problem scenarios, each word in a sequence/sentence is converted to an integer index using a Vocabulary object, and then the index representing the word is converted to a numeric vector of about 100 values, called a word embedding. Each word embedding is sequentially fed to an extremely complex neural system — typically an LSTM for moderate length input or a Transformer for long length input.

An EmbeddingBag layer converts an entire sequence/sentence to a numeric vector. This is dramatically simpler than a word embedding approach — but still extremely tricky (just like all NLP problems).

I intend to tidy up my demo program and write up an explanation and then publish it in Microsoft Visual Studio Magazine. Even though the demo program is only about 200 lines long, it is very dense in terms of ideas so my explanation will likely take two or three articles.

People who aren’t programmers or developers or data scientists don’t understand our world (if you’re reading this blog post, you are probably part of “our world”). We don’t relentlessly work on difficult problems because of some external force — we do so because our brains are wired that way.



The history of computer science is one largely of men who had relentless determination to create. Left: Wilhelm Schickard (1592–1635) designed, but did not build, a “calculating clock” that would have performed addition, subtraction, multiplication and division. Center: The Z1 mechanical computer was built by Konrad Zuse (1910-1995) in 1937. It weighed about 2,000 pounds and had 20,000 parts. The Z1 contained almost all the parts of a modern computer but wasn’t reliable. Right: In 1978, some MIT students built a tinker toy mechanical computer from 10,000 parts and fishing line. It was hard-wired to play tic-tac-toe.

Posted in PyTorch | 2 Comments

Researchers Explore Techniques to Reduce the Size of Deep Neural Networks on Pure AI

I contributed to an article titled “Researchers Explore Techniques to Reduce the Size of Deep Neural Networks” in the June 2021 edition of the online Pure AI Web site. See https://pureai.com/articles/2021/06/02/reduce-networks.aspx.

The motivation for reducing the size of deep neural networks is simple: even on the most powerful hardware available, huge models can take weeks or months to train, which uses a lot of electrical energy, which costs a lot of money and generates a lot of CO2 emissions. Reducing the size of a deep neural network is often called compression (in general) or pruning (for specific techniques).

The article describes three recent research papers. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” (2019) by J. Frankle and M. Carbin showed strong evidence that huge neural networks can be pruned to a fraction of their original size, and the pruned model can achieve nearly the same accuracy as the huge source model. The result is mostly theoretical because the paper started with a huge trained model and then pruned it.



“SNIP: Single-Shot Network Pruning Based on Connection Sensitivity” (2020) by N. Lee, T. Ajanthan and P. Torr demonstrated a relatively simple technique to prune a huge neural network before training it.

“Picking Winning Tickets Before Training by Preserving Gradient Flow” (2020) by C. Wang, G. Zhang and R. Grosse is essentially an extension of the SNIP technique to address what the authors felt were some minor weaknesses.

I was quoted in the article. I commented, “There’s a general feeling among my colleagues that huge deep neural models have become so difficult to train that only organizations with large budgets can successfully explore them.” I added, “Deep neural compression techniques that are effective and simple to implement will allow companies and academic institutions that don’t have huge budgets and resources to work with state of the art neural architectures.”

And I said, “Over the past three decades, there has been an ebb and flow as advances in hardware technologies, such as GPU processing, allow deep neural networks with larger sizes. I speculate that at some point in the future, quantum computing will become commonplace, and when that happens, the need for compressing huge neural networks will go away. But until quantum computing arrives, research on neural network compression will continue to be important.”


Three relatively obscure reduced-size comic book super heroes. Left: Tinyman first appeared in “Captain Marvel” #2 (1966). Tinyman was the sidekick for Captain Marvel who could . . get ready . . separate his body parts. Center: The Wasp first appeared in “Tales to Astonish” #44 (1963). She was the sidekick to Ant-Man. Right: Elasti-Girl first appeared in “My Greatest Adventure” #80 (1963). She was an Olympic swimming gold medalist who gained the power to shrink or grow.

Posted in Machine Learning | 2 Comments

Top Ten Science Fiction Movies with Pterodactyls

There are many science fiction movies that features Pteranodons, pterosaurs, and pterodactyls. Scientifically they’re different but basically they’re all flying reptiles. Here are ten of my favorites, listed in no particular order.


1. Rodan (1956) – Japanese miners dig a little too deep and unearth Rodan and his mate. After much destruction, Rodan is killed by a volcanic eruption and his mate commits suicide rather than live without him.


2. The Extraordinary Adventures of Adele Blanc-Sec (2010) – Adele is a writer in early 1900s Paris and has amazing adventures with reanimated mummies, parapsychology, and a pet pterosaur.


3. Jurassic Park III (2001) – This is the one where a divorced couple trick paleontologist Dr. Alan Grant into going to the now-deserted Isla Sorna to find their son. They are menaced by a pterosaur in a giant birdcage-like structure.


4. King Kong (1933) – Filmmaker Carl Denham and a crew go to Skull Island and find the giant ape. Kong saves Ann Darrow from a pterodactyl.


5. One Million Years B.C. (1966) – Sort of a prehistoric Romeo and Juliet with a happier ending. Tumak and Loana are from different tribes. A pterodactyl grabs Loana but she survives.


6. The Land That Time Forgot (1974) – Based on an Edgar Rice Burroughs novel. During World War I, a freighter is sunk by a German U-boat. Some of the survivors and the crew end up on an uncharted island in the South Atlantic where they find several types of dinosaurs including a hungry pterodactyl.


7. The People That Time Forgot (1976) – Sequel to “The Land that Time Forgot”. A rescue party is looking for survivors but their amphibious plane is damaged by a pterodactyl and forced down on the island.


8. The Valley of Gwangi (1969) – In Mexico in the early 1900s, a struggling rodeo discovers Forbidden Valley. The valley has dinosaurs including a pterodactyl and a T-rex named Gwangi.


9. Voyage to the Planet of Prehistoric Women (1968) – An exploration party travels to Venus where they find all kinds of interesting things including a creature that resembles a pterodactyl. This movie is not to be confused with “Women of the Prehistoric Plant” (1966), or “Prehistoric Women” (1967), or “Voyage to the Prehistoric Planet” (1965).


10. When Dinosaurs Ruled the Earth (1970) – Two tribes of prehistoric people come into contact. Blonde cavewoman Sanna ends up with dark-haired caveman Tara. Tara is attacked by a pterodactyl but kills it.



Honorable Mention


11. The Lost World (1925) – A silent film based on a 1912 story by Arthur Conan Doyle (of Sherlock Holmes fame). Explorers find a land with dinosaurs on top of an almost-inaccessible plateau in South America. The special effects astonished movie goers of the time, many of whom believed the dinosaurs to be real.


12. The Land Unknown (1957) – An expedition to Antarctica is exploring with a helicopter when they collide with a pterodactyl and descend to an underground land with dinosaurs.


13. At the Earth’s Core (1976) – Based on an Edgar Rice Burroughs novel. A scientist creates an incredible burrowing machine. An expedition travels deep under the Earth where they find evil pterodactyl-like Mahars who cruelly rule human slaves.


14. Jurassic World (2015) – This is the fourth movie in the Jurassic series. Another attempt at a live-dinosaur amusement park. Horrifying sequence where a woman is grabbed by a pterosaur, dropped into a giant tank of marine dinosaurs, and then eaten.


15. Journey to the Beginning of Time (1955) – This Czech film has excellent special effects which hold up well even today. A group of young people travel down a river that leads them back in time.


16. Legend of Dinosaurs (1977) – A somewhat obscure Japanese film that features a Rhamphorhynchus flying reptile fighting against a Plesiosaurus aquatic dinosaur. Also known as “Legend of Dinosaurs & Monster Birds”.


17. Caveman (1981) – A comedy featuring Ringo Starr and Shelley Long. Ringo and his tribe-mates steal a pterodactyl egg which doesn’t please the mother.


Some Direct to TV or Video Films


Pterodactyl (2005) – Some American college students have the bad luck to be in Romania when some preserved pterodactyl eggs hatch.

Cowgirls vs. Pterodactyls (2021) – In the old West, a schoolteacher’s husband is snatched by a pterodactyl and so she enlists the help of a prostitute to save him.

Terrordactyl (2016) – A bunch of pterodactyls, plus one really huge pterodactyl, attack Los Angeles. This film is actually quite good.

Posted in Top Ten | Leave a comment

Serving Up PyTorch Training Data Using The DataLoader collate_fn Parameter

When creating a deep neural network, writing code to prepare the training data and serve it up in batches to the network is almost always difficult and time consuming. A regular PyTorch DataLoader works great for tabular style data where all inputs have the same length. For variable length inputs, one approach is to use a DataLoader with a custom collate_fn function.

I was looking at an example of natural language processing which uses an EmbeddingBag layer. Briefly, an EmbeddingBag treats a sentence as one whole unit and converts the sentence into a vector of numeric values. This is different from a regular Embedding layer that converts each word in a sentence to a numeric vector.

To handle the training data I needed to use a custom DataLoader. A regular DataLoader accepts a PyTorch Dataset object, which must be implemented to fetch one item at a time. A custom DataLoader accepts a list of tuples and uses a program-defined collate_fn function to parse the list of tuples.

The idea is quite tricky and is best explained by example. I started with 8 movie reviews:

0, This was a BAD movie.
1, I liked this film! Highly recommended.
0, Just awful
1, Good film, acting
0, Don't waste your time - A real dud
0, Terrible
1, Great movie.
0, This was a waste of talent.

A label of 0 means negative review, 1 means positive review. The first step is to create of Vocabulary object that maps each possible word and punctuation character to a numeric index. That’s a difficult problem in itself.

The next step is to read the data into memory as a list of tuples. The resulting list would be:

[(0, This was a BAD movie.),
 (1, I liked this film! Highly recommended.),
 . .
 (0, This was a waste of talent.)]

Next, the goal is write code to batch up batch_size=3 reviews at a time and arrange them as a vector of labels, a vector of reviews converted to indexes, and a vector of offsets that indicate where each review starts. For example, the first batch would be:

labels : tensor([0, 1, 0])
reviews: tensor([ 4,  7,  3, 13,  6,  2,
                 19, 21,  4,  5,  9, 18, 24,  2,
                 20, 12])
offsets: tensor([ 0,  6, 14])

The first review is this=4, was=7, a=3, bad=13, movie=6, (period)=2. The three movies reviews are combined into a single tensor. The first review starts at [0], the second review starts at [6], and the third review starts at [14].

The code to convert the list of tuples into the desired format must be implemented in a collate_fn function that is passed to the DataLoader. Writing the collate_fn function was very tricky and difficult, and took me several hours.

The moral of the story is that working with natural language processing training data is difficult.



One of the many reasons why natural language processing problems are difficult is that English language words can have multiple meanings. The Merriam-Webster dictionary lists 23 different meanings for the word “model”. Left: The Ford Model T, built from 1908 to 1927, was the first affordable automobile. Left center: A fashion model is often an aspiration for what an ideal woman might look like. Right center: A scale model of a Hobbit house. Right: This image popped up from an Internet search for “machine learning model”. I must be doing something wrong — my ML models do not manifest themselves as glowing balls of energy.


Code below. Long. Continue reading

Posted in PyTorch | Leave a comment

Integrated Gradient Intepretability Explained

One of many techniques to explain why a neural network produced a particular prediction is called integrated gradient. The idea is difficult to understand if you’re not familiar with it. So I’ll try to give an informal (as possible) explanation.

Suppose your problem scenario is to predict the letter defined by a 3×3 image. There are just 5 different letters: ‘C’, ‘H’, ‘L’, ‘T’, ‘Z’. Each of the 9 pixels is a grayscale value, 0 to 16. You have created a 9-(10-8)-5 deep neural network. One of your images is correctly classified as a ‘T’ but you want to know why your model made that predication.

The integrated gradient technique computes a value between 0 and 1 for each pixel in the image being analyzed. The computed values are based on Calculus gradients which are kind of an indirect measure of error computed by a neural model.

You first set up a baseline image, typically all 0s or all 1s. Then you programmatically create several intermediate images that get closer and closer to the image to analyze. The integrated gradient technique can work with any kind of data, but image analysis is the most common scenario.

Next, you run each of the images through the trained model and compute the average gradient associated with each input pixel. This is tricky. Because your neural network is 9-(10-8)-5 each input node/pixel is associated with 10 weights and each of those weights has a bias. But you only want one gradient per input pixel, so you take the average the 10 gradients for each input pixel.

At this point each of the 9 pixels has 4 average gradients (one for each of the 4 images you have). For each set of pixels you integrate the area under the graph if they were graphed. For example, suppose that for the pixel at (0,0) in the image to analyze, the 4 associated gradients are 0.2, 0.3, 0.6, 0.8. If you graphed the gradients it would look like:

Integrating is computing the area under the curve. This is not feasible in practice so you use a technique called a Riemann Sum to estimate the integral. You don’t really need to make a graph, you just estimate the integral as if there were a graph.

After you estimate the integrals for each of the 9 pixels, you normalize them so each is between 0.0 and 1.0. The normalized integrated gradient values are measures of how important each pixel was when making the prediction – a large value means that pixel was relatively more important than a pixel with a smaller value.


Example from TensorFlow documentation. Does not show the intermediate images used.

The TensorFlow documentation has a worked example. The image to analyze was a fireboat. The integrated gradient technique shows that it was the water spray from the boat that most influenced the model’s classification. This would maybe lead you to investigate how well the model would do when presented with an image of a fireboat that isn’t spraying water.

The integrated gradient technique is useful when a model makes an incorrect prediction too. The image below was incorrectly classified as a vacuum. When the integrated gradient analysis was applied it shows the narrow railing next to a wide staircase railing, which does in fact resemble an upright vacuum cleaner, was responsible for the misclassification.


A neural model incorrectly classified a child in a stairwell as a vacuum cleaner. You can see why — the classifier focused on the railings instead of the child.

The integrated gradient technique seems a bit overly-complex but the research paper explains the mathematical motivations. See “Axiomatic Attribution for Deep Networks” (2017) by M. Sundararajan, A. Taly, and Q. Yan.



Interpreting art images is probably more difficult than interpreting neural network image classifiers. Here are three mixed media portraits from an Internet image search for “enigmatic portrait”. I have no idea what any of these portraits mean but they look nice to me. Left: by artist Aiden Kringen. Center: by artist Arnaud Bauville. Right: by artist Mstislav Pavlov.

Posted in Machine Learning | Leave a comment

An Example of the Python SciPy line_search() Function

The term “line search” refers to a classical statistics technique to minimize a function. The Python SciPy code library has a line_search() function that is a helper function for a line search but line_search() doesn’t actually do a line search. (Classic example of a bad function name.)

I did a quick Internet search, looking for a code example of using the line_search() function to actually perform a line search minimization and found . . . nothing. This raised an immediate red flag. I found a coulple of examples of how to call the line_search() function, all of them copied from the SciPy documentation. But there were no examples of using line_search() to do a line search. So, I set out to do so.

First I set up a function to minimize: f(x,y) = 4x^2 + 3x^2 + 1. The function has an obvious minimum at (x,y) = (0,0) when the value of the function is 1. The function looks like:

Notice that the function is convex, which loosely means it doesn’t have severe peaks and valleys, which in turn means it’s relatively easy to find the minimum.

The line_search() function accepts a function to minimize, the gradient function for the function to minimize, a starting point, and a search direction. The function returns six values but the most important one is result[0] which is called the alpha value. The alpha value is a value that you can add to the start point, in the direction of the search direction parameter, so that the new point will be closer to the minimum.

For example, let f(x,y) = 4x^2 + 3x^2 + 1, and suppose the start point is xk = (10, 10) and the direction is pk = (-0.10, -0.10). And then suppose the alpha return result is 4.0. This means a new point that’s closer to the minimum is:

xk = xk + (alpha * pk)
   = (10, 10) + 4 * (-0.10, -0.10)
   = (10, 10) + (-0.40, -0.40)
   = (9.60, 9.60)

which is in fact closer to the minimum at (0, 0).

It turns out that line_search() returns None if it can’t make any progress. So, to find the minimum of a function func, using a gradient grad, the steps in pseudo-code are:

set start point xk
set direction pk
loop until alpha is None
  alpha = line_search(func, grad, xk, pk)
  xk = xk + (alpha * pk)
end-loop

I did some experiments and there were lots of problems. Briefly, line search worked well for a simple convex function with just one single independent variable, but line search didn’t work well for functions of two or more independent variables.

Just for fun, I coded up a demo to find the minimum of f(x,y) = 4x^2 + 3x^2 + 1 using a machine learning approach with a learning rate (instead of using the line_search() approach). In pseudo-code:

set start point xk
set learn rate lr
loop until satisfied
  g = grad(xk)
  xk = xk - (lr * g)
end-loop

The idea here is that the gradient gives information about what direction and how far to move xk so that the value of the function will get smaller.



Three examples of portraits from artists whose style is based on lines, from moderately line-focused to extremely line-focused. Left: By artist Ikenaga Yasunari. Center: By artist Mads Berg. Right: By artist Ichiro Tsuruta.


# line_search_demo.py

import numpy as np
from scipy.optimize import line_search
import warnings
warnings.filterwarnings("ignore")

def func(x):
  # x is a np.float32 vector
  return 4*(x[0]**2) + 3*(x[1]**2) + 1  # scalar

def grad(x):
  return np.array([8*x[0], 6*x[1]], dtype=np.float32)

def main():
  print("\nBegin minimize f(x,y) = 4x^2 + 3y^2 + 1 demo")

  print("\nGradient descent ala machine learning, LR = 0.10: ")
  xk = np.array([10.0, 10.0], dtype=np.float32)
  lr = 0.10
  for i in range(15):
    g = grad(xk)
    xk = xk - (lr * g)
    fv = func(xk)
    print("i = %4d  x,y = %0.6f %0.6f fv = %0.4f" % (i, xk[0], \
xk[1], fv))

  print("\nUsing scipy line_search(): ")
  xk = np.array([10.0, 10.0], dtype=np.float32)  # start
  pk = np.array([-0.10, -0.10], dtype=np.float32)  # direction

  for i in range(200):
    results = line_search(func, grad, xk, pk)  # will warn
    # print(results)
    alpha = results[0]
    if alpha is None:
      print("line_search done ")
      break

    xk = xk + alpha * pk
    fv = func(xk)
    print("i = %4d  x,y = %0.6f %0.6f fv = %0.4f" % (i, xk[0], \
xk[1], fv))

  print("\nEnd demo ")

if __name__ == "__main__":
  main()
Posted in Machine Learning | Leave a comment