## Example of Kernel Density Estimation (KDE) Using SciPy

I was chit-chatting with a colleague a couple of days ago and the topic of kernel density estimation (KDE) came up. The first hurdle when trying to understand KDE is figuring out exactly what kind of problem the technique solves. Briefly, suppose you have a set of data points, for example the heights of 21 men, normalized by subtracting the average height and dividing by the standard deviation of the heights. The source data might look like [1.76, 0.40, 0.98, . . -2.55]. KDE is a classical statisitcs technique that can determine an estimated probability density function (PDF) from which your source data might have come from. The blue-bars histogram is the source data. The red line is the estimated PDF function computed from the source data.

A probability density function is a curve where the total area under the curve is 1.

Just for statistical hoots, I coded up a quick demo using the stats.gaussian_kde() function from the SciPy library. There are many ways to estimate a PDF. The “gaussian” in the name of the SciPy function indicates that many Gaussian kernel functions are used behind the scenes to determine the estimated PDF function.

In my demo, I hard-coded 21 data points that were loosely Gaussian distributed then used the stats.gaussian_kde() function to estimate the distribution from which the 21 data points were drawn. The graph looked pretty good.

I’ve been studying classical statistics and computational statistics for many years. I’ve never run into a real-life problem where I needed KDE but you never know — it’s important to have as many tools in your technical skillset as possible. Minimalist portraits are an estimate of reality. Left: By artist Jules Tillman. Center: By artist Malika Favre. Right: By artist Rokas Aleliunas.

```# kde_demo.py

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

print("\nBegin kernel density estimation demo ")

np.random.seed(0)

x_data = np.array([
1.76,  0.40,  0.98,  2.24,  1.87, -0.98,
0.95, -0.15, -0.10,  0.41,  0.14,  1.45,
0.76,  0.12,  0.44,  0.33,  1.49, -0.21,
0.31, -0.85, -2.55])
print("\nSource data points (normal): ")
print(x_data)

print("\nGenerating estimated PDF function from source x_data ")
gkde_obj = stats.gaussian_kde(x_data)

x_pts = np.linspace(-4, +4, 41)
print("\nFeeding points to KDE estimated PDF: ")
estimated_pdf = gkde_obj.evaluate(x_pts)

# print("\nEstimated y data points from KDE: ")
# print(estimated_pdf)

y_normal = stats.norm.pdf(x_pts)

plt.figure()
plt.hist(x_data, bins=7, density=1.0)
plt.plot(x_pts, estimated_pdf, label="kde estimated PDF", \
color="r")
plt.legend()
plt.show()

print("\nEnd demo ")
```

## Dealing With PyTorch Library Versioning

It’s a major challenge to deal with the compatibility of the dozens of Python based libraries needed to work with the PyTorch neural network library.

I was currently using PyTorch version 1.8 (for CPU) along with the Anaconda3-2020.02 distribution which contains Python 3.7.6 and roughly 500 mostly-compatible Python libraries. For natural language processing I was using TorchText 0.9 which is significantly different from its 0.8 predecessor. For image processing I was using TorchVision 0.9 (the fact that both TorchText and TorchVision are version 0.9 is coincidence).

One morning, I noticed that PyTorch version 1.9 had been released a few days earlier. So I decided to install it and run my basic Iris Dataset demo program as a sanity check. I knew before I started that there would be problems. Briefly, I uninstalled PyTorch 1.8 but during installation of the 1.9 version, I got a warning during installation that my TorchVision 0.9 would not work with my upgraded PyTorch 1.9 version. This was somewhat expected — TorchText and TorchVision typically lag behind new versions of PyTorch by two to three months.

Furthermore, even though I didn’t get a warning, my existing TochVision 0.9 version failed to load with the latest PyTorch 1.9 version. But my basic multi-class classification demo program on the Iris Dataset worked fine.

The moral to all of this is that even though Python / PyTorch versioning compatibility is much better than it used to be, versioning compatibility is still a major headache.

In a weird way, dealing with PyTorch versioning is a hidden barrier to entry for data scientists who are learning deep neural techniques with PyTorch. I often teach employees at my tech company, and I always tell beginners that they should not underestimate how difficult it is to deal with PyTorch versioning. Versions of automobiles. A blue 1973 Porsche 911 had a big influence on my life. In 1973, one of my college roommates was Scott Putney. Scott’s father lived in Sherman Oaks and had his brand new blue Porsche Targa stolen. The car was miraculously recovered in Florida. Scott and I flew to Florida (actually Alabama), picked up the car and drove it to California. We had many adventures, including getting thrown in jail in Junction, Texas. I had great admiration for Scott’s father and always aspired to be successful enough to own my own Porsche someday. This seemed out of reach because I was poor and studying cognitive psychology in school, which wasn’t exactly a top money-making major.

Left: An early Porsche model 901 (from 1964). Center: A 1973 Targa model, just like the one Scott and I drove across the country. Right: I eventually fulfilled my goal when I bought a model 996 (1999). I still own the car, mostly as a reminder of the value of persistence and striving for a long-term goal. And I still stay in touch with Scott and our other roommate, Ed Koolish.

Code for my sanity check Iris Dataset demo program below. Long. Continue reading

## Zoltar 2021 NFL Football Predictions – Preparing Schedule Data

Zoltar is my NFL football prediction computer program. It uses a deep neural network and a form of reinforcement learning. I’ve been running Zoltar for several years and always enjoy the challenge — success or failure is unambiguous.

It’s not enough to merely predict which team will win. A prediction system must predict which team will win and by how many points so a hypothetical wager can be placed.

Las Vegas maintains a point spread, which is best explained by example. Suppose Rams are playing the Bears and the Vegas point spread is that the Rams will win by 5.0 points. This would normally be expressed as “Rams -5.0 Bears” — the favored team has a negative point value. I currently implement Zoltar using the C# language. The three key data files are 1.) schedule data, 2.) Las Vegas point spread data, 3.) results data.

If you place a wager on the Rams, you win your wager only if the Rams win by more than 5 points. You lose if the Rams win but by fewer than 5 points (i.e., 4 points or less) or if the Bears win by any score or if the game is a tie. If you place wager on the Bears, you win your wager if the Bears win by any score, or if the game is a tie, or if the Rams win but by fewer than 5 points. You lose your wager only if the Rams win by more than 5 points.

In all cases in the Rams-favored-by-5 example, if the Rams win by exactly 5 points, all wagers are a push and money is returned. Because this is a real nuisance, in many cases the Vegas point spread is something like “Rams -5.5 Bears” (the Rams are favored by 5.5 points). This prevents a push on the game because no team can win by x.5 points.

I started preparing Zoltar for the 2021 season, which starts on Thursday, September 9, by setting up the schedule data. I got this information from the Web site at pro-football-reference.com. Dealing with machine learning data is always full of glitches. For example, schedules will surely change as games are postponed due to weather and other unforeseen circumstances. And the Washington Redskins team changed their name to the hideous “Washington Football Team” temporarily while they try to determine a better name. And there are tentatively two games which might be played in the UK, which means they’ll really be neutral site games which must be handled differently from regular games because the home field advantage is significant.

Anyway, I got Zoltar 2021 to compile and read in schedule data from a text file. Next, I’ll have to do some testing with dummy test game-results data. Left: My system is named after the Zoltar fortune teller machine you can find in arcades. Center: A photo of the “sports book” at the MGM Grand hotel in Las Vegas — one of the world’s largest books. Right: An example of a sheet bettors can use to figure out what they want to bet on. The “PK” means “pick-em” which means neither team is favored. The M/L is the “money line” which is a different way to bet. If I had infinite time, I’d create a version of Zoltar that analyzes money line bets.

## Researchers Use Machine Learning Techniques to Detect Compromised Network Accounts on the Pure AI

I contributed to an article titled “Researchers Use Machine Learning Techniques to Detect Compromised Network Accounts” on the Pure AI web site. See https://pureai.com/articles/2021/07/06/ml-detect.aspx. The article describes how researchers and engineers (including me) developed a successful system that detects compromised enterprise network user accounts. The effort is named Project Qidemon. The core of Qidemon is a deep neural autoencoder architecture. In one experiment with real life data, Qidemon examined 20,000 user accounts and successfully found seven of 13 known compromised accounts. In many scenarios, network users have broad permissions to view and modify resources. A single instance of a malicious network internal event can have catastrophic consequences for a company.

A deep neural autoencoder accepts input, such as network user login activity, and learns a model that predicts its input. The difference between actual input and computed output is called reconstruction error. The fundamental idea is that inputs that have high reconstruction error are anomalous, and therefore accounts associated with those inputs warrant close inspection. Internally, an autoencoder creates a compressed representation of its source data.

For example, suppose an input to a deep neural autoencoder is (time = 21:30:00, recent failed attempts = 1, IP address ID = 73027). A prediction output such as (time = 21:27:35, recent failed attempts = 1.5, IP address ID = 73027) would have low reconstruction error, but a prediction output such as (time = 06:25:00, recent failed attempts = 3.5, IP address ID = 99999) would have high reconstruction error.

The Qidemon system was compared to a system that used Principal Component Analysis (PCA). PCA is a classical statistics technique that is somewhat similar to a deep neural autoencoder in the sense that both models create a compressed representation of their source data. The compressed representation can be used to reconstruct the source data and so a reconstruction error metric can be calculated. Experiments showed that the Qidemon anomaly detection system significantly outperforms the PCA-based system.

I was quoted in the article:

McCaffrey noted that, “Deep neural systems have had amazing successes in areas such as image recognition, speech and natural language processing, but progress in security systems has been slower. In my opinion, this project represents a significant step forward.” When playing poker, players who have anomalous hands — very good hands like a Full House or very bad hands like a Pair of Fours — try not to reveal anything with their expressions. This is called a poker face. Three images from the Internet of literal poker faces.

## Differential Evolution Optimization Example Using Python

An Evolutionary Algorithm (EA) is one of many algorithms that are loosely based on the biological ideas of genetic crossover and mutation. Differential Evolution (DE) is a specific type of EA that has a bit of structure. I’m very familiar with evolutionary algorithms, and I’d seen Differential Evolution mentioned several times in research papers, but I had never investigated DE closely — for me that means implementing an example in code. In high-level pseudo-code, one form of a basic Evolutinary Algorithm is:

```create a population of possible solutions
loop
pick two good solutions
combine them using crossover to create child
mutate child slightly
if child is good then
replace a bad solution with child
end-if
end-loop
return best solution found
```

Differential Evolution differs from a general Evolutionary Algorithm in how the crossover and mutation operations are defined. In high-level pseudo-code, one basic form of DE is:

```create a population of possible solutions
loop
for-each possible solution
pick three random solutions
combine the three to create a mutation
combine curr solution with mutation = child
if child is better than curr solution then
replace current solution with child
end-if
end-for
end-loop
return best solution found
```

Just for fun I coded up a short demo of Differential Evolution. The goal was to solve the Rastrigin function with dim = 3. The Rastrigin function has many peaks and valleys which makes it difficult to solve. The function has a known minimum value of 0.0 at (0, 0, 0).

The implementation was easy for me because I’ve implemented similar algorithms many times.

Even though Differential Evolution algorithms have somewhat standard forms, there are still several variations of DE. For example, when picking three exiting solutions to use for creating the mutation, instead of picking three at random you can pick the best solution plus two random solutions.

Good fun. Evolution of women’s and men’s hairstyles, 1950s throught the 1990s.

Code below. Not thoroughly tested.

```# diff_evo_demo.py
# use differential evolution to solve
# Rastrigin function for dim = 3
# Python 3.7.6

import numpy as np

def rastrigin_error(x, dim):
# f(X) = Sum[xj^2 – 10*cos(2*pi*xj)] + 10n
z = 0.0
for j in range(dim):
z += x[j]**2 - (10.0 * np.cos(2*np.pi*x[j]))
z += dim * 10.0
# return avg squared difference from true min at 0.0
err = (z - 0.0)**2
return err

def main():
print("\nBegin Differential Evolution optimization demo ")
print("Goal is to minimize Rastrigin function with dim = 3 ")
print("Function has known min = 0.0 at (0, 0, 0) ")
np.random.seed(1)
np.set_printoptions(precision=5, suppress=True, sign=" ")

dim = 3
pop_size = 10
F = 0.5   # mutation
cr = 0.7  # crossover
max_gen = 100
print("\nSetting pop_size = %d, F = %0.1f, cr = %0.2f, \
max_gen = %d " % (pop_size, F, cr, max_gen))

# create array-of-arrays population of random solutions
print("\nCreating random solutions and computing their error ")
population = np.random.uniform(-5.0, 5.0, (10,dim))
popln_errors = np.zeros(pop_size)
for i in range(pop_size):
popln_errors[i] = rastrigin_error(population[i], dim)
# find initial best solution and its error
# best_idx = np.argmin(popln_errors)
# best_error = popln_errors[best_idx]

# main processing loop
for g in range(max_gen):
for i in range(pop_size):  # each possible soln in pop
# pick 3 other possible solns
indices = np.arange(pop_size)  # [0, 1, 2, . . ]
np.random.shuffle(indices)
for j in range(3):
if indices[j] == i: indices[j] = indices[pop_size-1]

# use the 3 others to create a mutation
a = indices; b = indices; c = indices
mutation = population[a] + F * (population[b] \
- population[c])
for k in range(dim):
if mutation[k]  5.0: mutation[k] = 5.0

# use the mutation and curr item to create a
# new solution item
new_soln = np.zeros(dim)
for k in range(dim):
p = np.random.random()   # between 0.0 and 1.0
if p < cr:  # usually
new_soln[k] = mutation[k]
else:
new_soln[k] = population[i][k]  # use current item

# replace curr soln with new soln if new soln is better
new_soln_err = rastrigin_error(new_soln, dim)
if new_soln_err < popln_errors[i]:
population[i] = new_soln
popln_errors[i] = new_soln_err

# after all popln items have been processed,
# find curr best soln
best_idx = np.argmin(popln_errors)
best_error = popln_errors[best_idx]
if g % 10 == 0:
print("Generation = %4d best error = %10.4f \
best_soln = " % (g, best_error), end="")
print(population[best_idx])

# show final result
best_idx = np.argmin(popln_errors)
best_error = popln_errors[best_idx]
print("\nFinal best error = %0.4f  best_soln = " % \
best_error, end="")
print(population[best_idx])

print("\nEnd demo ")

if __name__ == "__main__":
main()
```

## Computing PCA Using NumPy Without Scikit

Principal component analysis (PCA) is a classical statistics technique that can do data dimensionality reduction. This can be used to graph high dimensional data (if you reduce the dim to 2), or to clean data (by reconstructing the data from its reduced form), or anomaly detection (by comparing reconstructed data items to original source items).

When I need to do dimensionality reduction, I almost always use a deep neural autoencoder. But classical PCA can be useful in some situations such as when you’re dealing with legacy code or when you want a baseline for comparison with other techniques. The results on the left, from the Scikit PCA, are the same as the results from my NumPy-scratch function on the right. Principal components are allowed to vary by sign (which is a weird idea).

Computing PCA from absolute scratch is not feasible. PCA is one of the most complicated forms of numerical programming. At its heart PCA must compute eigenvalues and eigenvectors — extremely difficult. The most common way to perform PCA when using Python is to use the PCA() class in the Scikit library decomposition module. Using the Scikit PCA is fine but you take on a big dependency and get a lot of PCA functionality you may not need.

At one level of abstraction lower, you can compute PCA using either the NumPy or SciPy libraries. I set out one weekend morning to implement PCA using only NumPy. To cut to the chase, here is what I eventually came up with:

```def my_pca(x):
# returns transformed x, prin components, var explained
dim = len(x)  # n_cols
means = np.mean(x, axis=0)
z = x - means  # avoid changing x
square_m = np.dot(z.T, z)
(evals, evecs) = np.linalg.eig(square_m)  # 'right-hand'
trans_x = np.dot(z, evecs[:,0:dim])
prin_comp = evecs.T  # principal components are eigenvecs T
v = np.var(trans_x, axis=0, ddof=1)  # col sample var
sv = np.sum(v)
ve = v / sv
# order everything based on variance explained
ordering = np.argsort(ve)[::-1]  # sort order high to low
trans_x = trans_x[:,ordering]
prin_comp = prin_comp[ordering,:]
ve = ve[ordering]
return (trans_x, prin_comp, ve)
```

The function accepts a NumPy 2D array. It emits the data in transformed form, a 2D array of principal components that can be used to reconstruct the source data, and an array that holds the percentage of variance explained by each component.

To test my sort-of-from-scratch PCA function, I loaded the 150-item Iris Dataset (it has dim=4) and computed a variety of PCA-related values using both the Scikit library and my own function. The results are almost the same — some of the principal components vary by sign. This is OK because principal components can vary by sign, without affecting any values that use them. (This is a strange idea but explaining would take pages, so just trust me).

The trickiest part of my custom function was making sure that it returned everything (transformed data, principal components, variances) ordered by the percentage of variance explained by each component, so that output values matched the Scikit version.

In the end, my custom function isn’t directly useful: the code is a bit tricky and the small gain of avoiding the Scikit dependency isn’t worth the effort of a custom function. But the experiment was still a big win — I learned a lot and have a rock solid understanding of how PCA works. There are many science fiction movies where aliens disguise themselves as humans — sort of human reconstruction. Left: “It Came From Outer Space” (1953). An spacecraft with large jellyfish-like aliens crashes in the desert. The aliens are benign and can take on the appearance of humans to collect things they need to repair their spaceship, but the aliens can’t act 100% human (reconstruction error). The situation is resolved peacefully and the aliens resume their journey. Center: “Invaders From Mars” (1953). Martians land on Earth. They are most definitely not benign. The Martians can control people by inserting a device into their necks. Again, the controlled humans act a bit oddly. The aliens are defeated in the end. Right: “Invasion of the Body Snatchers” (1956). Alien pods show up in a small California town. The pods grow into human-shapes and then take over and dispose the source human if the human falls asleep.

Posted in Machine Learning, PyTorch | 1 Comment

## Particle Swarm Optimization Variants

Particle swarm optimization (PSO) is a meta-heuristic that can be used to construct a specific algorithm to find the minimum of an error function. In theory, PSO could improve neural network training because PSO does not use Calculus gradients like virtually every standard training algorithm — stochastic gradient descent and all of its variations such as Adam optimization.

In practice however, PSO is much too slow for very large, very deep neural networks. PSO maintains a collection (swarm) of possible solutions (particles). The “position” of a particle is a possible solution. For example, if some particle has position (4.5, -0.8, 3.2, 6.6, -1.7) this means (4.5, -0.8, 3.2, 6.6, -1.7) is a possible solution, for, say the five weights for a tiny neural network.

At a very high level, PSO looks like:

```create a swarm of random possible solutions
loop max_iterations
for-each particle/solution in swarm
compute a velocity
use velocity to update particle/solution
end-for
end-loop
return best particle/solution found by any particle
```

In code, each particle/solution is repeatedly updated like so:

```for k in range(dim):
swarm[i].position[k] += swarm[i].velocity[k]
```

In words, each particle (swarm[i]) component (position[k]) is the current component plus a “velocity”. A new velocity is computed as:

```swarm[i].velocity[k] = ( (w * swarm[i].velocity[k]) +
(c1 * r1 * (swarm[i].best_part_pos[k] -
swarm[i].position[k])) +
(c2 * r2 * (best_swarm_pos[k] -
swarm[i].position[k])) )
```

It’s not immediately obvious but the new velocity of a particle is based on the best position/solution the particle has seen so far (best_part_pos) and the best position/solution seen by any particle in the swarm (best_swarm_pos).

There are several constants and two randomization (r1, r2). The w constant is called inertia, the c1 constant is called cognitive/particle/self, and the c2 constant is called social/swarm/group. Most implementations of PSO use w = 0.729, c1 = 1.49445, c2 = 1.49445.

Because PSO is a loosely defined meta-heuristic rather than a tightly defined algorithm, there are many hundreds of reasonable variations that researchers can explore.

Many of the PSO variations involve using variable/adaptive values for w, c1, c2 rather than fixed constants. Other PSO variations do periodic resets or use multiple swarms to prevent the swarm from focusing too closely on one solution (premature convergence) that might not be optimal.

Four examples of the adaptive approach include “Linear Decreasing Inertia Weight” PSO (LDIW-PSO) which slowly decreases the w-inertia value, “Chaotic Descending Inertia Weight” PSO (CDIW-PSO) which adjusts the w-inertia value in a more randomized way, “Random Inertia Weight and Evolutionary Strategy” PSO (REPSO) which uses random w-inertia values plus some ideas from simulated annealing optimization, and “Adaptive Particle Swarm Optimization” (APSO) which uses the mutation idea from genetic algorithms.

In some sense all of these variations of PSO are really just excuses to publish yet-another research paper. The explorations are legitimate but they have almost no practical value — there are so many hyperparameters involved that any PSO variant can be shown to be better than any other variant by fiddling with the hyperparameters values.

But in the long run, optimization algorithms based on PSO and other similar meta-heuristics (spiral optimization, evolutionary algorithms, bacterial foraging optimization, fireworks optimization, etc., etc.) have great potential when vastly increased processing power becomes available — perhaps through quantum computing breakthroughs, or perhaps just a slow but steady increase in classical computing hardware. An example of combining computer science with art. I stumbled across a 2020 research paper “Generative Art with Swarm Landscapes” by D. de Andrade, N. Fachada, C. Fernandes, and A. Rosa. The authors took the Mona Lisa, digitized it and used the pixel values to define an error function. Then they applied particle swarm optimization and graphed the paths explored by the particles as they searched the image. Art like this is interesting and has a particular (semi-pun) form of beauty, but I much prefer human, non-computer generated, art.

Basic PSO demo code below.

```# particle_swarm_demo.py
# python 3.7.6
# demo of particle swarm optimization (PSO)
# solves Rastrigin's function

import random
import math    # cos() for Rastrigin
import copy    # array-copying convenience
import sys     # max float

# ------------------------------------

def show_vector(vector):
for i in range(len(vector)):
if i % 8 == 0: # 8 columns
print("\n", end="")
if vector[i] "greater-equal" 0.0:  # replace here
print(' ', end="")
print("%.4f" % vector[i], end="") # 4 decimals
print(" ", end="")
print("\n")

def error(position):
err = 0.0
for i in range(len(position)):
xi = position[i]
# one of several minor Rastrigin variants
err += (xi * xi) - (10 * math.cos(2 * math.pi * xi)) + 10
return err

# ------------------------------------

class Particle:
def __init__(self, dim, minx, maxx, seed):
self.rnd = random.Random(seed)
self.position = [0.0 for i in range(dim)]
self.velocity = [0.0 for i in range(dim)]
self.best_part_pos = [0.0 for i in range(dim)]

for i in range(dim):
self.position[i] = ((maxx - minx) *
self.rnd.random() + minx)
self.velocity[i] = ((maxx - minx) *
self.rnd.random() + minx)

self.error = error(self.position) # curr error
self.best_part_pos = copy.copy(self.position)
self.best_part_err = self.error # best error

def Solve(max_epochs, n, dim, minx, maxx):
rnd = random.Random(0)

# create n random particles
swarm = [Particle(dim, minx, maxx, i) for i in range(n)]

best_swarm_pos = [0.0 for i in range(dim)] # not necess.
best_swarm_err = sys.float_info.max # swarm best
for i in range(n): # check each particle
if swarm[i].error "less-than" best_swarm_err: # replace
best_swarm_err = swarm[i].error
best_swarm_pos = copy.copy(swarm[i].position)

epoch = 0
w = 0.729    # inertia
c1 = 1.49445 # cognitive (particle)
c2 = 1.49445 # social (swarm)

while epoch "less-than" max_epochs:  # replace here

if epoch % 10 == 0 and epoch "greater-than" 1: # rep.
print("Epoch = " + str(epoch) +
" best error = %.3f" % best_swarm_err)

for i in range(n): # process each particle

# compute new velocity of curr particle
for k in range(dim):
r1 = rnd.random()    # randomizations
r2 = rnd.random()

swarm[i].velocity[k] = ( (w * swarm[i].velocity[k]) +
(c1 * r1 * (swarm[i].best_part_pos[k] -
swarm[i].position[k])) +
(c2 * r2 * (best_swarm_pos[k] -
swarm[i].position[k])) )

if swarm[i].velocity[k] "less-than" minx:  # replace
swarm[i].velocity[k] = minx
elif swarm[i].velocity[k] "greater-than" maxx: # rep.
swarm[i].velocity[k] = maxx

# compute new position using new velocity
for k in range(dim):
swarm[i].position[k] += swarm[i].velocity[k]

# compute error of new position
swarm[i].error = error(swarm[i].position)

# is new position a new best for the particle?
if swarm[i].error "less-than" swarm[i].best_part_err: # rep.
swarm[i].best_part_err = swarm[i].error
swarm[i].best_part_pos = copy.copy(swarm[i].position)

# is new position a new best overall?
if swarm[i].error "less-than" best_swarm_err:  # replace
best_swarm_err = swarm[i].error
best_swarm_pos = copy.copy(swarm[i].position)

# for-each particle
epoch += 1
# while
return best_swarm_pos
# end Solve

print("\nBegin particle swarm optimization \
using Python demo\n")
dim = 3
print("Goal is to solve Rastrigin's function in " +
str(dim) + " variables")
print("Function has known min = 0.0 at (", end="")
for i in range(dim-1):
print("0, ", end="")
print("0)")

num_particles = 50
max_iterations = 100

print("Setting num_particles  = " + str(num_particles))
print("Setting max_iterations = " + str(max_iterations))
print("\nStarting PSO algorithm\n")

best_position = Solve(max_iterations, num_particles,
dim, -10.0, 10.0)

print("\nPSO completed\n")
print("\nBest solution found:")
show_vector(best_position)
err = error(best_position)
print("Error of best solution = %.6f" % err)

print("\nEnd particle swarm demo\n")
```

## Sentiment Analysis Using a PyTorch EmbeddingBag Layer in Visual Studio Magazine

I wrote an article titled “Sentiment Analysis Using a PyTorch EmbeddingBag Layer” in the July 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/07/06/sentiment-analysis.aspx. Natural language processing (NLP) problems are very difficult. A common type of NLP problem is sentiment analysis. The goal of sentiment analysis is to predict whether some text is positive (class 1) or negative (class 0). For example, a movie review of, “This was the worst film I’ve seen in years” would be classified as negative.

In situations where the text to analyze is long — say several sentences with a total of 40 words or more — two popular approaches for sentiment analysis are to use an LSTM (long, short-term memory) network or a Transformer Architecture network. These two approaches are very difficult to implement. For situations where the text to analyze is short, the PyTorch code library has a relatively simple EmbeddingBag class that can be used to create an effective NLP prediction model.  In my article, I present a complete end-to-end demo. The source data is 20 short movie reviews. I explain nine steps:

1. How to create and use a tokenizer object
2. How to create and use a Vocab object
3. How to create an EmbeddingBag layer and use it in a neural network
4. How to design a custom collating function for use by a DataLoader
5. How to design a neural network that uses all these components
6. How to train the network
7. How to evaluate the prediction accuracy of the trained model
8. How to use the model to make a prediction for a movie review
9. How to integrate all the pieces into a complete working program

The two key distinguishing characteristics of an NLP system that uses an EmbeddingBag layer are 1.) an EmbeddingBag layer system is much simpler than a system that uses a regular Embedding layer. Note however that EmbeddingBag layer systems are still very complex. And 2.) an EmbeddingBag layer system does not use information about the order of words in a sentence. This means EmbeddingBag layer systems work best with short-input sentences (perhaps about 20 words or fewer).

Deep neural systems have completely revolutionized natural language processing systems. Years ago, I worked on one of the very first Internet search engines — before Google existed. Dealing with free form input that users typed into the search box was a big headache. We had to do all kinds of crazy things like handcrafted stemming (reducing words to simple form) and lemmatization (dealing with different contexts, such as the meaning of “pound” to a doctor in the US, and to a banker in the UK). Airport Bags. Left: We all know some women who pack like this for a two-day trip. Left-Center: Anyone who has children can understand this photo of a child embedded into a bag. Right-Center: Many people put something on their luggage to easily identify it. Right: Rolling luggage has wheels on it for a purpose — but maybe not this.

## Jensen-Shannon Distance Example

The Jensen-Shannon distance measures the difference between two probability distributions. For example, suppose P = [0.36, 0.48, 0.16] and Q = [0.30, 0.50, 0.20]. The Jenson-Shannon distance between the two probability distributions is 0.0508. If two distributions are the same, the Jensen-Shannon distance between them is 0. Jensen-Shannon distance is based on the Kullback-Leibler divergence. In words, to compute Jensen-Shannon between P and Q, you first compute M as the average of P and Q and then Jensen-Shannon is the square root of the average of KL(P,M) and KL(Q,M). In symbols:

JS(P,Q) = sqrt( [KL(P,M) + KL(Q,M)] / 2 )
where M = (P + Q) / 2

So the key to computing JS is understanding how to compute KL.

KL(P,Q) = Sum_i(p[i] * ln(p[i] * q[i]))

Here’s a worked example of Jensen-Shannon distance:

```P = [0.36, 0.48, 0.16]
Q = [0.30, 0.50, 0.20]

M = 1/2 * (P + Q)
= [0.33, 0.49, 0.18]

KL(P,M) = 0.36 * ln(0.36 / 0.33) +
0.48 * ln(0.48 / 0.49) +
0.16 * ln(0.16 / 0.18)
= 0.002582

KL(Q,M) = 0.30 * ln(0.30 / 0.33) +
0.50 * ln(0.50 / 0.49) +
0.20 * ln(0.20 / 0.18)
= 0.0002580

JS(P,Q) = sqrt[ (KL(P,M) + KL(Q,M)) / 2 ]
= sqrt[ (0.002582 + 0.0002580) / 2 ]
= 0.050803
```

The Jensen-Shannon distance is symmetric, meaning JS(P,Q) = JS(Q,P). This is in contrast to Kullback-Leibler divergence which is not symmetric, meaning KL(P,Q) != KL(Q,P) in general.

Jensen-Shannon is not used very often. I’m not sure why this is so.

The Python scipy code library has an implementation of Jensen-Shannon distance but JS is easy to compute from scratch using a program-defined function if you want to avoid an external dependency.

Jensen-Shannon and Kullback-Leibler are both examples of mathematical f-divergence functions that calculate the difference between two probability distributions. Other f-divergences include Hellinger distance, Pearson Chi-Square divergence, and alpha divergence.

Jensen-Shannon distance combines two simple ideas: the average of two probability distributions and Kullback-Leibler divergence. The result is useful. I don’t know much about jewelry, but I’ve always thought that semi-precious stones like onyx and malachite are far more interesting than fancy gems like diamonds and rubies. And the combination of simple stones can be very appealing (to my eye anyway). Left: A necklace made from lapis lazuli and turquoise. Center: A necklace made from coral and onyx. Right: A necklace made from malachite and amethyst. ```# jensen_shannon_distance.py
# example of Jensen-Shannon distance

import numpy as np
from scipy.spatial import distance

def KL(p, q):
# Kullback-Leibler "from q to p"
# p and q are np array prob distributions
n = len(p)
sum = 0.0
for i in range(n):
sum += p[i] * np.log(p[i] / q[i])
return sum

def JS(p, q):
m = 0.5 * (p + q)  # avg of P and Q
left = KL(p, m)
right = KL(q, m)
return np.sqrt((left + right) / 2)

def main():
print("\nBegin Jensen-Shannon distance demo ")
np.set_printoptions(precision=4, suppress=True, sign=" ")

p = np.array([0.36, 0.48, 0.16], dtype=np.float32)
q = np.array([0.30, 0.50, 0.20], dtype=np.float32)

print("\nThe P distribution is: ")
print(p)
print("The Q distribution is: ")
print(q)

js_pq = JS(p, q)
js_qp = JS(q, p)

print("\nJS(P,Q) dist = %0.6f " % js_pq)
print("JS(Q,P) dist = %0.6f " % js_qp)

js_sp = distance.jensenshannon(p, q)
print("\nJS(P,Q) dist using scipy = %0.6f " % js_sp)

print("\nEnd demo ")

if __name__ == "__main__":
main()
```

## Whatever Happened to Percolation Theory?

Many years ago I was quite interested in mathematical percolation theory. But over the past 10 years or so I haven’t seen many new research papers published on the topic.

This figure illustrates some of the main ideas of percolation: If some edges are randomly connected, how likely is it that there’s a path from the top to the bottom? In this example, there’s a 7×7 square grid. This creates 84 up-down and left-right edges. Edges were connected randomly with p = 0.4 (which means that the probability that an edge wasn’t connected is 0.6). After the random connections were added, there were 27 out of 84 edges connected, so the actual probability of a connection in this particular grid is 27/84 = 0.32.

The resulting grid does in fact have a connection from top to bottom. The connected path contains 12 of the 27 connected edges so 12/27 = 0.4 of the edges are part of the connection.

There has been a huge amount of research done on percolation theory. One of the results is, for a square grid, that if the probability of connection is 0.5, then there is likely a connected path from top to bottom. Therefore, in this example, it was a bit lucky that a connected path exists.

Note: Let me point out that I have (deliberately) butchered several of the vocabulary terms and a few of the ideas to keep the explanation simple and understandable. If you are a mathematician who works with percolation theory, please don’t slaughter me in the Comments to this post.

As with any math topic, there are gazillions of generalizations. There are many kinds of grids (more accurately lattices), many kinds of connections (bond vs. site, etc.), 3D versions, n-dimensional versions, and on and on. Percolation theory has many relationships with graph theory, fractals, lattice theory, cellular automata, and other areas of math.

So, back to my original question: Why are there so few new research papers on percolation theory? First, maybe there are lots of new research papers on percolation and I’m just not seeing them. But if the number of research papers is in fact declining, I suspect that either 1.) percolation theory has not proven to be particularly useful in applied areas (such as oil moving through porous rock or materials science), or 2.) all the main ideas in percolation theory have been explored. But this is speculation on my part. Percolation means, “the slow passage of a liquid through a filtering medium”. But in informal usage percolation means giving some thought to a plan or idea. Here are four examples from an Internet search for “bad idea crime” where criminals should have percolated a bit more. (Interestingly, none of stories with these photos identified the criminal — perhaps too embarrassing). Left: A man in a wheelchair in Brazil tries to rob a store while holding a gun with his feet. Did not go well. Left-Center: A man in Philadelphia tries to rob a store using a banana. Not a good idea. Right-Center: A man in Missouri robs a convenience store but a few hours later he tried to grab a gun away from a police officer. Did not go well. Right: A woman in Ohio tried to diguise herself using a cow costume. The mugshot shows her plan needed more thought.