Parzen Window Probability Density Function Estimation

Suppose you have a sample of data, for example a bunch of people’s heights, or maybe the times between arrivals of Web requests to a server. You might want to estimate the underlying probability distribution (PDF) that generated the sample data.

In my two examples above, it’s well known that people’s heights usually follow a Normal, bell-shaped curve distribution. And times between arrivals often follow an Exponential distribution. But for many sets of sample data, the underlying distribution may be unknown.

There are dozens of classical statistics techniques to determine the underlying distribution for a set of sample data. One approach is called the Parzen window technique. It’s also known as kernel density estimation.

Briefly, if you have a set of sample data X, you can estimate the probability of each of the values xi with:

Ugh. What a mess. But the equations aren’t as bad as they appear. The f is the approximating function. It needs a smoothing parameter h and a kernel function K. The h shown is called Silverman’s rule of thumb. The K shown is the Gaussian kernel.

I coded up a demo. First, I generated a sample of 30 values from a Normal distribution with mean = 0 and standard deviation = 1, which is a bell-shaped curve, centered about 0, with most data between -3 and +3. In a real problem I wouldn’t know the underlying distribution.

The I estimated the PDF using the sample data. The graph shows the estimate is pretty close to the true distribution. Different choices of h and K would give significantly different results — having to pick h and K is a major weakness of Parzen window estimation.

The moral of the story is that the more techniques you know, the more flexible you become. But some topics you can live without. I think Parzen window PDF estimation is probably too rarely used for you to spend much time on it. But it’s an interesting technique.



“The Goldfish Window” (1916), Frederick Childe Hassam. Currier Museum of Art, Manchester, NH

Advertisements
Posted in Miscellaneous | Leave a comment

A Recap of Science Fiction Movies of 2017

Now that 2017 is well over, I’ve had a chance to see most of the year’s main science fiction films. It was an OK, but not great, year for science fiction. Here are my top ten science fiction films from 2017.

1. The Great Wall – Matt Damon and some other guy, and colorful Chinese soldiers battle scary Tao Tieh monsters attacking the Great Wall of China about 1050 AD. Most critics didn’t like this movie but I thought it was brilliant. I give this (only marginally sci-fi) movie a solid A grade.


2. Life – Six politically-correct-and-marketing-friendly-diverse people (with Ryan Reynolds and Jake Gyllenhaal as the token over-represented group males) are the crew of an orbiting space station. An unmanned probe from Mars brings a tiny life form onto the station. What could go wrong? Option #1 – be careful with it. Option #2 – zap it with electricity and keep your hand real close to it. Had these people never seen ANY science fiction film before? My grade is a B.


3. Valerian and the City of a Thousand Planets – Oh my goodness. This film was produced and directed by Luc Besson, who did one of my favorite movies of all time, The Fifth Element. Luc Luc, Luc. Why did you choose an actor for Valerian who has the physique of a 15-year old girl? Why did you choose an actress for Laureline who is possibly the most annoying, whining woman in the universe. And what type of medication were you on when you wrote dialog such as, “Can I help? I’m a good driver.” In spite of everything, the movie held my interest and I give it a B- grade.


4. Alien Covenant – Like many of my friends, I’m kind of aliened-out, so I had low expectations for this film. Bottom line: it was much better than I thought it’d be, especially coming after the absolutely terrible 2012 Prometheus. The plot was complicated but made sense, and the acting was excellent. I would have ranked this movie a step or two higher but it didn’t have anything wildly new. My grade is a B-.


5. Blade Runner 2049 – Sigh. A disappointment. Like most of my geek friends, the original 1982 Blade Runner is on my list of all-time favorites. So I had huge expectations. Ryan Gosling tracks down rogue replicants (artificial humans). This is one of those films that seemed to do everything right — acting, story, cinematography — but for some reason the movie was less than the sum of its parts. I think the movie was just too long at over two and one-half hours. My grade is a C+.


6. Ghost in the Shell – If this movie had been made 15 years ago, it would have been incredible. But all the ideas in the movie — cyborgs, weird dystopian future — seemed like we’d seen them before. Because we have — many of these ideas were introduced in the original early 1990s Japanese manga. This movie was a lot of fun to look at but the plot was rather predictable. I didn’t like the vaguely creepy way Scarlett Johansson always looked somewhat naked. My grade is a C.


7. Star Wars: The Last Jedi – How do I explain why I didn’t like this movie? It is one of the better Star Wars movies (which isn’t saying very much) but the movie is a paint-by-the-numbers exercise in marketing. Fierce empowered (add other cliché adjectives here) heroine who becomes expert with light saber after 15 minutes of training from Luke? Check. Chubby repair woman who instantly becomes kick-butt fighter? Check. Obligatory daring multicultural kiss? Check. Cute big-eyed animals for marketing to children? Check. Evil rich corporate types mistreating poor people and space horses? Check. And why the heck did the good guys’ flying machines have to drag a leg-thing onto the snow, making them easy targets for the bad guys? The movie did hold my attention but still, it’s a pandering-to-the-masses turd. My grade is a C-.


8. Geostorm – I like Gerard Butler. I like eco-disaster sci-fi. I like special effects. But this movie was just boring, boring, boring. I can be entertained by almost any science fiction film, but this was a snoozer beyond belief. I give the film a D.


9. War for the Planet of the Apes – Another ape movie. Yawn. My grade is a D. Can we please stop? To be fair, the movie was well-liked by most critics and audiences, but I am tired of ape movies.


10. The Dark Tower – Bad acting, incomprehensible plot. Idris Elba is the last of the Gunslingers. Huh? Matthew McConaughey is the Man in Black, an ageless deceiver and sorcerer. Huh? I give this film a solid F. One of the worst major sci-fi movies I’ve seen in a long time.


Notes:

Guardians of the Galaxy Vol. 2 just didn’t make much of an impression, good or bad, on me.

Logan is a super hero movie and I don’t include them in general science fiction films.

Power Rangers was a NO-OP for me. I did like the early 1990s original TV show episodes.

Posted in Top Ten | Leave a comment

What is a Machine Learning Epoch?

There are several popular machine learning code libraries, including TensorFlow, CNTK, and Keras. The exact meaning of the term “epoch” can vary from library to library. I coded a little demo program to determine exactly what an epoch is for the Keras library.

My demo had 120 training items. I specified epochs = 2 and batch_size = 40. The Keras fit() function (that does the training) performed 2 * 3 = 6 weight update operations. Because the batch size is 40, there are 3 batches needed to process all 120 items.

The moral is that the number of epochs to specify depends in part on the batch size you choose.

Compared to TensorFlow and CNTK, Keras is the easiest library to use (although by no means is it easy to learn) but Keras is the least flexible (meaning customization is difficult). For my demo, I created a Callback object that prints messages at the beginning and end of each epoch and batch:

class MySnooper(K.callbacks.Callback):
  def on_batch_begin(self, batch, logs={}):
    print("  batch %d begin" % batch)

  def on_batch_end(self, batch, logs={}):
    print("  batch end")

  def on_epoch_begin(self, epoch, logs={}):
    print("epoch %d begin" % epoch)

  def on_epoch_end(self, epoch, logs={}):
    print("epoch end")

Then to train:

# set up train_x and train_y
# set up and compile model
my_snooper = MySnooper()
model.fit(train_x, train_y, batch_size=40, epochs=2,
  verbose=0, callbacks=[my_snooper])

Learning how to use TensorFlow, CTK, Keras, and other machine learning code libraries requires a lot of time and patience. But the process is intellectually stimulating and fun.



“Star Dancers Epoch” – Gabriel Gajdos

Posted in Keras, Machine Learning | Leave a comment

Saving and Displaying Keras Model Weights

Keras is a code library for creating deep neural networks. After you create and train a Keras model, you can save the model to file in several ways. One Keras function allows you to save just the model weights and bias values. For example,

model.save_weights(".\\Models\\iris_model_wts.h5")

Somewhat unfortunately (in my opinion), Keras uses the HDF5 binary format when saving. I am not a fan of HDF5.

I coded up a demo to make sure I fully understood how Keras works when saving weights. I used the all-too-common Iris Dataset. My model had 4 input nodes for the sepal length, sepal width, petal length, petal width. I used just 2 hidden processing nodes for simplicity. The model had 3 output nodes for the probabilities of setosa, versicolor, virginica.

After training and saving, I made a prediction for an unknown Iris with features (6.0, 3.0, 5.0, 1.0) and got (0.0729, 0.7161, 0.2110).

In order to view an HDF5 binary file, you need to install HDF5 to get a boat load of programs including the h5dump utility. Luckily, I found a pre-built Windows self-extracting installer at https://support.hdfgroup.org/HDF5/release/obtain518.html.

First I got the structure of the weights file by entering the command:

h5dump -n iris_model_wts.h5

This showed me:

FILE_CONTENTS {
 group      /
 group      /dense_1
 group      /dense_1/dense_1
 dataset    /dense_1/dense_1/bias:0
 dataset    /dense_1/dense_1/kernel:0
 group      /dense_2
 group      /dense_2/dense_2
 dataset    /dense_2/dense_2/bias:0
 dataset    /dense_2/dense_2/kernel:0
 }

With that info, I could get the input-to-hidden weights by entering the command:

h5dump -d /dense_1/dense_1/kernel:0 iris_model_wts.h5

This showed me the 4×2 = 8 weights:

DATA {
 (0,0): 0.216208, -0.12038,
 (1,0): 0.489124, -1.38508,
 (2,0): -0.863519, 0.785177,
 (3,0): -0.986479, 0.993221
}

In a similar way, I got the hidden biases, the hidden-to-output weights, and the output biases. To verify the weights, I used Excel to manually calculate the output values for (6.0, 3.0, 5.0, 1.0 and I got the correct (0.0729, 0.7161, 0.2110) result.

To recap, one way to view the weights and biases of a Keras neural network is to save the weights using model.save_weights() (or the entire model plus the weights using the model.save() function) as a binary HDF5 file and then view using the h5dump utility.

An easier, alternative approach is to save the model using model.save(filepath) and then later load the model using K.models.load_model(filepath) and then fetch the weights using model.get_weights() and then display the return np array result using the print() function.

Conclusion: I don’t like HDF5. Much too complex.

Posted in Keras, Machine Learning | Leave a comment

Preparing the MNIST Dataset for Use by Keras

The MNIST (modified National Institute of Standards and Technology) image dataset is well-known in machine learning. It consists of 60,000 training images and 10,000 test images. Each image is 28×28 (784 pixel values) that are a handwritten digit between ‘0’ and ‘9’.

Keras is a popular machine learning library. There are plenty of examples of using Keras on the MNIST dataset but almost all start out along the lines of:

from keras.datasets import mnist  # get the dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

That’s great to load a pre-configured data file that comes with Keras, but what is going on behind the magic? What if you need to load a dataset different from MNIST, one that’s not already available?

So I took a few minutes during my lunch break to explore getting MNIST and loading MNIST from scratch. None of the steps is difficult, but there are a lot of steps. At a high level, briefly:

1. get the four MNIST g-zipped files
2. unzip the four files (they’re now in binary)
3. convert the four binary files into text files
4. load the four text files into memory

The g-zipped files can be found at http://yann.lecun.com/exdb/mnist/. On Windows you need the 7-Zip utility program to extract the files.

I write a little Python program that reads an image file and its corresponding label file, converts the image pixel values to integers between 0 and 255, and the labels to one-hot vectors such as 0 0 1 0 0 0 0 0 0 0 for ‘2’.

To see if my conversion worked, I wrote a second utility program that loads an image file (train or test) into memory and then displays the specified image. My demo show the first training image which is a ‘5’.

The moral of the story is that when working with machine learning, preparing your data is a huge part of the overall effort. It’s not uncommon for data preparation to take 90% or even more of your total time and effort.

# convert_binary_to_text.py
#
# go to http://yann.lecun.com/exdb/mnist/ and
# download the four g-zipped files:
# train-images-idx3-ubyte.gz (60,000 train images) 
# train-labels-idx1-ubyte.gz (60,000 train labels) 
# t10k-images-idx3-ubyte.gz  (10,000 test images) 
# t10k-labels-idx1-ubyte.gz  (10,000 test labels) 
# 
# use the 7-Zip program to unzip the four files.
# I recommend adding a .bin extension to remind
# you they're in a proprietary binary format
#
# run the script twice times, once for train data,
# once for test data, changing the 4 file names as appropriate

# import numpy as np
# import keras as K
# use pure Python only

def convert(img_bin_file, lbl_bin_file,
            img_txt_file, lbl_txt_file, n_images):

  img_bf = open(img_bin_file, "rb")    # binary image pixels
  lbl_bf = open(lbl_bin_file, "rb")    # binary labels

  img_tf = open(img_txt_file, "w")     # text image pixels
  lbl_tf = open(lbl_txt_file, "w")     # text labels

  img_bf.read(16)   # discard image header info
  lbl_bf.read(8)    # discard label header info

  for i in range(n_images):   # number images requested 

    # do labels first for no particular reason
    lbl = ord(lbl_bf.read(1))  # get label like '3' (one byte) 
    encoded = [0] * 10         # make one-hot vector
    encoded[lbl] = 1
    for i in range(10):
      lbl_tf.write(str(encoded[i]))
      if i != 9: lbl_tf.write(" ")  # like 0 0 0 1 0 0 0 0 0 0 
    lbl_tf.write("\n")

    # now do the image pixels
    for j in range(784):  # get 784 vals for each image file
      val = ord(img_bf.read(1))
      img_tf.write(str(val))
      if j != 783: img_tf.write(" ")  # avoid trailing space 
    img_tf.write("\n")  # next image

  img_bf.close(); lbl_bf.close();  # close the binary files
  img_tf.close(); lbl_tf.close()   # close the text files

def main():

  convert(".\\UnzippedBinary\\train-images.idx3-ubyte.bin",
          ".\\UnzippedBinary\\train-labels.idx1-ubyte.bin",
          ".\\mnist_train_images_3.txt",
          ".\\mnist_train_labels_3.txt",
          n_images = 3)  # first n images

if __name__ == "__main__":
  main()
# show_mage.py

import numpy as np
import matplotlib.pyplot as plt

def display(img_txt_file, idx):
  # assumes an image file is 784 space-delimited
  # int values between 0-255

  data = np.loadtxt(img_txt_file, delimiter = " ")

  img = np.array(data[idx], dtype=np.float32)
  img = img.reshape((28,28))
  plt.imshow(img, cmap=plt.get_cmap('gray'))
  plt.show()

def main():
  print("\nBegin show MNIST image demo \n")

  img_file = ".\\mnist_train_images_3.txt"
  display(img_file, idx=0)  # first image

  print("\nEnd \n")

if __name__ == "__main__":
  main()
Posted in Machine Learning | Leave a comment

The Distance Between Two Non-Numeric Items

I came up with a novel way to measure the distance between two non-numeric items. Let me explain.

Suppose you have two numeric items like (5.0, 2.0, 3.0) and (4.0, 0.0, 3.0). You can easily calculate a distance between the items. For example, the Euclidean distance is sqrt((5-4)^2 + (2-0)^2 + (3-3)^2) = sqrt(5) = 2.24.

But if the items are non-numeric, calculating a meaningful distance is very hard. Suppose v1 = (red, short, heavy) and v2 = (blue, medium, heavy). You could say that the two items differ in two spots so their distance is 2.0. But this doesn’t take into account that there could be a dozens colors (red, blue, green, . . . ) and only three lengths (short, medium, long). A full discussion of the difficulties of measuring distance between non-numeric items would takes pages. Just trust me, it’s an extremely difficult problem.

My novel idea uses what’s called category utility (CU). CU is a metric I ran across in an obscure research paper several years ago. CU measures the goodness of a clustering of a dataset of non-numeric data. CU works by computing a measure of theoretic information gain. Larger CU values mean more information gain due to the clustering.

So my idea is this. If you want to measure the distance between item v1 and v2 that belong to a non-numeric dataset, calculate a CU1 for a clustering that has item v1 in a cluster by itself (and all the other items in a second cluster). Then calculate CU2 for a clustering that has item v2 in a cluster by itself (and all the other items in a second cluster). The difference in CU values is a measure of the difference in information gain, and can be used as a distance.

I coded up a demo, and somewhat to my surprise, the idea seems to work very well. This could be an important result and I wish I had time to explore it.

Posted in Machine Learning | 1 Comment

Encoding Words for Machine Learning Analysis using Word2Vec

Neural networks understand only numbers. Therefore, if you are working with text, words must be converted into numbers. Suppose you have a corpus — a document(s) of interest. You could assign an integer to each word. For example, if the text started with, “In the beginning” then you could set “In” = 1, “the” = 2, “beginning” = 3, and so on.

But assigning values like this just doesn’t work very well because of how neural networks operate. Briefly, because 1 and 2 are close together numerically, “In” and “the” would be considered very close.

The Word2Vec (“word to vector”) system is one of the best ways to encode words. Briefly, each word is assigned a vector of numbers in a very clever way so that similar words have similar numeric values in the vector. There are several implementations of Word2Vec but I prefer the one in the gensim (the name originally stood for “generate similar” text) Python library.

I wrote a short demo. First I installed the gensim Python package using “pip install gensim”. The I wrote a Python script. My dummy corpus consisted of just three sentences. In a real scenario, your corpus could be huge, such as all of Wikipedia, or hundreds of thousands of news stories. I hard-coded my corpus like so:

sentences = [['In', 'the', 'beginning', 'God', 'created',
              'the', 'heaven', 'and', 'the', 'earth.', 
              'And', 'the', 'earth', 'was', 'without',
              'form,', 'and', 'void;', 'and', 'darkness',
              'was', 'upon', 'the', 'face', 'of', 'the',
              'deep.', 'And', 'the', 'Spirit', 'of', 'God',
              'moved', 'upon', 'the', 'face', 'of', 'the',
              'waters.']]

In a real problem, setting up the corpus is the hard part. You have to deal with punctuation, capitalization, and so on. In this demo I hard-coded the corpus as a list-of-lists. In a non-demo scenario, I’d likely read a corpus from a (UTF-8) text file like: sentences = word2vec.Text8Corpus(‘C:\\Data\\Corpuses\\whatever.txt’).

I built a model, specifying 10 values for each word vector (in a realistic large corpus, you’d use something like 100 or 200 values per word). Then I displayed the values for the word ‘earth’:

[ 0.01721778 -0.03160927 -0.01329765
-0.03671417 0.03356135 -0.03182576
-0.00196723 0.01548103 -0.02937444
0.04018674]

If you had a neural network, this is what you’d feed the network instead of the word ‘earth’. The Word2Vec library can do all kinds of additional capabilities. It’s a remarkable library.

# word_to_vec_demo.py

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : \
  %(levelname)s : %(message)s', level=logging.INFO)

sentences = [['In', 'the', 'beginning', 'God', 'created', 'the',
 'heaven', 'and', 'the', 'earth.', 'And', 'the', 'earth', 'was',
 'without', 'form,', 'and', 'void;', 'and', 'darkness', 'was',
 'upon', 'the', 'face', 'of', 'the', 'deep.', 'And', 'the',
 'Spirit', 'of', 'God', 'moved', 'upon', 'the', 'face',  'of',
 'the', 'waters.']]

print("\nBegin training model on corpus")
model = word2vec.Word2Vec(sentences, size=10, min_count=1)
print("Model created \n")

print("Vector for \'earth\' is: \n")
print(model.wv['earth'])

print("\nEnd demo")


The Multiverse

Posted in Machine Learning | Leave a comment