Generating IMDB Training and Test Files from Source Files

A well-known benchmark dataset for machine learning is the IMDB Movie Review Dataset. There are 50,000 written reviews that are labeled positive (good movie) or negative. Therefore, the dataset can be used to create a sentiment analysis model.

The Keras library has a pre-packaged version of the dataset but I wanted to generate training and test data myself, directly from the 50,000 source text files.

As I expected, generating the data from source material was very tricky and time-consuming. But I definitely learned a lot about processing files using Python.

I wrote a program called make_data_files.py that does just that. First I read 12,500 positive training reviews, 12,500 negative training reviews, 12,500 positive test reviews, and 12,500 negative test reviews into memory. Next I created a vocabulary dictionary of distinct words where the key is a word (like ‘the’) and the value is the rank by frequency (‘the’ is 1 because it’s most common).

I used the vocabulary dictionary to encode the raw text because neural models only understand numbers. I followed the Keras format mostly. Keras uses a value of 0 for padding when making all reviews the same length. Keras uses a value of 1 to indicate start of sequence — this is useless so I dropped 1. Keras uses 2 to indicate “out-of-vocabulary”, an unknown word, and so did I.

Each encoded word is offset by 3. So ‘the’ maps to 1 (most frequent), but is encoded as 1 + 3 = 4. This allows the 0, 1, 2 to be used. Weirdly, in the Keras scheme 3 is never used. OK, whatever.

After doing some experiments, I realized that the Keras version of the IMDB data has a different train-test split from the source data. For example, the raw data (from http://ai.stanford.edu/~amaas/data/sentiment/) has a short six-word negative review, “read the book forget the movie”) in the negative test data (file 6850_2.txt), but the Keras dataset stores this review as a negative training review. I’m not sure why this is so.

The moral of the story is that using pre-packaged data when exploring machine learning is convenient, but is a realistic scenario you have to generate data yourself and it’s almost always very difficult (and not very much fun — but developers’ definitions of fun are a bit different from those of ordinary people).



In spite of being interviewed on TV by a pretty news reporter, this man is probably not having fun.

Advertisements
Posted in Keras, Machine Learning | Leave a comment

I Give a Talk on LSTM Networks

I recently gave a talk on LSTM (“long, short-term memory”) networks. An LSTM cell is a small but complex software module that can be used to make predictions for sequence data. In particular, LSTM networks can be used to make predictions for natural language problems.

What makes LSTM networks different from ordinary neural networks is that LSTMs have a memory. For example, if I asked you to predict what word follows, “I ordered ___ .” you’d have trouble. But if I asked you to predict what follows, “The choice was fish or steak and I was in the mood for seafood so I ordered ___ .” you’d guess “fish” because the context words gave you a big hint.

LSTMs are very complex and I struggled to explain them clearly. Based on my experience learning about LSTMs, I presented LSTMs from three perspectives: an architecture diagram, a set of math equations, and computer code.

Many machine learning concept and techniques are relatively simple. But not LSTMS — they are very difficult to fully understand. But like almost anything, it is possible to understand LSTMs. It just takes some time.



“F. D. Crockett and Steamer Piankatank off Stringray Point Circa 1930 – Chesapeake Bay”, John_Barber.

Posted in Machine Learning | Leave a comment

Inspecting the IMDB Dataset – Indexing

At any given time, I tend to have a lot of mini-projects going on. These are things that I know are going to take many days of work, so I spend an hour or two at a time on each. One such project is to fully understand the IMDB movie review sentiment analysis problem.

Completely describing the problem would take a full page, but briefly, there are 50,000 movie reviews written by ordinary people. Each review is labeled positive or negative. The goal is to build a prediction model.

Previously, I explored the IMDB dataset and discovered that there are about 90,000 distinct words used in the reviews. The most common word is ‘”the” which isn’t surprising.

Today I looked at how the data is encoded for use by the Keras library. The idea is that Keras has a pre-packaged version of the IMDB dataset so I’m going to dissect it, and then reproduce it from the raw data.

In my demo, I loaded the pre-packaged data but only reviews in the training data (25,000 of the reviews) that were 12 words in length or less. It turns out that there are 8 of these super short reviews. Seven were negative and one was positive.

The pre-packaged data has converted each word into an index value — neural models only understand numbers. From the output, and the documentation, I see that 0 is used for padding and so doesn’t represent a word. The number 1 marks the start of a review. And a value of 2 represents an unknown word — the mapping indices were determined from the training data so some words in the test will be unknown. But I see a 2 in the training data — I don’t understand how that can happen.

My next step will be to reverse map the index numbers into words to see exactly how the mapping works.



If I wrote a movie review of “Dr. No” (1962), it would definitely be positive. The first Bond film.

Posted in Keras, Machine Learning | Leave a comment

I Give a Talk on Neural Network Fundamentals

When learning most things related to computer science, I think the most difficult part is the first few steps. I recently gave a talk intended to be an absolute introduction to neural networks. This topic forced me to think long and hard about what to include in the talk, and more importantly, what not to include.

I used a combination of pictures and code. I strongly believe you can’t understand neural networks without seeing code, but I also strongly believe you need diagrams to understand the code.

My canonical demo program illustrated the NN input-output mechanism. This included:

The idea of nodes and layers
Weights and biases
Weight initialization
Sum-of-products computation
Hidden layer activation (just tanh for now)
Output layer softmax activation

As I just mentioned, the important idea here is what I left out of the discussion. Things like back-propagation and stochastic gradient decent, data normalization and encoding, cross entropy error, and so on, are must-know topics, but based on my experience, presenting such topics too early does more harm than good.

Anyway, good fun for me. No matter how many times I present a topic, I always gain a new insight or two.

# nn_io.py
# Anaconda3 (Python 3.5.2, NumPy 1.11.1)

import numpy as np
import math

def show_vector(v, dec):
  fmt = "% ." + str(dec) + "f" # like '% .4f'
  for i in range(len(v)):
    print(fmt % v[i] + '  ', end='')
  print('')
  
def show_matrix(m, dec):
  for i in range(len(m)):
    show_vector(m[i], dec)
  
# -----
	
class NeuralNetwork:

  def __init__(self, num_input, num_hidden, num_output):
    self.ni = num_input
    self.nh = num_hidden
    self.no = num_output
	
    self.i_nodes = np.zeros(shape=[self.ni], dtype=np.float32)
    self.h_nodes = np.zeros(shape=[self.nh], dtype=np.float32)
    self.o_nodes = np.zeros(shape=[self.no], dtype=np.float32)
	
    self.ih_weights = np.zeros(shape=[self.ni,self.nh],
      dtype=np.float32)
    self.ho_weights = np.zeros(shape=[self.nh,self.no],
      dtype=np.float32)
	
    self.h_biases = np.zeros(shape=[self.nh], dtype=np.float32)
    self.o_biases = np.zeros(shape=[self.no], dtype=np.float32)
	
    self.rnd = np.random.RandomState(1)
    self.initialize_weights()
 	
  def set_weights(self, weights):
    if len(weights) != self.total_weights(self.ni, \
      self.nh, self.no):
      print("Warning: len(weights) error in set_weights()")	

    idx = 0
    for i in range(self.ni):
      for j in range(self.nh):
        self.ih_weights[i,j] = weights[idx]
        idx += 1
		
    for j in range(self.nh):
      self.h_biases[j] = weights[idx]
      idx += 1

    for j in range(self.nh):
      for k in range(self.no):
        self.ho_weights[j,k] = weights[idx]
        idx += 1
	  
    for k in range(self.no):
      self.o_biases[k] = weights[idx]
      idx += 1
	  
  def get_weights(self):
    tw = self.total_weights(self.ni, self.nh, self.no)
    result = np.zeros(shape=[tw], dtype=np.float32)
    idx = 0  # points into result
    
    for i in range(self.ni):
      for j in range(self.nh):
        result[idx] = self.ih_weights[i,j]
        idx += 1
		
    for j in range(self.nh):
      result[idx] = self.h_biases[j]
      idx += 1

    for j in range(self.nh):
      for k in range(self.no):
        result[idx] = self.ho_weights[j,k]
        idx += 1
	  
    for k in range(self.no):
      result[idx] = self.o_biases[k]
      idx += 1
	  
    return result
 	
  def initialize_weights(self):
    num_wts = NeuralNetwork.total_weights(self.ni,
      self.nh, self.no)
    wts = np.float32(self.rnd.uniform(-0.01, 0.01,
     (num_wts)))
    self.set_weights(wts)

  def compute_outputs(self, x_values):
    print("\n ih_weights: ")
    show_matrix(self.ih_weights, 2)
	
    print("\n h_biases: ")
    show_vector(self.h_biases, 2)
	
    print("\n ho_weights: ")
    show_matrix(self.ho_weights, 2)
  
    print("\n o_biases: ")
    show_vector(self.o_biases, 2)  
  
    h_sums = np.zeros(shape=[self.nh], dtype=np.float32)
    o_sums = np.zeros(shape=[self.no], dtype=np.float32)

    for i in range(self.ni):
      self.i_nodes[i] = x_values[i]

    for j in range(self.nh):
      for i in range(self.ni):
        h_sums[j] += self.i_nodes[i] * self.ih_weights[i,j]

    for j in range(self.nh):
      h_sums[j] += self.h_biases[j]
	  
    print("\n pre-tanh activation hidden node values: ")
    show_vector(h_sums, 4)

    for j in range(self.nh):
      self.h_nodes[j] = self.hypertan(h_sums[j])
	  
    print("\n after activation hidden node values: ")
    show_vector(self.h_nodes, 4)

    for k in range(self.no):
      for j in range(self.nh):
        o_sums[k] += self.h_nodes[j] * self.ho_weights[j,k]

    for k in range(self.no):
      o_sums[k] += self.o_biases[k]
	  
    print("\n pre-softmax output values: ")
    show_vector(o_sums, 4)

    soft_out = self.softmax(o_sums)
    for k in range(self.no):
      self.o_nodes[k] = soft_out[k]
	  
    result = np.zeros(shape=self.no, dtype=np.float32)
    for k in range(self.no):
      result[k] = self.o_nodes[k]
	  
    return result
	
  @staticmethod
  def hypertan(x):
    if x  20.0:
      return 1.0
    else:
      return math.tanh(x)

  @staticmethod
  def softmax(o_sums):
    result = np.zeros(shape=[len(o_sums)], dtype=np.float32)
    div = 0.0
    for k in range(len(o_sums)):
      div += math.exp(o_sums[k])
    for k in range(len(result)):
      result[k] =  math.exp(o_sums[k]) / div
    return result
	
  @staticmethod
  def total_weights(n_input, n_hidden, n_output):
   tw = (n_input * n_hidden) + (n_hidden * n_output) + \
     n_hidden + n_output
   return tw

# end class NeuralNetwork

def main():
  print("\nBegin NN demo \n")

  num_input = 3
  num_hidden = 4
  num_output = 2
  print("Creating a %d-%d-%d neural network " \
    % (num_input, num_hidden, num_output) )
  nn = NeuralNetwork(num_input, num_hidden, num_output)
  
  print("\nSetting weights and biases ")
  num_wts = NeuralNetwork.total_weights(num_input, \
    num_hidden, num_output)
  wts = np.zeros(shape=[num_wts], dtype=np.float32)  # 26 cells
  for i in range(len(wts)):
    wts[i] = ((i+1) * 0.01)  # [0.01, 0.02, . . 0.26 ]
  nn.set_weights(wts)
 
  x_values = np.array([1.0, 2.0, 3.0], dtype=np.float32)
  print("\nInput values are: ")
  show_vector(x_values, 1)
  
  y_values = nn.compute_outputs(x_values)
  print("\nOutput values are: ")
  show_vector(y_values, 4)

  print("\nEnd demo \n")
   
if __name__ == "__main__":
  main()
Posted in Machine Learning | Leave a comment

Inspecting the IMDB Dataset – Vocabulary Set

I set out to explore the well-known IMDB movie review sentiment analysis problem. I quickly realized that the exploration was going to take several days of work, so I’m taking the process one step at a time. This is the very first step of an unknown number of steps.

The IMDB dataset consists of the text of 50,000 movie reviews from ordinary people. A review can be positive (rated 7-10 stars), negative (1-4 stars), or neutral (5 or 6 stars). Therefore you can use the dataset to train a sentiment analysis model.

The 50,000 reviews are randomly divided into a 25,000-item training set and a 25,000-item test set. Each set has 12,500 positive and 12,500 negative reviews (no neural reviews).

The raw data is located at http://ai.stanford.edu/~amaas/data/sentiment/ but is very difficult to work with because each review is in a separate text file. The Keras library has a preprocessed version of the IMDB dataset so I’m using it.

In this first step I wanted to just examine the dataset vocabulary — in this case, how many distinct words are there, and what are the most common words. As it turns out, there are 88,584 distinct words in the positive-reviews training dataset. The five most common are “the”, “and”, “a” “of”, “to” which isn’t all that surprising. One of the most common words (about 20th as I recall) is “br” — because the reviews had HTML br tags and the Keras version of the dataset left them in.

There are many words that appear only once, for example “copywrite” (misspelled), “artbox” (huh?), and “l” (letter ell, probably intended to be the digit one).

My little exploration code loads the Keras version of the IMDB dataset into memory, then fetches the built-in dictionary where a word like “the” is the key and the value is the ranking by frequency — 1 for “the” because it is the most common.

# inspect_imdb_data.py

import numpy as np  # not used here
import keras as K
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

# allow the Windows cmd shell to deal with wacky characters
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)

from keras.datasets import imdb

print("\nInspect IMDB dataset \n") 

word_to_idx_dict = imdb.get_word_index()
n = len(word_to_idx_dict)
print("Number distinct words in reviews = %d \n" % n)

words_freq_list = []
for (k,v) in word_to_idx_dict.items():
  words_freq_list.append((k,v))

sorted_list = sorted(words_freq_list, key=lambda x: x[1])

print("Ten most common words: \n")
print(sorted_list[0:10])

print("\nLast five least common words: \n")
print(sorted_list[-5:])

I transferred the dictionary values into a list and sorted the list. An interesting sub-problem. And I ran into a very nasty issue where some of the reviews had wacky characters (typically non-English things like umlauts) and my Windows cmd shell couldn’t display them and blew up. I found a solution of redirecting stdout.

Anyway, I now understand the IMDB dataset vocabulary. Next step is — well, not sure yet. But eventually I will gain full understanding of the dataset.



The scene with thousands of steps from the 2006 movie “The Fall” was beautiful but scary

Posted in Keras, Machine Learning | 1 Comment

Implementing an LSTM Cell using Python

Just for fun, while I was eating breakfast one morning, I decided to code up an LSTM cell using Python. So I did.

An LSTM cell is a complex software module that accepts input (as a vector), generates output, and maintains cell state. If you connect an LSTM cell with some additional plumbing, you get an LSTM network. These networks can be used with sequence data, such as a sequence of words in a sentence.

I used as my base reference the description given in the Wikipedia entry on the topic. There are many, many variations of LSTMs, and I used the simplest.

It was a good exercise and reinforced my understanding of LSTMs and NumPy dot() function (matrix multiplication), multiply() function (Hadamard, matrix element-wise multiplication), addition function (element-wise addition which is implemented with add() or the overloaded ‘+’ operator).

LSTM are very interesting. At some point I’ll take a stab at hooking up a full LSTM network, and then training the LSTM network, which will not be a trivial task.

# lstm_io.py

import numpy as np
np.set_printoptions(precision=4)

def sigmoid(x):
  return 1 / (1 + np.exp(-x))

def compute_outputs(xt, h_prev, c_prev,
      Wf, Wi, Wo, Wc,
      Uf, Ui, Uo, Uc,
      bf, bi, bo, bc):

  ft = sigmoid(np.dot(Wf,xt) + np.dot(Uf,h_prev) + bf)
  it = sigmoid(np.dot(Wi,xt) + np.dot(Ui,h_prev) + bi)
  ot = sigmoid(np.dot(Wo,xt) + np.dot(Uo,h_prev) + bo)
  ct = np.multiply(ft, c_prev) + \
    np.multiply(it, np.tanh(np.dot(Wc,xt) + \
    np.dot(Uc, h_prev) + bc))
  ht = np.multiply(ot, np.tanh(ct))
  return (ht, ct)

# =========================================================

def main():
  print("\nBegin LSTM demo\n")

  xt = np.array([[1.0], [2.0]], dtype=np.float32)
  h_prev = np.zeros(shape=(3,1), dtype=np.float32)
  c_prev = np.zeros(shape=(3,1), dtype=np.float32)

  W = np.array([[0.01, 0.02],
                [0.03, 0.04],
                [0.05, 0.06]], dtype=np.float32)

  U = np.array([[0.07, 0.08, 0.09],
                [0.10, 0.11, 0.12],
                [0.13, 0.14, 0.15]], dtype=np.float32)

  b = np.array([[0.16], [0.17], [0.18]], dtype=np.float32)

  Wf = np.copy(W); Wi = np.copy(W)
  Wo = np.copy(W); Wc = np.copy(W)

  Uf = np.copy(U); Ui = np.copy(U)
  Uo = np.copy(U); Uc = np.copy(U)
  
  bf = np.copy(b); bi = np.copy(b)
  bo = np.copy(b); bc = np.copy(b)

  print("Sending input = (1.0, 2.0) \n")

  (ht, ct) = compute_outputs(xt, h_prev, c_prev, Wf, Wi,
    Wo, Wc, Uf, Ui, Uo, Uc, bf, bi, bo, bc)
  print("output = ")
  print(ht)
  print("")
  print("new cell state = ")
  print(ct)
  print("\n")

  h_prev = np.copy(ht)
  c_prev = np.copy(ct)
  xt = np.array([[3.0], [4.0]], dtype=np.float32) 

  print("Sending input = (3.0, 4.0) \n")

  (ht, ct) = compute_outputs(xt, h_prev, c_prev, Wf, Wi,
    Wo, Wc, Uf, Ui, Uo, Uc, bf, bi, bo, bc)
  print("output = ")
  print(ht)
  print("")
  print("new cell state = ")
  print(ct)
  
  print("\nEnd \n")

if __name__ == "__main__":
  main()
Posted in Machine Learning | 1 Comment

SAT Math Scores and Gender

I’ve taught at the high school and university levels, so I’m always interested in education. With the high school year coming to a close (in the U.S.) there’s interest in college admissions. Based on my experience, high school GPA and high school class rank, are very limited as predictors of college success.

Grading in high school is incredibly subjective. And class rank is entirely dependent on the overall quality of a particular high school.

The SAT exam (Scholastic Aptitude Test) and the ACT exam (American College Testing) are much better indicators of academic aptitude, according to much research. I’d heard that the SAT exam had been heavily revised for 2017 so I thought I’d investigate.

It was quite difficult to find raw data — I rarely trust derived data on the Web. Eventually I found the official SAT scores. For example, see https://reports.collegeboard.org/pdf/total-group-2016.pdf.

The SAT has a math section and a verbal section (which combines reading and writing). I looked at the math scores (from 2000 to 2017) because evaluating writing is rather subjective too.

As the graph here shows, there was apparently a big change in the math part of the test because the scores for both males and females jumped greatly. Information about the changes is hazy but as best I can determine, the SAT organization has been under pressure to reduce the difference between male and female scores, so the math section removed “quantitative comparison” questions which were more difficult for females than for males. An example of such an eliminated question is: Which value is smaller, 5/x or 5/2x if x is a positive number.

So, what does this all mean? Statistics are just statistics and aren’t meaningful when applied to an individual person. The gap in math achievement between males and females, which has been essentially constant since 1972, persists. But what that gender achievement gap means is open to interpretation.

By the way, girls tend to score higher on the reading/writing section of the SAT. And for both math and reading/writing, the difference in SAT scores by race is huge (a full standard deviation) compared to the difference by gender. Again, what these differences mean is not relevant for an individual person.



Male brains (top) tend to differ significantly from female brains (bottom) in neural connectivity patterns. Source: Verma, R., Proceedings of National Academy of Sciences.

Posted in Miscellaneous | Leave a comment