Getting the IMDB Dataset for PyTorch – The Bad Old torchtext Way

The ultimate goal of a project I’ve been working on is to create a prediction system on the IMDB data using a from-scratch Transformer built with PyTorch. A major sub-problem is writing code to read the IMDB data into memory and serve it up in batches for training.

The IMDB dataset has 25,000 movie reviews for training, and 25,000 for testing. There are 12,500 positive reviews and 12,500 negative reviews in each set. The raw IMDB data is a beast to work with because each review is in a separate text file. Yes, that’s 50,000 separate text files.

Click to enlarge. A screenshot is worth a thousand words.

PyTorch has a companion torchtext package for datasets, but torchtext is completely out-of-date and is in the process of being totally replaced. I didn’t want to wait until the new torchtext is ready (I’ve already waited months), so I decided to bite the bullet and work with the current, poorly designed torchtext APIs.

I wrote a demo program. The first problem was that torchtext doesn’t have an easy way to just read part of the data. When experimenting, I didn’t want to wait for two minutes each time just to load data. So, I ran my program once to get all data stored on my local machine, and then pruned away all but 100 positive training, 100 negative training, 100 positive test, and 100 negative test reviews.

Next, I spent several hours experimenting and reading documentation, until I was satisfied I knew enough about loading IMDB data (in the bad old way) to move on to the Transformer code.

First, I created two Field objects:

import torch as T
import torchtext as tt
import numpy as np
device = T.device("cpu")

TEXT =, include_lengths=True,
  batch_first=True, tokenize="basic_english")

This code would take a couple of pages to fully explain, but the main gist should be clear. Next, I called the torchtext IMDB() class to read the 400 reviews into memory, and then created a vocabulary of the 3,500 most common words in the 200 training reviews:

bat_sz = 2
max_review_len = 60
max_vocab_words = 3_500

train_ds, test_ds = tt.datasets.IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train_ds, max_size=max_vocab_words-2) 

print("\nFirst 6 words in vocabulary: ")
for i in range(6):
  s = TEXT.vocab.itos[i]
  print(i, s)

This code also is very complicated. Each word gets an ID. ID = 2 is the most common word, which is actually the period character. IDs 3, 4, 5 are “the” “,” “and” as you might expect. ID = 0 is reserved for words that don’t appear in the vocabulary and are therefore unknown. ID = 1 is reserved for padding to make every movie review in a batch of reviews the same length. (I used a batch size of 2 to keep things simple.)

Next, I used the wacky BucketIterator to batchify two reviews at a time. I had to artificially set shuffle=False because there’s no way to get reproducible results otherwise. Ugly. The BucketIterator tries to fetch reviews with about the same length so that padding will be minimized.

I enumerated the training iterator (which is a another rabbit hole because it isn’t a regular Python iterator) and filtered just for batch [33], which had two very short reviews. The code prunes any review that’s longer than 60 words down to 60 words (actually tokens rather than words).

train_itr, test_itr = \, test_ds), \
  shuffle=False, batch_size=bat_sz)

print("\nbatch[33] = \n")
for bat_idx, batch in enumerate(train_itr):
  if bat_idx != 33: continue  # show just review [33]
  print("\nbat_idx = " + str(bat_idx))
  inpt = batch.text[0].to(device)
  lbl = (batch.label - 1).to(device)  # ??

  if inpt.size(1) "greater-than" max_review_len:
    inpt = inpt[:, 0:max_review_len]

. . .

The demo displays each the batch of two reviews with words in numeric form, and then in words in string form as a sanity check.

In the screenshot, notice the many Warning messages that all of this code is deprecated and won’t be valid after the new torchtext is released. But, I’ve already worked with the new torchtext, and I’m confident I won’t have any trouble with it.

There’s still a metric ton of details I don’t understand about the bad-old-torchtext, but one of the keys to successful learning in my field is to know when to decide, “OK, I don’t fully understand xxx, but I need to move on. I know enough to make progress on my main task, so I’ll mentally log xxx and figure it out later.”

The phrase “bite the bullet” means to face an unpleasant situation with fortitude. The origin of the phrase is unknown but it might have come from the idea that a wounded soldier should bite onto a bullet to bear the pain of a wound. Here are three of my favorite pistol designs. Left: A Colt model 1892. Arguably the first modern revolver. It served with the U.S. Army and Navy until World War II, and in city police forces until the 1980s. It fires a .38 caliber “long” cartridge which is no longer available except by special order. This one was passed to me by my father and is one of my most cherished memories of him. Center: A Colt Detective Special revolver from the 1990s. It fires a .38 special cartridge. It is a piece of art in some ways because of its manufacturing precision. Right: A Ruger ultra-compact semi-automatic pistol. It fires a .380 cartridge. It has a form of beauty because of its extremely efficient design.


# replace "greater-than" with operator

import torch as T
import torchtext as tt
import numpy as np

device = T.device("cpu")

# -----------------------------------------------------------

def main():
  print("\nBegin get PyTorch get IMDB data - old style ")


  print("\nGetting data and building vocabulary ")
  TEXT =, include_lengths=True,
    batch_first=True, tokenize="basic_english")

  bat_sz = 2
  max_review_len = 60
  # max_vocab_words = 50_000
  max_vocab_words = 3_500

  train_ds, test_ds = tt.datasets.IMDB.splits(TEXT, LABEL)
  TEXT.build_vocab(train_ds, max_size=max_vocab_words-2) 

  print("\nFirst 6 words in vocabulary: ")
  for i in range(6):
    s = TEXT.vocab.itos[i]
    print(i, s)

  # no way to get reproducible results 
  #  with shuffle=True (default) 
  train_itr, test_itr = \, test_ds), \
    shuffle=False, batch_size=bat_sz)

  # for batch in train_itr:
  print("\nbatch[33] = \n")
  for bat_idx, batch in enumerate(train_itr):
    if bat_idx != 33: continue  # show just review [33]
    print("\nbat_idx = " + str(bat_idx))
    inpt = batch.text[0].to(device)
    lbl = (batch.label - 1).to(device)  # ??

    if inpt.size(1) "greater-than" max_review_len:
      inpt = inpt[:, 0:max_review_len]


    for i in range(inpt.shape[0]):
      for j in range(inpt.shape[1]):
        if j % 12 == 0: print("")  # words per line
        n = inpt[i][j]
        s = TEXT.vocab.itos[n]
        print("%15s " % s, end="")

  print("\nEnd getting IMDB data demo \n")

if __name__ == "__main__":
Posted in PyTorch | Leave a comment

Neural Regression Using PyTorch: Defining a Network

I write an article titled “Neural Regression Using PyTorch: Defining a Network” in the February 2021 edition of the online Microsoft Visual Studio Magazine. See

The article is the second in a series of four articles where I explain how to create a neural regression model.

The goal of a regression problem is to predict a single numeric value. There are several classical statistics techniques for regression problems. Neural regression solves a regression problem using a neural network.

The recurring problem over the series of articles is to predict the price of a house based on four predictor variables: its area in square feet, air conditioning (yes or no), style (“art_deco,” “bungalow,” “colonial”) and local school (“johnson,” “kennedy,” “lincoln”).

The demo program presented in the article begins by creating Dataset and DataLoader objects which have been designed to work with the house data. Next, the demo creates an 8-(10-10)-1 deep neural network. The demo prepares training by setting up a loss function (mean squared error), a training optimizer function (Adam) and parameters for training (learning rate and max epochs).

The demo trains the neural network for 500 epochs in batches of 10 items. An epoch is one complete pass through the training data. The training data has 200 items, therefore, one training epoch consists of processing 20 batches of 10 training items.

During training, the demo computes and displays a measure of the current error (also called loss) every 50 epochs. Because error slowly decreases, it appears that training is succeeding. Behind the scenes, the demo program saves checkpoint information after every 50 epochs so that if the training machine crashes, training can be resumed without having to start from the beginning.

After training the network, the demo program computes the prediction accuracy of the model based on whether or not the predicted house price is within 10 percent of the true house price. The accuracy on the training data is 93.00 percent (186 out of 200 correct) and the accuracy on the test data is 92.50 percent (37 out of 40 correct). Because the two accuracy values are similar, it is likely that model overfitting has not occurred.

Next, the demo uses the trained model to make a prediction on a new, previously unseen house. The raw input is (air conditioning = “no”, square feet area = 2300, style = “colonial”, school = “kennedy”). The raw input is normalized and encoded as (air conditioning = -1, area = 0.2300, style = 0,0,1, school = 0,1,0). The computed output price is 0.49104896 which is equivalent to $491,048.96 because the raw house prices were all normalized by dividing by 1,000,000.

The demo program concludes by saving the trained model using the state dictionary approach. This is the most common of three standard techniques.

The first step when designing a PyTorch neural network class for a regression problem is to determine its architecture. Neural architecture design includes the number of input and output nodes, the number of hidden layers and the number of nodes in each hidden layer, the activation functions for the hidden and output layers, and the initialization algorithms for the hidden and output layer nodes.

The number of input nodes is determined by the number of predictor values (after normalization and encoding), eight in the case of the House data. For most regression problems, there is just one output node, which holds the numeric value to predict. It is possible for a neural regression system to have two or more numeric values, but these problems are quite rare.

The demo network uses two hidden layers, each with 10 nodes, resulting in an 8-(10-10)-1 network. The number of hidden layers and the number of nodes in each layer are hyperparameters. Their values must be determined by trial and error guided by experience. The term “AutoML” is sometimes used for any system that programmatically, to some extent, tries to determine good hyperparameter values.

More hidden layers and more hidden nodes are not always better. The Universal Approximation Theorem (sometimes called the Cybenko Theorem) says, loosely, that for any neural architecture with multiple hidden layers, there is an equivalent architecture that has just one hidden layer. For example, a neural network that has two hidden layers with 5 nodes each, is roughly equivalent to a network that has one hidden layer with 25 nodes.

More cats is not better. More fingers is not better. More tennis balls is not better.

Posted in PyTorch | 1 Comment

Yet Another Buffon’s Needle Simulation Using Python

I remember being amazed years ago when I first read about Buffon’s Needle problem. You can estimate the value of pi (~3.1416) by dropping a needle on a floor made from wooden slats, and counting how many times the needle crosses the edge of a slat. Remarkable.

If the wood slats are w wide, and the needle is n long (where n is less than w), and you drop the needle many times and it crosses the edge of a slat with frequency f, then the value of pi_est = (2 * n) / (f * w). For example, if slats are w = 3 inches wide, a needle is n = 1 inch long, and you drop the needle 1,000 times, and it crossed an edge 210 times, then f = 210 / 1000 = 0.210 and pi_est = (2 * 1) / (0.21 * 3) = 3.1746.

Just for fun, I decided to code up a demo simulation. I saw many demos on the Internet but I didn’t want to be influenced by them, so I designed my simulation from scratch. In pseudo-code:

loop many times
  generate a random 1st point
  generate a random angle
  find 2nd point
  if line between points crosses edge
    num_hits = num_hits + 1
    num_misses = num_misses + 1
compute f
compute pi_est

There were quite a few minor details. I set up a scenario with a slat width of 1.0 and a needle length of 0.5 and used 10,000 simulated drops. I didn’t use any fancy trigonometry or geometry (mostly because I was too lazy to look up the equations).

I set up the simulation so that the first end point of a dropped needle had x-coordinate from 0 to 3, and therefore the second end point had x-coordinate from 0.5 to 3.5. A dropped needle could cross at x = 0, x = 1, x = 2, or x = 3.

In one run of the demo, with 10,000 simulated drops, the needle hit an edge 3175 times so f = 0.3175 and the estimated value of pi was 3.1496. I kind of cheated by trying different values of the random number generator until I got a really nice estimate of pi.

The Space Needle in Seattle (~600 ft, 1962) is a fantastic design and inspired many similar towers. Here are four. Skylon Tower (Niagara Falls, Canada, ~500 ft, 1965. Sky Tower (Auckland, ~750 ft, 1997). Stratosphere Hotel (Las Vegas, ~1100 ft, 1996). Macau Tower (Macau, ~1000 ft, 2001). I’ve been to all five of these cities but never had the desire to go to the top of the towers. For some reason, a view from a high location doesn’t interest me that much.

# Buffon's Needle

# replace "less-than" and "greater-than" with symbols

import numpy as np

print("\nBegin Buffon's Needle problem simulation \n")

width = 1.0     # floor slat width 
needle = 0.5   # needle length
num_hits = 0
num_misses = 0

x_lo = 0.0; x_hi = 3.0   # 1st end point
y_lo = 0.0; y_hi = 4.0

print("Starting simulation")
for i in range(10000):
  x = (x_hi - x_lo) * np.random.random() + x_lo
  y = (y_hi - y_lo) * np.random.random() + y_lo
  angle = np.radians(360.0 * np.random.random())  # 0 to 2pi

  xx = x + needle * np.cos(angle)    # 2nd end point
  yy = y + needle * np.sin(angle)

  if xx "less-than" x:
    (x, xx) = (xx, x)
    (y, yy) = (yy, y)
  # (x,y) now to left of (xx,yy)

  if (x "less-than" 0.0 and xx "greater-than" 0.0) \
    or (x "less-than" 1.0 and xx "greater-than" 1.0) \
    or (x "less-than" 2.0 and xx "greater-than" 2.0) \
    or (x "less-than" 3.0 and xx "greater-than" 3.0):
    num_hits += 1
    num_misses += 1

print("Simulation done \n")

pr = (num_hits * 1.0) / (num_hits + num_misses) # frequency
pi_est = (2.0 * needle) / (pr * width)

print("Needle length: %0.1f " % needle)
print("Slat width: %0.1f " % width)
print("Number hits: " + str(num_hits))
print("Number misses: " + str(num_misses))
print("Probability of hitting crack: %0.4f" % pr)
print("Estimate of pi: %0.4f" % pi_est)

print("\nEnd simulation ")
Posted in Miscellaneous | Leave a comment

Anomaly Detection Using Simplistic VAE Reconstruction Error

A standard technique for anomaly detection (well, since about 2017) is to feed source data (often log files) to a deep neural autoencoder (AE) and create a model. Then you feed each data item to the trained model and compare the computed output with the input to calculate reconstruction error. Data items with large reconstruction error are anomalous in some way.

There has been a lot of recent research into the idea of anomaly detection using a variational autoencoder (VAE). This idea is relatively new and mostly unexplored. A VAE is conceptually more complicated than an AE. Internally, a VAE computes two forms of error — typically cross entropy error and Kullback-Leiber divergence — a complex topic. Most of the new research in anomaly detection using a VAE looks at using the internal forms of error. I started exploring these ideas and quickly realized that it’s a huge topic and so I needed to start simply and proceed in a logical way.

As an initial exploration, I decided to use a VAE for anomaly detection in the simplest possible way, which is to ignore the complex inner workings of a VAE and its ability to generate synthetic data. Instead, the simplest idea is to just feed the VAE real data items (instead of noise as is used when generating synthetic data), compute an output, and compare the computed output with the input. Put another way, the idea is to use a VAE exactly as if it were an AE.

So, I put together a demo program. I used a dummy dataset of 240 items where each item is an Employee. The raw data looks like:

M 19 concord 32700.00 mgmt
F 22 boulder 27700.00 supp
M 39 anaheim 47100.00 tech
. . .

The normalized data looks like:

0  0.19  0 0 1  0.3270  1 0 0
1  0.22  0 1 0  0.2770  0 1 0
0  0.39  1 0 0  0.4710  0 0 1
. . .

The simplistic anomaly detection technique using VAE reconstruction error worked as expected. In the demo, I created a VAE model using the dummy 240 Employee items. Then I set up one of the Employee items for:

M  39  concord  $51,200  supp

normalized to:

0, 0.39, 0, 0, 1, 0.512, 0, 1, 0

and fed it to the VAE. The computed output (not shown in the screenshot) was:

0.48, 0.43, 0.35, 0.30, 0.31, 0.52, 0.29, 0.44, 0.27

The reconstruction error is:

err = [(0 – .48)^2 + (.39 – .43)^2 + . . + (0 – .27)^2] / 9
= 0.1521

My next steps will be to explore anomaly detection using the complex internal error representations of a VAE. It’s a big topic but an interesting topic. It would be possible to, quite literally, devote a lifetime to exploring anomaly detection using deep neural architectures such as AEs, VAEs, and Transformers. Good fun.

The Coyote had many complex plans to catch the Roadrunner. None of them worked, but at least they all failed in entertaining ways. Part of the fun was viewing the Coyote’s setup and then anticipating what was going to go wrong. Rocket. Catapult. Spring.

Posted in PyTorch | Leave a comment

I Learn New Code-Based Technologies by Refactoring Example Programs

When I’m learning a new technology that involves writing code, one of the techniques that I find is very useful is to find an example in the technical documentation for the topic, then disassemble the example code and reconstruct the code.

I’ve been on a multi-month mission to understand machine learning Transformer architecture. Relative to number of lines-of-code, Transformer architectures are by far the most complex software systems I’ve ever worked with. Transformer architecture systems are used for natural language processing, but my long term goal is to use them for other purposes (e.g., anomaly detection).

My neural network library of choice for complex systems is PyTorch (I prefer Keras for simple systems). I found an example of a program that uses a Transformer architecture in the PyTorch documentation, and I set out to refactor that example.

I spent quite a few hours over several weeks. First I got the documentation example program to run, which was no small task in itself. The documentation example reads in the WikiText dataset, which a bunch of Wikipedia articles — about 2 million words. The dummy task is to accept a sequence of words such as “The Battle of Gettysburg was fought in” and predict the sequence offset by one word: “Battle of Gettysburg was fought in 1863”.

As is often the case with technical documentation, the example Transformer program was an afterthought, and was very rudimentary.

My refactoring process starting by inserting dozens of print statements into the program to example variable values and shapes, and then running the demo roughly 100 times. I quickly realized that loading 2 million words every run took way too long, so I pruned the WikiText train, test, and validation files down to about 300 lines of text each.

I reorganized the code — combining some statements into functions, and disassembling some functions into separate statements when I thought it made the code easier to understand. I removed a lot of noise from the documentation example that obscured the main ideas of Transformer architecture — things like timing, processing the validation set, overly-complicated calculation of the loss values during training. I also changed many of the variable names. Many variables names in the documentation were poorly chosen, such as “bptt” for the maximum sequence length (number of words) to process. But I changed some non-hideous variable names too, because it helped me understand how the code worked.

I’d like to be able to say that I now fully understand the PyTorch Transformer documentation WikiText example, but the reality is that even that dummy example is incredibly complex, and so I know I have many more hours of exploration. However, the Transformer knowledge map is slowly but surely becoming more clear to me as I look at increasing levels of detail.

I used to ski a lot. Wherever I went — Tahoe, Mammoth Mountain, Whistler-Blackcomb, etc. — I noticed that all the ski trail maps looked very similar. It turns out that the majority of ski trail maps were created by one guy — James Niehues. Top: Whistler-Blackcomb. Bottom: Niehues in action. Amazing.


# Python 3.7.6  PyTorch 1.7.0
# Windows 10  CPU

# the three WikiText files (train, test, valid) were pruned
# after the initial download to about 300 lines of text each 

# (replace "gt" and "lt" with symbols -- my blog software chokes)

import io
import numpy as np
import torch as T
import torchtext as tt

device = T.device("cpu")

# -----------------------------------------------------------

class WikiTextDataset(

  def __init__(self, trn_tst_vld, bat_size, seq_len):
    self.bat_size = bat_size # used only to get equal lengths
    self.seq_len = seq_len 
    url = "\"

    arch = tt.utils.download_from_url(url)
    (test_path, valid_path, train_path) = \
    tok ="basic_english")
    self.vocab = tt.vocab.build_vocab_from_iterator(map(tok, \
      iter(, encoding="utf8"))))

    if trn_tst_vld == "train":
      r_iter = iter(, encoding="utf8"))
    elif trn_tst_vld == "test":
      r_iter = iter(, encoding="utf8"))
    elif trn_tst_vld == "valid":
      r_iter = iter(, encoding="utf8"))

    all_data_lst = [T.tensor([self.vocab[token] \
      for token in tok(item)], dtype=T.int64) for item in r_iter]
    all_data = t: t.numel() "gt" 0, \
      all_data_lst)))  # get rid of empty tensors

    # batch all data into cols instead of rows. strange.
    n = len(all_data) // self.bat_size  # num words each col

    all_data = all_data[0:(n*self.bat_size)]  # trim
    self.all_data = all_data.view(self.bat_size, -1).t().\
      contiguous().to(device)  # very tricky

  def __len__(self):
    return len(self.all_data) - 1

  def __getitem__(self, idx):
    src = self.all_data[idx]    # all columns
    tgt = self.all_data[idx+1]  # start at next word
    return (src, tgt)  # must flatten tgt when called

# -----------------------------------------------------------

class TransformerModel(T.nn.Module):
  # ntoken is vocabulary size
  # ninp is embed_dim
  # nhead is passed to EncoderLayer
  # nhid is number hidden nodes in NN part of EncoderLayer
  # nlayers is number of EncoderLayer blocks in Encoder
  # dropout is used by PositionalEncoding AND EncoderLayer

  def __init__(self, ntoken, ninp, nhead, nhid, nlayers, \
    super(TransformerModel, self).__init__()
    self.pos_encoder = PositionalEncoding(ninp, dropout)
    encoder_layers = T.nn.TransformerEncoderLayer(ninp, \
      nhead, nhid, dropout)
    self.transformer_encoder = \
      T.nn.TransformerEncoder(encoder_layers, nlayers)
    self.encoder = T.nn.Embedding(ntoken, ninp)
    self.ninp = ninp
    self.decoder = T.nn.Linear(ninp, ntoken)


  def init_weights(self):
    lim = 0.1, lim), lim)

  def make_mask(self, sz):
    mask = (T.triu(T.ones(sz, sz)) == 1).\
      transpose(0, 1).to(device)
    mask = mask.float().masked_fill(mask == 0, \
      float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

  def forward(self, src, src_mask):
    src = self.encoder(src) * np.sqrt(self.ninp)
    src = self.pos_encoder(src)
    output = self.transformer_encoder(src, src_mask)
    output = self.decoder(output)
    return output

# -----------------------------------------------------------

class PositionalEncoding(T.nn.Module):

  def __init__(self, d_model, drop_p=0.1, max_len=5000):
    super(PositionalEncoding, self).__init__()
    self.dropout = T.nn.Dropout(p=drop_p)

    pe = T.zeros(max_len, d_model)
    position = \
      T.arange(0, max_len, dtype=T.float).unsqueeze(1)
    div_term = T.exp(T.arange(0, d_model, 2).float() * \
      (-np.log(10000.0) / d_model))
    pe[:, 0::2] = T.sin(position * div_term)
    pe[:, 1::2] = T.cos(position * div_term)
    pe = pe.unsqueeze(0).transpose(0, 1)
    self.register_buffer('pe', pe)

  def forward(self, x):
    x = x +[:x.size(0), :]
    return self.dropout(x)

# -----------------------------------------------------------

def print_inpt_as_words(data, vocab):
  for i in range(len(data)):
    for j in range(len(data[0])):
      v = data[i][j]
      s = vocab.itos[v]
      print("%16s " % s, end="")

# -----------------------------------------------------------

def main():
  # 0. get started
  print("\nBegin Transformer documentation refactored demo ")

  # 1. prepare data
  seq_len = 20    # called "bptt" in documentation demo
  batch_size = 10 # number of columns, not rows (!)

  train_ds = WikiTextDataset("train", bat_size=batch_size,\
  train_ldr =,
    batch_size=seq_len, shuffle=False)  # NOTE !!
  # DataLoader assumes rows is the batch size, so 
  #   must use seq_len as the batch_size parameter

  # 2. create model
  ntokens = len(train_ds.vocab.stoi)
  emsize = 30
  nhid = 100    # in the internal NN
  nlayers = 2   # number transformer blocks
  nhead = 2     # multi-head attention
  drop_p = 0.2  # joint value for position and each enocder

  model = TransformerModel(ntokens, emsize, nhead, \
    nhid, nlayers, drop_p).to(device)

# -----------------------------------------------------------

  # 3. train
  epochs = 3
  loss_func = T.nn.CrossEntropyLoss()
  lr = 1.0 # learning rate
  optimizer = T.optim.SGD(model.parameters(), lr=lr)
  scheduler = T.optim.lr_scheduler.StepLR(optimizer, \
    1.0, gamma=0.95)

  for epoch in range(1, epochs + 1):
    print("Epoch " + str(epoch))

    src_mask = model.make_mask(seq_len)
    for bat_idx, batch in enumerate(train_ldr):
      inpt = batch[0]
      tgts = batch[1].flatten()

      # print(inpt)
      # print_inpt_as_words(inpt, train_ds.vocab)
      # print(inpt.shape)
      # print(tgts)
      # print(tgts.shape)
      # input()

      if inpt.size(0) != seq_len:  # small seq_len batch
        src_mask = model.make_mask(inpt.size(0))
      oupt = model(inpt, src_mask)
      loss_val = loss_func(oupt.view(-1, ntokens), tgts)
      T.nn.utils.clip_grad_norm_(model.parameters(), 0.5)

      log_interval = 10  # batch interval
      if bat_idx % log_interval == 0 and bat_idx "gt" 0:
        print(" batch %6d  batch loss %0.4f" % \
          (bat_idx, loss_val.item()))


  # 4. TODO: save model

  print("\nEnd refactored Transformer documentation demo")

if __name__ == "__main__":
Posted in PyTorch | 2 Comments

The Golden Age of Science Fiction Movies

I was discussing the 2019 version of the movie “Godzilla” with a friend of mine. We both agreed the special effects were excellent even though the story was weak. By coincidence, I had just watched the 1956 version of “Godzilla” (an Americanized version of the original 1954 Japanese film). I was reminded of how the period from 1950 to 1959 was a golden age for science fiction movies. There are at least 40 movies from that decade that I’d rate as B+ or better.

Here are 10 movies from that period. They’re not necessarily the best ones, but they’re ones I’d take with me on a trip if I could only take 10. Listed by release date.

1. The Thing from Another World (1951) – Scientists at a military base near the North Pole find a crashed flying saucer embedded in ice. The flying saucer has a passenger who thaws out. The scientists try to communicate with The Thing. Not a good idea. Intelligent plot, good production values, excellent acting.

2. Invaders from Mars (1953) – A young boy thinks he sees a flying saucer land in the sandpits behind his house during a thunderstorm at night. The next morning his dad investigates and comes back . . different. Everyone I know who first saw this movie when they were 10 years old or younger, had nightmares for years. Including me. Don’t go to the sandpits!

3. 20,000 Leagues Under the Sea (1954) – A Disney production. During the 1860s, Captain Nemo with his submarine Nautilus wants to stop war. Fantastic special effects highlighted by the Nautilus itself and the fight with the giant squid.

4. Gog (1954) – Mysterious deaths at a secret underground laboratory in the desert. Gog is one of the research projects – an artificial intelligence robot. With a flame thrower.

5. Them! (1954) – The original giant insect (ants) movie. The scenes shot in the desert during a wind storm are very tense and are among the most memorable in the history of science fiction films.

6. Godzilla (1956) – The scene where Godzilla first appears, rising up over the crest of a hill on small island, is another scene that is burned into my memory.

7. Forbidden Planet (1956) – This movie holds up well, in terms of visual effects, sound effects, and story, more than 60 years after it was made. In the 23rd century, the crew of a starship C-57D land on planet Altair IV to determine what happened to a scientific expedition from 20 years previously. They find only two survivors, Dr. Morbius and his daughter Altaira. What happened?

8. Quatermass 2 (1957) – A British film known as “Enemy From Space” in the U.S. Something strange is going on in the small village of Winnerden Flats. It turns out to be parasitic aliens preparing for inasion. Luckily professor Bernard Quatermass (not “Quartermass”) figures out what’s going on.

9. The Trollenberg Terror (1958) – Another British film, known as “The Crawling Eye” in the U.S. This movie is not highly regarded by critics, but it’s one of my favorites. A strange fog descends upon a mountain in the Swiss alps. Those who go into the fog do not come back. It’s an alien invasion where the aliens are giant eyeballs with long antennae like arms.

10. The Atomic Submarine (1959) – Another film that’s not highly regarded my most people but it’s iconic to me. In “the near future” relative to 1959, ships and submarines are disappearing in the oceans near the North Pole. It’s an alien space ship that can travel underwater. The crew of the nuclear submarine Tigershark find the alien ship, ram it, go inside, and find a very unpleasant occupant.

Posted in Top Ten | Leave a comment

A Preliminary Look at the New torchtext Library for PyTorch

The PyTorch neural network code library has several closely related libraries, such as torchvision for image processing and torchtext for natural language processing. The existing torchtext library has common datasets such as the IMDB dataset for sentiment analysis. The torchtext library also has many functions that work with the datasets, such as functions to load a dataset, parse a dataset, and build a vocabulary of words from a dataset.

Unfortunately, the torchtext library has two big problems. First, the current torchtext Dataset objects are not compatible with standard PyTorch Dataset objects. Second, the current torchtext API is really weird and ugly.

When you use a current torchtext dataset, you get all kinds of warnings that everything in the library is being deprecated. I came across a nice blog post that describes the new API for the new torchtext library, which is under development:

In order to experiment with the new torchtext library, I had to install the experimental daily builds of PyTorch and torchtext. I went to and found a whl file for the January 30, 2021 build of PyTorch, and the build for the same day for torchtext. I downloaded the torch and torchtext whl files to my local machine. I uninstalled my current torch and torchtext modules, and then installed the experimental nightly builds of PyTorch and torchtext without problem — I was very lucky because nightly builds are often wildly unstable.

I coded up a little demo. The demo loads the IMDB dataset, and splits it into training and test sets. Then the demo creates a vocabulary from the training data. And then the demo extracts a short movie review I found at index position [93] in the training data.

Here are some of the key lines of code in the demo:

import torchtext as tt
toker ="basic_english")
train_ds, test_ds = \
tmp_vocab = train_ds.get_vocab()
vocab = tt.vocab.Vocab(counter=tmp_vocab.freqs, \
  max_size=14_000, min_freq=10)
for idx, (label, txt) in enumerate(train_ds):
  # idx is 0, 1, ..
  # label is 0 or 1 (the sentiment)
  # txt is like [29, 70, 10, . . .] (the review)

The new torchtext Dataset object has the same structure as the and so it can be used with a DataLoader. At the time I wrote this post, the new API doesn’t have a nice way to serve up batches of data that have similar review lengths (the current API has a BucketIterator class that does that). You can easily write code to adapt the new API to serve up batches with similar review lengths, but maybe the new API will have a built-in way to do this when the API is released.

The new torchtext library API is a big improvement over the current API. I’m looking forward to using the new library and its API when the library reaches stability and is released.

In IMDB movie reviews, each movie gets a text review and a rating from 1 to 10 stars. In the IMDB machine learning dataset, movies that were rated from 1 to 4 stars are classified as class 0 (bad), and movies that were rated from 7 to 10 stars are classified as class 1 (good). Movies rated 5 or 6 stars are available, but not used in the main dataset. Here are three movies that have posters that are much better (in my opinion anyway) than the movies themselves. Left: The “The Brain Eaters” (1958) movie has a rating of 4.8 stars (class 0) but I rate the poster a 9 out of 10. Center: The “Beginning of the End” (1957) movie has a rating of 3.8 stars (class 0) but I rate the poster an 8 out of 10. Right: The “Target Earth” (1954) movie has a rating of 4.6 stars (class 0) but I rate the poster 8.5 out of 10.

# the old way to access datasets is being revamped

# replace "greater-than" with operator symbol

import torchtext as tt
import time

def print_w_time(str):
  print(str + "  ", end="")
  dt = time.strftime("%Y_%m_%d-%H_%M_%S")

def main():
  print("\nBegin demo of new torchtext interface for IMDB ")

  print_w_time("\nFetching IMDB using basic_english tokenizer ")
  toker ="basic_english")
  train_ds, test_ds = \
  print_w_time("Data has been fetched ")

  print_w_time("\nCreating vocabulary, min_freq=10 ")
  tmp_vocab = train_ds.get_vocab()
  vocab = tt.vocab.Vocab(counter=tmp_vocab.freqs,
    max_size=14_000, min_freq=10)
  print_w_time("Vocabulary created ")

  print("\nExamining short train item [93] ")
  for idx, (label, txt) in enumerate(train_ds):
    if idx == 93:
      print(str(idx) + "  " + str(label) + "  " + str(txt))
      for i in range(len(txt)):
        if i % 16 == 0: print("")
        n = txt[i].item()
        if n "greater-than" 13_999:
          s = "[unk]"
          s = vocab.itos[n]
        print(s + " ", end="")
  print("\nEnd demo \n")

if __name__ == "__main__":
Posted in PyTorch | Leave a comment

Neural Regression Classification Using PyTorch: Preparing Data

I wrote an article titled “Neural Regression Classification Using PyTorch: Preparing Data” in the February 2021 edition of Microsoft Visual Studio Magazine. See

There are three basic types of neural networks for tabular data (ordinary rows and columns data that can be placed in a table-like object). A neural network regression model predicts a single numeric value, such as the income of a person, based on predictors like age, sex, occupation, and so on. A neural network binary classification model predicts a discrete variable that can be just one of two possible values, such as sex, based on predictors like income, age, marital status, and so on. A neural network multi-class classification model predicts a discrete value that can be one of three or more possible values, such as a person’s political leaning (conservative, moderate, liberal) based on predictors such as age, sex, income, and so on.

The February article is the first in a series of four that cover: 1.) preparing data for regression, 2.) designing a neural network for regression, 3.) training a neural regression model, 4.) using a neural regression model.

The recurring example over the four articles is predicting the price of a house based on air conditioning (yes or no), square feet area, style (“art_deco”, “bungalow”, “colonial”), and local school (“johnson”, “kennedy”, “lincoln”). The data used is artificial but is based on the well-known Boston Area Housing Dataset where the goal is to predict the average price of a house in one of 506 towns near Boston, based on 13 predictor variables such as average house age, percentage of minority residents, tax rate and so on.

Serving up batches of data for training a network and evaluating the accuracy of a trained model is a bit trickier than you might expect if you’re new to PyTorch. In the early days of PyTorch, the most common approach was to write completely custom code. You can still write one-off code for loading data, but now the most common approach is to implement Dataset and DataLoader objects. Briefly, a Dataset object loads all training or test data into memory, and a DataLoader object serves up the data in batches.

import torch as T

class HouseDataset(
  def __init__(self, src_file, m_rows=None):
    all_xy = np.loadtxt(src_file, max_rows=m_rows,
      usecols=[0,1,2,3,4,5,6,7,8], delimiter="\t",
      comments="#", skiprows=0, dtype=np.float32)

    tmp_x = all_xy[:,[0,1,2,3,4,6,7,8]]
    tmp_y = all_xy[:,5].reshape(-1,1)

    self.x_data = T.tensor(tmp_x, \
    self.y_data = T.tensor(tmp_y, \

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx,:]  # or just [idx]
    price = self.y_data[idx,:] 
    return (preds, price)       # tuple of matrices

A Dataset object can be consumed by a DataLoader object that serves up the data in batches:

train_ds = HouseDataset(src, m_rows=5)
train_ldr =,
  batch_size=2, shuffle=True)
for epoch in range(2):
  print("\n\n Epoch = " + str(epoch))
  for (bat_idx, batch) in enumerate(train_ldr):
    X = batch[0]  # batch is tuple of two matrices
    Y = batch[1]
    print("bat_idx = " + str(bat_idx))
    print(X)  # predictors
    print(Y)  # target house price
  . . .

Preparing data for a neural network is not fun or glamorous, but it’s necessary and if you have the right mindset, data preparation can be interesting.

Predicting the price of art is not a problem that’s well-suited for neural regression. Three portraits by well-known artists. Left: By artist Csaba Markus. Center: By artist Manuel Nunez. Right: By artist Ichiro Tsuruta. I don’t have any idea of the price of these, and I’d rather just enjoy the art than try to analyze it. All three styles are very different but very appealing. To me at least, the three portraits complement each other nicely, and I think that the three portraits presented together look nicer than any of the portraits individually.

Posted in PyTorch | 1 Comment

A Look at the HuggingFace NLP Libraries – I’m Impressed

Transformer architecture systems for natural language processing problems are among the most complex software systems I’ve ever worked with. While I was wading through resources on the Internet, I kept seeing references to something called HuggingFace. I avoided exploring those references because the name “HuggingFace” sounded amateurish.

I finally got around to investigating HuggingFace and I was very impressed with my initial experience.

First I installed the HuggingFace system by launching a command shell and typing the command “pip install transformers” and . . . installation was quick and error-free. This was a very good sign.

Next I went to the HuggingFace documentation, and found a Quick Tour page at I created a tiny program based on the demo code there and . . . the demo ran perfectly. In the world of Python NLP this isn’t a minor miracle — it’s a major miracle.

Installation was painless. A minor miracle in Python world.

The API was simple and therefore beautiful.

The tiny demo set up a “pipeline” object for sentiment analysis. When run, a trained Transformer based language model was downloaded to my machine, along with an associated tokenizer. (In subsequent runs, the program checks to see if the model is already there to avoid an unnecessary download operation).

I set up three tiny dummy movie reviews:

 0 "This was a great movie in every way!"
 1 "It was a complete waste of my time."
 2 "Maybe viewable - alternate universe"

I fed the three reviews to the classifier object and got these results:

 0  This was a great movie in every way!  POSITIVE  0.9999
 1  It was a complete waste of my time.   NEGATIVE  0.9998
 2  Maybe viewable - alternate universe   POSITIVE  0.6740

The classifier was quite confident for the first two mini-reviews (0.9999 and 0.9998) but not very confident for the third review, which is how I designed the dummy reviews. Nice! Note: Interpreting confidence scores is tricky — see any of my posts on “calibration” on this Web site.

After only a brief look at HuggingFace, I don’t have enough information to give a solid opinion, but I really like what I saw. As I was looking at the documentation, I kept thinking to myself, “Yes, this is just how I’d have done it — as simply as possible.” HuggingFace gets a tip of my hat.

I had never heard of the term “hugging face” before. An Internet search told me it’s an emoji that means excitement and happiness. I literally have never ever used an emoji so I forgive myself for not liking the name HuggingFace until I used their code.

I didn’t like the “HuggingFace” name initially, but there are company and product names that are more unfortunate. Left: “Stubbs Prosthetics and Orthotics Inc.” is a company in Chattanooga, Tennessee. I hope they don’t charge an arm and a leg for their products. Center: “KidsExchange” is a store in Raleigh, North Carolina. I suspect many parents have wanted to trade their children in at some point. Right: Batavo “Batmilk” is a yogurt product in Brazil. I wonder if the product is hung upside down in supermarkets.


import numpy as np  # not needed
from transformers import pipeline

print("\nBegin demo \n")

classifier = pipeline('sentiment-analysis')

reviews  = \
  ["This was a great movie in every way!",
   "It was a complete waste of my time.",
   "Maybe viewable - alternate universe"]
targets = [1,0,1]  # not used yet
results = classifier(reviews)

print("\nResults: ")
for i in range(3):
  rvw = reviews[i]
  lbl = results[i]["label"]
  scr = results[i]["score"]
  print(" %2s %38s %10s   %0.4f " % (i, rvw, lbl, scr))

print("\nEnd demo ")
Posted in Machine Learning | 1 Comment

Refactoring the PyTorch Documentation Transformer Example Data Loading Code

I’ve been on a long mission to understand neural Transformer architecture. Transformer systems can be used for natural language processing problems such as sequence-to-sequence scenarios like translating English sentences into German. Transformer systems have replaced LSTM systems for most NLP tasks. But Transformer systems are brutally complicated.

It’s not possible to learn extremely complex topics in a purely sequential manner so I’ve been probing at Transformer architecture from many different directions. Today I just finished a major exploration. I dissected one part of the primary Transformer example in the PyTorch documentation — the data loading and batching code.

The documentation example program reads in about 2 million words of text from Wikipedia articles. The goal is to predict the next n words, given a set of n words.

Suppose the source text is just 73 words:

The quick brown fox jumps over the lazy dog. Meanwhile, 
the red robin flew low over the field, looking for food.
In the barn, some chickens were acting bold and were
pretending they could fly like eagles. The pet dogs were
oblivious to most of the activity and were mostly dreaming
of getting a bone to chew on. The cats regarded the dogs
with disdain, but then cats regard all creatures with disdain.

Conceptually, the demo program data-loading code first chunks the source data into columns like so:

the        low       bold        to        chew
quick      over      and         most      on
brown      the       were        of        the
fox        field     pretending  the       cats
jumps      looking   they        activity  regarded
over       for       could       and       the
the        food      fly         were      dogs
lazy       in        like        mostly    with
dog        the       eagles      dreaming  disdain
meanwhile  barn      they        of        but
the        some      pet         getting   then
red        chickens  dogs        a         cats
robin      were      were        bone      regard
flew       acting    oblivious   to        all

The number of columns, or chunks, is called the “batch_size” in the documentation — a mildly misleading name in my opinion. Here, the chunks/batch_size is set to 5. Because there are 73 words total, each column gets 14 words, and the 3 leftover words (“creatures”, “with”, “disdain”) are discarded.

The demo code serves up batches of training data with a size specified by a variable named “bptt” — another poor name. I called this variable “seq_len”. If bptt/seq_len is set to 4, then the first batch of source input items is:

the       low       bold         to       chew
quick     over      and          most     on
brown     the       were         of       the
fox       field     pretending   the      cats

The first batch of target items to predict is:

quick     over      and          most       on
brown     the       were         of         the
fox       field     pretending   the        cats
jumps     looking   they         activity   regarded

The items to predict are just the input items offset by one word. However, the targets are flattened to a 1-D tensor:

quick  over  and  most  on  brown . . . regarded

Very strange.

The second batch of source input items would be:

jumps     looking   they       activity   regarded
over      for       could      and        the
the       food      fly        were       dogs
lazy      in        like       mostly     with

And the second batch of targets is:

over      for       could      and        the
the       food      fly        were       dogs
lazy      in        like       mostly     with
dog       the       eagles     dreaming   disdain

which would be flattened to:

over  for  could  and  the  the . . disdain

And so on. This training data scheme and format used by the documentation example is very weird, and it took me a long time to figure it out.

The words aren’t served up as strings, instead, each word has a unique integer value from a vocabulary object which is constructed by analyzing all source training data. The most common words, like “the”, and “and”, have small values, and rare words have large values. There are special numeric values for unknown words and padding.

The documentation example data-loading code has some issues. The code works but it’s very messy with functions and import and global-scope variables strewn about.

I implemented a PyTorch class named WikiTextDataset and a DataLoader to read in the Wikipedia text, tokenize it, build a vocabulary, store in chunked format, and serve up batches in the same format as the documentation example. I used a lot of the key tricks that were in the documentation code. The complete version of my data loading code is listed below.

The code is called like so:

class WikiTextDataset(
  def __init__(self, trn_tst_vld, chunks, seq_len): . . .
  def __len__(self): . . .
  def __getitem__(self, idx): . . .

train_ds = WikiTextDataset("train", seq_len=4, chunks=5)
train_ldr =,
  batch_size=4, shuffle=False)  # confusing naming

for (bat_idx, batch) in enumerate(train_ldr):
  # print(batch[0])            # src inputs
  # print(batch[1].flatten())  # targets
  . . .

I tested my custom data-loading code by verifying that it served up the same data as the documentation code. The code is complicated and it took me over a day of work to get it to run correctly. And even so, I didn’t have time to thoroughly test the code, so there are probably a few edge cases that don’t work correctly.

The moral of this blog post is that neural Transformer architecture is very difficult to learn. Making things more difficult is the fact that serving up data to a Transformer system is a significant challenge by itself.

Two illustrations by artist Alexander Leydenfrost (1888-1961). He was born in Austria-Hungary and moved to the U.S. in 1923. Like many artists, Leydenfrost took work where he could find it. His style is very distinctive. The left image is from the September 1946 issue of Collier’s Magazine. Not much is known about the image on the right but it is believed to be an interior illustration in a magazine from the late 1940s or early 1950s. It’s my guess that the illustration was intended for the Collier’s article, but was cut because of space limitations.

# design a Dataset for Transformer documentation example

import io
import torch as T
import torchtext as tt

device = T.device("cpu")

class WikiTextDataset(

  def __init__(self, trn_tst_vld, chunks, seq_len):
    self.chunks = chunks 
    self.seq_len = seq_len 
    # chunks is num cols to break data into. called 
    # "batch_size" in the documentation example.
    # seq_len is number of words to process. called
    # "bptt" in the documentation example.

# --------------------------------------------------------

    url = "" \

    arch = tt.utils.download_from_url(url)
    (test_path, valid_path, train_path) = \
    tok ="basic_english")
    vocab = \
      tt.vocab.build_vocab_from_iterator(map(tok, \
      iter(, encoding="utf8"))))

    if trn_tst_vld == "train":
      r_iter = iter(, encoding="utf8"))
    elif trn_tst_vld == "test":
      r_iter = iter(, encoding="utf8"))
    elif trn_tst_vld == "valid":
      r_iter = iter(, encoding="utf8"))

    all_data_lst = [T.tensor([vocab[token] for token \
      in tok(item)], dtype=T.int64) \
      for item in r_iter]
    all_data = \ t: t.numel() > 0, \
      all_data_lst)))  # get rid of empty tensors

    # chunk all data into columns. weirdness.
    n = len(all_data) // self.chunks  # num words each col

    all_data = all_data[0:(n*self.chunks)]  # trim
    # all_data = all_data.narrow(0, 0, n * chunks)  # huh?
    self.all_data = all_data.view(self.chunks, -1).\
      t().contiguous().to(device)  # TRICKY !!

  def __len__(self):
    return len(self.all_data) - 1

  def __getitem__(self, idx):
    src = self.all_data[idx]    # all columns
    tgt = self.all_data[idx+1]  # start at next word
    return (src, tgt)           # flatten after fetching

# --------------------------------------------------------

def main():
  print("\nBegin data load and batching experiment")
  print("\nUsing custom code with seq (bptt) = 4, \
chunks (batch_size) = 5")

  train_ds = WikiTextDataset("train", seq_len=4, chunks=5)
  train_ldr =,
    batch_size=4, shuffle=False)  # confusing

  ct = 0
  for (bat_idx, batch) in enumerate(train_ldr):
    if ct == 0:
      print("\nFirst set of inputs: ")
      print(batch[0])  # src inputs
      print("\nFirst set of targets: ")
      print(batch[1].flatten())  # targets
      # input()
    ct += 1

  print("\nProcessed " + str(ct) + " batches ")
  print("\nLast set of inputs: ")
  print("\nLast set of targets")

  print("\nExperiment done \n")

if __name__ == "__main__":
Posted in PyTorch | 2 Comments