Why Isn’t batch_first the Default Geometry for PyTorch LSTM Modules?

I’ve been working for many weeks on dissecting PyTorch LSTM modules. An LSTM module is a very complex object that can be used to analyze natural language. The classic example is movie review sentiment.

I’ve never happy unless I completely understand a software module. By completely, I mean well enough to implement the module in question from scratch, using Notepad and the relevant programming language (Python in the case of a PyTorch LSTM).

After a lot of experimentation, I was satisfied that I understood how a PyTorch LSTM deals with a single sequence input. What do I mean by a single sequence? One of the problems with understanding LSTMs is that the vocabulary is very inconsistent, and in many cases, including official documentation, the vocabulary is blatantly incorrect.

In my mind, an LSTM batch is a collection of sentences, a sentence is a collection of words. And a word is made of several numeric values (called a word embedding). Almost all of the very few examples, and PyTorch documentation, uses terms like “input” which can have roughly a dozen different meanings, and “hidden” which usually means “output”, and “output” which really means “all outputs”, and on and on. Anyway, the point is I prefer the terms “batch”, “sentence”, “word”, and “values” (embedding values).

The problem I was examining was how to batch together two or more sentences so that a PyTorch LSTM can understand them. Put another way, what is the geometry of a PyTorch LSTM batch?

My first experiment was to set up 2 separate sentences. Each sentence has 4 words. And each word is represented by 3 embedding values. Therefore, each sentence has 12 numeric values:

# ex:      the    movie   was     good
sent1 = [[0.01, 
          0.02,
          0.03], [0.04,
                  0.05,
                  0.06], [0.07,
                          0.08,
                          0.09], [0.10,
                                  0.11,
                                  0.12]])
sent2 = [[0.13, etc. 0.24]]

After setting up the 2 sentences, I fed them in turn to an LSTM module and displayed the two outputs. OK.

Next, I placed the values for the two sentences in a batch. My first attempt was to use the intuitive approach (spoiler: it didn’t work):

batch = [[0.01, 0.02, . . . 0.12],  # 1st sentence? (no)
         [0.13, 0.14, . . . 0.24]]  # 2nd sentence? (no)

Then I reset the LSTM object and fed the batch to the module and . . . got completely different results.

After much experimentation, I figured out the correct geometry for an LSTM batch of two or more sentences, but it is completely unintuitive:

batch = [[0.01, 0.02, 0.03], # 1st word of 1st sentence
         [0.13, 0.14, 0.15], # 1st word of 2nd sentence
         etc.

Ugh. Just ugh. But lurking in the back of my memory was the recollection of a mysterious LSTM parameter named batch_first=False. I’d never seen it used and the documentation description was utterly unhelpful (something like “put the batch first”). On a hunch I created the LSTM object and set batch_first=True and voila! The intuitive batch geometry now worked (meaning gave the same outputs as feeding sentences individually.

Well, that was fun. Actually, I’m not being sarcastic. I get a nice sense of satisfaction in figuring things like this out.



“My psychiatrist told me I was crazy and I said I wanted a second opinion. He said, OK, you’re ugly too.” – Rodney Danderfield. “I have never let my schooling interfere with my education.” – Mark Twain. “Some people cause happiness wherever they go; others whenever they go.” – Oscar Wilde. “Even if you are on the right track, you’ll get run over if you just sit there.” – Will Rogers. “Sometimes the road less traveled is less traveled for a good reason.” – Jerry Seinfeld.

This entry was posted in Machine Learning, PyTorch. Bookmark the permalink.

1 Response to Why Isn’t batch_first the Default Geometry for PyTorch LSTM Modules?

  1. Thorsten Kleppe says:

    Thank you to let us be a part of your LSTM journey.

    The LSTM is over 20 years old, Sepp Hochreiter wrotes a paper in german about it.
    In a interview he told about how hard it was to make the idea of an LSTM public, and today its one of the best ideas for ML.
    He is a really cool guy I think. 🙂
    http://www.bioinf.jku.at/people/hochreiter/

    btw, did you know “Two Minute Papers”?

    What a time to be alive ^^

Comments are closed.