One of my personality weaknesses is that when a technical problem gets stuck in my head, I’m incapable of letting it go. Literally. I can think of several problems that stuck in my brain for several years until I finally solved them.
Time series regression (predicting the next number in a sequence) using an LSTM neural network (a very complex NN that has a memory) is one of these problems. This weekend I made a step forward in fully understanding LSTM time series regression. In particular, I figured out one reason why these problems are so difficult.
My usual example is the Airline Dataset. There are 144 values which represent the total number of international airline passengers each month, from January 1948 through December 1960. The data looks like:
1.12, 1.18, 1.32, 1.29, 1.21, 1.35 . . 4.32
The raw values are x100,000 so the first value means 112,000 passengers. In many of my attempts to predict the next number in the sequence. I thought I saw a phenomenon where all my predictions were off by one month. For example, suppose the input is a set of size four values. I saw things like:
Input Actual Predicted ========================================= 1.12 1.18 1.32 1.29 1.21 1.28 1.18 1.32 1.29 1.21 1.35 1.20 1.32 1.29 1.21 1.35 1.48 1.34 etc.
In other words, the model was predicting not the next value, but rather the last value of the current input. I figured I had just made some sort of indexing mistake because by shifting the predicting values up by one position, the predictions become very accurate. But I was wrong about that.
I now believe this effect is a fundamental problem with LSTM time series regression. An LSTM uses as its input, the new input and also part of the previous output. If the LSTM is ineffective, it could be using just the previous output, which because of the way rolling window data is set up, would give the bad results above. (Note: my explanation here isn’t fully correct — a complete explanation would take a couple of pages).
Put another way, time series regression problems with an LSTM appear to be extremely prone to a form of over-fitting related to rolling window data.
I coded up yet-another-example using the CNTK library. By adding a dropout layer I was able to lessen the effect of this over-fitting. There’s still much more to understand in my search for truth. My current working hypothesis is that LSTMs for time series regression can work well for modeling the structure of a dataset (which can be used for anomaly detection), or for predicting a very short time ahead (perhaps one or two time steps), but not for extrapolating several time steps ahead.