## Time Series Regression Using a PyTorch LSTM Network

Implementing a neural prediction model for a time series regression (TSR) problem is very difficult. I decided to explore creating a TSR model using a PyTorch LSTM network. For most natural language processing problems, LSTMs have been almost entirely replaced by Transformer networks. But LSTMs can work quite well for sequence-to-value problems when the sequences are not too long. For example, I get good results with LSTMs on sentiment analysis when the input sentences are 30 words or less.

I found a few examples of TSR with an LSTM on the Internet but all the examples I found had either conceptual or technical errors. The problem I tackled was the well-known Airline Passenger data set. There are 144 lines of data. The data value is total airline passenger count for the month, in 100,000s. The raw data looks like:

```"1949-01",112
"1949-02",118
. . .
"1960-12",432
```

If you divide the data with a sequence length of 4, conceptually you use four consecutive values to predict the next value. LSTMs were designed for natural language processing, not TSR. So, the first thing you need to know is how to map an NLP problem to a TSR problem. The time series regression using PyTorch LSTM demo program To create this graph, I printed output values, copied them from the command shell, dropped the values into Excel, and manually created the graph.

Suppose you are doing NLP sentiment analysis for movie reviews. Your data might be like:

```"A truly great film", 1
"It was a waste of time", 0
"I highly recommend this movie", 1
"Worst movie ever", 0
. . .
```

Each word is part of a sequence so the four data items shown have seq_len values of 4, 6, 5, 3. Each word has to be converted into an embedding vector. Suppose the embedding_dim is 3, then each word is converted into a vector with 3 cells. Finally, in NLP you are usually dealing with huge datasets and so you almost always place data items together into batches for training. To recap, in NLP you have a seq_len (number of words in a sentence), embed_dim (number of numeric values that represent each word), and a batch size (number of items grouped together for training).

When you create a PyTorch LSTM you must feed it a minimum of two parameters: input_size and hidden_size. When you call the LSTM object to compute output, you must feed it a 3-D tensor with shape (seq_len, batch, input_size). For NLP, the seq_len is number of words in a sentence; batch_size is number of sentences grouped together for training; input_size is the embed_dim (number numeric values used for each word).

With a TSR problem, the seq_len is the number of numeric values used to predict the next value in the series; I recommend that the batch size should usually be 1 (I’ll explain shortly); the input_size is always 1 because each value in the sequence is just one value (no embedding).

When using an LSTM, suppose the batch size is 10, meaning 10 sentences are processed together. The PyTorch documentation and resources on the Internet are very poor when it comes to explaining when the hidden cell state is reset to 0. Most sources say the 10 sentences in a batch are processed independently and the cell state is automatically reset to 0 after each batch. However, there is tremendous confusion with regards to when and how a PyTorch LSTM cell state is reset. For NLP, sometimes you want to retain state between sentences in a batch (if the sentences are related in some way) but sometimes you don’t want to retain state. As best I can determine, cell state is automatically reset to 0 after each batch, and if you want to maintain state between batches, you must do so explicitly by capturing the cell state (which is one of three LSTM outputs) and feeding it as one of the next input values.

For both NLP and TSR, all batches fed to the LSTM should be the same size. If your code batches up sequences you need to make sure the last batch is the same size as all other batches served up. If you use a batch size of 1, then you won’t have this problem. And for some reason, for TSR problems, when I use a batch size of 1, I usually get better results than when I use a larger batch size. I don’t understand why this is so — it could be just a coincidence. The main purpose of batching up input sequences is to speed up training, so I only use a batch size greater than 1 when I have very large amounts of training data. Most, but not all, of the TSR problem scenarios I encounter have relatively small data set sizes.

Note: When I use a batch size that is greater than 1, if I randomly shuffle the order of the training item (by setting shuffle=True in the DataLoader) as usual, my results are bad, and get increasingly worse as the batch size increases. But if I set shuffle=False with batch size greater than 1, I get good results. This indicates that there is some connection between the sequences in a batch during training. But if shuffle=False, the training data will be processed in the exact same order in each training epoch, which I suspect is bad — my hunch is that the LSTM network will just memorize the entire training data rather than create a general predictive model.

The only real way to figure out what is going on is to dissect the PyTorch LSTM source code. However, I know from previous experience that such an investigation would likely take at least a full day — and probably longer — and I can’t spare that much time right now. Maybe some day I’ll investigate further, but for now, this unexpected behavior is a major reason to use a batch size of 1.

So anyway, I coded up a demo. It is really complicated. If you are using this blog post as a resource to create your own PyTorch LSTM for time series regression, even if you are very familiar with PyTorch, you will likely need to spend at least two full days dissecting the demo code to see exactly what’s happening. Note: My LSTM network uses an intermediate Linear layer between the LSTM layer and the output Linear layer.

Coding up a Dataset class to serve up training data is a lot of work. Dealing with the shapes of the various tensors was a huge problem and took me many hours to figure out. Writing an accuracy() function was quite tricky. My point is that you can create a PyTorch LSTM for a time series regression. Be prepared to put in a significant amount of time and effort, but eventually you’ll see the light. Lighthouse art by the famous and semi-famous. Left: “The Jetty at Le Havre (1868) by Claude Monet. Center: “The Lighhouse at Two Lights” (1929) by Edward Hopper. Right: Cover art for the Hardy Boys “The Secret Warning” (1966) by Rudy Nappi.

```# airline_lstm.py
# PyTorch 1.7.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10

import numpy as np
import torch as T
device = T.device("cpu")  # apply to Tensor or Module

# -----------------------------------------------------------

# bat_size = 1 strongly recommended
# seq_len = 4 should divide into 120 evenly

class AirlineDataset(T.utils.data.Dataset):
# serve up data for training
# "1949-01",112
# . .
# "1958-12",337
# train = 120 rows, test = 24 rows

def __init__(self, src_file, seq_len):
delimiter=",", skiprows=0, dtype=np.float32)
all_data /= 100.0        # normalize
L = len(all_data)        # 120 for train, 24 for test
num_items = L - seq_len  # 120 - 4 = 166 input sequences

tmp_x = []; tmp_y = []
for i in range(num_items):
seq = all_data[i:i+seq_len]   # fetch 4 values
nxt = all_data[i+seq_len]     # next value is the target
tmp_x.append(seq)
tmp_y.append(nxt)

self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device)
self.y_data = T.tensor(tmp_y, dtype=T.float32).to(device)

def __len__(self):
return len(self.x_data)

def __getitem__(self, idx):
seq_in = self.x_data[idx]
target = self.y_data[idx]
sample = { 'seq_in' : seq_in, 'target' : target }
return sample

# -----------------------------------------------------------

class MyLSTM(T.nn.Module):
def __init__(self):
super(MyLSTM, self).__init__()
self.lstm = T.nn.LSTM(1, 10)   # 1 input_size, state size
self.linear1 = T.nn.Linear(10, 5)  # intermediate layer
self.linear2 = T.nn.Linear(5, 1)   # output layer

def forward(self, x):
# x must have shape (seq_len, bat_sz, input_size)
# seq_len = 4 (values in one sequence)
# batch = ?, input_size = 1 (analogous to embed dim)
(all_outs, (final_oupt,final_state)) = self.lstm(x)

# print(final_oupt.shape)    # shape [1,bat_sz,10]
# print(all_outs[-1].shape)  # shape [bat_sz,10]
# input()

oupt = self.linear1(all_outs[-1])
oupt = self.linear2(oupt)    # shape [bat_sz,1]
return oupt

# -----------------------------------------------------------

def accuracy(model, ds, pct):
# assumes model.eval()
# percent correct within pct of true income
# process 1 item at a time
n_correct = 0; n_wrong = 0

for i in range(len(ds)):
X = ds[i]['seq_in']        # 
X = X.reshape(4,1,1)
Y = ds[i]['target']        # 
oupt = model(X)          # [1,1,1]
oupt = oupt.reshape(1)

delta = np.abs(oupt.item() - Y.item())
allowable = np.abs(pct * Y.item())
if delta "less-than" allowable:
n_correct += 1
else:
n_wrong += 1

# print values to make a graph
# print("actual  predicted")
# print("%0.2f %0.4f" % (Y.item(), oupt.item()))

acc = (n_correct * 1.0) / (n_correct + n_wrong)
return acc

# -----------------------------------------------------------

def main():
# 0. get started
print("\nBegin PyTorch LSTM airline time series \n")
T.manual_seed(0)
np.random.seed(0)

print("Creating Airline Dataset objects ")
train_file = ".\\Data\\airline_train.txt"
train_ds = AirlineDataset(train_file, 4)  # seq_len = 4

test_file = ".\\Data\\airline_test.txt"
test_ds = AirlineDataset(test_file, 4)    # all 24 rows

bat_sz = 1   # batch size for TSR is tricky
batch_size=bat_sz, shuffle=True)
# use shuffle=False if bat_sz != 1 but risk of over-fitting

# 2. create network
lstm_tsr = MyLSTM().to(device)

# 3. train model
max_epochs = 100
ep_log_interval = 10
lrn_rate = 0.01

loss_func = T.nn.MSELoss()
optimizer = T.optim.SGD(lstm_tsr.parameters(), lr=lrn_rate)

print("\nbat_size = %3d " % bat_sz)
print("loss = " + str(loss_func))
print("optimizer = SGD")
print("max_epochs = %3d " % max_epochs)
print("lrn_rate = %0.3f " % lrn_rate)

print("\nStarting training")
lstm_tsr = lstm_tsr.train()
for epoch in range(0, max_epochs):
epoch_loss = 0  # for one full epoch

for (batch_idx, batch) in enumerate(train_ldr):
X = batch['seq_in']        # shape [bat_sz,4]
X = X.reshape(4,bat_sz,1)  # seq_len, bat_sz, "input_len"
Y = batch['target']        # shape  via Dataset
Y = Y.reshape(bat_sz,1)    # shape [bat_sz, 1]

oupt = lstm_tsr(X)      # oupt has shape [bat_sz,1]

loss_val = loss_func(oupt, Y)  # a tensor
epoch_loss += loss_val.item()  # accumulate
loss_val.backward()
optimizer.step()

if epoch % ep_log_interval == 0:
print("epoch = %4d   loss = %0.4f" % (epoch, epoch_loss))

print("Done ")

# 4. evaluate model accuracy
print("\nComputing model accuracy (within 15%)")
lstm_tsr = lstm_tsr.eval()
acc_train = accuracy(lstm_tsr, train_ds, 0.15)  # item-by-item
print("Accuracy on training data = %0.4f" % acc_train)

acc_test = accuracy(lstm_tsr, test_ds, 0.15)  # item-by-item
print("Accuracy on test data = %0.4f" % acc_test)

# 5. extrapolate 6 months ahead via bootstrapping
print("\nExtrapolating 6 months past all known data ")
v = T.tensor([5.08, 4.61, 3.90, 4.32],
dtype=T.float32).to(device)
for i in range(6):
X = v.reshape(4,1,1)  # one at a time