Fetching a Sequence of Items

I’m working on a time series regression problem. I’m experimenting with generating training sequences on the fly instead of using redundant data storage.

Suppose you have 9 items: 10, 11, 12, 13, 14, 15, 16, 17, 18. If you use a window of size 3, you could create a training data file like:

|curr 10 11 12 |next 13
|curr 11 12 13 |next 14
|curr 12 13 14 |next 15
|curr 13 14 15 |next 16
|curr 14 15 16 |next 17
|curr 15 16 17 |next 18

You’d have to randomly select lines from this data so that training wouldn’t oscillate. And this approach is fine, but there’s a lot of duplicate data.

An alternative approach is to store in memory an array that holds each value, then generate a sequence on the fly. In pseudo-code:

store entire sequence into an array
loop
  select a random start_index (not too close to the end)
  extract values from start_index to window_size
  use the sequence for training
end-loop

Pretty easy but as is often the case, the devil is in the details. I coded up a demo that selects 10 random sequences of three values each.

# fetch_sequences.py

import numpy as np

np.random.seed(2)

# read all data into an array
data_file = ".\\SourceData.txt"  
all_data = np.loadtxt(data_file, dtype=np.float32, skiprows=0, 
  usecols=[0])

N = len(all_data)  # 9 items
print("\nSize of source data = " + str(N) + "\n")

seq_len = 3  # window size

for i in range(10):
  start_idx = np.random.randint(0, N - seq_len + 1)  
  seq = all_data[start_idx : start_idx+seq_len]
  # it's [start : end) exclusive
  print(seq)

print("\nEnd demo \n")

Now I’m ready to tackle my time series regression problem without storing redundant data.



I love old magic posters. I have a few original Chang and Fak-Hong lithographs but they’re somewhere in storage.

This entry was posted in Machine Learning. Bookmark the permalink.