I’m working on a time series regression problem. I’m experimenting with generating training sequences on the fly instead of using redundant data storage.
Suppose you have 9 items: 10, 11, 12, 13, 14, 15, 16, 17, 18. If you use a window of size 3, you could create a training data file like:
|curr 10 11 12 |next 13
|curr 11 12 13 |next 14
|curr 12 13 14 |next 15
|curr 13 14 15 |next 16
|curr 14 15 16 |next 17
|curr 15 16 17 |next 18
You’d have to randomly select lines from this data so that training wouldn’t oscillate. And this approach is fine, but there’s a lot of duplicate data.
An alternative approach is to store in memory an array that holds each value, then generate a sequence on the fly. In pseudo-code:
store entire sequence into an array loop select a random start_index (not too close to the end) extract values from start_index to window_size use the sequence for training end-loop
Pretty easy but as is often the case, the devil is in the details. I coded up a demo that selects 10 random sequences of three values each.
# fetch_sequences.py import numpy as np np.random.seed(2) # read all data into an array data_file = ".\\SourceData.txt" all_data = np.loadtxt(data_file, dtype=np.float32, skiprows=0, usecols=) N = len(all_data) # 9 items print("\nSize of source data = " + str(N) + "\n") seq_len = 3 # window size for i in range(10): start_idx = np.random.randint(0, N - seq_len + 1) seq = all_data[start_idx : start_idx+seq_len] # it's [start : end) exclusive print(seq) print("\nEnd demo \n")
Now I’m ready to tackle my time series regression problem without storing redundant data.