I wrote an article titled “How To: Create a Streaming Data Loader for PyTorch” in the April edition of the online Microsoft Visual Studio Magaqzine. See https://visualstudiomagazine.com/articles/2021/04/01/pytorch-streaming.aspx.
When using the PyTorch neural network library to create a machine learning prediction model, you must prepare the training data and write code to serve up the data in batches. In situations where the training data is too large to fit into machine memory, one approach is to write a data loader that streams the data using an internal memory buffer. The article exlains how to create a streaming data loader for large training data files.
In situations where all of the training data will fit into machine memory, the most common approach is to define a problem-specific Dataset class and use a built-in DataLoader object.
If training data is too large to fit into memory, a crude approach is to physically divide the training data into smaller files. This approach is quite common but is messy to implement and difficult to manage. In many situations with very large training data files a better approach is to write a streaming data loader that reads data into a memory buffer, serves data from the buffer, reloading the buffer from file when needed.
Note that in addition to the Dataset class, PyTorch has an IterableDataset class. However, when an IterableDataset object is fed to a DataLoader object, the shuffle parameter is not available. This makes IterableDataset unsuited for training data.
In pseudo-code, the algorithm is:
if buffer is empty then reload the buffer from file if the buffer is ready then fetch a batch from buffer and return it if buffer not ready, reached EOF so reload buffer for next pass through file signal no next batch using StopIteration
There are some indications that current brute force approaches for training machine learning systems with huge data files is becoming unsustainable. It’s estimated that training the GPT-3 language model cost approximately $4.6 million dollars of processing time. And it’s not unusual for the computing cost of training even a moderately sized machine learning model to exceed $10,000. There are many research efforts under way to find ways to train machine learning models more efficiently with smaller datasets.
In Greek mythology, Naiads are a type of water nymph that live in streams. Left: Detail from “Hylas and The Nympths” by John William Waterhouse (1849-1917). Center: Detail from “Naiad” by contemporary artist Chuck Sperry (b. 1962). Right: “Water Nymph from the Goldfish Pond” by Franz Hein (1892-1976).