How To: Create a Streaming Data Loader for PyTorch

I wrote an article titled “How To: Create a Streaming Data Loader for PyTorch” in the April edition of the online Microsoft Visual Studio Magaqzine. See https://visualstudiomagazine.com/articles/2021/04/01/pytorch-streaming.aspx.

When using the PyTorch neural network library to create a machine learning prediction model, you must prepare the training data and write code to serve up the data in batches. In situations where the training data is too large to fit into machine memory, one approach is to write a data loader that streams the data using an internal memory buffer. The article exlains how to create a streaming data loader for large training data files.

In situations where all of the training data will fit into machine memory, the most common approach is to define a problem-specific Dataset class and use a built-in DataLoader object.

If training data is too large to fit into memory, a crude approach is to physically divide the training data into smaller files. This approach is quite common but is messy to implement and difficult to manage. In many situations with very large training data files a better approach is to write a streaming data loader that reads data into a memory buffer, serves data from the buffer, reloading the buffer from file when needed.

Note that in addition to the Dataset class, PyTorch has an IterableDataset class. However, when an IterableDataset object is fed to a DataLoader object, the shuffle parameter is not available. This makes IterableDataset unsuited for training data.

In pseudo-code, the algorithm is:

if buffer is empty then
  reload the buffer from file

if the buffer is ready then
  fetch a batch from buffer and return it

if buffer not ready, reached EOF so
  reload buffer for next pass through file
  signal no next batch using StopIteration

There are some indications that current brute force approaches for training machine learning systems with huge data files is becoming unsustainable. It’s estimated that training the GPT-3 language model cost approximately $4.6 million dollars of processing time. And it’s not unusual for the computing cost of training even a moderately sized machine learning model to exceed $10,000. There are many research efforts under way to find ways to train machine learning models more efficiently with smaller datasets.



In Greek mythology, Naiads are a type of water nymph that live in streams. Left: Detail from “Hylas and The Nympths” by John William Waterhouse (1849-1917). Center: Detail from “Naiad” by contemporary artist Chuck Sperry (b. 1962). Right: “Water Nymph from the Goldfish Pond” by Franz Hein (1892-1976).

This entry was posted in PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s