Limiting the Size of a PyTorch Dataset / DataLoader

When developing a deep neural model, you normally start by working with a relatively small subset of your data, which saves a huge amount of time. The most common way to read and use training and test data when using PyTorch is to use a Dataset object and a DataLoader object. Unfortunately, neither object has a built-in way to adjust the size of the underlying data.

One approach is to use some sort of utility to create subset files. For example, the MNIST images dataset has 60,000 training and 10,000 test images. You could use a utility program to make a 1000-item set for training and a 100-item set for testing to get your model up and running, and then a 5000-item and a 500 item set for tuning parameters, and then finally use the 60,000-item and 10,000-item datasets when you’re fully ready to train.



Top: Limiting the PyTorch Dataset / DataLoader using the limit-at-load technique. Bottom: The early-exit while training technique.


If are using PyTorch Dataset / DataLoader and you want to programmatically adjust the sizes of your underlying data, there are two realistic options. First, if your Dataset object is program-defined, as opposed to black box code written by someone else, you can limit the amount of data read into the Dataset data storage. Second, you can read all the data and then track the number of lines of data processed during training, and early-exit when you reach some limit.

I coded up two demo programs on a simple 9-itme dataset to illustrate both techniques. In both cases I restrict the data to just 6 of the 9 items. The key lines in the limit-read technique are:

class IrisDataset(T.utils.data.Dataset):
  def __init__(self, src_file, root_dir=None,
    num_rows=None, transform=None):
    self.data = np.loadtxt(src_file, usecols=range(0,5), 
      max_rows=num_rows, delimiter=",",
     skiprows=0, dtype=np.float32)
    . . . etc.

  iris_ds = IrisDataset(".\\Data\\iris_subset_mod.txt",
    num_rows=6)  # read 6

  train_ldr = T.utils.data.DataLoader(iris_ds, batch_size=2,
    shuffle=False, drop_last=False)  # load 6

  for epoch in range(0,2):  # 2 epochs
    print("epoch = " + str(epoch))
    for (batch_idx, batch) in enumerate(train_ldr): 
      print("  bat idx = " + str(batch_idx))
      . . . etc.

The key lines of the early-exit technique are:

  iris_ds = 
    IrisDataset(".\\Data\\iris_subset_mod.txt") # read all

  train_ldr = T.utils.data.DataLoader(iris_ds, batch_size=2,
    shuffle=False, drop_last=False)  # load all

  for epoch in range(0,2):  # 2 epochs
    print("epoch = " + str(epoch))
    num_lines_read = 0
    for (batch_idx, batch) in enumerate(train_ldr): 
      if num_lines_read == 6: break  # early exit
      num_lines_read += 2  # batch size
      . . . etc.

There are some fundamental differences between the two techniques. The limit-at-load technique fetches the first num_rows of data and uses only those rows. The early-exit technique fetches all rows but then on each epoch uses a subset of all data (if the shuffle=True argument is set in the DataLoader object).

So, there’s no big moral to this technical story except that perhaps working with PyTorch, or any other deep neural library, requires a lot, repeat a lot, of patience and attention to detail.


Shrinking the size of a PyTorch Dataset is easier than shrinking people. Left: “The Incredible Shrinking Man” (1957). Center: “Dr. Cyclops” (1940). Right: “Attack of the Puppet People” (1958).

This entry was posted in PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s