It’s difficult to explain what this blog post is all about, so bear with me. When training a PyTorch neural network, you must iterate through your training data, and group items into batches, so you can feed them to the network. Conceptually this is easy, but in practice batching and feeding data is extremely complicated.
In the early days of PyTorch (roughly 20 months ago), the most common approach was to code up this plumbing from scratch. And this approach is still viable. But since then, the standard approach is to use the Dataset and DataLoader objects from the torch.utils.data module.
Note: this code does not reproduce the same order of data on different runs because I didn’t place T.manual_seed(1) and np.random.seed(1) statements at the beginning of the program.
A DataLoader object uses a Dataset object. The Dataset object fetches the raw training data into memory. The Dataloader object serves up batches of predictor input tensors, and associated labels to predict. The Dataset class is just skeleton interface code. You have to implement the functionality yourself. And there are many dozens of design alternatives. (The DataLoader class is good to go as-is in most cases).
Instantiating and using a Dataset and a DataLoader looks like:
train_ds = PeopleDataset(train_file, num_rows=8) # make a Dataset train_ldr = T.utils.data.DataLoader(train_ds, batch_size=bat_size, shuffle=True) # make DataLoader for (batch_idx, batch) in enumerate(train_ldr): # loop thru data X = batch['predictors'] Y = batch['labels'] # feed X and Y to neural network . . .
OK, I’m finally getting close to describing the problem topic of this blog post. When implementing the Dataset interface, you must code an init() function and a getitem() function. The init() function typically loads data into memory as NumPy data from a text file. The getitem() function selects a batch of data from the in-memory data.
At some point, if the predictors and class labels are in the same file you separate the predictors and labels. Also, the data has to be converted to PyTorch tensors. One of the dozens of design decisions, and the topic of this post, is when to convert the data to tensors. There are three main alternatives:
1.) Inside the init() function, you can read data into memory as a NumPy matrix, and then convert all the data, in bulk, to a tensor matrix.
2.) In init() you leave the data as a NumPy matrix, then inside getitem(), you convert NumPy data to tensors on the fly, batch by batch.
3.) Inside init() and getitem() you leave data as NumPy matrices, then when DataLoader serves up a batch of items, you convert them to tensors at that point.
Curiously (to me anyway), all the examples I found in the PyTorch documentation, and all the examples I found on the Internet, use alternative #2 above. This didn’t seem to make sense to me. That approach seems inefficient because you are repeatedly converting the same NumPy data to PyTorch tensors over and over. In most training scenarios, it makes sense to just convert all NumPy data to PyTorch tensor data once. Two scenarios that would be exceptions are 1.) different batches have to be processed differently and that processing is easier with NumPy data than with PyTorch tensor data, and 2.) your data is huge and all of it won’t fit into tensor memory at one time.
I set out to implement alternative #1 to see if it was possible, or if there was some issue that I wasn’t aware of that would prevent it. To cut to the chase, alternative #1 — converting NumPy data to PyTorch tensor data once, in bulk — works just fine.
Implementing code to serve up batches of training items is very time consuming and tedious. But it’s a necessary part of creating a neural network model using PyTorch or any other deep neural network library.
Not good fun, but interesting.
Clothes designers have many subjective design decisions. Three interesting designs that use a peacock feather theme.