Multi-Class Classification Using PyTorch: Preparing Data

I wrote an article titled “Multi-Class Classification Using PyTorch: Preparing Data” in the December 2020 edition of Microsoft visual Studio Magazine. See

The article is the first in a series of four where I present a complete end-to-end demo. The problem with most demos is that to keep the size reasonable, important things are left out, such as how to prepare the data, how to compute prediction accuracy, and how to save checkpoints so that you can recover if the training machine fails during training. By writing four articles, I can cover all the important parts of a multi-class classification system.

The goal of a multi-class classification problem is to predict a value that can be one of three or more possible discrete values, such as “red,” “yellow” or “green” for a traffic signal. My example program predicts a college student’s major (“finance,” “geology” or “history”) from their sex, number of units completed, home state and score on an admission test. The data is synthetic and looks like:

M 39.5 oklahoma 512 geology
F 27.5 nebraska 286 history
M 22.0 maryland 335 finance
F 50.0 nebraska 565 geology
. . .
M 59.5 oklahoma 694 history

The gender values were encoded as “M” = -1 and “F” = +1. The units-completed values were normalized by dividing by 100. The student home state values were one-hot encoded as “maryland’ = (1, 0, 0), “nebraska” = (0, 1, 0), “oklahoma” = (0, 0, 1). The test scores were normalized by dividing by 1000. The dependent values-to-predict, student majors, were ordinal encoded as “finance” = 0, “geology” = 1, “history” = 2.

In the early days of PyTorch, the most common approach was to write completely custom code. You can still write one-off code for loading data, but now the most common approach is to implement a Dataset and DataLoader. Briefly, a Dataset object loads all training or test data into memory, and a DataLoader object serves up the data in batches. The bulk of my article describes how to implement a Dataset and Dataloader for the synthetic student data, and how to write a small program to test these objects.

Some of my colleagues squawk when they have to prepare data. But preparing data for use by a neural network can be very interesting if you look at it in the right way.

Tinted sunglasses can be very nice or very ugly. It all depends on how you look at it.

This entry was posted in PyTorch. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s