Should You Normalize and Encode Data Before Train-Test Splitting, or After Splitting?

In theory, it’s better to split neural network data into training and test datasets and then normalize and encode each dataset separately. In practice, there are advantages to normalizing and encoding all the data first, and then splitting the data. I usually normalize and encode first and then split.

Suppose you have 100 source data items where each data item represents a person:

33  male    68,000.00  sales  moderate
27  female  52,000.00  admin  liberal
41  male    77,000.00  tech   conservative
. . .

Your goal is to create a neural network to predict the political leaning (conservative, moderate, liberal) of a person based on age, sex, income, and job type. At some point in time you need to encode the categorical predictors (sex and job type), and you should normalize the numeric predictors (age and income).

Additionally, you probably want to split the 100-item source data into an 80-item set for training the neural network and a 20-item set for testing and model evaluation.

The guiding theoretical principle is that you should split the source data into training and test sets before you do anything else, then pretend the test data doesn’t exist. You use the test data only as the very last step, and then the model prediction accuracy on the test data is a rough estimate of how well the model will do on new, previously unseen data.

So, according to theory, it’s a no-brainer — split the data first and then normalize and encode the training data only (remember, the test data doesn’t exist conceptually), then train the model, and then encode and normalize the test data so that it’s compatible with the trained model (using the same train normalizing and encoding parameters such as min and max) and then finally use the test data to evaluate the trained model.

But there are advantages to normalizing and encoding first, and then splitting.

If you normalize and encode all the source data first, and then split the data, both the training and test data have additional information compared to the split-first approach because the normalization and encoding process

If you think about it very carefully, you’ll realize that the theoretically-endorsed approach of splitting first will (probably) give you a slightly better final estimate of model accuracy, but if you normalize and encode first and then split you will (probably) get a slightly better prediction model because the training data contains ever so slightly more information.

In practical terms, normalizing and encoding the source data first and then splitting is quite a bit easier than splitting first and then normalize-encode on two datasets. And if you normalize and encode first you won’t run into a situation where you can’t encode a categorical test predictor because it didn’t appear in the training data. For example, suppose after the immediate split, the training data has only three job types — sales, admin, tech. They would be encoded as (1 0 0), (0 1 0), and (0 0 1). The job-type predictor leads to three input nodes in the neural network. Now suppose the test data by sheer bad luck of the split has a job type, like exec, that wasn’t in the training data. How can you encode exec? You can’t. Note: The counterargument is that you should always ensure that test data is representative of all data so this scenario should never be allowed.

The bottom line is that whether you should normalize and encode predictors before splitting into train-test or after splitting into train-test isn’t clear-cut. In most situations the tiny theoretical advantage you get by splitting first and then normalizing and encoding isn’t worth the extra effort required. The final estimate of model accuracy is very fuzzy no matter how you split and normalize-encode.

My usual approach is to normalize and encode all source data first. I normalize using the divide-by-constant approach and I encode using the one-hot technique. Then I split the normalized and encoded data into a training set and a test set. Then I train a model using the training data. And then I evaluate the model using the test data.

Left: The 1970 Chevrolet Camaro had a split front bumper. Center: The 1963 Chevrolet Corvette had a split rear window. Right: The 1958 General Motors Firebird III concept car had a split windshield.

This entry was posted in Machine Learning. Bookmark the permalink.

1 Response to Should You Normalize and Encode Data Before Train-Test Splitting, or After Splitting?

  1. Thorsten Kleppe says:

    My understanding struggles a bit, because I don’t understand why the “split first, normalize after” method should gives a slightly better result.

    A fast approach could be:
    Load your data from file, skip your desired test data and store the Index while you train the model.
    After training repeat that process and skip the data which used in training and take the unseen data for the test.

    A huge topic is following up, unbalanced data with skewed distributions. What is best practice to tackle that? The split affects these problem essentially.

    My best idea was, train the distribution of the data before, and then train the model. These was pushing the prediction in my trials.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s