I write an article titled “Neural Regression Using PyTorch: Defining a Network” in the February 2021 edition of the online Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/articles/2021/02/11/pytorch-define.aspx.
The article is the second in a series of four articles where I explain how to create a neural regression model.
The goal of a regression problem is to predict a single numeric value. There are several classical statistics techniques for regression problems. Neural regression solves a regression problem using a neural network.
The recurring problem over the series of articles is to predict the price of a house based on four predictor variables: its area in square feet, air conditioning (yes or no), style (“art_deco,” “bungalow,” “colonial”) and local school (“johnson,” “kennedy,” “lincoln”).
The demo program presented in the article begins by creating Dataset and DataLoader objects which have been designed to work with the house data. Next, the demo creates an 8-(10-10)-1 deep neural network. The demo prepares training by setting up a loss function (mean squared error), a training optimizer function (Adam) and parameters for training (learning rate and max epochs).
The demo trains the neural network for 500 epochs in batches of 10 items. An epoch is one complete pass through the training data. The training data has 200 items, therefore, one training epoch consists of processing 20 batches of 10 training items.
During training, the demo computes and displays a measure of the current error (also called loss) every 50 epochs. Because error slowly decreases, it appears that training is succeeding. Behind the scenes, the demo program saves checkpoint information after every 50 epochs so that if the training machine crashes, training can be resumed without having to start from the beginning.
After training the network, the demo program computes the prediction accuracy of the model based on whether or not the predicted house price is within 10 percent of the true house price. The accuracy on the training data is 93.00 percent (186 out of 200 correct) and the accuracy on the test data is 92.50 percent (37 out of 40 correct). Because the two accuracy values are similar, it is likely that model overfitting has not occurred.
Next, the demo uses the trained model to make a prediction on a new, previously unseen house. The raw input is (air conditioning = “no”, square feet area = 2300, style = “colonial”, school = “kennedy”). The raw input is normalized and encoded as (air conditioning = -1, area = 0.2300, style = 0,0,1, school = 0,1,0). The computed output price is 0.49104896 which is equivalent to $491,048.96 because the raw house prices were all normalized by dividing by 1,000,000.
The demo program concludes by saving the trained model using the state dictionary approach. This is the most common of three standard techniques.
The first step when designing a PyTorch neural network class for a regression problem is to determine its architecture. Neural architecture design includes the number of input and output nodes, the number of hidden layers and the number of nodes in each hidden layer, the activation functions for the hidden and output layers, and the initialization algorithms for the hidden and output layer nodes.
The number of input nodes is determined by the number of predictor values (after normalization and encoding), eight in the case of the House data. For most regression problems, there is just one output node, which holds the numeric value to predict. It is possible for a neural regression system to have two or more numeric values, but these problems are quite rare.
The demo network uses two hidden layers, each with 10 nodes, resulting in an 8-(10-10)-1 network. The number of hidden layers and the number of nodes in each layer are hyperparameters. Their values must be determined by trial and error guided by experience. The term “AutoML” is sometimes used for any system that programmatically, to some extent, tries to determine good hyperparameter values.
More hidden layers and more hidden nodes are not always better. The Universal Approximation Theorem (sometimes called the Cybenko Theorem) says, loosely, that for any neural architecture with multiple hidden layers, there is an equivalent architecture that has just one hidden layer. For example, a neural network that has two hidden layers with 5 nodes each, is roughly equivalent to a network that has one hidden layer with 25 nodes.