Boston Housing Dataset Regression Using PyTorch

The Boston Housing dataset is a standard benchmark for regression algorithms. The goal of the Boston Housing problem is to predict the median price of a house in one of 506 towns near Boston. There are 13 predictor variables — average number of rooms in houses in town, tax rate, crime rate, percent of Black people in town, and so on.

The Boston Housing dataset was one of the first problems I looked at when PyTorch was new (version 0.4) and still rough around the edges. I decided I’d revisit the problem and use all the knowledge I had gained over the past 3 years.

First I prepared the Boston housing dataset by normalizing the numeric predictors (using order-magnitude normalization rather than min-max or z-score normalization) and encoding the single Boolean predictor. See https://jamesmccaffrey.wordpress.com/2021/08/18/preparing-the-boston-housing-dataset-for-pytorch/.

Compared to my early versions, I made four main changes in my new version. First, I used order-magnitude normalization to prepare the dataset, instead of min-max or z-score normalization. Second, I used a PyTorch Dataset and DataLoader to serve up training data instead of custom code. Third, I placed the training code into a program-defined train() function instead of placing all the code in main(). Fourth, I used Adam optimization instead of basic SGD optimization. I also made several minor changes, such as using relu() activation rather than tanh() activation on the hidden layers, and explicit Glorot initialization rather than default implicit initialization on the hidden layer weights.

There are a few new techniques I didn’t use to keep the size of my demo reasonable — saving fully reproducible training checkpoints, efficient batch accuracy computation, and so on.

I was satisfied with the new version of regression for the Boston housing dataset. It would take many pages to explain the details in the demo code, so instead I’ll just say that if you’re interested, take the demo code below, get it to run, and examine it carefully



Three unusual houses from movies I like. Left: The “Swiss Family Robinson” (1960) treehouse at Disneyland. I worked on that attraction when I was a college student. Its name was changed to “Tarzan’s Treehouse” at some point. Center: A real-life house based on “The Flintstones” (1994). Right: Bilbo Baggins’ house from “Lord of the Rings: The Fellowship of the Ring” (2001).


Code below. Long.

# boston.py
# Boston Area House Price dataset regression
# PyTorch 1.9.0-CPU Anaconda3-2020.02  Python 3.7.6
# Windows 10

import numpy as np
import torch as T

device = T.device("cpu")

# -----------------------------------------------------------

class BostonDataset(T.utils.data.Dataset):
  # features are in cols [0,12], median price in [13]

  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(0,14),
      delimiter="\t", comments="#", dtype=np.float32)
    self.x_data = T.tensor(all_xy[:,0:13]).to(device) 
    self.y_data = T.tensor(all_xy[:,13].reshape(-1,1)).to(device)

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    preds = self.x_data[idx]
    price = self.y_data[idx] 
    sample = { 'predictors' : preds, 'price' : price }
    return sample

# -----------------------------------------------------------

class Net(T.nn.Module):
  def __init__(self):
    super(Net, self).__init__()
    self.hid1 = T.nn.Linear(13, 10)  # 13-(10-10)-1
    self.hid2 = T.nn.Linear(10, 10)
    self.oupt = T.nn.Linear(10, 1)

    T.nn.init.xavier_uniform_(self.hid1.weight)  # glorot
    T.nn.init.zeros_(self.hid1.bias)
    T.nn.init.xavier_uniform_(self.hid2.weight)
    T.nn.init.zeros_(self.hid2.bias)
    T.nn.init.xavier_uniform_(self.oupt.weight)
    T.nn.init.zeros_(self.oupt.bias)

  def forward(self, x):
    z = T.tanh(self.hid1(x))  # or T.nn.Tanh()
    z = T.tanh(self.hid2(z))
    z = self.oupt(z)  # no activation, aka Identity()
    return z

# -----------------------------------------------------------


def train(model, ds, bs, lr, me, le):
  train_ldr = T.utils.data.DataLoader(ds,
    batch_size=bs, shuffle=True)
  loss_func = T.nn.MSELoss()
  optimizer = T.optim.Adam(model.parameters(), lr=lr)

  for epoch in range(0, me):
    epoch_loss = 0  # for one full epoch

    for (b_idx, batch) in enumerate(train_ldr):
      X = batch['predictors']
      y = batch['price']

      optimizer.zero_grad()
      oupt = model(X)
      loss_val = loss_func(oupt, y)  # a tensor
      epoch_loss += loss_val.item()  # accumulate
      loss_val.backward()  # compute gradients
      optimizer.step()     # update weights

    if epoch % le == 0:
      print("epoch = %4d  loss = %0.4f" % (epoch, epoch_loss)) 

# -----------------------------------------------------------

def accuracy(model, ds, pct):
  n_correct = 0; n_wrong = 0

  for i in range(len(ds)):    # one item at a time
    # (X, Y) = ds[i]            # (predictors, target)
    X = ds[i]["predictors"]
    y = ds[i]["price"]
    with T.no_grad():
      oupt = model(X)         # computed price

    abs_delta = np.abs(oupt.item() - y.item())
    max_allow = np.abs(pct * y.item())
    if abs_delta "less-than" max_allow:
      n_correct +=1; correct = True
    else:
      n_wrong += 1; correct = False

    if i % 100 == 0:
      # print("-----------------------------")
      print("i = %4d " % i, end="")
      print("predicted = %0.4f " % oupt.item(), end="")
      print("actual = %0.4f " % y.item(), end="")
      print("delta = %0.4f " % abs_delta, end="")
      print("max_allow = %0.4f " % max_allow, end="")

      if correct == True:
        print("correct")
      else:
        print("wrong")

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  return acc

# -----------------------------------------------------------


def main():
  # 0. get started
  print("\nBoston Housing Dataset regression using PyTorch ")
  np.random.seed(1)
  T.manual_seed(1) 

  # 1. create Dataset object
  print("Creating Boston train Dataset ")

  train_file = ".\\Data\\boston_all_om_normed.txt"
  train_ds = BostonDataset(train_file)

  # 2. create model
  print("\nCreating 13-(10-10)-1 DNN regression model ")
  net = Net().to(device)
  net.train()    # set mode

  # 3. train model
  batch_size = 4
  learn_rate = 0.01
  max_epochs = 1000
  log_every = 100

  print("\nSetting batch size = %d " % batch_size)
  print("Optimizer = Adam ")
  print("learn_rate = %0.3f " % learn_rate)

  print("\nStarting training ")
  train(net, train_ds, batch_size, learn_rate,
    max_epochs, log_every)
  print("Done ")

  # 4. compute accuracy
  print("\nComputing model accuracy ")
  acc = accuracy(net, train_ds, 0.15)
  print("\nModel accuracy (within 0.15) \
train data = %0.4f " % acc)

  # 5. TODO: save trained model

  # 6. TODO: use trained model to make a prediction

  print("\nEnd demo ")

if __name__=="__main__":
  main()
This entry was posted in PyTorch. Bookmark the permalink.

1 Response to Boston Housing Dataset Regression Using PyTorch

  1. Thorsten Kleppe says:

    You’ve improved slightly, but I’d bet you could pulverize the result if you wanted to, that would be interesting.

    What would also be really cool if this demo also works online. So then really everyone can learn to understand your work. For your projects without PyTorch this is possible, which I like to do on the fly. With the following code you could even read the dataset online.

    import requests
    link = “https://raw.githubusercontent.com/eric-bunch/boston_housing/master/boston.csv”
    f = requests.get(link)
    print(f.text)

    Just copy and run e.g. here:
    https://www.codabrainy.com/en/python-compiler/

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s