Don’t Blindly Trust All Machine Learning Datasets

I ran across a machine learning example that used the California Housing dataset. I didn’t know much about that dataset so I did a little exploration. I loaded and examined the dataset using the scikit library:

from sklearn.datasets import fetch_california_housing

bunch = fetch_california_housing(as_frame=True)
info =
desc = bunch.frame.describe()

The info() is, in part:

  :Number of Instances: 20640
  :Number of Attributes: 8 numeric, predictive and the target
  :Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude
    - MedHouseVal   median house value in block (div 100,000)
  :Missing Attribute Values: None

So, essentially, the idea is to predict the median house value (in 1990) in a California “block group” — a census area.

OK, but the more detailed information showed these max and min values:

       MedInc Age AveRooms  AveBedrms   Pop  AveOccup  MedVal
min    0.49    1      0.84      0.33      3      0.69  0.14
max   15.00   52    141.90     34.06  35682   1243.33  5.00

The data is wacky. For example, the maximum average occupancy in one census block group is 1,243.33 people per house. And one block group has an average of 141.90 rooms per house. What?

Prisons have a big average occupancy per “house”.

Well, I spent some time diving into the data. I sorted the data by average occupancy per house and found the associated latitude and longitude of the California block group that has an average of 1243.33 people per house. It turned out to be at lat/lon (38.32, -121.98), which . . . drum roll please . . . contains the Solano State Prison about 150 miles east of San Francisco. Mystery solved.

The conclusion is that the California Housing dataset is not usable as-is for machine learning purposes. The dataset requires many hours of preparation.

The moral is don’t blindly trust machine learning datasets.

I’ve never been in prison but it doesn’t sound like it’d be much fun. A chain gang is a group of prisoners chained together to do things such as road work. Chain gangs were introduced in the late 1860s but were mostly phased out by the mid 1950s.

On any given day, there are about 2 million people in U.S. prisons. Incarceration rates vary greatly. According to the U.S. Bureau of Justice, less than 0.03% of Asian males are ever jailed at some point in their lives but over 30% of Black males are jailed and in some census blocks in Baltimore, over 90% are jailed eventually. But I don’t blindly trust the data — it could be better or even worse.

Left: A chain gang from the early 20th century.

Right: The comedy movie “Pardon Us” (1931) featured Stan Laurel and Oliver Hardy. A very funny movie (to me anyway — funny is in the eye of the beholder).

This entry was posted in Machine Learning. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s