I ran across a machine learning example that used the California Housing dataset. I didn’t know much about that dataset so I did a little exploration. I loaded and examined the dataset using the scikit library:
from sklearn.datasets import fetch_california_housing bunch = fetch_california_housing(as_frame=True) info = bunch.frame.info() print(info) desc = bunch.frame.describe() print(desc)
The info() is, in part:
:Number of Instances: 20640 :Number of Attributes: 8 numeric, predictive and the target :Attribute Information: - MedInc median income in block group - HouseAge median house age in block group - AveRooms average number of rooms per household - AveBedrms average number of bedrooms per household - Population block group population - AveOccup average number of household members - Latitude block group latitude - Longitude block group longitude - MedHouseVal median house value in block (div 100,000) :Missing Attribute Values: None
So, essentially, the idea is to predict the median house value (in 1990) in a California “block group” — a census area.
OK, but the more detailed information showed these max and min values:
MedInc Age AveRooms AveBedrms Pop AveOccup MedVal min 0.49 1 0.84 0.33 3 0.69 0.14 max 15.00 52 141.90 34.06 35682 1243.33 5.00
The data is wacky. For example, the maximum average occupancy in one census block group is 1,243.33 people per house. And one block group has an average of 141.90 rooms per house. What?
Prisons have a big average occupancy per “house”.
Well, I spent some time diving into the data. I sorted the data by average occupancy per house and found the associated latitude and longitude of the California block group that has an average of 1243.33 people per house. It turned out to be at lat/lon (38.32, -121.98), which . . . drum roll please . . . contains the Solano State Prison about 150 miles east of San Francisco. Mystery solved.
The conclusion is that the California Housing dataset is not usable as-is for machine learning purposes. The dataset requires many hours of preparation.
The moral is don’t blindly trust machine learning datasets.
I’ve never been in prison but it doesn’t sound like it’d be much fun. A chain gang is a group of prisoners chained together to do things such as road work. Chain gangs were introduced in the late 1860s but were mostly phased out by the mid 1950s.
On any given day, there are about 2 million people in U.S. prisons. Incarceration rates vary greatly. According to the U.S. Bureau of Justice, less than 0.03% of Asian males are ever jailed at some point in their lives but over 30% of Black males are jailed and in some census blocks in Baltimore, over 90% are jailed eventually. But I don’t blindly trust the data — it could be better or even worse.
Left: A chain gang from the early 20th century.
Right: The comedy movie “Pardon Us” (1931) featured Stan Laurel and Oliver Hardy. A very funny movie (to me anyway — funny is in the eye of the beholder).
You must be logged in to post a comment.