Min-Max Data Normalization with Python

I coded up a Python function that does min-max normalization on data stored in an array-of-arrays style matrix. I’ll explain why I did this shortly.

Suppose you have height and weight data for a group of people. For example:

65.0, 220.0 
73.0, 160.0
. . .

Heights are in inches, like 65, and weight is in pounds, like 220. In many machine learning situations, you want to normalize the data — scale the data so that the values in different columns have roughly the same magnitude so that large values (like the weight) don’t overwhelm smaller values (like the heights).

There are several normalization algorithms. One of the simplest is min-max normalization. For each column, each value x is replaced by (x – min) / (max – min) where min is the smallest value in the column and max is the largest value in the column.

After min-max normalization, all values will be between 0.0 and 1.0 (where 0.0 corresponds to the smallest raw value, 1.0 to the largest).

OK, so why am I doing this? My ultimate goal is to do k-means clustering using the CNTK code library. My strategy is to first code k-means using plain Python, and then refactor the code to CNTK. When you do clustering you must normalize the data. Ergo, I need to write normalization code.

“New York Eve” – Max Zorn. Made entirely from packing tape.

This entry was posted in Machine Learning. Bookmark the permalink.