Encoding Data for Machine Learning using Excel

In many machine learning situations, the most time-consuming and annoying part of the process is getting data ready. A common task is to encode categorical data. For example, suppose you have a Color variable that can be one of three colors (red, white, blue), and the raw data encodes those three colors as 0, 1, or 2. If you are working with a neural network, you’ll want to use 1-of-(N-1) encoding so that 0 = (1, 0), 1 = (0, 1), and 2 = (-1, -1).

If there’s a lot of data (say more than 200 items) then the best approach is usually to write a utility program to do all the work. But for quick and dirty jobs, Excel is a great option.

Suppose raw data is:

age	color	foo
23	0	0
34	1	1
48	2	0
50	1	1
42	2	1
36	0	0

The first thing I’d do is take care of the purely numeric columns, like Age. For example, I’d use min-max normalization where each value v in the column becomes (v – min) / (max – min). Easy.

Now for the Color column, first I’d replace the 0, 1, 2 with A, B, C (to get rid of 0s and 1s). Next, I’d replace:

A with 1 0
B with 0 1
C with -1 -1

Then in Excel, I’d use the Text to Columns feature to expand the single cells to multiple columns. Neat! This technique also works for 1-of-N encoding where A would be replaced by 1 0 0 and so on.

For the Foo column, which is binary, I’d replace 0 with -1 and leave the 1 values alone.

Moral of the story: Excel is pretty awesome.

Artist Tatsuo Horiuchi creates paintings using only Excel. Crazy Awesome.

This entry was posted in Machine Learning. Bookmark the permalink.