In many machine learning situations, the most time-consuming and annoying part of the process is getting data ready. A common task is to encode categorical data. For example, suppose you have a Color variable that can be one of three colors (red, white, blue), and the raw data encodes those three colors as 0, 1, or 2. If you are working with a neural network, you’ll want to use 1-of-(N-1) encoding so that 0 = (1, 0), 1 = (0, 1), and 2 = (-1, -1).
If there’s a lot of data (say more than 200 items) then the best approach is usually to write a utility program to do all the work. But for quick and dirty jobs, Excel is a great option.
Suppose raw data is:
age color foo 23 0 0 34 1 1 48 2 0 50 1 1 42 2 1 36 0 0
The first thing I’d do is take care of the purely numeric columns, like Age. For example, I’d use min-max normalization where each value v in the column becomes (v – min) / (max – min). Easy.
Now for the Color column, first I’d replace the 0, 1, 2 with A, B, C (to get rid of 0s and 1s). Next, I’d replace:
A with 1 0
B with 0 1
C with -1 -1
Then in Excel, I’d use the Text to Columns feature to expand the single cells to multiple columns. Neat! This technique also works for 1-of-N encoding where A would be replaced by 1 0 0 and so on.
For the Foo column, which is binary, I’d replace 0 with -1 and leave the 1 values alone.
Moral of the story: Excel is pretty awesome.
Artist Tatsuo Horiuchi creates paintings using only Excel. Crazy Awesome.