Encoding a Categorical Dependent Variable using R

The R language is used by data scientists. A common data task in machine learning is to encode categorical data into numbers. For example, a classic ML problem is to predict the species of an iris flower, where the possible values are “setosa”, “versicolor”, “virginica”, from the flower’s sepal length, sepal width, petal length, and petal width. The dependent values can be encoded as (1,0,0) for setosa, (0,1,0) for versicolor, and (0,0,1) for virginica.


In many cases, when encoding is necessary, R will automatically encode data behind the scenes. But sometimes you need to explicitly encode data. One way to encode a categorical variable in the R language is to use the model.matrix function. To demonstrate it, I first found Fisher’s Iris Data set, copied it and pasted it into notepad. The entire data set has 150 items, so to keep the demo simple I deleted all but 6 rows, and added a header line. I saved the file as IrisData.txt:


In R, I loaded the contents of the file into a data frame:

rawIrisData <- read.table("IrisData.txt", header=T, sep=",")

Then I encoded the data like so:

encodedIrisData <- model.matrix(~0 + SepLen + SepWid +
 PetLen + PetWid + Species, rawIrisData)

The 0 parameter means to not add a leading column of 1s (which would create a “design matrix” which is needed for certain analyses). The result is a 1-of-N encoded matrix suitable for a neural network analysis:

5.1    3.5    1.4    0.2   1  0  0
4.9    3.0    1.4    0.2   1  0  0
7.0    3.2    4.7    1.4   0  1  0
6.4    3.2    4.5    1.5   0  1  0
6.3    3.3    6.0    2.5   0  0  1
5.8    2.7    5.1    1.9   0  0  1

The result above has had the header line removed. The last three column headers for the new encoded data are SpeciesIris-setosa, SpeciesIris-versicolor, and SpeciesIris-virginica.

This entry was posted in Machine Learning. Bookmark the permalink.