Recently, I was looking at the problem of normalizing data in an R language data frame (if you’re not familiar with R you can loosely think of a data frame as a matrix with column headers). By normalizing data, I mean mapping numeric values so they all have roughly the same magnitude.

There are several different kinds of data normalization. I often use “z-score normalization” (also called Gaussian), and “min-max normalization”.

As is often the case with R, when I searched online, I found many different approaches, notably the built-in scale() function. But I wanted complete control over exactly how I normalize. Here’s an R function that does z-score normalization on the specified columns of a data frame:

normalize <- function(df, cols) {
result <- df # make a copy of the input data frame
for (j in cols) { # each specified col
m <- mean(df[,j]) # column mean
std <- sd(df[,j]) # column (sample) sd
for (i in 1:nrow(result)) { # each row of cur col
result[i,j] <- (result[i,j] - m) / std
}
}
return(result)
}

The function has an outer loop over each column (index j) and an inner loop over each row of the current column (index i). The explicit inner loop can be replaced by a call to the built-in sapply() function that applies some operation to a vector:

result[,j] <- sapply(result[,j], function(x) (x - m) / std)

There are many other possible code variations which is characteristic of R. I wrote a little test script to call the function:

mydf <- read.table("AgeIncomeGpaOccupation.txt",
header=T, sep=",")
cat("Original data frame:\n")
print(mydf)
cols <- c(1,2,3)
mydf.normal <- normalize(mydf, cols)
cat("\nNormalized data frame:\n")
print(mydf.normal, digits=3)

Writing a custom normalize() function gives me total control. For example, the built-in z-score normalization uses the sample standard deviation, but with a custom approach I can use the population standard deviation or absolute average deviation or any measure of dispersion.

### Like this:

Like Loading...

*Related*