In software engineering, a basic knowledge of statistics is often useful. When I was a university professor some years ago, I was always careful to explain to my students the difference between correlation and covariance. As usual, a concrete example is the best way to illustrate the idea. Both correlation and covariance are measures of how closely related two variables are. For example, suppose you have two variables, X and Y, and 4 sets of data:

X Y

—-

2 5

0 5

2 9

8 9

There are actually several different types of coefficients of correlation, but the most common is usually called Pearson’s product-moment correlation coefficient, and is usually given by the symbol r. This is what most students encounter in an introductory statistics class, usually when studying linear regression. One of many equivalent equations for r is å(Xi – Xm)(Yi – Ym) / sqrt(å(Xi – Xm)^2 * sqrt(å(Yi – Ym)^2 where Xi means each X value and Xm is the mean of all the X values. For the data above, if you compute r you get 0.6667.

The coefficient of covariance does not have a standard symbol. One of several equations for the coefficient of covariance is cov = å(XiYi) / n where n is the number of data sets. For the data above, if you compute cov you get 4.0000.

Both of these statistics are closely related. In fact there are several mathematical equations which describe the exact relationship. But when should you use which statistic? In general, the coefficient of correlation is a better choice in most situations because its value is always normalized to the range [-1.0, +1.0] but the coefficient of covariance has no upper limit.