Factor analysis is a classical statistics technique that isn’t used too much in machine learning but it can be quite valuable. As is often the case with statistics and ML, it’s a bit tricky to explain what factor analysis is without going into a huge amount of detail.
Briefly, if you have a dataset that has many variables, factor analysis can tell you if some of the variables are actually due to a hidden, latent variable. The idea is best explained by an example.
Suppose you ask a bunch of people to rate 8 movies on a scale of 1 (bad) to 5 (excellent). The movies are The Fifth Element, Forbidden Planet, Dark City, Galaxy Quest, The Hangover, Meet the Parents, Ben Hur, and Gladiator.
The raw data might look like:
This means person 01 gives The Fifth Element a rating of 5, gives Forbidden Planet a rating of 4, and so on.
In this example, I’ve deliberately set up the problem so that there are three latent variables that explain the data – science fiction, comedy, and historical.
To do factor analysis in R, you can use the somewhat unfortunately named “factanal” function. If a data frame named dd holds the numeric part of the data, then you could call
fact3 = factanal(dd, factor=3)
to see how well three latent variables fit the data. The results for my dummy data are:
Factor1 Factor2 Factor3 TheFifthElement 0.757 -0.355 ForbiddenPlanet 0.977 -0.134 TheHangover 0.940 -0.177 MeetTheParents -0.166 0.800 -0.218 BenHur -0.204 -0.254 0.915 Gladiator -0.224 -0.498 0.654 GalaxyQuest 0.585 0.606 -0.435 DarkCity 0.785 -0.240
Notice “Factor1” (which is science fiction) captures The Fifth Element, Forbidden Planet, Galaxy Quest, and Dark City extremely well. “Factor2” captures comedies The Hangover, Meet the Parents, and Galaxy Quest. And “Factor3” captures historical movies Ben Hur and Gladiator.
The SS Loadings output gives you a rough idea of how important each factor is:
Factor1 Factor2 Factor3 SS loadings 2.607 2.222 1.724
As a rule of thumb, if an SS loading is greater than 1.0 the factor is relevant. If I ran the analysis with four factors, the SS loading for Factor4 would likely be less than 1.0 showing that it’s not important.
The chi-square p-value is the probability that the factors explain the data variability perfectly, so higher values are better. In my demo, the p-value is 0.1970.
Factor analysis is typically used with lots of variables. The you have v variables and f factors, then (v-f)^2 must be greater than v+f.