## Factor Analysis

Factor analysis is a classical statistics technique that isn’t used too much in machine learning but it can be quite valuable. As is often the case with statistics and ML, it’s a bit tricky to explain what factor analysis is without going into a huge amount of detail.

Briefly, if you have a dataset that has many variables, factor analysis can tell you if some of the variables are actually due to a hidden, latent variable. The idea is best explained by an example.

Suppose you ask a bunch of people to rate 8 movies on a scale of 1 (bad) to 5 (excellent). The movies are The Fifth Element, Forbidden Planet, Dark City, Galaxy Quest, The Hangover, Meet the Parents, Ben Hur, and Gladiator.

The raw data might look like:

P01,5,4,2,3,1,2,4,5
P02,2,1,5,5,1,1,4,3
etc.

This means person 01 gives The Fifth Element a rating of 5, gives Forbidden Planet a rating of 4, and so on.

In this example, I’ve deliberately set up the problem so that there are three latent variables that explain the data – science fiction, comedy, and historical.

To do factor analysis in R, you can use the somewhat unfortunately named “factanal” function. If a data frame named dd holds the numeric part of the data, then you could call

```fact3 = factanal(dd, factor=3)
```

to see how well three latent variables fit the data. The results for my dummy data are:

```                Factor1 Factor2 Factor3
TheFifthElement  0.757          -0.355
ForbiddenPlanet  0.977  -0.134
TheHangover              0.940  -0.177
MeetTheParents  -0.166   0.800  -0.218
BenHur          -0.204  -0.254   0.915
GalaxyQuest      0.585   0.606  -0.435
DarkCity         0.785          -0.240
```

Notice “Factor1” (which is science fiction) captures The Fifth Element, Forbidden Planet, Galaxy Quest, and Dark City extremely well. “Factor2” captures comedies The Hangover, Meet the Parents, and Galaxy Quest. And “Factor3” captures historical movies Ben Hur and Gladiator.

The SS Loadings output gives you a rough idea of how important each factor is:

```               Factor1 Factor2 Factor3