Latent Dirichlet allocation (LDA) is a machine learning technique that is most often used to analyze the topics in a set of documents. The problem scenario is best explained by a concrete example. Suppose you have 100 documents, where each document is a one-page news story. First you select the number of topics, k. Suppose you set k = 3, and unknown to you these latent three topics are “sports”, “politics”, and “business”.
Next, you analyze the documents, using word frequencies. Suppose that the words (“score”, “win”, “record”, “team”) map mostly to the “sports” topic. Words (“democrat”, “republican”, “law”, “bill”) map mostly to the “politics”. And words (“profits”, “sales”, “revenue”, “tax”) map mostly to the “business” topic. But notice that a word can correspond to more than one topic. For example, the word “loss” could be associated with both sports (as in a team had a loss in a game), or business (as in profits and losses), or politics (a candidate suffered a loss in an election).
Next, you can use the word-topic mapping information to analyze each of the 100 documents. Your results might be something like Document #1 is 85% sports, 5% politics, and 10% business. Document #2 is 1% sports, 48% politics, and 51% business. And so on. Note that I’ve greatly simplified this explanation at the expense of some technical accuracy.
The mathematics behind latent Dirichlet allocation are based on the Dirichlet probability distribution, which is a fascinating topic in its own right. I tend to think of the Dirichlet probability distribution as an extension of the Beta distribution. But that’s not a useful information for most people. I often use a Beta distribution in my work (directly or indirectly), so Beta is a good mental point of reference for me.
Latent Dirichlet allocation is an unsupervised machine learning technique because the analysis is based on word frequencies, and therefore raw data can be analyzed. One way to think about Latent Dirichlet allocation is that it has similarities to clustering, but with probabilistic assignments (to the presence or absence of topics in a given document).
The term Dirichlet is capitalized because it’s named after Johann Peter Gustav Lejeune Dirichlet, a German mathematician who lived in the first half of the 1800s. It’s not exactly known how to pronounce “Dirichlet” because the surname was coined by his grandfather. The “ch” can be pronounced like an “sh” sound, or a hard “k” sound. And the ending “et” can be pronounced in French fashion as “lay” or as “let” with a hard “t” sound.
Latent Dirichlet allocation was first explained in a 2003 research paper, but like most techniques, the key ideas were published earlier. In machine learning, the acronym LDA is ambiguous because it can also stand for linear discriminant analysis, which is a completely different technique. (Linear discriminant analysis is a relatively crude classification technique based on analysis of variance). So I sometimes mildly scold my colleagues if the use LDA in a presentation without defining which LDA they’re referring to.