Naive Bayes Classification with C#

In the February 2013 issue of MSDN Magazine, I wrote an article titled, “Naive Bayes Classification with C#”. See http://msdn.microsoft.com/en-us/magazine/jj891056.aspx. Naive Bayes classification is one of the most fundamental techniques in machine learning — sort of a “Hello World” of machine learning. (I’d say two other “Hello World” machine learning techniques are Logistic Regression classification and k-Means clustering).

The idea of classification is to predict which category some data item belongs to. For example, does a patient have cancer (yes, no) based on the results of medical tests. Or, what category of risk is a loan application risky (low, medium, high) based on an applicant’s financial data.

There are many different algorithms to perform classification. Each technique has pros and cons. Naive Bayes is most often used when the category to predict has just two possible values (like the cancer example above). Naive Bayes works by analyzing a set of training data that has a known category, and then calculating an equation based on probability that can be used to predict the category of a new data item that has an unknown category.

To demonstrate Naive Bayes classification, in the MSDN Magazine article I create an artificial example where the category to predict is the sex (male, female) of a person and the predictor variables are job type (construction, administrative, etc.), hand dominance (left-handed, right-handed), and height (72 inches, 64 inches, etc.).

The term naive in Naive Bayes comes from the fact that the technique assumes that the predictor variables (job type, hand dominance, height) are all mathematically independent of each other. This assumption is often not true, but in spite of this Nave Bayes often works quite well.