“Nearest Centroid Classification for Numeric Data Using C#” in Visual Studio Magazine

I wrote an article titled “Nearest Centroid Classification for Numeric Data Using C#” in the June 2024 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2024/06/17/nearest-centroid-classification.aspx.

The goal of a machine learning classification system is to predict the value of a discrete variable. For example, you might want to predict the political leaning (conservative or moderate or liberal) of a person based on their sex, age, and so on.

There are many machine learning classification techniques. Common techniques include logistic regression, neural network classification, naive Bayes classification, decision tree classification, k-nearest neighbors classification, and several more.

My article presents a complete end-to-end demo of a technique called nearest centroid classification. Briefly, in nearest centroid classification, the vector centroids (also called means or averages) in the training data are computed for each of the classes to predict. To classify a data item, the distance between the item and each centroid is computed. The predicted class is the class associated with the nearest centroid.

Four advantages of nearest centroid classification (NCC) are that NCC is easy to implement, NCC can work with very small datasets, NCC is highly interpretable, and NCC works for both binary classification and multi-class classification. Two disadvantages of NCC are that basic NCC works only with strictly numeric predictor variables (although there are new techniques, to modify NCC to work with mixed numeric and categorical predictors), and most importantly, NCC is the least powerful classification technique because it doesn’t take interactions between predictor variables into account.

The goal of the demo problem is to predict the species of a penguin (0 = Adelie, 1 = Chinstrap, 2 = Gentoo) from its bill length, bill width, flipper length, and body mass. The raw data looks like:

2, 50.0, 16.3, 230.0, 5700.0
0, 39.1, 18.7, 181.0, 3750.0
1, 38.8, 17.2, 180.0, 3800.0
2, 39.3, 20.6, 190.0, 3650.0
0, 39.2, 19.6, 195.0, 4675.0
. . .

The trained model scores 0.9333 accuracy (28 out of 30 correct) on the training data and 1.0000 accuracy (10 out of 10) on the test data. Such high accuracy is rare for a nearest centroid classification model. The high accuracy indicates that the species of a penguin is almost completely determined by one or two of the predictor variables. As it turns out, class 0 = Adelie is unambiguously identified by a low value for bill length and a high value for bill width. Class 1 = Chinstrap is identified by a high value for bill length and a low value for body mass. Class 2 = Gentoo is identified by a high value for flipper length and a high value for body mass.

Nearest centroid classification is a good way to establish a baseline prediction model result. In most scenarios, models created by more powerful techniques, such as neural network classifiers and decision tree classifiers, should have better predictive accuracy. That said however, there are some situations, such as the predicting the Penguin Dataset, where nearest centroid classification is surprisingly powerful.



Left: I like the original “Godzilla” movie (Japanese version 1954, American version 1956) a lot. I classify it with a solid A grade — one of the best and most influential science fiction movies of all time. However, after the first Godzilla movie, I’m not really a fan of any of the other 37 and counting movies.

Center: “Invasion of Astro-Monster” (1965) is bonkers. Xiliens from Planet X come to Earth and ask to borrow Godzilla and Rodan to use to defeat the three-headed King Ghidorah monster who is attacking their planet. But the Xiliens actually plan to enslave Earth. I classify this as a C grade compared to other Godzilla movies.

Right: I watched “Godzilla Minus One” (2023) recently. It received good reviews from critics and audiences, but I wasn’t too impressed. At least this entry in the series had something of a story, unlike most of the other Godzilla movies that are essentially non-stop scenes of fighting monsters. I classify this movie with a B- grade overall.


This entry was posted in Machine Learning. Bookmark the permalink.

Leave a comment