“Clustering Mixed Categorical and Numeric Data Using k-Means with C#” in Visual Studio Magazine

I wrote an article titled “Clustering Mixed Categorical and Numeric Data Using k-Means with C#” in the May 2024 edition of Microsoft Visual Studio Magazine. See https://visualstudiomagazine.com/Articles/2024/05/15/clustering-mixed-categorical-and-numeric-data.aspx.

Data clustering is the process of grouping data items together so that similar items are in the same group/cluster. For strictly numeric data, the k-means clustering technique is relatively simple, effective, and is the technique most commonly used. For strictly non-numeric (categorical) data, there are somewhat more complicated techniques that use entropy or Bayesian probability or categorical utility. But clustering mixed categorical and numeric data is very tricky.

The article presents a technique for clustering mixed categorical and numeric data using standard k-means clustering implemented using the C# language. Briefly, the source mixed data is preprocessed so that all the numeric and categorical values are between 0.0 and 1.0 and therefore k-means clustering can be applied without any modification.

I present a complete demo program to explain the idea. The synthetic demo source data looks like:

F  short   24  arkansas  29500  liberal
M  tall    39  delaware  51200  moderate
F  short   63  colorado  75800  conservative
M  medium  36  illinois  44500  moderate
F  short   27  colorado  28600  liberal
. . .

The raw data is normalized and encoded and looks like:

0.5, 0.25, 0.12, 0.25, 0.00, 0.00, 0.00, 0.1496, 0.0000, 0.0000, 0.3333
0.0, 0.75, 0.42, 0.00, 0.00, 0.25, 0.00, 0.5024, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.90, 0.00, 0.25, 0.00, 0.00, 0.9024, 0.3333, 0.0000, 0.0000
0.0, 0.50, 0.36, 0.00, 0.00, 0.00, 0.25, 0.3935, 0.0000, 0.3333, 0.0000
0.5, 0.25, 0.18, 0.00, 0.25, 0.00, 0.00, 0.1350, 0.0000, 0.0000, 0.3333
. . .

The sex column of binary categorical values is encoded using zero-zero-point-five encoding.

The height ordinal categorical column is encoded using equal-interval encoding as short = 0.25, medium = -.50, tall = 0.75.

The numeric age column is min-max normalized.

The State categorical column is encoded using one-over-n-hot encoding as Arkansas = (0.25 0 0 0), Colorado = (0 0.25 0 0), Delaware = (0 0 0.25 0), Illinois = (0 0 0 0.25).

The numeric income column is min-max normalized.

The political leaning categorical column is encoded using one-over-n-hot encoding as conservative = (0.3333 0 0), moderate = (0 0.3333 0), liberal = (0 0 0.3333)

Because all encoded and normalized values are numeric between 0 and 1, standard k-means clustering can be applied.

An alternative for clustering mixed categorical and numeric data is to use an old technique called k-prototypes clustering. One disadvantage of using k-prototypes clustering is that it’s not widely implemented in machine learning code libraries, as is the case with k-means clustering routines. Another disadvantage of k-prototypes clustering is that it’s relatively crude because it uses a dissimilarity function that simply counts the number of mismatched ordinal-encoded categorical values between two items. In a set of limited experiments, the clustering technique presented in the article beat or matched k-prototypes clustering results on all datasets.

I’m a fan of early science fiction movies. An odd clustering of such films contains mixed Italian and American casts. Typically the films were shot in Italy and had a mostly Italian cast, but with one or two American actors to appeal to the U.S. market. I watched many of these movies when I was young and they were fun to look at but they, without any exception, gave me a headache because the plots never, ever, made any sense.

Left: “Space-Men” aka “Assignment: Outer Space” (1960) tells the story of a space crew that must stop a runaway spaceship that threatens to incinerate the Earth. Featured American Rick Van Nutter (yes, that was his real name), who is best known for playing Felix Leiter in the 1965 Bond film “Thunderball”.

Center: “Planet of the Vampires” aka “Terror in Space” (1965). Two crews crash land on a forbidding planet and several crew members are killed. The disembodied aliens who live on the planet possess the bodies of the dead crew members and start hunting down the members who survived the crash. This movie was clearly a strong influence on “Alien” (1979). The movie featured American Barry Sullivan who had a very long and successful acting career.

Right: “War of the Planets” aka “I Diafanoidi Vengono da Marte” / “The Diaphanoids Come From Mars” (1966) is the second of four films made in 1966 and 1967 — “Wild, Wild Planet” (1966), “War of the Planets” (1966), “War Between the Planets” (1966), “Snow Devils” (1967) — that featured mostly the same actors, same sets, same costumes, same props, and similar titles, making them difficult to distinguish. In “of the”, a space crew battles the Diaphanoids who are energy beings. Featured Italian American actor Tony Russel, who was born in Italy but worked mostly in the U.S.