I’m investigating the possibility of writing code for k-means clustering using the CNTK library. CNTK was designed to create deep neural networks. But CNTK has low-level functions that in principle will allow me to write code for clustering.
The idea is that writing a k-means clustering system using CNTK will allow you to take advantage of features such as GPU processing and the ability to handle large datasets that won’t fit entirely into memory.
Well instead of diving into CNTK directly, my strategy is to first write k-means clustering code using plain Python. Once that code is up and running, I can refactor the code to CNTK — CNTK is written in C++ but you call CNTK using Python.
So, I spent most of a recent lunch break coding up a plain Python clustering demo. I’ve written code like this many times, but even so, it took my entire lunch break to get the k-means code running correctly.
My demo has 20 data items. Each item represents a person’s height in inches, weight in pounds, GPA for high school, and annual income. Even though I created the dummy data so that’s there’s a clear grouping into three clusters, it’s not obvious from the raw data. After clustering, you can easily see there are three distinct groups of people.
OK, Python version of k-means clustering — check. Next step is to start refactoring the code into CNTK. This will take some time — many lunch breaks.