I was reading a research paper this morning and the paper used the DBSCAN (“density-based spatial clustering of applications with noise”) clustering algorithm. DBSCAN is somewhat similar to k-means clustering. Both work only with strictly numeric data.
In k-means you must specify the number of clusters. DBSCAN doesn’t require you to specify the number of clusters, but for DBSCAN you must specify an epsilon value (how close is “close”) and a minimum number of points that constitute a core cluster. These two DBSCAN parameters implicitly determine the number of clusters.
I hadn’t used DBSCAN in a long time so I coded up a quick demo to refresh my memory. Implementing DBSCAN from scratch isn’t too difficult (I’ve done so using the C# language), but the scikit-learn Python language code library has a built-in implementation that’s simple and easy to use. So my demo was based on the scikit documentation example.
I set up 20 items of dummy data. I used 2D data so that I could graph the results. Like most clustering algorithms, the source data must be normalized so that large magnitude items don’t swamp small magnitude items.
The clustering function assigns a cluster ID label to each data item: 0, 1, 2, etc. Items that don’t get assigned to a cluster get a label of -1 to indicate they are “noise”.
I used the documentation code to create a graph of the clustering results. The five red dots are class 0, six green dots are class 1, and four blue dots are class 2. Noise items are colored black.
Data items that belong to clusters can be “core points” or non-core points. The large dots are core points, the smaller dots are non-core points.
# dbscan_cluster.py import numpy as np from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt print("\nBegin clustering demo with DBSCAN ") X = np.array([ [ 0.325, -0.595], [ 0.507, 1.619], [ 0.817, 1.895], [ 1.147, 0.764], [-1.285, -0.95 ], [-1.237, -0.532], [ 1.108, 1.248], [-0.847, -0.722], [ 0.124, -1.346], [ 0.910, -0.227], [ 0.310, -0.756], [-1.384, -0.715], [ 0.736, 1.15 ], [ 0.511, -0.517], [-1.081, -0.91 ], [ 0.416, 1.252], [-2.349, -0.42 ], [-0.559, -1.161], [ 0.806, 1.054], [ 1.023, -0.133]]) print("\nX data: ") print(X) print("\nPerforming clustering ") clustering = DBSCAN(eps=0.5, min_samples=4).fit(X) print("Done ") print("\nComputed cluster labels: ") print(clustering.labels_) print("\nIndices of core points: ") print(clustering.core_sample_indices_) n_clusters = np.max(clustering.labels_) + 1 counts = np.zeros(n_clusters+1, dtype=np.int64) # noise for i in range(len(X)): lbl = clustering.labels_[i] if lbl == -1: counts[n_clusters] += 1 else: counts[lbl] += 1 print("\nCluster counts, noise count: ") print(counts) print("\nDisplaying clusering: " ) core_samples_mask = np.zeros_like(clustering.labels_, \ dtype=bool) core_samples_mask[clustering.core_sample_indices_] = True unique_labels = set(clustering.labels_) colors = [plt.cm.hsv(each) \ for each in np.linspace(0, 1, \ len(unique_labels))] for k, col in zip(unique_labels, colors): if k == -1: col = [0, 0, 0, 1] # noise = black class_member_mask = (clustering.labels_ == k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=6) plt.show() print("\nEnd demo ")