Example of Spectral Clustering Using the scikit Library

Ah, where to begin. Bottom line: spectral clustering is a machine learning technique that is great in theory but just isn’t practical or useful in most real-world scenarios.

The idea of clustering is to group data points together so that similar data points are in the same group. By far the most common algorithm for clustering is the k-means (also called Lloyd’s) algorithm. It’s simple and effective. Note: k-means++ is just k-means with clever initialization.

But researchers do research and the spectral clustering technique was introduced in 1995 (although the ideas involved had been around since the 1970s). Spectral clustering is intended to cluster data that has unusual geometry. The standard example is data that forms two concentric circles when graphed.

Briefly, spectral clustering starts by creating a graph from the source data, typically by using the k-nearest neighbors algorithm. Then the matrix that defines the graph is analyzed using eigenvalue decomposition. Then the results of that decomposition are clustered using the k-means algorithm. Note: There are dozes of variations of spectral clustering.

I put together a demo of spectral clustering using the scikit library. The main problem from a practical point of view is that spectral clustering has too many parameters:

SpectralClustering(n_clusters=8, *,
  eigen_solver=None,
  n_components=None,
  random_state=None,
  n_init=10,
  gamma=1.0,
  affinity='rbf',
  n_neighbors=10,
  eigen_tol='auto',
  assign_labels='kmeans',
  degree=3,
  coef0=1,
  kernel_params=None,
  n_jobs=None,
  verbose=False)

The result clustering is highly sensitive to the parameters used. In the image above, the data sort of looks like there are two concentric circles and an artist would cluster the inner data points together and the outer data points together. But is that science or is it art?

If you think about how spectral clustering works, when the k-nearest neighbors algorithm is used to create the data graph, that’s where clustering is actually happening.

I have followed clustering research for many years. In my opinion, spectral clustering is somewhat of an example of a research solution in search of a practical problem, at least for real-world data science scenarios. That statement is a bit of an exaggeration, but in all the practical engineering situations I’ve been in, k-means works better than spectral clustering when you take all factors into account.



Three travel posters by Danish illustrator Mads Berg. The similar style makes them easy to cluster together from an art point of view.


Demo code.

# spectral_cluster_scikit.py

# PyTorch 1.12.1-CPU Anaconda3-2020.02  Python 3.7.6
# scikit 0.22.1
# Windows 10/11 

import numpy as np
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt

# ---------------------------------------------------------

def my_make_circles(n_samples=100, factor=0.8,
  noise=None, seed=1):

  rnd = np.random.RandomState(seed)
  n_samples_out = n_samples // 2
  n_samples_in = n_samples - n_samples_out

  lin_out = np.linspace(0, 2 * np.pi, n_samples_out,
    endpoint=False)
  lin_in = np.linspace(0, 2 * np.pi, n_samples_in,
    endpoint=False)
  outer_circ_x = np.cos(lin_out)
  outer_circ_y = np.sin(lin_out)
  inner_circ_x = np.cos(lin_in) * factor
  inner_circ_y = np.sin(lin_in) * factor

  X = np.vstack(
    [np.append(outer_circ_x, inner_circ_x),
     np.append(outer_circ_y, inner_circ_y)]).T
  y = np.hstack(
    [np.zeros(n_samples_out, dtype=np.int64),
     np.ones(n_samples_in, dtype=np.int64)])

  # add noise
  if noise is not None:
    X += rnd.normal(loc=0.0, scale=noise, size=X.shape)
  
  return X, y

# ---------------------------------------------------------

def main():
  print("\nBegin spectral clustering demo ")

  data, labels = my_make_circles(n_samples=20, 
    factor=0.40, noise=0.06, seed=0)

  print("\ndata = ")
  print(data)
  print("\nlabels = ")
  print(labels)

  plt.scatter(data[:,0], data[:,1])
  plt.show()

  # from sklearn.cluster import KMeans
  # print("\nClustering using basic k-means ")
  # clustering = KMeans(n_clusters=2,
  #   random_state=0).fit(data)
  # print("Result clustering: ")
  # print(clustering.labels_)

  print("\nClustering using spectral k-NN(5) ")
  clustering = SpectralClustering(n_clusters=2,
    affinity='nearest_neighbors',
    n_neighbors=4,
    assign_labels='kmeans',
    random_state=0).fit(data)
  print("Result clustering: ")
  print(clustering.labels_)

  print("\nClustering using spectral RBF ")
  clustering = SpectralClustering(n_clusters=2,
    affinity='rbf',
    assign_labels='kmeans',
    random_state=0).fit(data)
  print("Result clustering: ")
  print(clustering.labels_)
  
  print("\nEnd demo ")

if __name__ == "__main__":
  main()
This entry was posted in Scikit. Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s