I was looking at spectral clustering with the scikit-learn library. Standard k-means clustering doesn’t work well for data that has weird geometry. A standard example is data that when graphed looks like two concentric circles. Spectral clustering connects data into a virtual graph which allows it to deal with weird geometry.

I wanted to run some experiments with spectral clustering and so I needed some data to work with. The scikit library has a make_circles() function that is intended for clustering experiments. I don’t like dependencies, so I figured I’d go to the scikit source code and fetch the code for the make_circles() function. How complicated could it be?

Agh! I quickly discovered that the scikit make_circles() function was very complicated. It had dependencies which had dependencies which had yet more dependencies. There were at least 400 lines of code.

I was engineering-annoyed and decided to pull just the essential code and implement a simplified version. All I needed was approximately 15 lines of code:

def my_make_circles(n_samples=100, factor=0.8, noise=None, seed=1): rnd = np.random.RandomState(seed) n_samples_out = n_samples // 2 n_samples_in = n_samples - n_samples_out lin_out = np.linspace(0, 2 * np.pi, n_samples_out, endpoint=False) lin_in = np.linspace(0, 2 * np.pi, n_samples_in, endpoint=False) outer_circ_x = np.cos(lin_out) outer_circ_y = np.sin(lin_out) inner_circ_x = np.cos(lin_in) * factor inner_circ_y = np.sin(lin_in) * factor X = np.vstack( [np.append(outer_circ_x, inner_circ_x), np.append(outer_circ_y, inner_circ_y)]).T y = np.hstack( [np.zeros(n_samples_out, dtype=np.int64), np.ones(n_samples_in, dtype=np.int64)]) # add noise if noise is not None: X += rnd.normal(loc=0.0, scale=noise, size=X.shape) return X, y

I wrote a demo. The key calling statement is:

data, labels = my_make_circles(n_samples=20, factor=0.5, noise=0.08, seed=0)

The return is a Tuple with a “data” and a “labels”. There are 20 rows. The “data” item has an x coordinate and a y coordinate. The first 10 rows are the outer circle. The second 10 rows are the inner circle. The return “labels” has 10 zeros followed by 10 ones.

The factor parameter should be between 0.0 and 1.0 and controls how small the inner circle is. Smaller values make a smaller inner circle. The noise parameter is the standard deviation of the Standard Normal distribution that adds randomness. A value of 0.0 gives perfect circles, larger values give a more random circular shape.

This example points out that library functions are always larger than implementing from scratch. Because library functions can be used in so many ways, they must have a lot of parameters to deal with many different ways the function might be used, and a lot of error-checking code, and a lot of extra code to make all library functions work with each other.

Good fun.

*Three interesting portraits with a circular theme. Left: By famous art nouveau illustrator Alphonse Mucha (1860-1939). Center: By contemporary artist Manuel Nunez. Right: By contemporary artist Karol Bak.*

Demo code:

# spectral_cluster_scikit.py import numpy as np import matplotlib.pyplot as plt # --------------------------------------------------------- def my_make_circles(n_samples=100, factor=0.8, noise=None, seed=1): rnd = np.random.RandomState(seed) n_samples_out = n_samples // 2 n_samples_in = n_samples - n_samples_out lin_out = np.linspace(0, 2 * np.pi, n_samples_out, endpoint=False) lin_in = np.linspace(0, 2 * np.pi, n_samples_in, endpoint=False) outer_circ_x = np.cos(lin_out) outer_circ_y = np.sin(lin_out) inner_circ_x = np.cos(lin_in) * factor inner_circ_y = np.sin(lin_in) * factor X = np.vstack( [np.append(outer_circ_x, inner_circ_x), np.append(outer_circ_y, inner_circ_y)]).T y = np.hstack( [np.zeros(n_samples_out, dtype=np.int64), np.ones(n_samples_in, dtype=np.int64)]) # add noise if noise is not None: X += rnd.normal(loc=0.0, scale=noise, size=X.shape) return X, y # --------------------------------------------------------- def main(): print("\nBegin simplified make_cicrles() demo ") data, labels = my_make_circles(n_samples=20, factor=0.5, noise=0.08, seed=0) print("\ndata = ") print(data) print("\nlabels = ") print(labels) outer = data[0:10,:] inner = data[10:20,:] plt.scatter(outer[:,0], outer[:,1]) plt.scatter(inner[:,0], inner[:,1]) plt.show() print("\nEnd ") if __name__ == "__main__": main()

You must be logged in to post a comment.