K-means clustering is an algorithm to quickly group a large quantity of data. It’s used in variety of ways, from statistical analysis to improving usability of user interfaces. If you read Google News, you’re probably familiar with the way they group similar news items together. When I first saw that, I thought there must be some serious language processing and semantics behind that – that they somehow extract meaning from the articles and group them together accordingly.
Turns out that it’s a little easier than that. Article abstracts are split into words, and for each article, a multidimensional vector consisting of these words is constructed. These vectors are then put into n-dimensional space, in which number of dimensions corresponds to the total number of different words in all articles analyzed. Then a clustering algorithm is run on the articles in this space. It yields groups of correlated articles based on their word vectors.
One clustering algorithm to do that is k-means. It works like that (I cite Manning’s excellent book “Algorithms of Intelligent Web” under fair use):
The k-means algorithm randomly picks k points that represent the initial centroids of the candidate clusters. Subsequently the distances between these centroids and each point of the set are calculated, and each point is assigned to the cluster with the minimum distance between the cluster centroid and the point. As a result of these assignments, the locations of the centroids for each cluster have now changed, so we reevaluate the new centroids until their locations stop changing. This particular algorithm for k-means is attributed to E.W. Forgy and to S.P. Lloyd, and has the following advantages:
- It works well with many metrics.
- It’s easy to derive versions of the algorithm that are executed in parallel—when
- the data are divided into, say, N sets and each separate data set is clustered, in
- parallel, on N different computational units.
- It’s insensitive with respect to data ordering.
At this point you may wonder, what happens if the algorithm doesn’t stop? Don’t worry! It’s guaranteed that the iterations will stop in a finite number of steps. In practice, the algorithm converges quickly (that’s the mathematical jargon).
Of course you need to use Apache Mahout, preferably in combination with Solr, to do that on text in production environment. But it’s also possible to experiment in Processing, albeit for this experiment I w0n’t use text, but points in space. You can see this in the Processing.js applet below. It fills 3D space with random points, adds cluster nodes and then optimizes their positions so that every point belongs to a cluster, coloring the points belonging to each cluster in same color. Resulting configuration is in fact a Voronoi diagram in 3D, which is related to Delaunay triangulation.
Click anywhere on screen to restart, controls to refresh/stop, or select numbers of clusters and points. I suggest downloading the original sketch here, it runs much faster.