Tagged: k-means

wiki_popes

Grouping entries in Slovenian Wikipedia by contributors

Share Button

Some time ago, I helped Miha Mazzini extract some data from Slovenian Wikipedia. For that, I needed to write a comprehensive parser, extracting not only titles and text, but also number of overall and per-contributor revisions, along with contributor usernames.

So, for each entry, I got a list of contributing accounts and number of edits that were performed by that account. I wondered: how are the areas of expertise distributed among all those contributors? Are some of them specialized mainly for science, some others for politics, and so on? And, perhaps more interestingly, where do these areas overlap? Do we have experts for sport, who also happen to curate political entries?

To find out, I extracted all entries with more than 25 edits, vectorized them by contributing accounts, and ran t-SNE to cluster them spatially and prepare the visualization. When t-SNE layout was complete, k-means clustering was run on the x,y coordinates only to be able to distinguish those areas by color. It must be said that these colored groups don’t always coincide with semantic grouping, so take it with a grain of salt. It’s there mainly to make the map look better and to improve legibility. Font size is in proportion with number of revisions that an entry had so far.

Here’s the entire map as one big 9000 x 6000 image. Click the image to display, then zoom into it with mouse. You can find cropped clusters and some commentary below.

Wikipedia entries by contributing authors
Wikipedia entries by contributing authors – entire map (click for enlargement)

Turns out the areas of expertise are pretty well delineated.

Let’s look at some clusters. Here we have some geographic Wikipedia entries, mainly countries and some historical persons. That sounds logical – editing an entry about a great ruler probably causes one to contribute to an entry about his or her country.

wiki_countries

Here are some famous Slovenian people, mostly writers, intermixed with some towns. In the lower left quadrant, there are alo some Slovenian politicians. It sounds funny that the late Communist ruler Josip Broz Tito is so close to Janez Janša, who is the current leader of right-wing opposition. It appears that there is a number of people who edit both entries. I wonder why. Here’s an article by Miles Mathis about editing of Wikipedia. I don’t know what to think of it, but I surely read a lot about autocratic rule of (English) Wikipedia editors to give it some credence. I don’t know about Slovenian version, and I don’t want to speculate, but this is as good an opportunity as any to start thinking about it.

wiki_history

Here is a cluster of lists. It seems that there exists an entire group of people who curate them, regardlessly of their content.

wiki_lists

Here’s a funny cluster dealing mostly with public transit in Slovenia. It almost seems that there are some bureaucrats in the government that edit these entries on taxpayer dime. I could probably find that out, if I traced the IPs in the edit logs. If someone hires me as an Internet detective, I might do that, but I made these pictures for fun.

wiki_lpp

This is an interesting cluster. It appears that many same people edit entries about Euroviviion Contest, parliamentary elections, World cup in basketball and World cups in skiing, along with two new parties in Slovenian parliament: ZAAB, which is the remainder of majority party in the last parliamentary term, and SMC, which is a new majority party. Both parties, along with Pozitivna Slovenija (former majority party)  were founded hastily right before elections, and won them by a big landslide. I wonder how would political analysts comment on their (speculatively) members’ love for sports contests. wiki_politicians

Here we have many religious personalities, mostly many popes.wiki_popes

A grouping of entries about Slovenian popular music.

wiki_popevke

Some entries about historical scientists and natural sciences.

wiki_science

More geographical entries, along with some entries about Slovenian highways.

wiki_traffic

Here’s the center of the map. It would follow logic that entries with many non-specialized contributors are drawn towards it. It’s generally more chaotic that the outskirts, but here are great many contemporary and historical art personalities grouped together..

wiki_writers

Another snapshot from the center, mostly consisting of entries about worker’s rights and things related to work.

wiki_workers

So here it is. For the technically minded – everything was done in JavaScript with Andrej Karpathy‘s tsnejs library, clusterfck for k-means, and d3 for drawing.

I also made an inverted map, on which the contributors were shown, grouped by the entries they made revisions to. It’s not so interesting for general public, but if someone wants to see it, it’s available by request.

K-means clustering with Processing.js

Share Button

K-means clustering is an algorithm to quickly group a large quantity of data. It’s used in variety of ways, from statistical analysis to improving usability of user interfaces. If you read Google News, you’re probably familiar with the way they group similar news items together. When I first saw that, I thought there must be some serious language processing and semantics behind that – that they somehow extract meaning from the articles and group them together accordingly.

Turns out that it’s a little easier than that. Article abstracts are split into words, and for each article, a multidimensional vector consisting of these words is constructed.  These vectors are then put into n-dimensional space, in which number of dimensions corresponds to the total number of different words in all articles analyzed. Then a clustering algorithm is run on the articles in this space. It yields groups of correlated articles based on their word vectors.

One clustering algorithm to do that is k-means. It works like that (I cite Manning’s excellent book “Algorithms of Intelligent Web” under fair use):

The k-means algorithm randomly picks k points that represent the initial centroids of the candidate clusters. Subsequently the distances between these centroids and each point of the set are calculated, and each point is assigned to the cluster with the minimum distance between the cluster centroid and the point. As a result of these assignments, the locations of the centroids for each cluster have now changed, so we reevaluate the new centroids until their locations stop changing. This particular algorithm for k-means is attributed to E.W. Forgy and to S.P. Lloyd, and has the following advantages:

  • It works well with many metrics.
  • It’s easy to derive versions of the algorithm that are executed in parallel—when
  • the data are divided into, say, N sets and each separate data set is clustered, in
  • parallel, on N different computational units.
  • It’s insensitive with respect to data ordering.

At this point you may wonder, what happens if the algorithm doesn’t stop? Don’t worry! It’s guaranteed that the iterations will stop in a finite number of steps. In practice, the algorithm converges quickly (that’s the mathematical jargon).

Of course you need to use Apache Mahout, preferably in combination with Solr, to do that on text in production environment. But it’s also possible to experiment in Processing, albeit for this experiment I w0n’t use text, but points in space. You can see this in the Processing.js applet below. It fills 3D space with random points, adds cluster nodes and then optimizes their positions so that every point belongs to a cluster, coloring the points belonging to each cluster in same color. Resulting configuration is in fact a Voronoi diagram in 3D, which is related to Delaunay triangulation.

Click anywhere on screen to restart, controls to refresh/stop, or select numbers of clusters and points. I suggest downloading the original sketch here, it runs much faster.

Clusters:
    Points:
      
   
width=”800″ height=”600″>Your browser does not support the canvas tag.