Some time ago, I helped Miha Mazzini extract some data from Slovenian Wikipedia. For that, I needed to write a comprehensive parser, extracting not only titles and text, but also number of overall and per-contributor revisions, along with contributor usernames.
So, for each entry, I got a list of contributing accounts and number of edits that were performed by that account. I wondered: how are the areas of expertise distributed among all those contributors? Are some of them specialized mainly for science, some others for politics, and so on? And, perhaps more interestingly, where do these areas overlap? Do we have experts for sport, who also happen to curate political entries?
To find out, I extracted all entries with more than 25 edits, vectorized them by contributing accounts, and ran t-SNE to cluster them spatially and prepare the visualization. When t-SNE layout was complete, k-means clustering was run on the x,y coordinates only to be able to distinguish those areas by color. It must be said that these colored groups don’t always coincide with semantic grouping, so take it with a grain of salt. It’s there mainly to make the map look better and to improve legibility. Font size is in proportion with number of revisions that an entry had so far.
Here’s the entire map as one big 9000 x 6000 image. Click the image to display, then zoom into it with mouse. You can find cropped clusters and some commentary below.
Turns out the areas of expertise are pretty well delineated.
Let’s look at some clusters. Here we have some geographic Wikipedia entries, mainly countries and some historical persons. That sounds logical – editing an entry about a great ruler probably causes one to contribute to an entry about his or her country.
Here are some famous Slovenian people, mostly writers, intermixed with some towns. In the lower left quadrant, there are alo some Slovenian politicians. It sounds funny that the late Communist ruler Josip Broz Tito is so close to Janez Janša, who is the current leader of right-wing opposition. It appears that there is a number of people who edit both entries. I wonder why. Here’s an article by Miles Mathis about editing of Wikipedia. I don’t know what to think of it, but I surely read a lot about autocratic rule of (English) Wikipedia editors to give it some credence. I don’t know about Slovenian version, and I don’t want to speculate, but this is as good an opportunity as any to start thinking about it.
Here is a cluster of lists. It seems that there exists an entire group of people who curate them, regardlessly of their content.
Here’s a funny cluster dealing mostly with public transit in Slovenia. It almost seems that there are some bureaucrats in the government that edit these entries on taxpayer dime. I could probably find that out, if I traced the IPs in the edit logs. If someone hires me as an Internet detective, I might do that, but I made these pictures for fun.
This is an interesting cluster. It appears that many same people edit entries about Euroviviion Contest, parliamentary elections, World cup in basketball and World cups in skiing, along with two new parties in Slovenian parliament: ZAAB, which is the remainder of majority party in the last parliamentary term, and SMC, which is a new majority party. Both parties, along with Pozitivna Slovenija (former majority party) were founded hastily right before elections, and won them by a big landslide. I wonder how would political analysts comment on their (speculatively) members’ love for sports contests.
Here we have many religious personalities, mostly many popes.
A grouping of entries about Slovenian popular music.
Some entries about historical scientists and natural sciences.
More geographical entries, along with some entries about Slovenian highways.
Here’s the center of the map. It would follow logic that entries with many non-specialized contributors are drawn towards it. It’s generally more chaotic that the outskirts, but here are great many contemporary and historical art personalities grouped together..
Another snapshot from the center, mostly consisting of entries about worker’s rights and things related to work.
So here it is. For the technically minded – everything was done in JavaScript with Andrej Karpathy‘s tsnejs library, clusterfck for k-means, and d3 for drawing.
I also made an inverted map, on which the contributors were shown, grouped by the entries they made revisions to. It’s not so interesting for general public, but if someone wants to see it, it’s available by request.