Similarities between representatives in Slovenian parliament

Share Button

The title should actually be “An exploration of dimensionality reduction techniques on voting dataset from Slovenian parliament”.

I’ve long been procrastinating with proper and comprehensive study of various machine learning techniques, especially those related to neural networks. I feel I made a few baby steps towards that goal with this research, which is actually a writeup of a project I made for a local newspaper in collaboration with excellent designer Aljaž Vindiš (follow him on Twitter).

The dataset comes from another project that I’m collaborating on with Transparency International Slovenia and Institut Jožef Stefan. Zakonodajni monitor is a platform for inspecting the legislative process and for following the activity of parliamentary representatives, intended mostly for journalists and researchers. Among other things, it contains records of every vote cast in parliamentary sessions by every representative, which is then used for various statistics and visualizations. It also has an API for public access to that data, although I have it in a local database too, making it somewhat easier for me to explore it.

This project is an attempt to visualize relationships between representatives and parties in two dimensional space, or on a line, to better understand the dynamics of power in Slovenian politics. It’s a part of my ongoing collaboration with Dnevnik newspaper for data analysis and visualization. Since the project was not supposed to be interactive from the start, one important constraint was that the results should be fit for a paper version of the newspaper.

Dataset

Each representative has a great many properties in the database, but among them is a vote vector, containing a record of her or his votes so far. A “yea” vote is 1, “nay” is -1, and abstinence for whatever reason is 0. At the time of this project, there were a little less than 650 votings in this parliamentary term, so input data for each representative was a vector with approximately 650 dimensions. Our objective was to construct one- or twodimensional visualization, which would hopefully confirm our existing knowledge about alliances between parties and individuals in the parliament, and, if possible, reveal new and interesting information.

To effectively communicate this information, we had to employ some dimensionality reduction techniques, of which we tried three:

  • PCA (principal components analysis),
  • autoencoder,
  • t-SNE

In the end, we decided on t-SNE because it’s fast and convenient, but other two methods, with the exception of PCA in two dimensions, gave very similar results.

What is “dimensionality reduction”, you might ask? It’s a set of techniques to make sense of complex data. A shadow is a simple natural reduction technique, because it’s a projection from three dimensional space into two. Going on with this analogy, if you want to recognize a person from its shadow, the position of the sun matters a great deal. For example, sun directly over person’s head doesn’t give us much information about the person’s shape. It’s necessary to find a proper angle.

These various techniques have much to to with proper positioning of the “sun” in relation to data, to retain maximum possible amount of information in the projection. Of course, if you project from 650 dimensions into one, a lot of information is lost. Also, in many cases it’s not immediately clear what is the exact meaning of the axes in the projection. Read on, I’ll try to elaborate below.

Autoencoder

We started with an autoencoder. An autoencoder is a form of artificial neural network that is often used for dimensionality reduction. It is a deep neural network with many layers that essentially tries to teach itself identity, that is to say, it’s trained to generalize patterns in data in by compressing the knowledge in some way, and then recreating it. We used an autoencoder with 650 inputs, two layers of 100 neurons each, then a bottleneck layer with two neurons only, followed by an inverted structure acting as a decoder. When training was complete, every representative’s vector was again propagated through the network, and activations of the two bottleneck neurons saved as a coordinate pair. These were then plotted on a 2D canvas, resulting in a image shown below.

poslanci_autoencoder_v3

Legend for clarification:

  • brown (SMC), blue (DeSUS) and red (SD) are leftist position parties with heavy majority in the parliament. Much could be said about their leftism, but let’s leave at that.
  • violet (SDS) and green (NSi) are rightmost opposition parties. They are vehemently anti-communist (SDS) and catholic-conservative (NSi)
  • rose is oppositional ZAAB, which is a party of former prime minister Alenka Bratušek. It leans to the left.
  • grey is oppositional ZL, which is Slovenia’s version of Syriza.

 

The dataset was relatively small, so the autoencoder was implemented in JavaScript with Karpathy’s excellent convnet.js library. Training took two hours on a i7 machine with 16GB RAM.
As a small branch of this project, we also tried to arrange the representatives on an ideological spectrum. For this, a similar neural network was used, but we first trained it with the most left-and right-leaning representatives to obtain extremes, then fed the others through it and plotted the regression scores in one dimension. This arrangement is somewhat different than the final one.

poslanci_by_ideology

 

Principal component analysis

Next on was an attempt to validate our results with PCA. Principal components analysis is a (quote) “technique, used to emphasize variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualize” . It’s essentially a method for projecting data from multidimensional space to a lowerdimensional (say 2D) one, where we try to retain as much information as possible. The first axis is chosen so that the variance along it is maximized, maximizing the information, the others follow in a similar fashion, with the constraint that they must be orthogonal.
We ran PCA for one- and twodimensional solutions, giving solutions on images below.

poslanci_pca_2d

 

Here’s the one-dimensional variant:

poslanci_by_ideology_pca

t-SNE

Finally, we used t-SNE algorithm with one- and twodimensional solutions. t-SNE (or “t-distributed stochastic neighbor embedding”) is another technique for dimensionality reduction, well-suited for visualizing complex datasets in 2D or 3D. You’ll mostly see it in articles dealing with classification of complex data, for instance images and words, where you can see nice plots of similarily-themed images or words with similar meaning clustered together. Here we used it on our voting data, and the results were quite good. First we tried a 2D visualization. It’s roughly similar to the one derived from autoencoder.

Dot sizes correspond to voting attendance. You can see that the representatives with lower attendance are drawn to the center. Also, note that the violet group (SDS party), which is the true and fervent opposition, is relatively close to those with lower attendance. This is simply because the opposition frequently employs obstruction as a parliamentary tactics, or are simply not there due to other reasons.

The neutral control point is the azure rectangle in the center. It’s simply a hypotethical rep that always abstained.

poslanci_tsne_v6

See the voting records for the opposition (yellow is absent, red is a vote against, blue a vote in favor):

controls_opposition

And here are records for some ministers:

controls_missing

Compare these with the position:

controls_position

Partly confirming validity and possible artefacts, we moved on. What we realized so far was that the absences introduced errors in position, and that these errors tend to draw those absent towards the center, possibly confusing the arrangement in a way that some people could wonder: what does this clearly positional rep do close to the opposition? Is (s)he leaning towards them in voting? No, this is simply an artefact that absences introduce into the positioning due to the way these methods work.

Then we decided that we’d maybe like a simpler visualization, one that is more suitable for a paper medium. So we ran t-SNE again in one dimension, then we used a “beeswarm” layout to sketch things out. The beeswarm is essentially a one-dimensional layout, in which some clustered elements are pushed onto a plane to avoid overcrowding on the single axis.

bees_v1
Finally, and mostly for aesthetic reasons, we converted that into hex-binned layout. Number of hexagons corresponds to number of dots above, but voting attendance is encoded in opacity, and party affiliation is represented by color. Here is a sketch:

poslanci_hex_opacity

Here’s a closer view of the opposition:poslanci_hex_close

As a final step, we removed everyone who was present at less that 200 voting sessions, and also added three control points:

  • neutral: a hypothetical representative who always abstains,
  • all yea: a hypothetical representative who always votes in favor of the proposition,
  • all nay: a hypothetical representative who always votes against the proposition

The neutral control point neatly bisects the space between the position and opposition, not counting the mostly absent representatives from the position. The other two would be relevant if all propositions came from the position – it would then be at the extreme pro-government position. In reality, many acts are proposed by the opposition, so they are just not relevant. In the image below, these are the azure hexagons. Neutral is the leftmost one.

poslanci_hex2
And here is a finished version, expertly done by a pro designer:

hex_dnevnik

Added bonus: visualization of tSNE in 3D:

poslanci_3d

Closeup on opposition:

poslanci_3d_2

Interpretation

So, what does this visualization really show? I’d like to say that since the acts subject to voting are mostly put forward by the governing coalition, it’s an arrangement of representatives on a continuum of support for government policy. But that is simply not so, as many acts are proposed by the opposition. It’s more like that the arrangement depends on an individual’s position in relation to majority’s vote, which might or might not relate to the above.

This often coincides with arrangement on ideological spectrum, but it’s not the same. You might wonder what is a cluster of weakly-colored representatives in the right-middle. These are mostly ministers that cast a few votes in the beginning of the parliamentary term, but which then left to be members of actual government. They still were members at one point in time, so we included them in our research, but we might have easily dropped them, since they don’t really figure in day-to-day parliamentary work.

Most of the errors and contra-intuitive positioning are due to gaps in representatives’ voting records. These methods compare voting records component-wise, so if, for example, we have two members of the same party, who substituted for each other (one was there when another wasn’t, as is the case with the ministers), we can only compare their available records with everyone else, but not among themselves.

t-SNE was also done in JavaScript, with another one of Karpathy’s libraries (tsnejs).

Here’s a final look at the data: a hierarchical clustering of all the representatives, including those mostly absent.

hierarchical

Original datasets and code are available by request. My mail address is on the About page.

Analysis of traffic violations in Slovenia between beginning of 2012 and end of 2014

Share Button

This is my first attempt to use open data for data visualization in web presentation and for a mobile app. The idea was to cross-pollinate promotion, but it didn’t go so well – more on this later.

The analysis is published on a separate URL due to heavy use of JavaScript, which complicates things in WordPress. Click link above or the big image with parking ticket to read it.

Parking ticket
Parking ticket

According to data provided by state police, highway authority and local traffic wardens, there occurred a little less than a million traffic violations between start of 2012 and September 2014. Given that there are 1,300,000 registered vehicles and 1,400,000 active driving licenses in the country, this is a lot. A big majority of them are parking and toll tickets.

In the main article, there are a lot of images and charts. For example, I analyzed data for major towns in Slovenia to get the streets with the highest number of issued traffic tickets. Here’s an example for Ljubljana:

Parking tickets in Ljubljana
Streets with parking tickets in Ljubljana – click to read article

I had temporal data for each issued ticket, so I could also show on which streets you are more likely to be ticketed in the morning, midday or evening. On the image below, morning is blue, midday is yellow, and evening is red.

Tickets issued by hour
Tickets issued by hour – click for main article

This is, however, only the beginning. Here are questions I tried to answer:

  • Are traffic wardens and traffic police just another type of tax collectors for the state and counties?
  • Do traffic wardens really issue more tickets now than in the past, or is that just my perception?
  • Which zones in bigger towns are especially risky, should you forget to pay the parking?
  • Are traffic wardens more active in specific time intervals?
  • Does the police lay speed traps in locations with most traffic accidents? What about DUI checking?
  • How does temperature influence the number of issued traffic tickets?
  • Does the moon influence the number of issued traffic tickets? If so, which types?
  • Where and when are drivers most at risk of encountering other drunk drivers?
  • Where does the highway authority check for toll, and when to hit the road if one does not want to pay it?
  • How can we drive safer using open data?

Be sure to read the main article to see all the visualizations and interactive maps. There are also videos, for example this one, showing how the ticketing territory expanded through time in Ljubljana:

Parkirne kazni v Ljubljani 2012 – 2014 from Marko O’Hara on Vimeo.

Some other highlights:

The big finding was a sharp increase of number of parking tickets issued in Ljubljana by the end of 2013, which coincides with publishing of debt that the county has run into:

Increase of parking tickets issued in LJubljana
Increase of parking tickets issued in Ljubljana

There’s an interactive map showing the quadrants with most DUI tickets and their distribution by day of week and month in year:

DUI distribution
DUI distribution

Mobile app for Android

Mobile app for android - start screen
Mobile app for android – map

I also wrote an Android mobile app (get it on Google Play if you are interested) that locates the user and shows locations of violations of selected type on the map, as well as a threat assessment, should she want to break the law. Here’s the description on Google Play:

The app helps the user find out where and when were traffic tickets issued in Slovenia, thus facilitating safer driving. 
Ticket database is limited to territory of Republic of Slovenia.

Choose between these issued citations to show in app:
– parking
– speeding
– driving while using a cellphone
– ignoring safety belt laws
– unpaid toll
– DUI
and traffic accidents.

The app will locate you, fetch data about traffic citations issued in your vicinity, and show them on map. To see citations, that were issued somewhere else, click on map. Additionally available is summary of threat level, derived from statistical data, collected by government agencies.

Locating the user and showing dots on map wasn’t really a challenge, but I wanted to show a realistic threat assessment, based on location and time. To do that, I wrote an API method that calculates the number of tickets issued on the same day of week in the same hour interval and then draws a simple gauge.

Let’s say, for example, that you find yourself in the center of Ljubljana on Monday at noon, don’t have the money for parking fee, and you really only want to take a box to a friend who lives there. You’ll be gone for ten minutes only, so should you risk not paying the parking fee?

The app finds out the total number of tickets issued on Mondays in the three-hour period between noon and 3 PM, then graphically shows the threat level along with some distributions, something like this:

Threat assessment
Threat assessment

It works pretty well, and I use it sometimes, although I admit that its use cases may be marginal for majority of population. It does get ten new installs a day, although I don’t know how long this trend will continue.

I did send out press reviews and mounted a moderate campaign on Twitter (here’s the app’s account), but it amounted to precious little. Maybe the timing was bad – I launched it during Christmas holidays, when Internet usage is low. Or this type of app just isn’t so interesting.

I’m currently working on analysis of parking tickets for New York City, maybe that will be more interesting. There were, after all, more than nine million tickets issued there, and data is much richer.

Stay tuned!

A project for Transparency International Slovenija – visualization of lobbying contacts between state officials and lobbyists

Share Button

On the basis of previous post, Transparency International Slovenia asked me to collaborate on some projects. This is one of them, and it was launched today on a separate site: kdovpliva.si (English: whoinfluences.si).

It’s an attempt to visualize several networks of lobbyists, their companies, politicians and state institutions. Perhaps the most interesting part is the network of lobbying contacts, which was constructed with data containing around 700 reported contacts between 2011 and late 2014.

As you may imagine, not every lobbying contact is reported. For those who are, records are kept at the Komisija za preprečevanje korupcije (Commission for prevention of corruption, a state institution). Transparency International Slovenia obtained those records as PDF files, since the institution refused to provide them in a machine-readable format. They hired a few volunteers to copy and paste the information in spreadsheets, then handed them to me to visualize them.

You can see the results below. Click here or the image to open the site in a new window. It’s in Slovenian. For methodology, continue reading below the image.

App screenshot - lobbying contacts
App screenshot – lobbying contacts

 

Network construction

The meaning of every network is determined by the nature of its nodes and connections. Here, we have four node types:

  • lobbyists
  • those who were lobbied – state officials
  • organizations on which behalf lobbying was performed
  • state institutions at which the abovementioned officials work

Lobbying contact is initiated by a company or an organization, which employs a lobbyist to to the work. These people then contact state officials of a sufficient influence, who work at appropriate state institution.

So an organization is connected to the lobbyist with a weight of 2, the lobbyist to a state official with a weight of 1, and state official to her institution with a weight of 2. The weights signify the approximate loyalty between these entities. We presupposed that lobbyists are more loyal to their clients than they are to the state officials, with which they must be in a promiscuous relationship. Furthermore, the state officials are also supposed to be more loyal to their employers than to the lobbyists, although this is a daring supposition. But let’s say they are, or at least that they should be.

After some processing, the network emerged. Immediately apparent are the interest groups, centered around seats of power. Here’s an image of the pharmaceutical lobby. It’s centered on the Public Agency for Pharmaceuticals and Medicine. Main actors of influence are companies such as Merck, Novartis, Eli Lilly, Aventis, etc.

Pharmaceutical lobby
Pharmaceutical lobby

A click on the agency node brings up a panel with some details, such as a list of companies (font size indicates the frequency of contact), lobbying purposes and a timeline of lobbying contacts. Here we can see that Novartis and Krka were most active companies, and that they lobbied for purposes of pricing and to limit potential competition by producers of generic drugs.

You can explore the network by yourself to see the other interest groups.

Who lobbied the drug agency?
Who lobbied the drug agency?

 

Some advice from Information Commissioner

Unfortunately, we had to omit lobbyists’ names for reasons of supposed privacy. The Information Commissioner strongly advised us not to display them on the basis of some EU ruling. I’m not an expert in EU law, and perhaps there are good reasons for this. On the other hand, there may not be. I fail to see why this information would not be in public interest, since these decisions have an impact on a significant number of taxpayers, if not all of them.

Anyway, we have the names. After all, we had to use them to connect the network. They are present in raw data, just not displayed.

We’re are probably going to continue developing this project, as new information comes to light and new rulings regarding privacy are issued.

Stay tuned!

Voting and attendance in Slovenian Parliament from 2004 to current term

Share Button

In Slovenia, we have a love/hate relationship with our politicians. We hate them, because at almost every single step they make, they let us know they are corrupt and they can easily get away with it. But in each new election new faces appear, promptly get elected and are hailed as saviors, who will finally clean the Augean stables of greed and corruption that has been accumulating for too long.

Most emotions are reserved for those in the front row, mainly government members. Members of parliament are somehow exempted, as they are not so widely known. Somehow, they are not monitored properly, at least in my book. There is a site that contains session records per member and per session, but it’s not widely known. It was an inspiration for this attempt to present members’ activity in an easily understandable and graphic way for current term and a few terms in the past.

See the interactive version:        Slo                             Eng

Interest groups

The main idea was to group the parliamentary members by similarity of their voting record. Most parliamentary members are bound by strict voting discipline, imposed by the parties they belong to. This way the parties can guarantee that some or another act will pass and become a law. But is this really so? I tried to use a simple machine learning technique to answer that question. First I collected all the voting results from parliamentary term and sorted them in chronological order, then applied the technique (k-means clustering, for technologically minded). Number of groups was set to ten, but I could increase it to see smaller groups – maybe fractions inside parties, or cross-party interest groups.

Below you can see an example of two groups from recent term.

Here is the first:

And here another:group1

It’s apparent that groups do not contain representatives from one party only, and the visual representation imparts a feel for the differences in voting. As I mentioned above, I arbitrarily constructed ten groups, but a serious researcher would play and tinker with the number, as every clustering technique is an exploratory process and must be iterated upon for best results. It’s interesting that the results also show other parliamentary tactics. This one below could be interpreted as obstruction, or simply passivity or indifference. So what is it? To ask this question is to answer it, I guess.

To put it in context, this is a group of left-wing opposition representatives during a period when they were in heavy minority.

Indifference or obstruction?
Indifference or obstruction?

In contrast, this is the right-wing voting machine that prevailed:

A disciplined voting machine
A disciplined voting machine

The contrast between these two groups is so dramatic that it would be funny, if these were funny affairs.  While the opposition was idling away, the majority voted into existence law after law that, together, still influence the lives of the Slovenian citizenry. In interactive version (English) you can explore what the votes were about by simply moving the mouse over horizontal stripes.

See the interactive version:        Slo                             Eng

Attendance record

Session attendance is another telling indicator of particular representative’s zeal in upholding democracy and fulfilling the interests of his constituency. It’s already apparent from  charts above, but I still constructed a separate graphics for that. It’s sorted by presence and more easily readable.

It has to be noted that some representatives were excused from voting sessions for various periods of time. Among them are those who became ministers and those who replaced them in the parliamentary seat, not being there before.

Here’s an example from the recent term. At the bottom, you can see two blocks with alternating presence. That’s because there were two governments. When the first one fell, the ministers returned to their seats; those who originally replaced them, returned to the party’s roster; new ministers were sworn in and abandoned their seats; and new replacements came from opposite camp.

attendanceEN

See the interactive version:        Slo                             Eng

 Yes-men and rebels

Another interesting statistics is: representatives with most votes for yea or nay. I don’t really know how to interpret this, but I did it nevertheless. One could say that in terms with only one governments, members of ruling majority with most yea votes are those who unquestioningly toe the party line.  Conversely, those with most nay votes are most fervent members of the opposition. In terms with two governments, this is a little less clear-cut: one would have to separate the timelines and run the statistics on subperiods for each government. I didn’t do this, but a serious researcher would. I made this report to let them know that they are being monitored, but it’s a task of an investigative journalist to delve into the data and interpret it in a meaningful way. I don’t have time for this, and I don’t really know the particulars of daily politics here enough to be able to do that.

But I’m offering the database to anyone who would like to do that. Send me a mail for details, I’ll gladly oblige.

Here are a few simple pie charts that illustrate what I just wrote:

Yes men and rebels
Yes men and rebels

See the interactive version:        Slo                             Eng

Unity index

While programming, it struck me that I could calculate a synthetic measure that would show the unity in the parliament. The reasoning goes: if the vote was unanimous, the parliament as a whole was united in cause at hand. But if half of representatives voted yea, and the other half nay, the parliament was divided. So I constructed a timeline of all voting sessions and colored every session according to this measure. Blue for unanimous vote, red for evenly split vote, and violet hues as nuances of disharmony.

Additionally, the bar heights indicate the presence ratio. Lower heights obviously mean lower presence.

In some terms, the presence falls toward the end, and the proportion of red bars increase. This means that the representatives lost heart and abandoned their posts, and those who stayed, quarreled bitterly.

Here are these graphics for various terms. They are stretched to same length. Perhaps a more correct, but less visually appealing approach would be not to stretch them, so the length of particular term would be apparent.

indexEN
IV (2004 – 2008) – PM Janez Janša
indexEN
V (2008 – 2011) – PM Borut Pahor – ended prematurely
indexEN
VI (2011 – 2014) PM Janez Janša, PM Alenka Bratušek – ended prematurely
indexEN
VII (present) – probable PM Miro Cerar

See the interactive version:        Slo                             Eng

Session timelines and voting networks

The drive behind this section was to find out whether the attendance is falling, as the session progresses into small hours. I found that not to be so, which is encouraging in a way. These charts at least show which sessions were bitterly contested, and which were almost unanimous. You can see examples of both behaviors in the graphic below.

sessions

Going one step further, I constructed a separate network for each session in a way that if a representative voted for a proposition, he or she is connected with it, otherwise no.

Networks are a little bit messy, and people tend to not understand them well. This network below shows three groups of representatives (you can zoom in and out in the interactive version). They are grouped close to the propositions they voted for. So this is another opportunity to find out the interest groups on the micro level, for each proposition. Some propositions don’t have a name, just a date. That’s not my fault, but the parliament’s, as they didn’t bother to publish it on the web.network

See the interactive version:        Slo                             Eng

Seating order

Finally, here are some heatmaps for various variables, mapped on to seating orders. The first is partitioned according to representatives’ party. Sorry, no legend here. You can mouse over in the interactive version to show details.

The second is attendance heatmap. Green is full attendance, red is total absence, and there’s a linear color scale between them. This one provides at-a-glance overview of attendance of entire party blocks.

Next two are yea and nay heatmaps, so you can see which party blocks mostly voted yea, and which nay. They are normalized to their local maxima for visual appeal, but a more correct approach would be to not normalize them, so it would be apparent that a nay vote is much less frequent than a yea. Why, I have no Idea, but I imagine there must be a lot of technical votings, for example establishing presence and so on.

seatsEN

These seating orders are approximate, as I couldn’t get them for past terms from the parliament. They asserted that they didn’t have them, and claimed they don’t even have the current one, even if it’s published on their own website. There were more lies, but I won’t go into that here. They are, after all, in power, and I’m just a blogger.

Why they should engage in such behaviour is beyond me. Maybe they think that the information is theirs and should be kept from the public.

Again, if anyone needs the MongoDB database, drop me a note. My email address is on the About page.

See the interactive version:        Slo                             Eng

Discovering and visualizing songs with similar trends on the British Top 40 Charts from 1990 to 2014

Share Button

I often wondered what is an average lifetime of a pop song on the charts. If one follows music, it becomes intuitively apparent that there are in fact several types of hits. Some stay on the charts for many weeks, and others barely make it, then immediately slip out.

So I set about discovering groups of songs with similar trends, as they moved on weekly British Top 40 Chart from 1990 to 2014. A total of 1284 different songs appeared on the charts in that period. After a series of experiments, 100 groups were arbitrarily decided on. Position data for each song was collected across the weeks, then the songs were grouped using k-means clustering.

The result is part interactive, part static visualization, consisting of an exploratory chart and 100 small charts showing each separate group.

Check it out here! Or click the image below.Song trends over time in a typical group

Song trends over time in a typical group

 

To group the songs, the data was first scraped from www.officialcharts.com, then arranged in format suitable for k-means clustering. The visualization was constructed with d3.

And here are some of the small multiples.

Some of the 100 different groups. Click image for more.
Some of the 100 different groups. Click image for more.