On the basis of previous post, Transparency International Slovenia asked me to collaborate on some projects. This is one of them, and it was launched today on a separate site: kdovpliva.si (English: whoinfluences.si).
It’s an attempt to visualize several networks of lobbyists, their companies, politicians and state institutions. Perhaps the most interesting part is the network of lobbying contacts, which was constructed with data containing around 700 reported contacts between 2011 and late 2014.
As you may imagine, not every lobbying contact is reported. For those who are, records are kept at the Komisija za preprečevanje korupcije (Commission for prevention of corruption, a state institution). Transparency International Slovenia obtained those records as PDF files, since the institution refused to provide them in a machine-readable format. They hired a few volunteers to copy and paste the information in spreadsheets, then handed them to me to visualize them.
You can see the results below. Click here or the image to open the site in a new window. It’s in Slovenian. For methodology, continue reading below the image.
The meaning of every network is determined by the nature of its nodes and connections. Here, we have four node types:
those who were lobbied – state officials
organizations on which behalf lobbying was performed
state institutions at which the abovementioned officials work
Lobbying contact is initiated by a company or an organization, which employs a lobbyist to to the work. These people then contact state officials of a sufficient influence, who work at appropriate state institution.
So an organization is connected to the lobbyist with a weight of 2, the lobbyist to a state official with a weight of 1, and state official to her institution with a weight of 2. The weights signify the approximate loyalty between these entities. We presupposed that lobbyists are more loyal to their clients than they are to the state officials, with which they must be in a promiscuous relationship. Furthermore, the state officials are also supposed to be more loyal to their employers than to the lobbyists, although this is a daring supposition. But let’s say they are, or at least that they should be.
After some processing, the network emerged. Immediately apparent are the interest groups, centered around seats of power. Here’s an image of the pharmaceutical lobby. It’s centered on the Public Agency for Pharmaceuticals and Medicine. Main actors of influence are companies such as Merck, Novartis, Eli Lilly, Aventis, etc.
A click on the agency node brings up a panel with some details, such as a list of companies (font size indicates the frequency of contact), lobbying purposes and a timeline of lobbying contacts. Here we can see that Novartis and Krka were most active companies, and that they lobbied for purposes of pricing and to limit potential competition by producers of generic drugs.
You can explore the network by yourself to see the other interest groups.
Some advice from Information Commissioner
Unfortunately, we had to omit lobbyists’ names for reasons of supposed privacy. The Information Commissioner strongly advised us not to display them on the basis of some EU ruling. I’m not an expert in EU law, and perhaps there are good reasons for this. On the other hand, there may not be. I fail to see why this information would not be in public interest, since these decisions have an impact on a significant number of taxpayers, if not all of them.
Anyway, we have the names. After all, we had to use them to connect the network. They are present in raw data, just not displayed.
We’re are probably going to continue developing this project, as new information comes to light and new rulings regarding privacy are issued.
In Slovenia, we have a love/hate relationship with our politicians. We hate them, because at almost every single step they make, they let us know they are corrupt and they can easily get away with it. But in each new election new faces appear, promptly get elected and are hailed as saviors, who will finally clean the Augean stables of greed and corruption that has been accumulating for too long.
Most emotions are reserved for those in the front row, mainly government members. Members of parliament are somehow exempted, as they are not so widely known. Somehow, they are not monitored properly, at least in my book. There is a site that contains session records per member and per session, but it’s not widely known. It was an inspiration for this attempt to present members’ activity in an easily understandable and graphic way for current term and a few terms in the past.
The main idea was to group the parliamentary members by similarity of their voting record. Most parliamentary members are bound by strict voting discipline, imposed by the parties they belong to. This way the parties can guarantee that some or another act will pass and become a law. But is this really so? I tried to use a simple machine learning technique to answer that question. First I collected all the voting results from parliamentary term and sorted them in chronological order, then applied the technique (k-means clustering, for technologically minded). Number of groups was set to ten, but I could increase it to see smaller groups – maybe fractions inside parties, or cross-party interest groups.
Below you can see an example of two groups from recent term.
Here is the first:
And here another:
It’s apparent that groups do not contain representatives from one party only, and the visual representation imparts a feel for the differences in voting. As I mentioned above, I arbitrarily constructed ten groups, but a serious researcher would play and tinker with the number, as every clustering technique is an exploratory process and must be iterated upon for best results. It’s interesting that the results also show other parliamentary tactics. This one below could be interpreted as obstruction, or simply passivity or indifference. So what is it? To ask this question is to answer it, I guess.
To put it in context, this is a group of left-wing opposition representatives during a period when they were in heavy minority.
In contrast, this is the right-wing voting machine that prevailed:
The contrast between these two groups is so dramatic that it would be funny, if these were funny affairs. While the opposition was idling away, the majority voted into existence law after law that, together, still influence the lives of the Slovenian citizenry. In interactive version (English) you can explore what the votes were about by simply moving the mouse over horizontal stripes.
Session attendance is another telling indicator of particular representative’s zeal in upholding democracy and fulfilling the interests of his constituency. It’s already apparent from charts above, but I still constructed a separate graphics for that. It’s sorted by presence and more easily readable.
It has to be noted that some representatives were excused from voting sessions for various periods of time. Among them are those who became ministers and those who replaced them in the parliamentary seat, not being there before.
Here’s an example from the recent term. At the bottom, you can see two blocks with alternating presence. That’s because there were two governments. When the first one fell, the ministers returned to their seats; those who originally replaced them, returned to the party’s roster; new ministers were sworn in and abandoned their seats; and new replacements came from opposite camp.
Another interesting statistics is: representatives with most votes for yea or nay. I don’t really know how to interpret this, but I did it nevertheless. One could say that in terms with only one governments, members of ruling majority with most yea votes are those who unquestioningly toe the party line. Conversely, those with most nay votes are most fervent members of the opposition. In terms with two governments, this is a little less clear-cut: one would have to separate the timelines and run the statistics on subperiods for each government. I didn’t do this, but a serious researcher would. I made this report to let them know that they are being monitored, but it’s a task of an investigative journalist to delve into the data and interpret it in a meaningful way. I don’t have time for this, and I don’t really know the particulars of daily politics here enough to be able to do that.
But I’m offering the database to anyone who would like to do that. Send me a mail for details, I’ll gladly oblige.
Here are a few simple pie charts that illustrate what I just wrote:
While programming, it struck me that I could calculate a synthetic measure that would show the unity in the parliament. The reasoning goes: if the vote was unanimous, the parliament as a whole was united in cause at hand. But if half of representatives voted yea, and the other half nay, the parliament was divided. So I constructed a timeline of all voting sessions and colored every session according to this measure. Blue for unanimous vote, red for evenly split vote, and violet hues as nuances of disharmony.
Additionally, the bar heights indicate the presence ratio. Lower heights obviously mean lower presence.
In some terms, the presence falls toward the end, and the proportion of red bars increase. This means that the representatives lost heart and abandoned their posts, and those who stayed, quarreled bitterly.
Here are these graphics for various terms. They are stretched to same length. Perhaps a more correct, but less visually appealing approach would be not to stretch them, so the length of particular term would be apparent.
The drive behind this section was to find out whether the attendance is falling, as the session progresses into small hours. I found that not to be so, which is encouraging in a way. These charts at least show which sessions were bitterly contested, and which were almost unanimous. You can see examples of both behaviors in the graphic below.
Going one step further, I constructed a separate network for each session in a way that if a representative voted for a proposition, he or she is connected with it, otherwise no.
Networks are a little bit messy, and people tend to not understand them well. This network below shows three groups of representatives (you can zoom in and out in the interactive version). They are grouped close to the propositions they voted for. So this is another opportunity to find out the interest groups on the micro level, for each proposition. Some propositions don’t have a name, just a date. That’s not my fault, but the parliament’s, as they didn’t bother to publish it on the web.
Finally, here are some heatmaps for various variables, mapped on to seating orders. The first is partitioned according to representatives’ party. Sorry, no legend here. You can mouse over in the interactive version to show details.
The second is attendance heatmap. Green is full attendance, red is total absence, and there’s a linear color scale between them. This one provides at-a-glance overview of attendance of entire party blocks.
Next two are yea and nay heatmaps, so you can see which party blocks mostly voted yea, and which nay. They are normalized to their local maxima for visual appeal, but a more correct approach would be to not normalize them, so it would be apparent that a nay vote is much less frequent than a yea. Why, I have no Idea, but I imagine there must be a lot of technical votings, for example establishing presence and so on.
These seating orders are approximate, as I couldn’t get them for past terms from the parliament. They asserted that they didn’t have them, and claimed they don’t even have the current one, even if it’s published on their own website. There were more lies, but I won’t go into that here. They are, after all, in power, and I’m just a blogger.
Why they should engage in such behaviour is beyond me. Maybe they think that the information is theirs and should be kept from the public.
Again, if anyone needs the MongoDB database, drop me a note. My email address is on the About page.
I often wondered what is an average lifetime of a pop song on the charts. If one follows music, it becomes intuitively apparent that there are in fact several types of hits. Some stay on the charts for many weeks, and others barely make it, then immediately slip out.
So I set about discovering groups of songs with similar trends, as they moved on weekly British Top 40 Chart from 1990 to 2014. A total of 1284 different songs appeared on the charts in that period. After a series of experiments, 100 groups were arbitrarily decided on. Position data for each song was collected across the weeks, then the songs were grouped using k-means clustering.
The result is part interactive, part static visualization, consisting of an exploratory chart and 100 small charts showing each separate group.
I’ve been following comments under Slovenian web news items for quite some time. The commenters there are well known for their animosity towards anyone who disagrees with their political worldview. Reading the comment section usually means immersing yourself in verbal filth, depravity and all imaginable kinds of hate speech.
Some time ago I wrote software for scraping comments off these websites and have been since then storing them in a database. They are useful for a number of things. For example, I’ve had some success with stylometry (identifying commenters by their writing style, even when they post under different name), but this is a matter of another post. I also helped SAZU compiling a list of new slang words for the new Dictionary of Slovene Language.
So here’s a lighter project for people who don’t read these comments, neither they want to. If you want to see it all, just click the image below and behold the auto-generated stream for a minute or two.
Note that these are not real comments. The text is generated from two Markov chains, which have been initialized with texts of left-wing and right-wing commenters. The comments used are approximately a year old, lest someone accuses me of participating in election campaign of some kind. The web page simply generates a few sentences from one, then from the other, and so it continues ad nauseam infinitum.
I think it’s a fitting commentary of Slovenian mentality. Slovenian-speaking visitors will notice that, even if the texts are probabilistically computer-generated, there’s still ample hurling of insults based on the outcome of the last World War. There’s quite a lot of that.
Also, even though both sides pack serious vitriol, the right wingers use more classic hate speech, and they write comparatively worse.
Technically, it was a breeze to make. First I pulled entire corpora of selected commenters from the database in text form, then I used RiTA, a generative text tool, for initializing the two models and generating sentences. The code is very short, most of it has to do with displaying and scrolling.
After a year of publishing data visualizations and learning many things in the process, I think I can share a thing or two about best publishing practices that I’m so far aware of.
Here’s the Google Analytics Audience Overview for virostatiq.com in little more than a year of operation. Big spikes all come from social media. The biggest spike happened when some bigger players linked to one of my posts. Lulls in activity are because of my holidays.
There are two more charts from my site further down. One shows number of visits from social media, the other top referring sites.
In other words, you made a data visualization, and now you’d like to share it with the right people so they can appreciate the findings or the execution. Where to publish it?
Thinking about it, there are several groups one should target. Some effort is required to distribute your work appropriately.
Your target groups
Data visualization enthusiasts
People who enjoy a well executed visualization, and may care about the actual content just in terms of whether it’s well and fairly presented or not. Be aware that data visualization is a geekish thing.
There are several such communities:
Dataisbeautiful on Reddit – a subreddit dedicated to sharing data visualizations. Infographics are not accepted there. You may want to publish the link at a time that maximizes visibility for North American visitors. Subscription is required before posting.
A word of caution – you shouldn’t use Reddit exclusively for self-promotion. Posting just links to your work will be considered as spamming sooner or later, especially if you cross-post same link to other subreddits, so tread lightly, and do engage in discussion whenever possible. The FAQ says that it’s barely acceptable to post one link to your site per five links to others. Conspicious violation of these rules will get you banned, or worse yet, shadowbanned, which means you won’t even know your posts don’t appear on Reddit.
Visualizing.org – a very nice website allowing users to submit various kinds of visualizations. They also run competitions from time to time with awards up to $5,000. If your visualization gets featured, that can substantially add to your traffic, as they also have an active Twitter account and regularly publish what they think is best. If you make it to that list, your reach may expand significantly.
They also partner with academic organizations and big companies, so this is definitely the site to publish on.
Visual.ly – another website like Visualizing.org, but they also have a business. They will help you execute a project for a fee, but their galleries are well visited. After submitting a visualization, an editor reviews it and either approves it or not. I’ve had several rejected, mainly because they were wrongly categorized, or were in a gray zone. For example, I submitted a blog post with an interactive embedded map, but it was rejected. Then I made a separate page with the map, resubmitted it and got approved. Sometimes, they will mark a well-executed visualization as a Staff Pick, which can lead to more attention from the users.
Another thing with this site is that they may have really high-profile visitors, as in journalists who write for mainstream media. In fact, one of my most visited posts – My heart rate during latest episode of Game of Thrones – got spotted there by a Popular Science journalist, who published it in an article, which attracted more journalists from other media, and they published separate articles on their own, all linking to the original.
Another well-received post was a map of building ages in my city (a Staff Pick) , which attracted attention of Wired’s science editor, who was putting together a Wired Map blog post about such maps. After a few mails back and forth, they featured my map on Wired site (see Media page for a list of all such articles).
Visualoop – I’m not sure how to reliably submit your work to this site, but they do have a contact form at the bottom. They included me twice or three times in their weekly reviews. They probably spotted what I published on other sites. I recommend following their monthly dataviz calendar of events, there are a lot of conferences and hackatons there.
Various blogs and Twitter accounts
Some specialize in infographics, not making a distinction between that and a data visualization, others post just map-related stuff, and some are just run by geeks who enjoy something novel or cool. Search for them on Google.
As for Twitter accounts, search for tags: #dataviz, #ddj, #data-journalism and such, then follow frequent posters and institutions. Also try to find accounts of data journalists and professionals and technologies you used in your work, for example Sigma.js. Follow, retweet, etc … if even one of them retweets your link, you can see a hundred times more exposure as usual. Also, try to patiently build and cultivate your online community. This is an area in which I lack, as I’m more work oriented. Read articles such as these.
Data visualization competitions
It’s good to apply for as many of these as you can, even if you have to pay a symbolic fee. If nothing else, you may get longlisted, and your link will be displayed on a prestigious page, thus exposing your work to more interested people.
I also recommend subscribing to DashingD3 newsletter, go to DashingD3.com, there’s a sign up an the bottom. They also have another D3 newsletter for D3 freelance opportunities.
Journalists interested in data visualization
Data-driven journalism is an emerging trend. Most big publishing houses create prestigious visualizations that garner a lot of online interest. Guardian, Bloomberg, Washington Post and New York Times come to mind first, but there are many more, so there are naturally many journalists who are looking for a scoop in this area. You can find these people on Twitter, but there’s also a newsletter which some of them read. Subscribe to http://okfn.org/, and send your posts to firstname.lastname@example.org.
Join the LinkedIn’s group Data visualization. Actually, I got the idea for this post from a past discussion there. It has a ton of mostly business oriented posts and resources. You can also post your creations there. There’s also a lot of people there who might need a service you provide. More on this in the section about monetizing your work below.
Academia and government
In my experience, this is an organic thing. If your visualizations have an educational value and you regularly post them, academics will notice and contact you. I was once contacted by a Canada’s health department’s official who used my findings from this post in his presentation at an international conference.
People interested in content and findings of your data visualization
This is the trickiest part, and possibly the most rewarding. There’s a lot of trial and error and improvisation involved here, but try posting on online forums and other communities, mailing to editors at news organizations, Facebook groups and such. I once posted a link to My heart rate during latest episode of Game of Thrones to a Game of Thrones forum and got overwhelming response.
Be careful though, as there’s a thin line that separates rightful enthusiasm from obnoxious spamming. In an ideal world, you would be an active member of these communities before you posted your link there.
I mentioned social networking above, but I feel this topic requires a separate treatment. Examine chart above to get an idea of relative importance of these media.
Make sure you add sharing links to all your visualizations to make it easy for visitors to share them. They probably won’t use the buttons, but some of them will be reminded of possibility to share, and will do it their own way.
Facebook – try to join various data visualization groups and post there. Here are some: Gephi, Urban Data Visualization, Infographics and Data Visualization. Be aware that some of these groups are private. Also, post on your wall (obviously) and walls of organizations or pages that publish content that relates to your work, but carefully.
Another strategy is to ask friends, who might be opinion makers, to post links to your works. I have such a friend, and when he does it, it makes a huge difference. Like factor ten difference, and they will reach other opinion makers, who will repost.
Google+ – consider creating a page with your efforts, and link it with your blog or site as per instructions. That will bolster your search results on Google and give you another avenue for showcasing your work and promotion. For example, here’s my page.
Twitter – I mentioned Twitter strategy above, so again, read articles such as these.
LinkedIn – join groups and post your work. This is a good place to develop business leads. Complete your profile and publish link to it on your blog.
Pinterest – create account, pin static images of your work to a panel with appropriate name, for example “data visualizations”.
Tumblr – consider creating a separate blog there and repost everything.
Digg – submit all your links. data visualizations frequently appear on Digg front page, although I didn’t make it yet, so I can’t give a firsthand account on what kind of traffic you can expect.
GitHub – if you’re an accomplished programmer, clean your code and commit it there. I’m not, so I don’t. But it surely helps, especially if you manage to put together a library that others will use.
This is an area in which I don’t excel, but it’s a game which can potentially make a huge difference. Be sure to optimize your code and insert with meta tags. If your work is ajaxed, read Google guidelines for indexing such sites.
For more information turn to bloggers who make money out of their sites, there are tons of super useful resources out there. I’m a one man band, so it’s hard for me to keep current on all this in addition to technology and content.
If visualizing data is more of a hobby than your primary work area, this article about reinventing yourself might boost your courage. In any case, don’t expect an avalanche of business opportunities and money from your hobby. Some might materialize though. So far I had the pleasure to do three projects for a small fee, and there’s another one in the works. A relatively well-known social network from Seattle contacted me to make a map. They saw my gallery over at Visualizing.org and proposed some business, and I accepted. Needless to say, anything made for real production must be super tight, so there was a learning opportunity.
Some friends suggested that I display ads on my site. I won’t – firstly because I don’t believe that data vis enthusiasts would click on any, and secondly because I don’t have enough traffic to warrant inclusion. It would just be silly. The most I did was to enroll in Amazon Associates program and placed some links in posts to see what kind of revenue we’d be talking about. It’s of no consequence, but I might continue to do that, if only for information value in the books advertised.
Half a year after starting this blog, I won an award on Memefest Friendly Competition about Food Democracy. I went to Australia on their budget. That’s pretty cool. Now I consider my blog and my hobby as a potential vehicle to enrich my life in such unexpected ways.
Optimize for mobile! There are times when half of visitors on my site have mobile devices. So make your visualizations responsive, and be careful with user interface so that you catch touch events.
Do not cut corners. A week more programming can make a difference between a featured visualization or a mediocre one, that’s going to get buried under other submissions in a day.
Content is king. Ever heard this phrase? I did, but I had trouble understanding it. It means that a mediocre, but tight work on a superhot topic can be a hundred times more interesting than a perfectly executed job on uninteresting data.
Top referring sites to virostatiq.com
Here’s a last chart to sum up. It’s self-explanatory and gives a little more perspective to topic at hand.