I’ve been following comments under Slovenian web news items for quite some time. The commenters there are well known for their animosity towards anyone who disagrees with their political worldview. Reading the comment section usually means immersing yourself in verbal filth, depravity and all imaginable kinds of hate speech.
Some time ago I wrote software for scraping comments off these websites and have been since then storing them in a database. They are useful for a number of things. For example, I’ve had some success with stylometry (identifying commenters by their writing style, even when they post under different name), but this is a matter of another post. I also helped SAZU compiling a list of new slang words for the new Dictionary of Slovene Language.
So here’s a lighter project for people who don’t read these comments, neither they want to. If you want to see it all, just click the image below and behold the auto-generated stream for a minute or two.
Note that these are not real comments. The text is generated from two Markov chains, which have been initialized with texts of left-wing and right-wing commenters. The comments used are approximately a year old, lest someone accuses me of participating in election campaign of some kind. The web page simply generates a few sentences from one, then from the other, and so it continues ad nauseam infinitum.
I think it’s a fitting commentary of Slovenian mentality. Slovenian-speaking visitors will notice that, even if the texts are probabilistically computer-generated, there’s still ample hurling of insults based on the outcome of the last World War. There’s quite a lot of that.
Also, even though both sides pack serious vitriol, the right wingers use more classic hate speech, and they write comparatively worse.
Technically, it was a breeze to make. First I pulled entire corpora of selected commenters from the database in text form, then I used RiTA, a generative text tool, for initializing the two models and generating sentences. The code is very short, most of it has to do with displaying and scrolling.
After a year of publishing data visualizations and learning many things in the process, I think I can share a thing or two about best publishing practices that I’m so far aware of.
Here’s the Google Analytics Audience Overview for virostatiq.com in little more than a year of operation. Big spikes all come from social media. The biggest spike happened when some bigger players linked to one of my posts. Lulls in activity are because of my holidays.
There are two more charts from my site further down. One shows number of visits from social media, the other top referring sites.
In other words, you made a data visualization, and now you’d like to share it with the right people so they can appreciate the findings or the execution. Where to publish it?
Thinking about it, there are several groups one should target. Some effort is required to distribute your work appropriately.
Your target groups
Data visualization enthusiasts
People who enjoy a well executed visualization, and may care about the actual content just in terms of whether it’s well and fairly presented or not. Be aware that data visualization is a geekish thing.
There are several such communities:
Dataisbeautiful on Reddit – a subreddit dedicated to sharing data visualizations. Infographics are not accepted there. You may want to publish the link at a time that maximizes visibility for North American visitors. Subscription is required before posting.
A word of caution – you shouldn’t use Reddit exclusively for self-promotion. Posting just links to your work will be considered as spamming sooner or later, especially if you cross-post same link to other subreddits, so tread lightly, and do engage in discussion whenever possible. The FAQ says that it’s barely acceptable to post one link to your site per five links to others. Conspicious violation of these rules will get you banned, or worse yet, shadowbanned, which means you won’t even know your posts don’t appear on Reddit.
Visualizing.org – a very nice website allowing users to submit various kinds of visualizations. They also run competitions from time to time with awards up to $5,000. If your visualization gets featured, that can substantially add to your traffic, as they also have an active Twitter account and regularly publish what they think is best. If you make it to that list, your reach may expand significantly.
They also partner with academic organizations and big companies, so this is definitely the site to publish on.
Visual.ly – another website like Visualizing.org, but they also have a business. They will help you execute a project for a fee, but their galleries are well visited. After submitting a visualization, an editor reviews it and either approves it or not. I’ve had several rejected, mainly because they were wrongly categorized, or were in a gray zone. For example, I submitted a blog post with an interactive embedded map, but it was rejected. Then I made a separate page with the map, resubmitted it and got approved. Sometimes, they will mark a well-executed visualization as a Staff Pick, which can lead to more attention from the users.
Another thing with this site is that they may have really high-profile visitors, as in journalists who write for mainstream media. In fact, one of my most visited posts – My heart rate during latest episode of Game of Thrones – got spotted there by a Popular Science journalist, who published it in an article, which attracted more journalists from other media, and they published separate articles on their own, all linking to the original.
Another well-received post was a map of building ages in my city (a Staff Pick) , which attracted attention of Wired’s science editor, who was putting together a Wired Map blog post about such maps. After a few mails back and forth, they featured my map on Wired site (see Media page for a list of all such articles).
Visualoop – I’m not sure how to reliably submit your work to this site, but they do have a contact form at the bottom. They included me twice or three times in their weekly reviews. They probably spotted what I published on other sites. I recommend following their monthly dataviz calendar of events, there are a lot of conferences and hackatons there.
Various blogs and Twitter accounts
Some specialize in infographics, not making a distinction between that and a data visualization, others post just map-related stuff, and some are just run by geeks who enjoy something novel or cool. Search for them on Google.
As for Twitter accounts, search for tags: #dataviz, #ddj, #data-journalism and such, then follow frequent posters and institutions. Also try to find accounts of data journalists and professionals and technologies you used in your work, for example Sigma.js. Follow, retweet, etc … if even one of them retweets your link, you can see a hundred times more exposure as usual. Also, try to patiently build and cultivate your online community. This is an area in which I lack, as I’m more work oriented. Read articles such as these.
Data visualization competitions
It’s good to apply for as many of these as you can, even if you have to pay a symbolic fee. If nothing else, you may get longlisted, and your link will be displayed on a prestigious page, thus exposing your work to more interested people.
I also recommend subscribing to DashingD3 newsletter, go to DashingD3.com, there’s a sign up an the bottom. They also have another D3 newsletter for D3 freelance opportunities.
Journalists interested in data visualization
Data-driven journalism is an emerging trend. Most big publishing houses create prestigious visualizations that garner a lot of online interest. Guardian, Bloomberg, Washington Post and New York Times come to mind first, but there are many more, so there are naturally many journalists who are looking for a scoop in this area. You can find these people on Twitter, but there’s also a newsletter which some of them read. Subscribe to http://okfn.org/, and send your posts to firstname.lastname@example.org.
Join the LinkedIn’s group Data visualization. Actually, I got the idea for this post from a past discussion there. It has a ton of mostly business oriented posts and resources. You can also post your creations there. There’s also a lot of people there who might need a service you provide. More on this in the section about monetizing your work below.
Academia and government
In my experience, this is an organic thing. If your visualizations have an educational value and you regularly post them, academics will notice and contact you. I was once contacted by a Canada’s health department’s official who used my findings from this post in his presentation at an international conference.
People interested in content and findings of your data visualization
This is the trickiest part, and possibly the most rewarding. There’s a lot of trial and error and improvisation involved here, but try posting on online forums and other communities, mailing to editors at news organizations, Facebook groups and such. I once posted a link to My heart rate during latest episode of Game of Thrones to a Game of Thrones forum and got overwhelming response.
Be careful though, as there’s a thin line that separates rightful enthusiasm from obnoxious spamming. In an ideal world, you would be an active member of these communities before you posted your link there.
I mentioned social networking above, but I feel this topic requires a separate treatment. Examine chart above to get an idea of relative importance of these media.
Make sure you add sharing links to all your visualizations to make it easy for visitors to share them. They probably won’t use the buttons, but some of them will be reminded of possibility to share, and will do it their own way.
Facebook – try to join various data visualization groups and post there. Here are some: Gephi, Urban Data Visualization, Infographics and Data Visualization. Be aware that some of these groups are private. Also, post on your wall (obviously) and walls of organizations or pages that publish content that relates to your work, but carefully.
Another strategy is to ask friends, who might be opinion makers, to post links to your works. I have such a friend, and when he does it, it makes a huge difference. Like factor ten difference, and they will reach other opinion makers, who will repost.
Google+ – consider creating a page with your efforts, and link it with your blog or site as per instructions. That will bolster your search results on Google and give you another avenue for showcasing your work and promotion. For example, here’s my page.
Twitter – I mentioned Twitter strategy above, so again, read articles such as these.
LinkedIn – join groups and post your work. This is a good place to develop business leads. Complete your profile and publish link to it on your blog.
Pinterest – create account, pin static images of your work to a panel with appropriate name, for example “data visualizations”.
Tumblr – consider creating a separate blog there and repost everything.
Digg – submit all your links. data visualizations frequently appear on Digg front page, although I didn’t make it yet, so I can’t give a firsthand account on what kind of traffic you can expect.
GitHub – if you’re an accomplished programmer, clean your code and commit it there. I’m not, so I don’t. But it surely helps, especially if you manage to put together a library that others will use.
This is an area in which I don’t excel, but it’s a game which can potentially make a huge difference. Be sure to optimize your code and insert with meta tags. If your work is ajaxed, read Google guidelines for indexing such sites.
For more information turn to bloggers who make money out of their sites, there are tons of super useful resources out there. I’m a one man band, so it’s hard for me to keep current on all this in addition to technology and content.
If visualizing data is more of a hobby than your primary work area, this article about reinventing yourself might boost your courage. In any case, don’t expect an avalanche of business opportunities and money from your hobby. Some might materialize though. So far I had the pleasure to do three projects for a small fee, and there’s another one in the works. A relatively well-known social network from Seattle contacted me to make a map. They saw my gallery over at Visualizing.org and proposed some business, and I accepted. Needless to say, anything made for real production must be super tight, so there was a learning opportunity.
Some friends suggested that I display ads on my site. I won’t – firstly because I don’t believe that data vis enthusiasts would click on any, and secondly because I don’t have enough traffic to warrant inclusion. It would just be silly. The most I did was to enroll in Amazon Associates program and placed some links in posts to see what kind of revenue we’d be talking about. It’s of no consequence, but I might continue to do that, if only for information value in the books advertised.
Half a year after starting this blog, I won an award on Memefest Friendly Competition about Food Democracy. I went to Australia on their budget. That’s pretty cool. Now I consider my blog and my hobby as a potential vehicle to enrich my life in such unexpected ways.
Optimize for mobile! There are times when half of visitors on my site have mobile devices. So make your visualizations responsive, and be careful with user interface so that you catch touch events.
Do not cut corners. A week more programming can make a difference between a featured visualization or a mediocre one, that’s going to get buried under other submissions in a day.
Content is king. Ever heard this phrase? I did, but I had trouble understanding it. It means that a mediocre, but tight work on a superhot topic can be a hundred times more interesting than a perfectly executed job on uninteresting data.
Top referring sites to virostatiq.com
Here’s a last chart to sum up. It’s self-explanatory and gives a little more perspective to topic at hand.
This is an attempt at visualizing different conspiracy theories. The visualization tries to show interconnectedness of actors, organizations and concepts in each one, so a network graph was chosen as a mode of presentation. The presented theories are: The Antigravity Drive, Chemtrails, The Cabal (American deep state from JFK assassination to 9/11), The Illuminati/New World Order, and the most recent, the Malaysian Airlines Flight MH370 disappearance. In a way, it’s a progression from the previous network visualization about the PRISM scandal, which was once also considered a conspiracy theory.
I chose this topic because those theories always attracted me as a means of alternative explanation of things that I couldn’t understand in official versions of events. That is not to say that I necessarily believe in any of them. For example, I’d be hard pressed to believe in the Moon Landing Hoax theory, which I first included here because of relative ease of gathering source material, but later discarded because of its relatively low value. The Flt 370 theory has extremely low credibility too, and I wonder what I’ll think when this post is a year old.
A conspiracy, according to Wikipedia ” … may also refer to a group of people who make an agreement to form a partnership in which each member becomes the agent or partner of every other member and engage in planning or agreeing to commit some act.“. This is a pretty broad definition. It can apply to a government, a company, or every group of people who are trying to further an agenda, be it good or bad for their natural or social environment. But anything labelled as a conspiracy almost always has an evil association, for example “A civil conspiracy or collusion is an agreement between two or more parties to deprive a third party of legal rights or deceive a third party to obtain an illegal objective.” (Wikipedia – civil conspiracy), or “In criminal law, a conspiracy is an agreement between two or more persons to commit a crime at some time in the future.” (Wikipedia – criminal conspiracy).
A conspiracy theory is therefore an attempt at explaining a real or imagined conspiracy. In this sense, even official stories of various incidents are conspiracy theories, unless they are well founded in evidence and irrefutable facts. In a free society, a kind of market then forms of conspiracy theories, in which those with better means, but also more vested interests, compete for public’s attention with other bodies of citizenry, whose interests and aims can differ significantly. For example, a government can execute a false flag attack, as the Nazis did in Poland at the beginning of WW2, and spin a theory that the other party did it, in order to go to war and grab land. The public may then be motivated to concoct a variety of counter theories with various motives – simply seeking the truth, overthrowing the government by exposing the lies it tells, furthering some commercial agenda, for example selling books, or purely personal paranoid agendas, which serve no one else than the authors and their need to sustain their delusions.
Let me briefly explain the theories I used in this visualization. First two are quite believable.
The Cabal: the story of American deep state and events from JFK assassination to 9/11 attacks
A story of how the Nazi regime allegedly developed a form of anti gravity propulsion in total secrecy, made possible by a strictly compartmentalized environment, imposed on the German war production efforts by the SS. The technology was then seized by the US military and other allies after the war and developed further in utmost secrecy. The first such machines ever seen were so-called foo fighters. These balls of light, sighted and documented by various US Air Force pilots, flew in parallel with bombers and fighter planes, and frequently executed seemingly impossible air maneuvres. Also mentioned is a mythical machine The Glocke (The Bell), which ran on red mercury and was responsible for death of several scientists due to extreme radiation it produced, and the discoveries of Viktor Schauberger. His implosion engine, which drew heavily on vortex physics, was allegedly successful, and produces two flying prototypes. The US military immediately grabbed and classified much of this work, and it stays secret until now. It’s said to be employed in B-2 bomber and various flying craft sighted around Area 51 in Nevada. The story also goes to mention modern experiments in anti gravity physics, notably performed by Evgeniy Podkletnov, which allegedly succeeded in reducing gravity over a spinning superconducting electromagnet for two percent.
How a handful of secret societies dominate the world. The plot allegedly has its roots in The Bavarian Illuminati society, started in the eighteen century by Adam Weisshaupt. They were eradicated, but some claim they survived in a covert form, forging an alliance with international bankers. Most big world events since then were planned in advance, among them both the advent of Communism, Nazism and Zionism, World Wars, and the third too. Says Pike: “The Third World War must be fomented by taking advantage of the differences caused by the “agentur” of the “Illuminati” between the political Zionists and the leaders of Islamic World. The war must be conducted in such a way that Islam (the Moslem Arabic World) and political Zionism (the State of Israel) mutually destroy each other. Meanwhile the other nations, once more divided on this issue will be constrained to fight to the point of complete physical, moral, spiritual and economical exhaustion…We shall unleash the Nihilists and the atheists, and we shall provoke a formidable social cataclysm which in all its horror will show clearly to the nations the effect of absolute atheism, origin of savagery and of the most bloody turmoil.”
In recent times, the organizations that further Illuminati goals are Council for Foreign Relations, Trilateral Commission and the Bilderbergers. Here are some books: The Illuminati: Facts & Fiction by Mark Dice, and The Illuminati original by Adam Weisshaupt.
Malaysian Airlines Flight MH370 disappearance
A recent theory about the whereabouts of the missing plane. On it, there seemed to be an awful lot of technical personnel, involved in developing military hardware. They supposedly worked for a company named Freescale Semiconductors, which was in a patent wrestle with the Rothschild family. Acording to the story, Israeli agents and elements of US military hijacked the plane and secretly flew it to Diego Garcia military base in the Indian Ocean to debrief the experts and possibly use the plane in another 9/11-style attack in the future.
Construction and visualization of visualization networks
A few words for technologically minded. The networks were constructed by text-mining the source material, isolating known entities in sentences by means of massive dictionaries, connecting them in subnetworks (each sentence – one subnetwork), and finally adding them in the master network for that topic. Only sentence-length subnetworks were constructed, although it would be probably more fruitful to connect entities in paragraphs too. That would yield a too convoluted master network, so I stayed with sentences for clarity.
The dictionaries were automatically generated from source texts, then edited, Many synonyms had to be added, since my dictionary generating technique relies more on brute force than on semantic aspects of text. Again, the connections are not semantic, which means that if there was a sentence “The Illuminati are NOT connected with the CFR”, Illuminati and CFR would still be connected. Here I’m relying on the power of statistics: in majority of sentences there mostly appear connected entities. For the minority in which they are not, the bonds between them are too weak to influence the big picture.
I did try to process volumes of texts with a natural language processing framework, namely Apache OpenNLP, but got frustrated with the amount of work that would be needed for this little hobby project. I’d need to train the classifiers to extract named entities, which is no small feat, and I’d probably not use them again. To gain some insight in types of connections between these entities, I tried parsing the sentences into parse trees, then extract relationships, but parsing tech is not very accurate. It would probably do, again relying on power of statistics, but the sheer amount of relationship types would add little to visual value of the graphs, so I decided that I’d do this with a simpler project first. The logic I wrote is still in project source code, so if anyone is interested, mail me (About page) and I’ll send it your way. Same goes for the graph files and the categorized dictionaries.
Finally, the topic networks were exported as subgraphs, so that every node in the network is represented by a subgraph. These subgraphs are added into – or removed from – the master graph by the client. The networks in Browser are managed by sigma.js. Preliminary analysis was done in Gephi, I recommend Network Graph Analysis and Visualization with Gephi by Ken Cherven.
Additionally, geographic entities were extracted for each node. These are represented on a small map in the bottom of the screen. Map is managed by d3.js.
Interacting with visualization
There are two modes – reading the story or exploring on your own. Switch between them by clicking a button on top right of the graph. While read the story, the graph will change in real time as you scroll the text down. If you choose to explore, you can click on terms, and their subgraphs will be interactively added to the master graph.
Clicking on a graph node will expand it (load its associated nodes and display them, if previously not loaded), or delete it, if it was already loaded, at the same time showing the text from which its existence was text-mined.
There’s no way for the user to control the map. It’s there for informative and decorative purposes.
There appeared an article, in which an attempt was made to expose questionable practices of some Slovenian enterpreneurs. The scheme is such: establish a company, perform some work, bleed it dry, then establish a new one and move all workers into it, at the same time avoiding paying benefits and a sizable portion of salaries. When the new company has server its purpose, establish a new one, and so on, as far as it goes. These companies are frequently registered at the same address.
The article says that there are as many as 120 companies registered in one residential building. But because of a weakness of the law, state inspectors can’t put an end to such practice.
I wanted to see these addresses on the map, so here’s an attempt. For every address with more than five companies, there’s a dot, with color and radius proportional with number of companies registered there. The biggest dots represent business buildings, in which a predominantly legitimate businesses reside. My data sources didn’t allow for filtering out just residential buildings.
You can see the standalone map here. (In Slovene.)
Clicking on a marker displays a popup with a list of companies, sorted by date of establishment – youngest first. There’s also a chart of predominant business categories at that address. The categories that the article mentions as most prone to scheme in question, are Construction and Retail. So even of this map can’t really show the locations with these questionable companies, it can maybe help their discovery. If there’s a big dot with predominantly these categories, there’s a certain possibility that some of these fraudulent companies are there.
Most addresses shown here of course don’t have anything to do with any illegal activity.
The Global Gender Gap Index examines the gap between men and women in four fundamental categories (subindexes): Economic Participation and Opportunity, Educational Attainment, Health and Survival and Political Empowerment. Table 1 displays all four of these subindexes and the 14 different indicators that compose them, along with the sources of data used for each.
I thought it would be nice to try to visualize the data and make it as interactive as I could, and learn d3.js in process. I actually tried to mobilize all the data in the report, which one can see in graphical form by clicking on countries on world map, or selecting the categories in the dropdown.
There are several categories:
In addition to that, I calculated the differences between 2013 and previous years. These maps are also accessible through dropdown menu, or simply by scrolling up and down.
This subindex is captured through three concepts: the participation gap, the remuneration gap and the advancement gap. The participation gap is captured using the difference in labour force participation rates. The remuneration gap is captured through a hard data indicator (ratio of estimated female-to-male earned income) and a qualitative variable calculated through the World Economic Forum’s Executive Opinion Survey (wage equality for similar work). Finally, the gap between the advancement of women and men is captured through two hard data statistics (the ratio of women to men among legislators, senior officials and managers, and the ratio of women to men among technical and professional workers).
In this subindex, the gap between women’s and men’s current access to education is captured through ratios of women to men in primary-, secondary- and tertiary-level education. A longer-term view of the country’s ability to educate women and men in equal numbers is captured through the ratio of the female literacy rate to the male literacy rate.
Health and Survival
This subindex provides an overview of the differences between women’s and men’s health. To do this, we use two indicators. The first is the sex ratio at birth, which aims specifically to capture the phenomenon of “missing women” prevalent in many countries with a strong son preference. Second, we use the gap between women’s and men’s healthy life expectancy, calculated by the World Health Organization. This measure provides an estimate of the number of years that women and men can expect to live in good health by taking into account the years lost to violence, disease, malnutrition or other relevant factors.
This subindex measures the gap between men and women at the highest level of political decision-making, through the ratio of women to men in minister-level positions and the ratio of women to men in parliamentary positions. In addition, we include the ratio of women to men in terms of years in executive office (prime minister or president) for the last 50 years. A clear drawback in this category is the absence of any indicators capturing differences between the participation of women and men at local levels of government. Should such data become available at a global level in future years, they will be considered for inclusion in the Global Gender Gap Index.
Out of the 110 countries that have been involved every year since 2006, 95 (86%) have improved their performance over the last four years, while 15 (14%) have shown widening gaps. Ten countries have closed the gap on both the Health and Survival and Educational Attainment subindexes. No country has closed the economic participation gap or the political empowerment gap. On the Economic Participation and Opportunity subindex, the highest-ranking country (Norway) has closed over 84% of its gender gap, while the lowest ranking country (Syria) has closed only 25% of its economic gender gap. There is similar variation in the Political Empowerment subindex. The highest-ranking country (Iceland) has closed almost 75% of its gender gap whereas the two lowest-ranking countries (Brunei Darussalam and Qatar) have closed none of the political empowerment gap according to this measure.