small multiples Archives

by virostatiq - February 3, 2016

Mapping parking infractions in Manhattan, NYC, by car make

This is a technical explanation of procedure to map parking infractions in Manhattan for every available car make. To see the interactive visualization, click here, or click the image below. Otherwise read on.

Last year I published an Android app to enable Slovenian drivers to better avoid areas frequently inspected by parking wardens. It works by geolocating the user and then plotting issued paring tickets in the vicinity, with a breakdown by month, time of day and temperature on another screen. It was not a huge hit, but it did reasonably well for such a small country and no marketing budget.

I was thinking of making a version for New York City, but then abandoned the project. These visualizations are all that remains of it.

I started with downloading the data from New York Open Data repository. It’s here. The data is relatively rich, but it’s not geocoded. Luck had it that Mapbox just rolled out a batch geocoder at that time, and it was free with no quotas. So I quickly sent around 100,000 adresses through it and saved the results in a database for later use. The processed result is now available on Downloads page in form of JSON files, one per car make.

The actual drawing procedure was easier than I thought. I downloaded street data from New York GIS Clearinghouse and edited out everything but Manhattan with QGis.

First I tried a promising matrix approach, but I was unable to rotate the heatmap so that it would make sense. Here’s an example for Audi:

As you can see, it is a heatmap, but doesn’t look very good.

So I wrote a Python script that went through all street segments and awarded a point if there was an infraction closer that 100 meters from the relevant segment. Then I just used matplotlib to draw all the street segments, coloring them according to the maximum segment value.

A result for Audi now looks like this:

All that remained was drawing required images for animated GIFs, each for every hour for every car make. This was done with minimal modifications to original script (I learned Python multithreading in the process). The resulting images were then converted to animated GIFs with ImageMagic.

The whole procedure took approximately 12h of calculating and rendering time on a i7-6700 with 32 GB RAM. I guess I could shave several hours from that time, but I just let it run overnight.

See interactive version here, and tell me what you think in the comment section, if you feel like it.

by virostatiq - January 14, 2016

Densities of different tree species in Ljubljana on minimaps

This is a rework of the visualization I did for the Dnevnik newspaper. The Ljubljana government was generous enough to give us a location database with information on species and location of every tree within city limits. I thought it would be nice to render every species on its own map, so that the distributions can be compared.

Instead of just drawing points where each tree is, I calculated distance from each building to all the trees, and increased the building “score” if a tree was within 150 meters distance. Then I colored the buildings according to the score – the darker green it is, the more trees in its vicinity.

See the detailed version by clicking here or the image below.

The aforementioned article depicted the areas with higher potential for causing allergenic reactions due to specific tree species that grow there, but it also has a detailed map with every building colored in proportion with its distance from trees in vicinity.

See the article by clicking here or the image.

by virostatiq - January 8, 2016January 8, 2016

Maps of various name suffix densities in USA and Slovenia

This post and maps were inspired by Moritz Stefaner’s -ach, -ingen, -zell. I firmly believe in giving credits to whom they are due, so there it is.

That said, I embarked into a similar adventure, first for Slovenia. Etimology of Slovenian towns and other populated places may differ a little from German one, so I was naturally curious what it would be like on a map. I had several geo files for Slovenia around, and also a comprehensive list of all populated places with coordinates, making this a relatively short endeavour.

In addition to common ~~suffices~~ suffixes, I also extracted common prefixes. This is because many Slovenian place names begin with “Gornja” (Upper) or “Velika” (Great), so I wanted to see if there are meaningful spatial distributions of these names. It turns out that they are.

For example, this one. By columns: “gornja” (a variant of “upper) vs “dolnja” (variant of “lower”), “zgornja” and “dolnja” (another couple of variations on the same dichotomy), the “velika” and “mala” (“great” and “small”). It’s apparent that places with those prefixes have characteristic spatial distributions. Why, I don’t know. Dialects of Slovenian language vary wildly, to the point that some of them are virtually incomprehensible to me.

To see the interactive version with more maps, click here, or click the image. Switch between prefixes and suffixes using links in the upper left square.

Distribution of places with some common prefixesprefixes — Distribution of places with some common prefixes

Having written the code and downloaded the geonames.org database, it was just a matter of changing a few things to produce a similar map of a similar distribution in the USA. I colored it a litlle differently, but it’s basically the same thing.

Again, click here or the image for interactive version. Note that you can click on a little link above each map to display the list of place names.

Then, a friend and coworker of mine said that he always wondered about the distributions of U.S. towns with borrowed names from European places. That would effectively show distributions of immigration in early history of USA, with exception of Spanish names, which tend to be on the Mexican border because of history, and some random noise in string matchings.

Here’s an image. Click here or map for interactive version.

Check out the maps! Some technical details: the maps were drawn with d3, and hexagons produced with the hex-binning plugin.

Name matching was not a big challenge, but I did want to find unique suffixes. So I wrote some software to first isolate the most frequently occurring seven character suffixes, then I gradually shortened them until big dropoff, say, more than 50 places occurred. That way I prevented near duplicates to be included, for example “-ville”, “-ille”, “-lle”, which have approximate same distribution, so only one of them has a place on the map.

The biggest challenge was in fact generating a hex grid within the borders, and then fitting the data inside it. That’s the reason the pages need some time to load. I brute-forced that by generating points inside the bounding box and checking if withing the polygon in question with turf.js, then setting all hexagon lenghts to zero, and finally filling them with real data.

Hope you enjoy the maps!

by virostatiq - August 19, 2015August 19, 2015

Could computers vote instead of parliamentary representatives?

Haha, what a funny question. Of course they can’t. How can one teach a computer all the intricacies of lawmaking process, and trust it well enough to let it vote? This must surely be a recipe for disaster.

Yet, as I realized in previous research, the parties mostly demand ruthless discipline from their parliamentary representatives at voting time, simply to be able to actually govern in Slovenian multiparty democracy, where there’s never an absolute winner. This leads to coalition governments, where every vote counts towards a majority.

That means that in a polarized parliament, one could theoretically predict a representative’s vote by examining the votes cast by all other representatives. If an opposition party proposes to vote on an act, it’s very likely that members of government block will uniformly, or at least predominantly, vote against it, and vice versa. There are few exceptions to that rule, namely some profoundly ethical decisions, in which majority parties will let their members vote by conscience. But they are few and far apart.

Fun with neural networks

I decided to test this out by modeling some representatives by neural networks, and training the networks with a few voting sessions and their outcomes in the beginning of the parliamentary term.

Model for each representative was fed votes by every other rep except him- or herself as input, and his or her vote as desired output. This was repeated and repeated again for all hundred training sessions, until the model converged (loss fell under 0.05).

It was then shown voting sessions iz hasn’t seen yet, and tasked to predict the outcomes.

The results are shown in images below. For each representative, the image contains:

name and party,
training vector (the votes he/she cast in first 100 voting sessions – red for “against”, blue for “in favor”, yellow for absence for whatever reason),
actual votes (400 votes the network hasn’t seen and was trying to predict),
predicted votes (how the neural network thought the representative would vote), and
difference indicator (with red rectangles for wrong prediction, green rectangles for correct prediction, and yellow rectangles for absence)

I didn’t bother too much with statistics, to see who was the most predictable, neither did I try to predict voting for every rep.

In short, those with the mainly green bottom strip were the most predictable.

Government coalition

DeSUS

SMC

Opposition

NSi

IMNS

SDS

ZAAB

A cursory examination of results yields several realizations:

even in best predictions with lowest error rate, the model doesn’t predict absences well, especially for representatives with low incidence of absence in training data. This is intuitively understandable on two levels: first, it’s hard for the network to generalize something it didn’t observe, and second, absences can happen on a human whim, which is unreachable for a mathematical model. For representatives of opposition parties, who frequently engage in obstruction as a valid tactics, the model fares a little better.
the model predicts best the voting behavior of majjority party (SMC) members.
the model utterly fails to predict anything for representatives whowere absent in training period (duh).

So, could we substitute the actual representatives with simple neural networks? Not with this methodology. The problem is that we need votes of everyone else in the same session to predict the vote of modeled rep, so at the time of prediction, we already have their vote. We don’t have a way of inferring votes from scratch, or from previous votes.

We could, in theory, try to predict each rep’s vote independently from others by training the network on proposed acts’ texts. I speculate that a deeper network could correlate vectorized keywords in training texts with voting outcomes, and then be able to predict voting for each rep independently based on previous unseen texts. Maybe I’ll do that when I get the texts and learn a bit more. It’s still ANN 101 period for me.

I used a simple perceptron with 98 inputs (there have been 99 representatives in this term, counting also current ministers and substitutes), a hidden layer of 60 neurons, and a softmax classifier on the end.

As usual, I used Karpathy’s convnetjs for modeling, and d3 for visualization. Dataset comes from Zakonodajni monitor.

by virostatiq - August 5, 2014August 7, 2014

Discovering and visualizing songs with similar trends on the British Top 40 Charts from 1990 to 2014

I often wondered what is an average lifetime of a pop song on the charts. If one follows music, it becomes intuitively apparent that there are in fact several types of hits. Some stay on the charts for many weeks, and others barely make it, then immediately slip out.

So I set about discovering groups of songs with similar trends, as they moved on weekly British Top 40 Chart from 1990 to 2014. A total of 1284 different songs appeared on the charts in that period. After a series of experiments, 100 groups were arbitrarily decided on. Position data for each song was collected across the weeks, then the songs were grouped using k-means clustering.

The result is part interactive, part static visualization, consisting of an exploratory chart and 100 small charts showing each separate group.

Check it out here! Or click the image below.

Song trends over time in a typical group

To group the songs, the data was first scraped from www.officialcharts.com, then arranged in format suitable for k-means clustering. The visualization was constructed with d3.

And here are some of the small multiples.

Some of the 100 different groups. Click image for more.