There appeared an article, in which an attempt was made to expose questionable practices of some Slovenian enterpreneurs. The scheme is such: establish a company, perform some work, bleed it dry, then establish a new one and move all workers into it, at the same time avoiding paying benefits and a sizable portion of salaries. When the new company has server its purpose, establish a new one, and so on, as far as it goes. These companies are frequently registered at the same address.
The article says that there are as many as 120 companies registered in one residential building. But because of a weakness of the law, state inspectors can’t put an end to such practice.
I wanted to see these addresses on the map, so here’s an attempt. For every address with more than five companies, there’s a dot, with color and radius proportional with number of companies registered there. The biggest dots represent business buildings, in which a predominantly legitimate businesses reside. My data sources didn’t allow for filtering out just residential buildings.
You can see the standalone map here. (In Slovene.)
Clicking on a marker displays a popup with a list of companies, sorted by date of establishment – youngest first. There’s also a chart of predominant business categories at that address. The categories that the article mentions as most prone to scheme in question, are Construction and Retail. So even of this map can’t really show the locations with these questionable companies, it can maybe help their discovery. If there’s a big dot with predominantly these categories, there’s a certain possibility that some of these fraudulent companies are there.
Most addresses shown here of course don’t have anything to do with any illegal activity.
The Global Gender Gap Index examines the gap between men and women in four fundamental categories (subindexes): Economic Participation and Opportunity, Educational Attainment, Health and Survival and Political Empowerment. Table 1 displays all four of these subindexes and the 14 different indicators that compose them, along with the sources of data used for each.
I thought it would be nice to try to visualize the data and make it as interactive as I could, and learn d3.js in process. I actually tried to mobilize all the data in the report, which one can see in graphical form by clicking on countries on world map, or selecting the categories in the dropdown.
There are several categories:
In addition to that, I calculated the differences between 2013 and previous years. These maps are also accessible through dropdown menu, or simply by scrolling up and down.
This subindex is captured through three concepts: the participation gap, the remuneration gap and the advancement gap. The participation gap is captured using the difference in labour force participation rates. The remuneration gap is captured through a hard data indicator (ratio of estimated female-to-male earned income) and a qualitative variable calculated through the World Economic Forum’s Executive Opinion Survey (wage equality for similar work). Finally, the gap between the advancement of women and men is captured through two hard data statistics (the ratio of women to men among legislators, senior officials and managers, and the ratio of women to men among technical and professional workers).
In this subindex, the gap between women’s and men’s current access to education is captured through ratios of women to men in primary-, secondary- and tertiary-level education. A longer-term view of the country’s ability to educate women and men in equal numbers is captured through the ratio of the female literacy rate to the male literacy rate.
Health and Survival
This subindex provides an overview of the differences between women’s and men’s health. To do this, we use two indicators. The first is the sex ratio at birth, which aims specifically to capture the phenomenon of “missing women” prevalent in many countries with a strong son preference. Second, we use the gap between women’s and men’s healthy life expectancy, calculated by the World Health Organization. This measure provides an estimate of the number of years that women and men can expect to live in good health by taking into account the years lost to violence, disease, malnutrition or other relevant factors.
This subindex measures the gap between men and women at the highest level of political decision-making, through the ratio of women to men in minister-level positions and the ratio of women to men in parliamentary positions. In addition, we include the ratio of women to men in terms of years in executive office (prime minister or president) for the last 50 years. A clear drawback in this category is the absence of any indicators capturing differences between the participation of women and men at local levels of government. Should such data become available at a global level in future years, they will be considered for inclusion in the Global Gender Gap Index.
Out of the 110 countries that have been involved every year since 2006, 95 (86%) have improved their performance over the last four years, while 15 (14%) have shown widening gaps. Ten countries have closed the gap on both the Health and Survival and Educational Attainment subindexes. No country has closed the economic participation gap or the political empowerment gap. On the Economic Participation and Opportunity subindex, the highest-ranking country (Norway) has closed over 84% of its gender gap, while the lowest ranking country (Syria) has closed only 25% of its economic gender gap. There is similar variation in the Political Empowerment subindex. The highest-ranking country (Iceland) has closed almost 75% of its gender gap whereas the two lowest-ranking countries (Brunei Darussalam and Qatar) have closed none of the political empowerment gap according to this measure.
This topic is apparently interesting enough that it warrants its own discussion on Quora. People there are relying on keen observational powers of human mind, but for this article, I tried to group the flags algorithmically.
I plotted the results on the map below. Countries with same colors have similar flags. The brighter the color, the bigger the group of countries with similar flags.
Here are some flag groups. To see them all, click the image above.
How I grouped the flags
I used a machine learning algorithm called k-means clustering. It’s really a rudimentary exercise, but the results are good enough to publish on this wee blog.
The algorithm accepts units to be grouped as vectors, so I had to vectorize the images first, that is to say, convert them in a long string of numbers. Each image was partitioned into a grid, then the average color value for each cell was computed. The grid was 24 x 24 cells big. I found that enough for simple flags. These color values were converted into HSB color space and experimentally weighted, then copied into a vector. These vectors were fed into the k-means algorithm with requested number of individual clusters set to 120 (there are 240 different flags). You can see results in the viewer.
Number of clusters was set experimentally, and the clustering is not perfect. For example, Canadian is grouped with some very unlikely lookalikes.
I was wondering if presence of faces in video content was an indicator of anything, and if so, of what. So I decided to scan episodes of a popular TV series and analyze them, second by second, for number of faces in video frames, and then compare charts of various episodes. Here is the result of this research.
I decided to analyze House Of Cards, partly because it’s a great series, but also because it’s character focused, so there are many scenes with a lot of people. I built an interactive viewer, which allows to see which faces were recognized at a particular point in time in Episode 3, which contains a variety of scenes with many people in them.
To pull this off, I used the OpenCV computer vision library, which has a good capability to recognize faces. As the computer watches TV, this tool scans every frame for faces, and, if it finds any, communicates the relevant rectangles, so they can be drawn or extracted and saved.
Here’s a screenshot of a scene in church. It’s immediately apparent that the tool does not do such a good job, for many faces remain unrecognized. Still, many are recognized.
Recognized faces in the church scene, Episode 3
In this frame below, more faces are recognized.
There are also many false positives. The computer sometimes thinks that something is a face, where it most certainly it’s not, as in this picture below. If one looks carefully, one can sometimes see something face-like in these rectangles.
To construct the viewer, I extracted individual faces from frames so I could display them on the page. They are of various sizes and look like this:
To construct the charts, I just counted the faces in each seconds, then displayed the time series for each episode.
This is the final chart. It’s a series of timelines that show how many faces were recognized per second. Why are some lines orange, and some yellow?
As video frames scanning progressed, some faces were recognized in only one frame in entire second – there are 23 of them. Some other faces were recognized in more frames, ans others in yet more frames. I thought this to be a good indicator of face detection reliability, but that’s not so. If it tells anything, it’s how steady the camera was in that section.
My inspiration was small multiples, a visualization technique which allows for easier comparison of several datasets from the same domain. Wikipedia says:
A small multiple (sometimes called trellis chart, lattice chart, grid chart, or panel chart) is a series or grid of small similar graphics or charts, allowing them to be easily compared. The term was popularized by Edward Tufte.
According to Tufte (Envisioning Information, p. 67):
At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.
As always, if anyone is interested in code, mail me. My address is on About page.
A few months ago, while researching business times of various categories of establishments in Slovenia, I thought it would be nice to somehow visualize a map with a graphical representation of density of open establishments. I decided on heatmap style, although I later discover that my chosen implementation had some drawbacks.
Getting the data
Data with business hours of commercial establishments is traditionally not open for many reasons, two of them being that (1) this information can be commercially exploited, and (2) the opening hours can be subject to frequent changes, which can tax the database owner with considerable effort should the database stay current and reliable.
First I toyed with the idea of crawling entire directory of odpiralnicasi.com, then I actually thought about making a version for London, Amsterdam or San Francisco with Yelp data, for which I would have to crawl an entire Yelp city directory, a task I’m not sure it would succeed. Yelp would probably block my IP before I could harvest a significant portion of what interested me.
So I decided I would use the Najdi.si maps business directory. Disclosure: I work there, so I have access to the database with various business data, which is being kept current.
For every company, I took out only the name, geo coordinates, business hours and business category, then I constructed the animated maps. Before I delve into that, a short video of economic activity in Slovenia in course of a typical Monday.
The animated chart you see on the bottom shows the number of active establishments in various economic categories, such as Restaurants and catering, Industry, Shopping, etc. The full list is:
blue: Computers and IT,
red: Restaurants and catering,
green: Home and garden,
yellow: Beauty and health,
pink: General business,
orange: Free time,
magenta: Culture and schooling
Rendering the maps and constructing the visualization
Rendering one frame in one city at a specific time is just a matter of setting appropriate latitude, longitude and zoom level on the map, selecting the desired time and plotting on the map all establishments that are open at that time. I used Processing to do that, and for the heat map part I used this excellent example by Philipp Seifried. As a finishing touch, I made maps to switch between day and night styles at appropriate times.
To do entire video, I had to write a parallel rendering queue lest the rendering of a single video took an eternity – Eclipse project available by email request.
To complicate things a bit I decided to include up to four different places on the same map, so the viewer could compare opening hours in Ljubljana in different economic categories, or see how different cities woke up and went to sleep at different times.
A typical frame looks like this:
Here’s an example for different economic activities in Ljubljana:
I mostly did this to be able to visually compare levels of business activity in Ljubljana. First of all, the heatmap technique I employed here turned out to be somewhat unreliable for video purposes, because it colors the dots relative to the highest concentration. But concentration and absolute numbers of active businesses change from frame to frame, so it seems that at night there’s more activity that during the day.
Even so it’s still clear that restaurants, bars and clubs are still pretty much open when other activity starts to die down.
This is Ljubljana at noon, again:
top left: General business
top right: Restaurants and catering
bottom left: Industry,
bottom right:Beauty and health
The big spot in the northeast is the mall region, where untold number of business operate in ten or more big malls. Business concentration there dwarfs everything else in the city, except maybe in industrial category.
Below is Ljubljana at eight o’clock in the evening. Pretty much everything has closed down except for eating and drinking, and maybe the cinema theater in the mall.
Below: Ljubljana at ten o’clock in the evening. Some businesses don’t close down at all. I double checked the primary data source and it’s true. There are cleaning services that stay open during the night, etc.
I’m relatively satisfied with results except for the heatmap issue. I may correct that if I get the data for a bigger city.