This topic is apparently interesting enough that it warrants its own discussion on Quora. People there are relying on keen observational powers of human mind, but for this article, I tried to group the flags algorithmically.
I plotted the results on the map below. Countries with same colors have similar flags. The brighter the color, the bigger the group of countries with similar flags.
Here are some flag groups. To see them all, click the image above.
How I grouped the flags
I used a machine learning algorithm called k-means clustering. It’s really a rudimentary exercise, but the results are good enough to publish on this wee blog.
The algorithm accepts units to be grouped as vectors, so I had to vectorize the images first, that is to say, convert them in a long string of numbers. Each image was partitioned into a grid, then the average color value for each cell was computed. The grid was 24 x 24 cells big. I found that enough for simple flags. These color values were converted into HSB color space and experimentally weighted, then copied into a vector. These vectors were fed into the k-means algorithm with requested number of individual clusters set to 120 (there are 240 different flags). You can see results in the viewer.
Number of clusters was set experimentally, and the clustering is not perfect. For example, Canadian is grouped with some very unlikely lookalikes.
I was wondering if presence of faces in video content was an indicator of anything, and if so, of what. So I decided to scan episodes of a popular TV series and analyze them, second by second, for number of faces in video frames, and then compare charts of various episodes. Here is the result of this research.
I decided to analyze House Of Cards, partly because it’s a great series, but also because it’s character focused, so there are many scenes with a lot of people. I built an interactive viewer, which allows to see which faces were recognized at a particular point in time in Episode 3, which contains a variety of scenes with many people in them.
To pull this off, I used the OpenCV computer vision library, which has a good capability to recognize faces. As the computer watches TV, this tool scans every frame for faces, and, if it finds any, communicates the relevant rectangles, so they can be drawn or extracted and saved.
Here’s a screenshot of a scene in church. It’s immediately apparent that the tool does not do such a good job, for many faces remain unrecognized. Still, many are recognized.
Recognized faces in the church scene, Episode 3
In this frame below, more faces are recognized.
There are also many false positives. The computer sometimes thinks that something is a face, where it most certainly it’s not, as in this picture below. If one looks carefully, one can sometimes see something face-like in these rectangles.
To construct the viewer, I extracted individual faces from frames so I could display them on the page. They are of various sizes and look like this:
To construct the charts, I just counted the faces in each seconds, then displayed the time series for each episode.
This is the final chart. It’s a series of timelines that show how many faces were recognized per second. Why are some lines orange, and some yellow?
As video frames scanning progressed, some faces were recognized in only one frame in entire second – there are 23 of them. Some other faces were recognized in more frames, ans others in yet more frames. I thought this to be a good indicator of face detection reliability, but that’s not so. If it tells anything, it’s how steady the camera was in that section.
My inspiration was small multiples, a visualization technique which allows for easier comparison of several datasets from the same domain. Wikipedia says:
A small multiple (sometimes called trellis chart, lattice chart, grid chart, or panel chart) is a series or grid of small similar graphics or charts, allowing them to be easily compared. The term was popularized by Edward Tufte.
According to Tufte (Envisioning Information, p. 67):
At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.
As always, if anyone is interested in code, mail me. My address is on About page.
A few months ago, while researching business times of various categories of establishments in Slovenia, I thought it would be nice to somehow visualize a map with a graphical representation of density of open establishments. I decided on heatmap style, although I later discover that my chosen implementation had some drawbacks.
Getting the data
Data with business hours of commercial establishments is traditionally not open for many reasons, two of them being that (1) this information can be commercially exploited, and (2) the opening hours can be subject to frequent changes, which can tax the database owner with considerable effort should the database stay current and reliable.
First I toyed with the idea of crawling entire directory of odpiralnicasi.com, then I actually thought about making a version for London, Amsterdam or San Francisco with Yelp data, for which I would have to crawl an entire Yelp city directory, a task I’m not sure it would succeed. Yelp would probably block my IP before I could harvest a significant portion of what interested me.
So I decided I would use the Najdi.si maps business directory. Disclosure: I work there, so I have access to the database with various business data, which is being kept current.
For every company, I took out only the name, geo coordinates, business hours and business category, then I constructed the animated maps. Before I delve into that, a short video of economic activity in Slovenia in course of a typical Monday.
The animated chart you see on the bottom shows the number of active establishments in various economic categories, such as Restaurants and catering, Industry, Shopping, etc. The full list is:
blue: Computers and IT,
red: Restaurants and catering,
green: Home and garden,
yellow: Beauty and health,
pink: General business,
orange: Free time,
magenta: Culture and schooling
Rendering the maps and constructing the visualization
Rendering one frame in one city at a specific time is just a matter of setting appropriate latitude, longitude and zoom level on the map, selecting the desired time and plotting on the map all establishments that are open at that time. I used Processing to do that, and for the heat map part I used this excellent example by Philipp Seifried. As a finishing touch, I made maps to switch between day and night styles at appropriate times.
To do entire video, I had to write a parallel rendering queue lest the rendering of a single video took an eternity – Eclipse project available by email request.
To complicate things a bit I decided to include up to four different places on the same map, so the viewer could compare opening hours in Ljubljana in different economic categories, or see how different cities woke up and went to sleep at different times.
A typical frame looks like this:
Here’s an example for different economic activities in Ljubljana:
I mostly did this to be able to visually compare levels of business activity in Ljubljana. First of all, the heatmap technique I employed here turned out to be somewhat unreliable for video purposes, because it colors the dots relative to the highest concentration. But concentration and absolute numbers of active businesses change from frame to frame, so it seems that at night there’s more activity that during the day.
Even so it’s still clear that restaurants, bars and clubs are still pretty much open when other activity starts to die down.
This is Ljubljana at noon, again:
top left: General business
top right: Restaurants and catering
bottom left: Industry,
bottom right:Beauty and health
The big spot in the northeast is the mall region, where untold number of business operate in ten or more big malls. Business concentration there dwarfs everything else in the city, except maybe in industrial category.
Below is Ljubljana at eight o’clock in the evening. Pretty much everything has closed down except for eating and drinking, and maybe the cinema theater in the mall.
Below: Ljubljana at ten o’clock in the evening. Some businesses don’t close down at all. I double checked the primary data source and it’s true. There are cleaning services that stay open during the night, etc.
I’m relatively satisfied with results except for the heatmap issue. I may correct that if I get the data for a bigger city.
Breaking Bad is a great TV show, I really recommend it. Even Anthony Hopkins wrote a much publicized fan letter to the crew and the main actor. I watched it avidly until the episode with the fly. Then I took a pause that somehow extended itself up until the finale.
After that all the information has come from the media and from my girlfriend, who still watched it on a regular basis. So these measurements were taken by a person who isn’t biased enough in sense of any emotional involvement with the onscreen characters.
What do brainwaves measure, and what do the levels mean? Here’s a quote from Wikipedia:
delta: adult slow-wave sleep, in babies, has been found during some continuous-attention tasks.
theta: young children, drowsiness or arousal in older children and adults, idling, associated with inhibition of elicited responses (has been found to spike in situations where a person is actively trying to repress a response or action).
alpha: relaxed/reflecting, closing the eyes, also associated with inhibition control, seemingly with the purpose of timing inhibitory activity in different locations across the brain.
beta: alert, active, busy, or anxious thinking, active concentration.
gamma: displays during cross-modal sensory processing (perception that combines two different senses, such as sound and sight), also is shown during short-term memory matching of recognized objects, sounds, or tactile sensations.
There’s also mu, but the Mindwave doesn’t measure it.
Here’s the EEG graph overlaid on the frames. The EEG values have been averaged per shown frame.
The colors are:
red: low alpha
orange: high alpha,
pink: low beta,
light blue: high beta,
green: Attention (synthetic NeuroSky value).
To measure the brainwaves, I used the NeuroSky Mindwave. It’s a convenient and portable personal EEG. It’s a little limited, and one has to learn how to use it properly, but it has a professional quality DSP chip that it uses to calculate two levels the company calls “Attention” and “Meditation”. It also outputs standard alpha, beta, gamma, theta and delta waves.
It looks like this:
By “limited” I mean that it’s sampling brainwave data only twice a second. So whatever it’s happening in your brain now, you can measure after half second in the worst case.
This is the “attention” chart during the episode:
Here is the video with onscreen readings. It’s just another way of presenting the same as in the picture above, except there’s more brainwave frequencies shown.
I hope I’m not in copyright violation for that video. It’s essentially unwatchable story-wise.
I’m not totally satisfied with the images and video produced here, but I’m not watching the episode again. I must also admit that I can’t really interpret the charts and video. Attention is self-explanatory, and elevated beta levels also mean increased attention, but do high alpha values mean that I was falling asleep? I was pretty alert while watching.
There’s also possibility of interference. The EEG is essentially a very sensitive voltmeter that measures minute potential changes. Twitching facial muscles, blinking, yawning, … etc., all interfere with the readings. I did look at my second monitor quite a few times to check if the data was being written to a file, maybe some spikes come from that. All in all, I don’t think there are any spoilers here.
There has recently been a flurry of activity by self-made mappers on the net that major media have noticed. It seems that proliferation of tools such as the excellent TileMill does help to make custom maps a relatively painless, yet still laborious process.
In my experience, a major hurdle in this process is getting good data. Governments and corporations around the globe have made acquiring the goods easier, but the quality frequently leaves one wanting. More about this particular dataset later.
This map is my attempt to visualize real estate prices in Slovenia. Buildings are colored according to the most expensive unit they contain, except in some cases where data is bad. More below.
This dataset is provided by GURS, a government institution. I used it before, to make the map of structure ages in Ljubljana. It comes in a variety of formats, such as SHP (geometry) and text (building properties) files, which were clearly dumped from database tables.
It has some severe problems. For example, some bigger and more expensive buildings contain many units, but these units all hold the same value regardless of their useful area. To make matters more complicated, other multiunit buildings don’t hold the same value for the units they contain. They are, in other words, evidenced correctly. Then, there are building compounds, like the nuclear power plant in Krško, in which every building clearly holds the exorbitant value of entire compound. Some other buildings have price value as zero, and so on.
All of this doesn’t even start to address the quality of valuation the government inspectors performed. In the opinion of many property owners, the values are too low. There’s a new round of valuation coming, in which the values are reportedly bound to drop by further five to twenty percent, if I remember correctly. It will be interesting to make another map with the valuation differences some day.
Massaging the data
This means that the above map is my interpretation of the dataset beyond the visualization itself. In calculating values for visualization, there were several decisions I made:
For multiunit buildings, I calculated the cost of square meter for every unit, then colored the building with color value of the most expensive unit. This was necessary, because some buildings contain many communal areas, garages and parking lots, which are all independently valued. I first tried with a simple average value, but the apartment buildings with many parking boxes and garages were then valued deceivingly low. I tried to make the map more apartment-oriented, so this was a necessary decision to make it more accurately reflect the market.
For incorrectly evidenced buildings with same value (high) unit value, I took the price of one unit, divided by sum of unit areas. I could do this on one unit only, but which one? There’s no easy answer. The average seemed the way to go.
I also made a list of the most expensive buildings by their total Euro value. Individual unit values were summed, except in cases described in the second bullet point above. there I simply took the price of one unit. It’s accessible as a separate vector layer under “Most expensive buildings” menu item.
Turns out the most expensive buildings are mostly power plants, which is not surprising. In Ljubljana, two of the most expensive buildings were completed recently. Well, the Stožice stadium was not really completed. I don’t know whether it was paid for or not – this is a discourse best suited for political tabloids. See the gallery:
It’s also hardly surprising that the capital and the coast are areas with the most expensive real estate available. The state of city of Maribor is sad to see, though, at least in comparison to Ljubljana.
I suggest taking the tour in the map itself, where I go into a little more depth for some towns and cities. Also, be sure to click the “Most expensive buildings”, then hovering the mouse pointer over highlighted buildings to get an idea of their total cost and price per square meter, which in many cases diverges dramatically.
Here are two charts showing price/m2 distribution at different intervals in time.
This one is an all-time chart. Most buildings are valued low, since all ages were taken into account.
This one shows the period between year 2008 and now, in other words, since the crisis struck. Nevertheless, more expensive buildings seem to prevail. No wonder, since they are new. But that probably also means that there’s more apartment building construction relative to countryside development. I’m not really a real estate expert, so if anyone has a suggestion, comment away.
I also have to thank the kind people at GURS for providing me with data. They know it’s flawed somewhat, but all in all it’s not so bad.
As I’ve noted before, this map is a result of my interpretation of government data. I’m in no way I responsible for any misunderstandings arising from this map. If you want to see the actual valuation of your building or building unit, please consult GURS or use their web application to find out.