Grouping countries according to flag similarity

Share Button

This topic is apparently interesting enough that it warrants its own discussion on Quora. People there are relying on keen observational powers of human mind, but for this article, I tried to group the flags algorithmically.

I plotted the results on the map below. Countries with same colors have similar flags. The brighter the color, the bigger the group of countries with similar flags.

Launch the interactive viewer to explore the matter interactively.

Countries by flag similarity
Countries by flag similarity

 

Here are some flag groups. To see them all, click the image above.

flags_7flags_80 flags_124 flags_27

How I grouped the flags

I used a machine learning algorithm called k-means clustering. It’s really a rudimentary exercise, but the results are good enough to publish on this wee blog.

The algorithm accepts units to be grouped as vectors, so I had to vectorize the images first, that is to say, convert them in a long string of numbers. Each image was partitioned into a grid, then the average color  value for each cell was computed. The grid was 24 x 24 cells big. I found that enough for simple flags.  These color values were converted into HSB color space and experimentally weighted, then copied into a vector. These vectors were fed into the k-means algorithm with requested number of individual clusters set to 120 (there are 240 different flags). You can see results in the viewer.

Number of clusters was set experimentally, and the clustering is not perfect. For example, Canadian is grouped with some very unlikely lookalikes.

See also the other post with k-means clustering, K-means clustering with Processing.js

 

Enhanced by Zemanta

Presence of faces in House of Cards TV Series by episode

Share Button

I was wondering if presence of faces in video content was an indicator of anything, and if so, of what. So I decided to scan episodes of a popular TV series and analyze them, second by second, for number of faces in video frames, and then compare charts of various episodes. Here is the result of this research.

I decided to analyze House Of Cards, partly because it’s a great series, but also because it’s character focused, so there are many scenes with a lot of people. I built an interactive viewer, which allows to see which faces were recognized at a particular point in time in Episode 3, which contains a variety of scenes with many people in them.

Launch the viewer, or continue reading for short description of technology.

LAunch the House of Cards Face Recognition Interactive Viewer

Technology

To pull this off, I used the OpenCV computer vision library, which has a good capability to recognize faces. As the computer watches TV, this tool scans every frame for faces, and, if it finds any, communicates the relevant rectangles, so they can be drawn or extracted and saved.

Here’s a screenshot of a scene in church. It’s immediately apparent that the tool does not do such a good job, for many faces remain unrecognized. Still, many are recognized.

Recognized faces in House of Cards

Recognized faces in the church scene, Episode 3

In this frame below, more faces are recognized.

hoc-0201

There are also many false positives. The computer sometimes thinks that something is a face, where it most certainly it’s not, as in this picture below. If one looks carefully, one can sometimes see something face-like in these rectangles.

hoc-0018

To construct the viewer, I extracted individual faces from frames so I could display them on the page. They are of various sizes and look like this:

1 0 2 0 1 0

To construct the charts, I just counted the faces in each seconds, then displayed the time series for each episode.

Results

This is the final chart. It’s a series of timelines that show how many faces were recognized per second. Why are some lines orange, and some yellow?

As video frames scanning progressed, some faces were recognized in only one frame in entire second – there are 23 of them. Some other faces were recognized in more frames, ans others in yet more frames. I thought this to be a good indicator of face detection reliability, but that’s not so. If it tells anything, it’s how steady the camera was in that section.

House of Cards face recognition charts by episode
House of Cards face recognition charts by episode

My inspiration was small multiples, a visualization technique which allows for easier comparison of several datasets from the same domain. Wikipedia says:

A small multiple (sometimes called trellis chart, lattice chart, grid chart, or panel chart) is a series or grid of small similar graphics or charts, allowing them to be easily compared. The term was popularized by Edward Tufte.

According to Tufte (Envisioning Information, p. 67):

At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.

 

As always, if anyone is interested in code, mail me. My address is on About page.

My brainwaves during the final episode of Breaking Bad

Share Button

This is a follow-up to the first self-quantizing post here, my heart rate during the latest episode of the Game of Thrones.  See also Graphs of recognized faces per second in House of Cards episodes. This time I thought it’d be fun to measure my brainwaves while watching a critical episode of another TV show.

Breaking Bad is a great TV show, I really recommend it. Even Anthony Hopkins wrote a much publicized fan letter to the crew and the main actor. I watched it avidly until the episode with the fly. Then I took a pause that somehow extended itself up until the finale.

After that all the information has come from the media and from my girlfriend, who still watched it on a regular basis. So these measurements were taken by a person who isn’t biased enough in sense of any emotional involvement with the onscreen characters.

What do brainwaves measure, and what do the levels mean? Here’s a quote from Wikipedia:

  • delta: adult slow-wave sleep, in babies, has been found during some continuous-attention tasks.
  • theta: young children, drowsiness or arousal in older children and adults, idling, associated with inhibition of elicited responses (has been found to spike in situations where a person is actively trying to repress a response or action).
  • alpha: relaxed/reflecting, closing the eyes, also associated with inhibition control, seemingly with the purpose of timing inhibitory activity in different locations across the brain.
  • beta: alert, active, busy, or anxious thinking, active concentration.
  • gamma: displays during cross-modal sensory processing (perception that combines two different senses, such as sound and sight), also is shown during short-term memory matching of recognized objects, sounds, or tactile sensations.

There’s also mu, but the Mindwave doesn’t measure it.

Here’s the EEG graph overlaid on the frames. The EEG values have been averaged per shown frame.

The colors are:

  • red: low alpha
  • orange: high alpha,
  • pink: low beta,
  • light blue: high beta,
  • green: Attention (synthetic NeuroSky value).

Breaking BAd final episode EEG chart

To measure the brainwaves, I used the NeuroSky Mindwave. It’s a convenient and portable personal EEG. It’s a little limited, and one has to learn how to use it properly, but it has a professional quality DSP chip that it uses to calculate two levels the company calls “Attention” and “Meditation”. It also outputs standard alpha, beta, gamma, theta and delta waves.

It looks like this:

Neurosky Mindwave
Neurosky Mindwave

By “limited” I mean that it’s sampling brainwave data only twice a second. So whatever it’s happening in your brain now, you can measure after half second in the worst case.

This is the “attention” chart during the episode:

Breaking Bad final episode EEG chart (attention)
Breaking Bad final episode EEG chart (attention)

Here is the video with onscreen readings. It’s just another way of presenting the same as in the picture above, except there’s more brainwave frequencies shown.

Breaking bad final episode fast forward with EEG readings from Marko O’Hara on Vimeo.

I hope I’m not in copyright violation for that video. It’s essentially unwatchable story-wise.

I’m not totally satisfied with the images and video produced here, but I’m not watching the episode again. I must also admit that I can’t really interpret the charts and video. Attention is self-explanatory, and elevated beta levels also mean increased attention, but do high alpha values mean that I was falling asleep? I was pretty alert while watching.

There’s also possibility of interference. The EEG is essentially a very sensitive voltmeter that measures minute potential changes. Twitching facial muscles, blinking, yawning, … etc., all interfere with the readings. I did look at my second monitor quite a few times to check if the data was being written to a file, maybe some spikes come from that. All in all, I don’t think there are any spoilers here.

Here are some more charts:

 

 

Slovenian real estate prices mapped

Share Button

There has recently been a flurry of activity by self-made mappers on the net that major media have noticed. It seems that proliferation of tools such as the excellent TileMill does help to make custom maps a relatively painless, yet still laborious process.

In my experience, a major hurdle in this process is getting good data. Governments and corporations around the globe have made acquiring the goods easier, but the quality frequently leaves one wanting. More about this particular dataset later.

This map is my attempt to visualize real estate prices in Slovenia. Buildings are colored according to the most expensive unit they contain, except in some cases where data is bad. More below.

See the map!

A map of real estate prices in Slovenia.

A map of real estate prices in Slovenia.

About the dataset

This dataset is provided by GURS, a government institution. I used it before, to make the map of structure ages in Ljubljana. It comes in a variety of formats, such as SHP (geometry) and text (building properties) files, which were clearly dumped from database tables.

It has some severe problems. For example, some bigger and more expensive buildings contain many units, but these units all hold the same value regardless of their useful area. To make matters more complicated, other multiunit buildings don’t hold the same value for the units they contain. They are, in other words, evidenced correctly. Then, there are building compounds, like the nuclear power plant in Krško, in which every building clearly holds the exorbitant value of entire compound. Some other buildings have price value as zero, and so on.

All of this doesn’t even start to address the quality of valuation the government inspectors performed. In the opinion of many property owners, the values are too low. There’s a new round of valuation coming, in which the values are reportedly bound to drop by further five to twenty percent, if I remember correctly. It will be interesting to make another map with the valuation differences some day.

Massaging the data

This means that the above map is my interpretation of the dataset beyond the visualization itself. In calculating values for visualization, there were several decisions I made:

  • For multiunit buildings, I calculated the cost of square meter for every unit, then colored the building with color value of the most expensive unit. This was necessary, because some buildings contain many communal areas, garages and parking lots, which are all independently valued. I first tried with a simple average value, but the apartment buildings with many parking boxes and garages were then valued deceivingly low. I tried to make the map more apartment-oriented, so this was a necessary decision to make it more accurately reflect the market.
  • For incorrectly evidenced buildings with same value (high) unit value, I took the price of one unit, divided by sum of unit areas. I could do this on one unit only, but which one? There’s no easy answer. The average seemed the way to go.

I also made a list of the most expensive buildings by their total Euro value. Individual unit values were summed, except in cases described in the second bullet point above. there I simply took the price of one unit. It’s accessible as a separate vector layer under “Most expensive buildings” menu item.

Findings

Turns out the most expensive buildings are mostly power plants, which is not surprising. In Ljubljana, two of the most expensive buildings were completed recently. Well, the Stožice stadium was not really completed. I don’t know whether it was paid for or not – this is a discourse best suited for political tabloids. See the gallery:

It’s also hardly surprising that the capital and the coast are areas with the most expensive real estate available. The state of city of Maribor is sad to see, though, at least in comparison to Ljubljana.

I suggest taking the tour in the map itself, where I go into a little more depth for some towns and cities. Also, be sure to click the “Most expensive buildings”, then hovering the mouse pointer over highlighted buildings to get an idea of their total cost and price per square meter, which in many cases diverges dramatically.

Here are two charts showing price/m2 distribution at different intervals in time.
This one is an all-time chart. Most buildings are valued low, since all ages were taken into account.
realestate-chart-m2

This one shows the period between year 2008 and now, in other words, since the crisis struck. Nevertheless, more expensive buildings seem to prevail. No wonder, since they are new. But that probably also means that there’s more apartment building construction relative to countryside development. I’m not really a real estate expert, so if anyone has a suggestion, comment away.

realestate-chart-m2-2008

Credits

Inspiration for the tour was this excellent visualization by the Pulitzer center.

I also have to thank the kind people at GURS for providing me with data. They know it’s flawed somewhat, but all in all it’s not so bad.

Disclaimer

As I’ve noted before, this map is a result of my interpretation of government data. I’m in no way I responsible for any misunderstandings arising from this map. If you want to see the actual valuation of your building or building unit, please consult GURS or use their web application to find out.

See also

Structure ages map in Ljubljana.

Interactive timeline of the PRISM scandal

Share Button

中国翻译此页

Purpose of this visualization

This is an interactive timeline of events about the Prism scandal, chronicled by selected media in online news articles, giving a summarized view of events as they unfolded. It’s intended as a parody of a NSA software to track people and analyze their metadata. It consists of these parts:

  • the chronological order of articles, visualized as a timeline,
  • a network of people, places and organizations that appear in the articles,
  • geographic information that the articles refer to, and
  • a bar graph showing wordcounts of interesting words, associated with the main theme.

In the bottom part there is a timeline displaying the published articles in chronological order. Articles are accessible by clicking on a title and then using the Go to article link in the popup bubble.

In the center background there is a rotating globe, displaying major cities referred by articles. Labels next to cities contain titles of all articles visible in timeline view that refer to specific city.

Center foreground contains a network showing interconnectedness of various entities recognized in the article text. Entities appearing in the same paragraph are connected. The network is additive, which means that the more frequently entities appear in same paragraphs, the stronger is the bond between them.

In the right corner, there is a small bar graph containing frequencies of selected words, giving an idea of Snowden’s options at the time. It’s just  a word count of shown words in visible articles, not a semantic analysis.

As the timeline is moved, new articles appear, and network is updated with new data, giving a quick overview of who and how frequently was involved in discussion, and who was related to them.

Hovering the mouse over a network node shows only the portion of the network that the node is directly connected to, making this useful for detailed exploration of relationships between those entities.

Launch the PRISM Scandal visualization!  Use Chrome if possible. It won’t work in IE.

PRISM Scandal Visualization Window
PRISM Scandal Visualization Window

Interacting with the visualization

Use the mouse to drag timeline left or right, or rotate the middle wheel for the same effect. For quicker navigation use the Quick jump menu. The subnetworks will load and unload automatically, and the whole network will try to stabilize so that it accurately reflects frequencies of terms and bond stregths between them. If it doesn’t stabilize well, click the Reorder network button or double-click in the middle of the timeline.

It’s possible to zoom the network in and out to get a better idea of shown names and connections. To do that, position the mouse pointer over the network area and drag or zoom with the wheel. Node size corresponds to term frequency in visible articles, while the bond thickness corresponds to its weight, that is to say frequency of said bond.

Click the article titles in the timeline to display more information and links. To change the publisher, click the Publisher menu and select a desired one. This will load new set of events into the timeline. To automatically move the timeline, use the Play button.

Data sources

News articles containing keywords Snowden, Prism, NSA, Wikileaks and Julian Assange were scraped from selected media and stored locally for processing. The articles themselves are linked from the timeline. Their content, apart from titles, is not accesible in this visualization for copyright issues. Geographical database with city names and corresponding latitudes and longitudes was obtained as a free download at GeoNames. The media an/or publishing houses were selected to give a balanced set of worldviews. These are, in alphabetical order:

Processing the articles

First a dictionary of all capitalized word sequences and their permutations was constructed by processing all articles in the database. This is essentially a dictionary of all people, states, cities and organizations appearing in the whole database. Then, the title and body text of each article was scanned for these dictionary entries and city names, so that an article abstract was constructed, containing of title, publishing date, a link to article, a wordcount of selected words, a subnetwork of connected entities, and a list of cities along with latitudes and longitudes.

Constructing the network

Subgraphs in an article

The article subnetwork was constructed so that entities in the same sentence (connecting in a paragraph shown in the picture) are connected with a set weight. Nodes not connected with any other nodes are dropped at this point, since their inclusion would lead to a largely unconnected network, which is visually unappealing and cumbersome to navigate.

All web scraping and text processing was done in Java locally, there were around 10,000 articles processed in the latest count. See picture below.

 

prism-solr-admin

There does not exist a live server database that this visualization would query.  The entity dictionaries are here (names) and here (selected words).

Constructing the visualization

The visualization was constructed entirely in HTML5 and JavaScript. Four major libraries were used:

  • Sigma.js for displaying the networks. The latest version does not contain some key functionality for dynamically and additively loading and unloading of subgraphs into the main graph, so the source code was updated with required methods. Separate article on that topic is upcoming.
  • Three.js for rotating Earth and all geographically-related work.
  • Simile Timeline for the timeline.
  • Flot for the bar graph.

If anyone is interested in Java code for web scraping / networking / constructing timeline input files, drop me a note. My email is on the About page.

Enhanced by Zemanta