Novi agregator svetovnih novic apdejt.si

Share Button

Izšel je novi portal s svežimi novicami iz večine svetovnih virov, ki se sproti prevajajo v slovenščino. Namenjen je zapolnjevanju informacijske vrzeli, ki nastaja v slovenskih medijih. To namreč zaradi omejenega časa, kadra in/ali interesa ne morejo pokrivati vseh tem, o katerih se piše v svetovnem časopisju.

Novice se vsako uro zajemajo iz okrog 10.000 svetovnih virov. Razdeljene so v tri sklope:

Med alternativnimi mediji in blogi je veliko virov z nasprotujočih si političnih in ideoloških polov. Nekateri objavljajo novice daleč pred večinskimi mediji, nekateri pa objavljajo tudi dezinformacije. Za tem trpi tudi marsikateri večinski medij. Kljub temu so ti viri po mojem mnenju dovolj zanimiv informacijski segment, da si zaslužijo ločeno vertikalo. Bralec naj seveda informacije dodatno preveri.

Seznam virov je na voljo na strani z viri RSS.

Novice prevaja in bogati jezikovni model. Včasih prevodi šepajo, v veliki večini pa so točni. Poleg prevajanja so članki, oziroma semantične skupine člankov, obogateni še s kategorijami, temami, entitetami (npr. imeni oseb in organizacij) in državami, kjer se novice dogajajo.

Tako je mogoče pregledovati novice po:

Poleg tega je mogoče iskati po surovih virih RSS s ključnimi besedami:

S pazljivo izboro ključnih besed lahko analizirate pojavljanje določene teme prek časa.

Na voljo je tudi vizualizacija omrežja držav, tem, kategorij in entitet:

Vsako uro se iz porazdelitve tem in entitet izračunajo trendi. To so teme in entitete, ki so v porastu:

Spletno mesto se nenehno posodablja. Raziščite ga!

Izšel je Kontekst.io, iskalnik podobnih izrazov in sinonimov v sodobni slovenščini

Share Button

Kaj je kontekst.io?

Kontekst.io je iskalnik besed in fraz, uporabljenih v podobnem kontekstu kot iskani izraz. Med njimi niso samo sinonimi, temveč tudi nasprotja, nadpomenke in pomensko sorodni izrazi, ki morda ne sodijo v nobeno od naštetih kategorij, vseeno pa skupaj ustvarijo precej jasno sliko o iskanem izrazu.

Iskalnik je zasnovan na jezikovnem modelu, ki grupira besede, uporabljene v podobnih kontekstih. Besedi travnik in pašnik sta skoraj sinonima, saj je statistično gledano porazdelitev sosednjih besed podobna. To seveda velja tudi za npr. toplo in hladno (npr. “Bilo mi je toplo” in “Bilo mi je hladno”), in za mnogo drugih primerov rabe besed in fraz. Zato so rezultati iskanja pogosto raznoliki in ne vsebujejo samo sinonimov.

Kako uporabljati iskalnik, in za kaj je uporaben

Skupaj je v iskalniku nekaj manj kot 600.000 izrazov, med katere sodijo tudi dvo- ali trobesedne fraze.

Iskalnik je mogoče uporabljati v različne namene.

Z njim si je mogoče prek podobnih izrazov pojasniti, kaj pomenijo redke , a v medijih obujene besede, na primer zavržno.

Poiskati je mogoče sorodne osebe, npr. Živadinov, Trump, Miha Mazzini, Luka Dončić, … ali pa najti vzdevke zanje, če so le dovolj znani, npr za Karla Erjavca ali za “Serpentinška“.

Najti je mogoče sorodne izdelke, kot npr. Lekadolu podobna zdravila. To velja tudi za blagovne znamke na področju kozmetike, avtomobilizma, …

Glede na to, da je bilo med obdelanimi besedili precej slengovskih, lahko najdemo sorodne slengovske izraze, npr. sleng za “zakva”. Precej slengovskih besedil izvira iz spletnih forumov, kjer lahko moderatorji cenzurirajo neprimerne besede.

Poiskati je moče celo sorodne medije ali podjetja. Znani slovenski rumeni medij tako nastopa v precej drugačni družbi kot velike medijske hiše.

Možno je iskati tudi geografske pojme, in tako npr. preveriti, katera mesta so podobna Londonu, ali morda Ljubljani.

Najti je mogoče celo številne slabe razvade, kot tudi kemikalije, ki so zanje odgovorne.

Kakšen besednjak obsega kontekst.io?

Jezikovni model, ki je osnova za iskalnik, smo naučili na približno dvajset gigabajtih slovenskih besedil, pridobljenih iz različnih virov. Mednje sodijo:

  • knjige (tu gre zahvala založbama Beletrina in Eno),
  • novice, objavljene v spletnih medijih,
  • komentarji na te novice,
  • objave na številnih slovenskih spletnih forumih,
  • referenčnih korpusih, ki so jih posredovale slovenske znanstvene ustanove, predvsem Inštitut Jožefa Štefana,
  • prevajalskih korpusih, prosto dostopnih na spletnem mestu OPUS,
  • slovenskih podnapisih,
  • kuharskih receptih,

Med viri je veliko takih, ki vsebujejo slengovske in pogovorne izraze, veliko pa je tudi znanstvenega izrazoslovja ter imen blagovnih znamk.

Matematični model jezika, ki ga uporablja iskalnik

Več na Wikipediini strani o algoritmu word2vec (v angleščini). Jezikovni model je mogoče uporabiti v številne namene, med katerimi je tak iskalnik prevzaprav najbolj banalen.

V naslednjem prispevku sledi ekspoze o teh namenih, in kaj je iz tega mogoče sklepati o slovenskem jeziku, mentaliteti Slovencev in kulturi izražanja na spletnih medijih.

State projects and county budget visualizations – a collaboration with Transparency International Slovenia

Share Button

These two projects are a result of recent collaboration with Transparency International Slovenia. The datasets were provided by the state, and I was asked to develop visualizations that would structure the information in an accessible way. Much help was also provided by members of Institut Jožef Štefan.

State project browser

The first project is a browser of all projects, initiated by state institutions, from 1991 on. The idea was to let users discover, where and for what purposes the money goes in their county. The dataset and visualization allow for exploration by various categories, as well as time.

The projects in the dataset also contain projects that are still in the planning phase, and won’t be completed until year 2025. With this tool, citizens can hopefully inspect the planned expenditures for roads, water sources, and other categories of infrastructure, culture and other fields of development, and compare that with their own expectations.

It allows browsing and filtering of projects by statistical regions and counties, as well as displaying the timeline of all projects, which is basically an  expandable version of a Gantt chart.

To see the interactive project website, click here, or click the image below.

State projects app
State projects app

The original data is provided on the project’s “About” page.

County budget browser

The new project is a straightforward visualization of county budgets. The budgets are displayed as dynamic, zoomable hierarchical (“sunburst”) diagrams. They react to each other, allowing a side-by-side comparison of budgets of two user-selected counties.

The visualization enables users to delve into expenses and incomes of all Slovenian counties on separate tabs.

To see the interactive project website, click here, or click the image below.

County budgets app
County budgets app

 

Technology and design

The data cleanup and preparation was done with some Python scripts. The sunburst diagram accepts hierarchical data in a tree format, so this provided an interesting exercise of converting a tabular dataset into a nested dictionary of optional depth.

The visualizations were done in d3, which is really an indispensable tool for any serious work in online visualization.

Both projects were minimalistically, yet expertly designed by Tomaž Plahuta (Bitnik, Eno).

Check out the projects and let me know your opinion in the comments!

Malofiej24 Award 2016 for Best Map in printed media

Share Button

This is just a short recap of the project that was awarded a Miguel Urabayen Award as the Best Map in printed media and a gold medal for a feature article at Malofiej24. The whole list of awarded projects is available on their website, our project is listed first, and then again under the Features / Reportajes heading. My colleagues – Aljaž Vesel, Ajda Bevc, Aljaž Vindiš and the graphics editor Samo Ačko – got two more awards, and I congratulate them sincerely. Read more about the award here. The article in dnevnik.si about the awards is here (Slovenian).

The project was my first collaboration with the Dnevnik newspaper for the Objektivno feature section, which mainly features various data visualizations. It was a done in a  somewhat ad-hoc fashion for lack of anything else to do. I realized I’ve been scraping the site where the list of towed cars is published for the owners to check if the car suddenly disappears from a public parking in Ljubljana.  The list doesn’t exist anymore, but it used to be on this page. It contained the car make and model, registration plate number, the location from where it was towed, and datetime stamp. We decided to put it all on the map, and analyze it a bit to see where the luxury makes are towed most.

Here’s the map printout from the newspaper. Click it for the PDF, or click this link.

dnevnik-spiders-net
dnevnik-spiders-net

It’s in Slovenian language, so for English speakers:

  • street segment thickness is for number of cars towed (legend top left)
  • color is for ratio between better and ordinary car makes – we arbitrarily decided what is “better”, but we generally considered more expensive cars, like Audi, Mercedes-Benz, etc. as better. Yellow is for uniform distribution, red is for slightly more better cars, blue for mostly better cars, and black for exclusively better cars. Circles denote regions where mostly better cars were towed. That usually happens in the center and around the new sports stadium.
  • on the bottom left there are some statistics, as well as the list of car makes we used.
  • on the bottom right there are some map cutouts of neuralgic points on the map with some commentary.

One wonders if owners of better cars are more prone to get parking tickets than owners of ordinary cars. I believe that is so, and the sad reason must be an inflated sense of self-importance, which translates in the said persons being convinced that the law doesn’t apply to them, leaving their shiny cars parked in inappropriate places. There’s another side to the story – the underpaid traffic wardens, who are all too happy to make a point by immediately calling the tow truck and ignoring the owners’ pleas even if they come before the towing itself. So there is a social undertone to this project, and I’m happy if the jury members realized this as they deliberated.

The whole project was done on Mapbox platform, except for street geocoding and geometry, which comes from my privately curated database, derived from public dataset, which is in turn managed by this public agency. Many thanks to Mapbox team for the turf.js library, which I used in node.js to properly annotate the geometry with numbers and calculate the ratios. The resulting geojson file was then imported into MapBox Studio, styled by the gifted designer Aljaž Vindiš, and prepared for print.

Some time ago, I released a much more comprehensive project with many visualizations of traffic infractions in Slovenia, which took me months to make, but failed to make any significant traffic or impact in public sphere.

The raw development version is still on my server, see it here. I forgot what I meant with the coloring, but I guess it’s the car make ratio.

The whole thing took us around two days to make. After that, we collaborated on a number of interesting projects, but sadly, as is inevitable in life, the merry group self-disbanded and left the newspaper for greener pastures. I’m looking forward to collaborating again with any of them.

gold-award

Image courtesy of Matjaž Erker.