Here you can find some interesting datasets or models that I enriched or generated in course of various projects. Free to download and experiment with.
Word2vec models
Survivalist – English, bigram, 300 dimensions, minimal word count: 20, window 7, 20 iterations, generated from corpus containing tens of millions of words in posts on Survivalist boards forum, type: Gensim. Dashes removed. Suitable for researching weapons and survivalist mindset.
Drugs – English, bigram, 300 dimensions, minimal word count: 20, window 7, 20 iterations, generated from corpus containing tens of millions of words in posts on www.drugs-forum.com and bluelight.org, type: Gensim. Dashes removed. Good for researching drugs and mindset of drug users.
Lemmatized GigaFida – Slovenian, unigran, lemmatized words, 300 dimensions, minimal word count: 20, generated from GigaFida (mini) corpus, 1GB text, type: Gensim. Good for general Slovenian NLP.
Comments and articles in Slovenian media – Slovenian, unigram, 300 dimensions, minimal word count: 50, generated from articles and comments scraped from 10 major Slovenian web publishers 2013 – 2104. Good for investigating slang and journalistic language.
General Slovenian texts – Slovenian, bigram, 300 dimensions. Good for general Slovenian NLP research.
Miscellaneous data
Parking tickets in New York City in 2014 – JSON files by car make and hours. One file per car make, containing JSON array with geolocated parking violations by hour. Manhattan and Queens only. Original dataset here.
CSV table with news items in Slovenian media 2015 – 2019: data from RSS feeds. Table contents: publication date, title, publisher, summary, category, URL, image URL. More than half a million rows.
Spanish sentiment training dataset. Collected from various sources, like scraping comments from hotel sites and classifying English part of parallel corpora from OPUS (Wikipedia, subtitles, Tatoeba, EuroParl, …), and then filtering for score. Not every row may be totally reliable, but a classifier trains successfully. There are around 500.000 examples.