Tagged: models

Downloads

Share Button

Here you can find some interesting datasets or models that I enriched or generated in course of various projects. Free to download and experiment with.

Word2vec models

Survivalist – English, bigram, 300 dimensions, minimal word count: 20, window 7, 20 iterations, generated from corpus containing tens of millions of words in posts on Survivalist boards forum, type: Gensim. Dashes removed. Suitable for researching weapons and survivalist mindset.

Drugs – English, bigram, 300 dimensions, minimal word count: 20, window 7, 20 iterations, generated from corpus containing tens of millions of words in posts on www.drugs-forum.com and bluelight.org,  type: Gensim. Dashes removed. Good for researching drugs and mindset of drug users.

Lemmatized GigaFida – Slovenian, unigran, lemmatized words, 300 dimensions, minimal word count: 20, generated from GigaFida (mini) corpus, 1GB text, type: Gensim. Good for general Slovenian NLP.

Comments and articles in Slovenian media – Slovenian, unigram, 300 dimensions, minimal word count: 50, generated from articles and comments scraped from 10 major Slovenian web publishers 2013 – 2104. Good for investigating slang and journalistic language.

General Slovenian texts – Slovenian, bigram, 300 dimensions. Good for general Slovenian NLP research.

 

Miscellaneous data

Parking tickets in New York City in 2014 – JSON files by car make and hours. One file per car make, containing JSON array with geolocated parking violations by hour. Manhattan and Queens only. Original dataset here.