This page contain links to list for public datasets used in different NLP tasks, I try to keep the lists updated as I found new and interesting datasets.
I’ve been keeping track of public and free datasets for semantic relationship extraction, this github repository contains annotated datasets which can be used to train supervised models to perform semantic relationship extraction.
The datasets are organized into three different groups:
Named-Entity Recognition datasets organised by different languages, also some are for different domains:
Lexicons and Dictionaries
Several lexicons I gathered for different NLP tasks, including lists of names, acronyms and it’s extensions, stop-words, overlap of names and toponyms, etc.:
NomesLex-PT a lexicon of Portuguese person names made up of 2,027 first names and 8,019 surnames.
names-surnames-NL-UK-IT-PT-ES.zip a list of names and surnames for Dutch, English, Portuguese and Spanish.
publico-cargos.txt a list of Portuguese noun quantifiers, i.e., words that occur before a proper noun, gathered from the on-line newspaper publico.pt.
publico-acronyms.txt a list of acronyms and it’s possible extensions, extracted from a collection of Portuguese news gathered from the on-line newspaper publico.pt.
wikipedia-acronyms.txt a list of acronyms and it’s possible extensions, extracted from the English Wikipedia.
PT-stopwords.txt a collections of stop-words for Portuguese.