This page contain links to list for public datasets used in different NLP tasks, I try to keep the lists updated as I found new and interesting datasets.

Relationship Extraction

I’ve been keeping track of public and free datasets for semantic relationship extraction, this github repository contains annotated datasets which can be used to train supervised models to perform semantic relationship extraction.

The datasets are organized into three different groups:

Named-Entity Recognition

Named-Entity Recognition datasets organised by different languages, also some are for different domains:

Lexicons and Dictionaries

Several lexicons I gathered for different NLP tasks, including lists of names, acronyms and it’s extensions, stop-words, overlap of names and toponyms, etc.:

  • NomesLex-PT a lexicon of Portuguese person names made up of 2,027 first names and 8,019 surnames.

  • a list of names and surnames for Dutch, English, Portuguese and Spanish.

  • publico-cargos.txt a list of Portuguese noun quantifiers, i.e., words that occur before a proper noun, gathered from the on-line newspaper

  • publico-acronyms.txt a list of acronyms and it’s possible extensions, extracted from a collection of Portuguese news gathered from the on-line newspaper

  • wikipedia-acronyms.txt a list of acronyms and it’s possible extensions, extracted from the English Wikipedia.

  • PT-stopwords.txt a collections of stop-words for Portuguese.