Software

BREDS ‑ Bootstrapping of Relationship Extractors with Distributional Semantics

A Python package implementation based on results from my Ph.D. thesis. BREDS is an approach to extract named‑entity relationships without labelled data by relying instead on an initial set of seeds, i.e. pairs of named entities representing the relationship type to be extracted. The algorithm uses the seeds to learn extraction patterns and expands the initial set of seeds using distributional semantics to generalise the relationship while limiting the semantic drift.

GitHub   PyPI Package


nervaluate ‑ NER Evaluation Considering Partial Matching

An open‑source software package to evaluate named‑entity recognition systems considering partial entity matching. Originally started with a blog post I wrote about the subject which attracted the interest of several people and converged into a Python package which is currently maintained by myself and other contributors.

GitHub   PyPI Package


snowball-extractor ‑ Extracting Relations from Large Plain-Text Collections

An open‑source software package to evaluate named‑entity recognition systems considering partial entity matching. Originally started with a blog post I wrote about the subject which attracted the interest of several people and converged into a Python package which is currently maintained by myself and other contributors.

GitHub   PyPI Package


Politiquices.PT ‑ Support and Opposition Relationships in Portuguese Political News Headlines

I’ve analysed thousands of archived titles, identifying those that report supportive or opposing relationships between political actors and also associated the political personalities with their identifier on Wikidata. The result was a semantic graph, politiquices.pt allowing answering questions involving political personalities and parties. The project was awarded 2nd place in the “Arquivo.pt Awards 2021”.

Web   GitHub




Datasets

Relationship Extraction

I’ve been keeping track of public and free datasets for semantic relationship extraction. The datasets are organised into three different groups:


Named-Entity Recognition

Named-Entity Recognition datasets are organised by different languages, also some are for different domains:


Lexicons and Dictionaries

Several lexicons I gathered for different NLP tasks, including lists of names, acronyms and their extensions, stop-words, overlap of names and toponyms, etc.:

  • NomesLex-PT a lexicon of Portuguese person names made up of 2,027 first names and 8,019 surnames.

  • names-surnames-NL-UK-IT-PT-ES.zip a list of names and surnames for Dutch, English, Portuguese and Spanish.

  • publico-cargos.txt a list of Portuguese noun quantifiers, i.e., words that occur before a proper noun, gathered from the on-line newspaper publico.pt.

  • publico-acronyms.txt a list of acronyms and its possible extensions, extracted from a collection of Portuguese news gathered from the on-line newspaper publico.pt.

  • wikipedia-acronyms.txt a list of acronyms and its possible extensions, extracted from the English Wikipedia.

  • PT-stopwords.txt a collections of stop-words for Portuguese.