In this post I will present one of the first proposed Open Information Extraction systems, which is very simple and effective, relying only on part-of-speech tags. I also implement it and apply it to Portuguese news articles.

But first a small introduction, Information Extraction, in an NLP context, deals with the problem of extracting structured information from text. For a computer text is just a sequence of bytes, with no semantic meaning whatsoever.

A possible approach to extract structured information from text is to extract semantic relationships or semantic triples, for instance, named-entities (e.g., persons, locations, organisations), and the semantic relationships between them.

Semantic relationships are typically represented by triples in the form <e1, rel, e2>, where e1 and e2 are noun-phrases/entities of a relationship, and rel is the type of relationship relating the two noun phrases.

For instance, given the following sentence:

"The linguist Noam Chomsky was born in East Oak Lane neighbourhood of Philadelphia."

we first identify the named-entities, in this case, three different named-entities are recognized.

Then, two different relationships, can be extracted between the recognized named-entities:

<Noam Chomsky, place-of-birth, East Oak Lane>

Typically (or until 2007) most systems aimed at extracting relationships with known a priori relationship types (i.e., place-of-birth, part-of, etc.), and each target relationship is learned from training data or defined by hand-made rules.

Another approach is Open Information Extraction (OIE) (Etzioni et al. (2008), Banko, et al. (2007)), which is suited when the target relations are unknown and the textual data is heterogeneous.

OIE techniques, on the other hand, extract all possible relationship types from a given collections of documents. OIE can be divided into two main categories, data- and rule-based.

Rule-based OIE relies on hand-crafted patterns from PoS-tagged text or rules operating on dependency parse trees. Data-based OIE generates patterns based on training data represented by means of dependency tree or PoS-tagged text.

The figure below shows a taxonomy of different methods to perform relationship extraction. If you are interested you can also check Chapter 2 of my PhD thesis for a review of work in relationship extraction.

### ReVerb: Identifying Relations for Open Information Extraction

ReVerb (Fader et al., 2011) extracts relationships based on a simple constraint, every relational phrase, i.e. the sequence of words connecting two entities, must be either:

• a verb (e.g., invented),
• a verb followed immediately by a preposition (e.g., located in),
• a verb followed by nouns, adjectives, or adverbs ending in a preposition (e.g., has atomic weight of)

This corresponds to the PoS-tags pattern shown in the figure below

If there are multiple possible matches for a single verb, the longest possible match is chosen. If the pattern matches multiple adjacent sequences, ReVerb merges them into a single relation phrase.

During extraction, the system first looks for a matching relational phrase and then for the arguments (e1, e2) of the relationship, thus avoiding confusing a noun in the relational phrase for an argument.

### Extracting semantic relationships from Portuguese

For English, there are several software packages ready to use out-of-the-box to perform simple NLP tasks: such as part-of-speech tagging, dependency parsing, named-entity recognition, etc.

But for Portuguese, and many other languages, that’s not the case, therefore I was curious whether I could quickly and easily adapt ReVerb to Portuguese and and extract triples from a collection of documents.

I also did a small change from the original ReVerb algorithm, instead of looking for noun-phrases, I first tagged all the named-entities (i.e., PER, LOC, ORG) in a document collection, and then tried to find relational phrases, according to the pattern based on PoS, which connect the named-entities in a relationship.

I used three main resources for a running quick experiment:

I) Polyglot, is a NLP library supporting several languages, include Portuguese, performing:

• sentence boundary detection

• part-of-speech tagging

• named-entity recognition (NER)

I had some troubles installing PyICU which is needed for Polyglot, this post on stackoverflow helped me.

Polyglot tries do automatically detect the language in which a text is written and apply the correct model (i.e., for PoS-tagging, NER, etc.). But, due to foreign names, sometimes it detects the language as being English or other rather than Portuguese, but you can force the language of a text and override the language detection mechanism

II) I used RegexpParser from NLTK to encode ReVerb’s regular expression adapted to Portuguese. The PoS-tags given by Polyglot follow the Universal Part of Speech tags set, 17 unique tags.

verb = "<ADV>*<AUX>*<VERB><PART>*<ADV>*"

rel_pattern = "( %s (%s* (%s)+ )? )+ " % (verb, word, preposition)
grammar_long = '''REL_PHRASE: {\\%s}''' % rel_pattern


III) CHAVE, is a collection of portuguese news articles, available free of charge for research proposes distributed by Linguateca.

You just need to give an email address to receive the credentials allowing to download the collection. It contains all the complete published editions for the years of 1994 and 1995 for two popular newspapers, PUBLICO from Portugal and Folha de São Paulo from Brazil. The figure below is a statistical description of the dataset.

I used Polyglot to identify named-entities and perform part-of-speech tagging, then I looked for ReVerb patterns between every pair of named-entities which was no more than 8 tokens apart. I extracted triples from PUBLICO news articles from the following categories:

          ['Nacional', 'Mundo', 'Economia', 'Sociedade', 'Cultura']


This took around 2 hours to run on a Mac Book Pro, and extracted around 70 000 unique relational triples. I then started to explore the extracted triples by simply using grep, to find triples refering specific entities.

There are interesting facts in the semantics triples, mostly, of course, related to events in 1994 and 1995, for instance:

When Madredeus start to gain international fame and tour outside of Portugal:

Teresa Salgueiro	cantar composições do	        Pedro Ayres Magalhães
Madredeus	        vendem tanto como	        Paul McCartney
Teresa Salgueiro        frequentou aulas de canto com	Cortês Medina


When the East-Timor conflict gained larger and political proportions:

Bill Clinton	  falou no problema dos direitos humanos em	Timor-Leste
Bill Clinton      levante a questão de                          Timor-Leste
Bill Clinton      também não deixará de mencionar a questão de  Timor

Governo da Indonésia	não poupa ataques	                    Portugal
Coliseu do Porto        haverá um espectáculo de solidariedade com  Timor
Danielle Mitterrand     participar numa conferência sobre           Timor-Leste
Durão Barroso           lança apelo por                             Timor

Galvão de Melo          não poupou elogios à presença indonésia em  Timor
Governo russo           reconheceu                  Timor-Leste
Indonésia               explora petróleo do         Timor
Indonésia               ocupou o território de      Timor
Indonésia	        acusa                       Portugal

Alvalade	montar o espectáculo da banda de                David Gilmour
Alvalade	ouvir a música dos	                        Pink Floyd
Damásio	        prometeu roubar os concertos ao seu rival de	Alvalade
Rod Stewart	dá um concerto em	                        Lisboa
Van Halen	fazerem a primeira parte do concerto dos	Bon Jovi


The typical messages of accusation and support in politics:

António Guterres    responde às críticas de	    Luís Filipe Menezes
Manuel Monteiro	    acusou ontem	            António Guterres
Narciso Miranda	    dedicou a vitória a             António Guterres
PCP                 não poupou críticas a	    Guterres
PS                  teceu fortes críticas ao	    PSD
PSD                 criticou o governo de	    Guterres
Sampaio             elogiando a capacidade de	    Guterres
Sampaio	            entra para o longo abraço a	    Guterres


And the investments, buys and sells of shares in economics:

António Champalimaud	aumentar o capital da               Mundial Confiança
Sonae Investimentos	lançou sobre as acções da	    Interlog SGPS
British Aerospace	queria vender a totalidade das acções da    Rover
Carlsberg	        entrou no capital da	                    Unicer
Continente	        lançou sobre a totalidade do capital da	    Modelo
Microsoft		adquiriu todas as acções da	            Altamira
Portucel		não acompanhará o aumento de capital da	    Inapa
Renault		        conduziu a uma forte subida das acções da   Volvo
Santander	        entra efectivamente no capital do	    BCI


But there is also uninformative extractions, which occur mainly due to two problems:

• NER component of Polyglot
• ReVerb limitations - which are also pointed out in the original paper

## Named-Entity Recognition Errors

One of the problems with the extraction has to due with the Polyglot failing to correctly identify the named-entities. Polyglot often detects wrong named-entities, for instance:

!       I-ORG
'       I-PER
)       I-ORG
. . ) ? I-PER
. . . ' I-ORG
. . . . I-PER
. . . . . . . . . . .   I-PER
17.3.95 I-PER
22.06.94        I-LOC
2C      I-ORG
4.10.1952       I-LOC
52n Street      I-ORG
72nd West       I-ORG
? . . . I-PER


or sometimes the named-entities are incomplete, for instance, given the phrase:

“O Fundo BPI América, gerido pela Douro Fundos, grupo Banco Português de Investimento, …“

The following triple is extracted, with the incomplete entity “Douro” tagged as LOCATION

                BPI América        gerido pela     Douro


## Shallow Analysis of the Sentence

ReVerb operates only at a shallow sentence level, it does not take into consideration syntactic dependencies among words or groups of words, as a consequence it only extracts binary relationships. Given this sentence:

“Fontes diplomáticas citadas pela AFP referiram que Washington sempre encarou com desagrado qualquer anúncio de redução de efectivos da ONU na Bósnia antes da formação de uma nova força multinacional”

the follwing triple is extracted:

                     AFP     referiram       Washington


and “Washington” wrongly as tagged as a PERSON.

This is a complex sentence, and there is not a simple binary relationship that expresses all the information in the sentence.

## Future Work

This was a quick experiment on how to perform open information extraction in portuguese using only open and out-of-the-box tools. Two things that need to be improved are:

• Go beyond part-of-speech tags, i.e., for for syntactic dependencies
• Improve the named-entity recognition

It’s easy to solve this for English, but for Portuguese it’s more complicated. For morphological and syntactic information for Portuguese, there is SyntaxNet, but I still have to find a way to make it work in batch or inside my own Python code.

For Named-Entity Recognition, as far as I know, apart from Polyglot, there is no other library, ready to use out-of-the-box. There are datasets available, which can be transformed and used as training data. I will try to address this issue in the future, and hopefully make a post about it.

The full code and the extracted triples are available on my github:

https://github.com/davidsbatista/information-extraction-PT