In May 2016 Google released SyntaxNet, a syntactic parser whose performance beat previous proposed approaches.

In this post I will show you how to have SyntaxNet’s syntactic dependencies and other morphological information in Python, precisely how to load NLTK structures such as DependencyGraph and Tree with SyntaxNet’s output.

In this example will use the Portuguese model, but as you will see this can be easily adapted to any language, provided you have already a pretrained model.

## Setup

First you need to install SyntaxNet:

https://github.com/tensorflow/models/tree/master/syntaxnet


Then, you need to download a pretrained model, from the list of all the available models

http://download.tensorflow.org/models/parsey_universal/<language>.zip


As the authors show in the tutorial after installing SyntaxNet and downloading a pretrained model, one can parse a sentence with the following command:

MODEL_DIRECTORY=/where/you/unzipped/the/model/files
cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh \
\$MODEL_DIRECTORY > output.conll


Now I will show you how to parse a file with a sentence per line and use it within Python NLTK.

cat sentences.txt

Quase 900 funcionários do Departamento de Estado assinaram memorando \
que critica Trump.
Meo, Nos e Vodafone arriscam-se a ter de baixar preços a milhões \
de clientes.


First we load all the sentences into a list, and joined them into a single string separated by the newline ‘\n’ character.

Then we will use python subprocess to call SyntaxNet, process the loaded sentences, and fetch the parsed sentences from stdout.

We process the captured stdout, for each token, the dependencies and other morphological information. Each token is represented by list with all it’s syntactic and morphologic information. A list of lists makes the sentence.

We then join each word/token information in a string separated by ‘\tab’ character, each word/token in a different line.

We then pass this string into the NLTK’s DependenccyGraph and can then see all the dependency triples or an ASCII print of the tree.

For the first sentence we have the following triples and tree:

((u'assinaram', u'VERB'), u'nsubj', (u'funcion\xe1rios', u'NOUN'))
((u'funcion\xe1rios', u'NOUN'), u'nummod', (u'900', u'NUM'))
((u'funcion\xe1rios', u'NOUN'), u'name', (u'Departamento', u'PROPN'))
((u'assinaram', u'VERB'), u'ccomp', (u'memorando', u'VERB'))
((u'memorando', u'VERB'), u'ccomp', (u'critica', u'VERB'))
((u'critica', u'VERB'), u'mark', (u'que', u'SCONJ'))
((u'critica', u'VERB'), u'dobj', (u'Trump.', u'PROPN'))

assinaram
___________|_____________
funcionários               memorando
________|___________              |
|        |           |       ______|_______
Quase      do          de    que           Trump.


And for the second sentence:

((u'pode', u'VERB'), u'nsubj', (u'galinha', u'NOUN'))
((u'galinha', u'NOUN'), u'det', (u'Uma', u'DET'))
((u'pode', u'VERB'), u'dobj', (u'ovos', u'NOUN'))
((u'ovos', u'NOUN'), u'nummod', (u'250', u'NUM'))
((u'ovos', u'NOUN'), u'nmod', (u'ano.', u'NOUN'))