While working on some projects of mine I come to a point where I needed pre-trained word embeddings for Portuguese. I could have trained some on my own on some corpora but I did not want to spend time on cleaning and running the training, so instead I searched the web for collections of word vectors for Portuguese, here’s a compiled list of what I’ve found.

NILC-Embeddings (2017)

A very comprehensive evaluation of different methods and parameters to generate word embeddings for both Brazilian and European variants. In total 31 word embedding models based on FastText, GloVe, Wang2Vec and Word2Vec, evaluated intrinsically on syntactic and semantic analogies and extrinsically on POS tagging and sentence semantic similarity tasks.

LX-DSemVectors (2018)

The authors apply the Skip-Gram model to a dataset composed of mostly European Portuguese newspapers. I would say that if you want embeddings for the new domain in European Portuguese this is probably a very good choice.

Facebook fasttext (2018)

This is the famous dataset published by Facebook research containing word embeddings trained on the Wikipedia and Common Crawl data. It contains Portuguese among a total of 157 languages.

Wikipedia2Vec (2018)

Unlike other word embedding tools, this software package learns embeddings of entities as well as words, the method jointly maps words and entities into the same continuous vector space. They provide such embeddings for 11 Languages, including Portuguese.

NLPL word embeddings repository

The paper states: “a shared repository of large-text resources for creating word vectors, including pre-processed corpora and pre-trained vectors for a range of frameworks and configurations. This will facilitate reuse, rapid experimentation, and replicability of results”. The repository contains different types of embedding for many languages, including embeddings based on the Portuguese CoNLL17 corpus.