The Attention Mechanism in Natural Language Processing - seq2seq

The Attention mechanism is now an established technique in many NLP tasks. I’ve heard about it often, but wanted to go a bit more deep and understand the details. In this first blog post - since I plan to publish a few more blog posts regarding the attention subject - I make an introduction by focusing in the first proposal of attention mechanism, as applied to the task of neural machine translation.

Portuguese Word Embeddings

While working on some projects of mine I come to a point where I needed pre-trained word embeddings for Portuguese. I could have trained some on my own on some corpora but I did not wanted to spent time on cleaning and running the training, so instead I searched the web for collections of word vectors for Portuguese, here’s a compiled list of what I’ve found.

KOVENS'19 - The German-focused NLP Conference

KONVENS is an annual conference, which gathers together the computer scientist and computational linguist community working with the German language.

Language Models and Contextualised Word Embeddings

Since the work of Mikolov et al., 2013 was published and the software package word2vec was made public available a new era in NLP started on which word embeddings, also referred to as word vectors, play a crucial role. Word embeddings can capture many different properties of a word and become the de-facto standard to replace feature engineering in NLP tasks.

Named-Entity Recognition based on Neural Networks

Recently (i.e., at the time of this writing since 2015~2016 onwards) new methods to perform sequence labelling tasks based on neural networks started to be proposed/published, I will try in this blog post to do a quick recap of some of these new methods, understanding their architectures and pointing out what each technique brought new or different to the already knew methods.

Evaluation Metrics, ROC-Curves and imbalanced datasets

I wrote this blog post with the intention to review and compare some evaluation metrics typically used for classification tasks, and how they should be used depending on the the dataset. I also show how one can tune the probability thresholds for the particularly metrics.

Named-Entity evaluation metrics based on entity-level

When you train a NER system the most typically evaluation method is to measure precision, recall and f1-score at a token level. These metrics are indeed useful to tune a NER system. But when using the predicted named-entities for downstream tasks, it is more useful to evaluate with metrics at a full named-entity level. In this post I will go through some metrics that go beyond simple token-level performance.

Convolutional Neural Networks for Text Classification

Convolutional Neural Networks (ConvNets) have in the past years shown break-through results in some NLP tasks, one particular task is sentence classification, i.e., classifying short phrases (i.e., around 20~50 tokens), into a set of pre-defined categories. In this post I will explain how ConvNets can be applied to classifying short-sentences and how to easily implemented them in Keras.

Applying scikit-learn TfidfVectorizer on tokenized text

Sometimes your tokenization process is so complex that cannot be captured by a simple regular expression that you can pass to the scikit-learn TfidfVectorizer. Instead you just want to pass a list of tokens, resulting of a tokenization process, to initialize a TfidfVectorizer object.

Hyperparameter optimization across multiple models in scikit-learn

I found myself, from time to time, always bumping into a piece of code (written by someone else) to perform grid search across different models in scikit-learn and always adapting it to suit my needs, and fixing it, since it contained some already deprecated calls. I finally decided to post it here in my blog, so I can quickly find it and also to share it with whoever needs it.

StanfordNER - training a new model and deploying a web service

Stanford NER is a named-entity recognizer based on linear chain Conditional Random Field (CRF) sequence models. This post details some of the experiments I’ve done with it, using a corpus to train a Named-Entity Recognizer: the features I’ve explored (some undocumented), how to setup a web service exposing the trained model and how to call it from a python script.

Conditional Random Fields for Sequence Prediction

This is the third and (maybe) the last part of a series of posts about sequential supervised learning applied to NLP. In this post I will talk about Conditional Random Fields (CRF), explain what was the main motivation behind the proposal of this model, and make a final comparison between Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM) and CRF for sequence prediction.

Maximum Entropy Markov Models and Logistic Regression

This is the second part of a series of posts about sequential supervised learning applied to NLP. It can be seen as a follow up on the previous post, where I tried do explain the relationship between HMM and Naive Bayes. In this post I will try to explain how to build a sequence classifier based on a Logistic Regression classifier, i.e., using a discriminative approach.

Hidden Markov Model and Naive Bayes relationship

This is the first post, of a series of posts, about sequential supervised learning applied to Natural Language Processing. In this first post I will write about the classical algorithm for sequence learning, the Hidden Markov Model (HMM), explain how it’s related with the Naive Bayes Model and it’s limitations.

Google's SyntaxNet - HTTP API for Portuguese

In a previous post I explained how load the syntactic and morphological information given by SyntaxNet into NLTK structures by parsing the standard output. Although useful this is does not scale when one wants to process thousands of sentences, but finally I’ve found a Docker image to setup SyntaxNet as a web-service.

PyData Berlin 2017

The PyData Berlin conference took place in the first weekend July, at the HTW. Full 3 days of many interesting subjects including Natural Language Processing, Machine Learning, Data Visualization, etc. I was happy to have my talk proposal accepted, and had the opportunity to present work done during my PhD on Semantic Relationship extraction.

Open Information Extraction in Portuguese

In this post I will present one of the first proposed Open Information Extraction systems, which is very simple and effective, relying only on part-of-speech tags. I also implement it and apply it to Portuguese news articles.

Document Classification

Classifying a document into a pre-defined category is a common problem, for instance, classifying an email as spam or not spam. In this case there is an instance to be classified into one of two possible classes, i.e. binary classification.

Google's SyntaxNet in Python NLTK

In May 2016 Google released SyntaxNet, a syntactic parser whose performance beat previous proposed approaches.

First post :-)

Just the first post ever to test everything :-)