This is the first post, of a series of posts, about sequential supervised learning applied to Natural Language Processing. In this first post I will write about the classical algorithm for sequence learning, the Hidden Markov Model (HMM), explain how it’s related with the Naive Bayes Model and it’s limitations.

You can find the second and third posts here:


The classical problem in Machine Learning is to learn a classifier that can distinguish between two or more classes, i.e., that can accurately predict a class for a new object given training examples of objects already classified.

NLP typical examples are, for instance: classifying an email as spam or not spam, classifying a movie into genres, classifying a news article into topics, etc., however, there is another type of prediction problems which involve structure.

A classical example in NLP is part-of-speech tagging, in this scenario, each describes a word and each the associated part-of-speech of the word (e.g.: noun, verb, adjective, etc.).

Another example, is named-entity recognition, in which, again, each describes a word and is a semantic label associated to that word (e.g.: person, location, organization, event, etc.).

In both examples the data consist of sequences of $(x, y)$ pairs, and we want to model our learning problem based on that sequence:

in most problems these sequences can have a sequential correlation. That is, nearby $x$ and $y$ values are likely to be related to each other. For instance, in English, it’s common after the word to the have a word whose part-of-speech tag is a verb.

Note that there are other machine learning problems which also involve sequences but are clearly different. For instance, in time-series, there is also a sequence, but we want to predict a value at point , and we can use all the previous true observed to predict. In sequential supervised learning we must predict all values in the sequence.

The Hidden Markov Model (HMM) was one the first proposed algorithms to classify sequences. There are other sequence models, but I will start by explaining the HMMM as a sequential extension to the Naive Bayes model.

Naive Bayes classifier

The Naive Bayes (NB) classifier is a generative model, which builds a model of each possible class based on the training examples for each class. Then, in prediction, given an observation, it computes the predictions for all classes and returns the class most likely to have generated the observation. That is, it tries to predict which class generated the new observed example.

In contrast discriminative models, like logistic regression, tries to learn which features from the training examples are most useful to discriminate between the different possible classes.

The Naive Bayes classifier returns the class that as the maximum posterior probability given the features:

where it’s a class and is a feature vector associated to an observation.

Bayes theorem in blue neon.
(taken from Wikipedia)

The NB classifier is based on the Bayes’ theorem. Applying the theorem to the equation above, we get:

In training, when iterating over all classes, for a given observation, and calculating the probabilities above, the probability of the observation, i.e., the denominator, is always the same, it has no influence, so we can then simplify the formula:

which, if we decompose the vector of features, is the same as:

this is hard to compute, because it involves estimating every possible combination of features. We can relaxed this computation by applying the Naives Bayes assumption, which states that:

"each feature is conditional independent of every other feature, given the class"

formerly, with . The probabilities are independent given the class and hence can be ‘naively’ multiplied:

plugging this into our equation:

we get the final Naive Bayes model, which as consequence of the assumption above, doesn’t capture dependencies between each input variables in .


Training in Naive Bayes is mainly done by counting features and classes. Note that the procedure described below needs to be done for every class .

To calculate the prior, we simple count how many samples in the training data fall into each class $y_{i}$ divided by the total number of samples:

To calculate the likelihood estimate, we count the number of times feature appears among all features in all samples of class :

This will result in a big table of occurrences of features for all classes in the training data.


When given a new sample to classify, and assuming that it contains features , we need to compute, for each class :

This is decomposed into:

Again, this is calculated for each class , and we assign to the new observed sample the class that has the highest score.

From Naive Bayes to Hidden Markov Models

The model presented before predicts a class for a set of features associated to an observation. To predict a class sequence for sequence of observation , a simple sequence model can be formulated as a product over single Naïve Bayes models:

Two aspects about this model:

  • there is only one feature at each sequence position, namely the identity of the respective observation due the assumption that each feature is generated independently, conditioned on the class .

  • it doesn’t capture interactions between the observable variables .

It is however reasonable to assume that there are dependencies at consecutive sequence positions , remember the example above about the part-of-speech tags ?

This is where the First-order Hidden Markov Model appears, introducing the Markov Assumption:

"the probability of a particular state is dependent only on the previous state"

which written in it’s more general form:

where Y represents the set of all possible label sequences .

Hidden Markov Model

A Hidden Markov Model (HMM) is a sequence classifier. As other machine learning algorithms it can be trained, i.e.: given labeled sequences of observations, and then using the learned parameters to assign a sequence of labels given a sequence of observations. Let’s define an HMM framework containing the following components:

  • states (e.g., labels):
  • observations (e.g., words) :
  • two special states: and which are not associated with the observation

and probabilities relating states and observations:

  • initial probability: an initial probability distribution over states
  • final probability: a final probability distribution over states
  • transition probability: a matrix with the probabilities from going from one state to another
  • emission probability: a matrix with the probabilities of an observation being generated from a state

A First-order Hidden Markov Model has the following assumptions:

  • Markov Assumption: the probability of a particular state is dependent only on the previous state. Formally:

  • Output Independence: the probability of an output observation depends only on the state that produced the observation and not on any other states or any other observations. Formally:

Notice how the output assumption is closely related with the Naive Bayes classifier presented before. The figure below makes it easier to understand the dependencies and the relationship with the Naive Bayes classifier:

Transitions and Emissions probabilities in the HMM.
(image adapted from CS6501 of the University of Virginia)

We can now define two problems which can be solved by an HMM, the first is learning the parameters associated to a given observation sequence, that is training. For instance given words of a sentence and the associated part-of-speech tags, one can learn the latent structure.

The other one is applying a trained HMM to an observation sequence, for instance, having a sentence, predicting each word’s part-of-speech tag, using the latent structure from the training data learned by the HMM.

Learning: estimating transition and emission matrices

Given an observation sequence and the associated states how can we learn the HMM parameters, that is, the matrices and ?

In a HHM supervised scenario this is done by applying the Maximum Likelihood Estimation principle, which will compute the matrices.

This is achieved by counting how many times each event occurs in the corpus and normalizing the counts to form proper probability distributions. We need to count 4 quantities which represent the counts of each event in the corpus:

Initial counts:
(how often does state is the initial state)

Transition counts:
(how often does state transits to another state )

Final Counts:
(how often does state is the final state)

Emissions counts:
(how often does state is associated with the observation/word )

where, is the number of training examples and the length of the sequence, 1 is an indicator function that has the value 1 when the particular event happens, and 0 otherwise. The equations scan the training corpus and count how often each event occurs.

All these 4 counts are then normalised in order to have proper probability distributions:

These equations will produce the transition probability matrix , with the probabilities from going from one label to another and the emission probability matrix with the probabilities of an observation being generated from a state.

Laplace smoothing

How will the model handle words not seen during training ?

In the presence of an unseen word/observation, and has a consequence incorrect decisions will be made during the predicting process.

There is a technique to handle this situations called Laplace smoothing or additive smoothing. The idea is that every state will always have a small emission probability of producing an unseen word, for instance, denoted by UNK. Every time the HMM encounters an unknown word it will use the value as the emission probability.

Decoding: finding the hidden state sequence for an observation

Given a trained HMM i.e., the transition matrixes and , and a new observation sequence we want to find the sequence of states that best explains it.

This is can be achieved by using the Viterbi algorithm, that finds the best state assignment to the sequence as a whole. There is another algorithm, Posterior Decoding which consists in picking the highest state posterior for each position in the sequence independently.


It’s a dynamic programming algorithm for computing:

the score of a best path up to position ending in state . The Viterbi algorithm tackles the equation above by using the Markov assumption and defining two functions:

the most likely previous state for each state (store a back-trace):

The Viterbi algorithm uses a representation of the HMM called a trellis, which unfolds all possible states for each position and it makes explicit the independence assumption: each position only depends on the previous position.

An unfilled trellis representation of an HMM.

Word Emission and State Transitions probabilities matrices.

Using the Viterbi algorithm and the emission and transition probabilities matrices, one can fill in the trellis scores and effectively find the Viterby path.

An filled trellis representation of an HMM.

The figures above were taken from a Viterbi algorithm example by Roger Levy for the Linguistics/CSE 256 class. You can find the full example here.

HMM Important Observations

  • The main idea of this post was to see the connection between the Naive Bayes classifier and the HMM as a sequence classifier

  • If we make the hidden state of HMM fixed, we will have a Naive Bayes model.

  • There is only one feature at each word/observation in the sequence, namely the identity i.e., the value of the respective observation.

  • Each state depends only on its immediate predecessor, that is, each state is independent of all its ancestors given its previous state .

  • Each observation variable depends only on the current state .

Software Packages

  • seqlearn: a sequence classification library for Python which includes an implementation of Hidden Markov Models, it follows the sklearn API.

  • NLTK HMM: NLTK also contains a module which implements a Hidden Markov Models framework.

  • lxmls-toolkit: the Natural Language Processing Toolkit used in the Lisbon Machine Learning Summer School also contains an implementation of Hidden Markov Models.



There is also a very good lecture, given by Noah Smith at LxMLS2016 about Sequence Models, mainly focusing on Hidden Markov Models and it’s applications from sequence learning to language modeling.