When you train a NER system the most typical evaluation method is to measure precision, recall and f1-score at a token level. These metrics are indeed useful to tune a NER system. But when using the predicted named-entities for downstream tasks, it is more useful to evaluate with metrics at a full named-entity level. In this post I will go through some metrics that go beyond simple token-level performance.

You can find the complete code associated with this blog post on this repository:

You can find more about Named-Entity Recognition here:


Comparing NER system output and golden standard

Comparing the golden standard annotations with the output of a NER system different scenarios might occur:

I. System surface string and entity type match

Golden Standard System Prediction
Surface String Entity Type Surface String Entity Type
in O in O
New B-LOC New B-LOC
York I-LOC York I-LOC
. O . O

II. The system hypothesised an entity

Golden Standard System Prediction
Surface String Entity Type Surface String Entity Type
an O an O
Awful O Awful B-ORG
Headache O Headache I-ORG
in O in O

III. The system misses an entity

Golden Standard System Prediction
Surface String Entity Type Surface String Entity Type
in O in O
Palo B-LOC Palo O
Alto I-LOC Alto O
, O , O

Note that considering only this 3 scenarios, and discarding every other possible scenario we have a simple classification evaluation that can be measured in terms of false negatives, true positives, false negatives and false positives, and subsequently compute precision, recall and f1-score for each named-entity type.

But of course we are discarding partial matches, or other scenarios when the NER system gets the named-entity surface string correct but the type wrong, and we might also want to evaluate these scenarios again at a full-entity level.

IV. The system assigns the wrong entity type

Golden Standard System Prediction
Surface String Entity Type Surface String Entity Type
I O I O
live O live O
in O in O
Palo B-LOC Palo B-ORG
Alto I-LOC Alto I-ORG
, O , O

V. The system gets the boundaries of the surface string wrong

Golden Standard System Prediction
Surface String Entity Type Surface String Entity Type
Unless O Unless B-PER
Karl B-PER Karl I-PER
Smith I-PER Smith I-PER
resigns O resigns O

VI. The system gets the boundaries and entity type wrong

Golden Standard System Prediction
Surface String Entity Type Surface String Entity Type
Unless O Unless B-ORG
Karl B-PER Karl I-ORG
Smith I-PER Smith I-ORG
resigns O resigns O

How can we incorporate these described scenarios into evaluation metrics ?

Different Evaluation Schemas

Throughout the years different NER forums proposed different evaluation metrics:

CoNLL: Computational Natural Language Learning

The Language-Independent Named Entity Recognition task introduced at CoNLL-2003 measures the performance of the systems in terms of precision, recall and f1-score, where:

“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities in the corpus found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”

so basically it only considers scenarios I, II and III, the other described scenarios are not considered for evaluation.

Automatic Content Extraction (ACE)

The ACE challenges use a more complex evaluation metric which includes a weighting schema, I will not go into detail here, and just point for the papers about it:

I kind of gave up on trying to understand results and replicating experiments and baselines from ACE since all the datasets and results are not open and free, so I guess this challenge results and experiments will fade away with time.

Message Understanding Conference (MUC)

MUC introduced detailed metrics in an evaluation considering different categories of errors, these metrics can be defined as in terms of comparing the response of a system against the golden annotation:

  • Correct (COR) : both are the same;
  • Incorrect (INC) : the output of a system and the golden annotation don’t match;
  • Partial (PAR) : system and the golden annotation are somewhat “similar” but not the same;
  • Missing (MIS) : a golden annotation is not captured by a system;
  • Spurius (SPU) : system produces a response which doesn’t exist in the golden annotation;

these metrics already go a beyond the simple strict classification and consider partial matching for instance. They are also close to cover the scenarios defined in the beginning of this post, we just need to find a way to consider the differences - between NER output and golden annotations - based on two axes, the surface string and the entity type.

An implementation of the MUC evaluation metrics can be found here:

International Workshop on Semantic Evaluation (SemEval)

The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC.

  • Strict: exact boundary surface string match and entity type;

  • Exact: exact boundary match over the surface string, regardless of the type;

  • Partial: partial boundary match over the surface string, regardless of the type;

  • Type: some overlap between the system tagged entity and the gold annotation is required;

each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above

Scenario Golden Standard System Prediction Evaluation Schema
Entity Type Surface String Entity Type Surface String Type Partial Exact Strict
III brand TIKOSYN MIS MIS MIS MIS
II brand healthy SPU SPU SPU SPU
V drug warfarin drug of warfarin COR PAR INC INC
IV drug propranolol brand propranolol INC COR COR INC
I drug phenytoin drug phenytoin COR COR COR COR
I Drug theophylline drug theophylline COR COR COR COR
VI group contraceptives drug oral contraceptives INC PAR INC INC

Then precision/recall/f1-score are calculated for each different evaluation schema. In order to achieve data, two more quantities need to be calculated:

Number of gold-standard annotations contributing to the final score

$$\text{POSSIBLE} (POS) = COR + INC + PAR + MIS = TP + FN $$

Number of annotations produced by the NER system:

$$\text{ACTUAL} (ACT) = COR + INC + PAR + SPU = TP + FP$$

Then we can compute precision/recall/f1-score, where roughly describing precision is the percentage of correct named-entities found by the NER system, and recall is the percentage of the named-entities in the golden annotations that are retrieved by the NER system. This is computed in two different ways depending whether we want an exact match (i.e., strict and exact ) or a partial match (i.e., partial and type) scenario:

Exact Match (i.e., strict and exact )

$$\text{Precision} = \frac{COR}{ACT} = \frac{TP}{TP+FP}$$
$$\text{Recall} = \frac{COR}{POS} = \frac{TP}{TP+FN}$$

Partial Match (i.e., partial and type)

$$\text{Precision} = \frac{COR\ +\ 0.5\ \times\ PAR}{ACT} = \frac{TP}{TP+FP}$$
$$\text{Recall} = \frac{COR\ +\ 0.5\ \times\ PAR}{POS} = \frac{COR}{ACT} = \frac{TP}{TP+FP}$$

Putting all together:

Measure Type Partial Exact Strict
Correct 3 3 3 2
Incorrect 2 0 2 3
Partial 0 2 0 0
Missed 1 1 1 1
Spurius 1 1 1 1
Precision 0.5 0.66 0.5 0.33
Recall 0.5 0.66 0.5 0.33
F1 0.5 0.66 0.5 0.33

Code

I did a small experiment using sklearn-crfsuite wrapper around CRFsuite to train a NER over the CoNLL 2002 Spanish data. Next I evaluate the trained CRF over the test data and show the performance with the different metrics:

Note you can find the complete code for this blog post on this repository:

Example

import nltk
import sklearn_crfsuite

from copy import deepcopy
from collections import defaultdict

from sklearn_crfsuite import metrics

from ner_evaluation import collect_named_entities
from ner_evaluation import compute_metrics

Train a CRF on the CoNLL 2002 NER Spanish data

nltk.corpus.conll2002.fileids()
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Feature Extraction

%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]
CPU times: user 1.12 s, sys: 98.2 ms, total: 1.22 s
Wall time: 1.22 s

Training

%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)
CPU times: user 34.1 s, sys: 197 ms, total: 34.3 s
Wall time: 34.4 s

Performance per label type per token

y_pred = crf.predict(X_test)
labels = list(crf.classes_)
labels.remove('O') # remove 'O' label from evaluation
sorted_labels = sorted(labels,key=lambda name: (name[1:], name[0])) # group B and I results
print(sklearn_crfsuite.metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels, digits=3))
             precision    recall  f1-score   support

      B-LOC      0.810     0.784     0.797      1084
      I-LOC      0.690     0.637     0.662       325
     B-MISC      0.731     0.569     0.640       339
     I-MISC      0.699     0.589     0.639       557
      B-ORG      0.807     0.832     0.820      1400
      I-ORG      0.852     0.786     0.818      1104
      B-PER      0.850     0.884     0.867       735
      I-PER      0.893     0.943     0.917       634

avg / total      0.809     0.787     0.796      6178

Performance over full named-entity

test_sents_labels = []
for sentence in test_sents:
    sentence = [token[2] for token in sentence]
    test_sents_labels.append(sentence)
index = 2
true = collect_named_entities(test_sents_labels[index])
pred = collect_named_entities(y_pred[index])
true
[Entity(e_type='MISC', start_offset=12, end_offset=12),
 Entity(e_type='LOC', start_offset=15, end_offset=15),
 Entity(e_type='PER', start_offset=37, end_offset=39),
 Entity(e_type='ORG', start_offset=45, end_offset=46)]
pred
[Entity(e_type='MISC', start_offset=12, end_offset=12),
 Entity(e_type='LOC', start_offset=15, end_offset=15),
 Entity(e_type='PER', start_offset=37, end_offset=39),
 Entity(e_type='LOC', start_offset=45, end_offset=46)]
compute_metrics(true, pred)
({'ent_type': {'actual': 4,
   'correct': 3,
   'incorrect': 1,
   'missed': 0,
   'partial': 0,
   'possible': 4,
   'precision': 0.75,
   'recall': 0.75,
   'spurius': 0},
  'strict': {'actual': 4,
   'correct': 3,
   'incorrect': 1,
   'missed': 0,
   'partial': 0,
   'possible': 4,
   'precision': 0.75,
   'recall': 0.75,
   'spurius': 0}},
 {'LOC': {'ent_type': {'correct': 1,
    'incorrect': 1,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 1,
    'incorrect': 1,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'MISC': {'ent_type': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'ORG': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'PER': {'ent_type': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}}})
to_test = [2,4,12,14]
index = 2
true_named_entities_type = defaultdict(list)
pred_named_entities_type = defaultdict(list)

for true in collect_named_entities(test_sents_labels[index]):
    true_named_entities_type[true.e_type].append(true)

for pred in collect_named_entities(y_pred[index]):
    pred_named_entities_type[pred.e_type].append(pred)
true_named_entities_type
defaultdict(list,
            {'LOC': [Entity(e_type='LOC', start_offset=15, end_offset=15)],
             'MISC': [Entity(e_type='MISC', start_offset=12, end_offset=12)],
             'ORG': [Entity(e_type='ORG', start_offset=45, end_offset=46)],
             'PER': [Entity(e_type='PER', start_offset=37, end_offset=39)]})
pred_named_entities_type
defaultdict(list,
            {'LOC': [Entity(e_type='LOC', start_offset=15, end_offset=15),
              Entity(e_type='LOC', start_offset=45, end_offset=46)],
             'MISC': [Entity(e_type='MISC', start_offset=12, end_offset=12)],
             'PER': [Entity(e_type='PER', start_offset=37, end_offset=39)]})
true_named_entities_type['LOC']
[Entity(e_type='LOC', start_offset=15, end_offset=15)]
pred_named_entities_type['LOC']
[Entity(e_type='LOC', start_offset=15, end_offset=15),
 Entity(e_type='LOC', start_offset=45, end_offset=46)]
compute_metrics(true_named_entities_type['LOC'], pred_named_entities_type['LOC'])
({'ent_type': {'actual': 2,
   'correct': 1,
   'incorrect': 0,
   'missed': 0,
   'partial': 0,
   'possible': 1,
   'precision': 0.5,
   'recall': 1.0,
   'spurius': 1},
  'strict': {'actual': 2,
   'correct': 1,
   'incorrect': 0,
   'missed': 0,
   'partial': 0,
   'possible': 1,
   'precision': 0.5,
   'recall': 1.0,
   'spurius': 1}},
 {'LOC': {'ent_type': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 1},
   'strict': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 1}},
  'MISC': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'ORG': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'PER': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}}})

results over all messages

metrics_results = {'correct': 0, 'incorrect': 0, 'partial': 0,
                   'missed': 0, 'spurius': 0, 'possible': 0, 'actual': 0}

# overall results
results = {'strict': deepcopy(metrics_results),
           'ent_type': deepcopy(metrics_results)
           }

# results aggregated by entity type
evaluation_agg_entities_type = {e: deepcopy(results) for e in ['LOC','PER','ORG','MISC']}

for true_ents, pred_ents in zip(test_sents_labels, y_pred):    
    # compute results for one message
    tmp_results, tmp_agg_results = compute_metrics(collect_named_entities(true_ents),collect_named_entities(pred_ents))

    # aggregate overall results
    for eval_schema in results.keys():
        for metric in metrics_results.keys():
            results[eval_schema][metric] += tmp_results[eval_schema][metric]

    # aggregate results by entity type
    for e_type in ['LOC','PER','ORG','MISC']:
        for eval_schema in tmp_agg_results[e_type]:
            for metric in tmp_agg_results[e_type][eval_schema]:
                evaluation_agg_entities_type[e_type][eval_schema][metric] += tmp_agg_results[e_type][eval_schema][metric]

results
{'ent_type': {'actual': 3518,
  'correct': 2909,
  'incorrect': 564,
  'missed': 111,
  'partial': 0,
  'possible': 3584,
  'spurius': 45},
 'strict': {'actual': 3518,
  'correct': 2779,
  'incorrect': 694,
  'missed': 111,
  'partial': 0,
  'possible': 3584,
  'spurius': 45}}
evaluation_agg_entities_type
{'LOC': {'ent_type': {'actual': 0,
   'correct': 861,
   'incorrect': 180,
   'missed': 32,
   'partial': 0,
   'possible': 0,
   'spurius': 5},
  'strict': {'actual': 0,
   'correct': 840,
   'incorrect': 201,
   'missed': 32,
   'partial': 0,
   'possible': 0,
   'spurius': 5}},
 'MISC': {'ent_type': {'actual': 0,
   'correct': 211,
   'incorrect': 46,
   'missed': 33,
   'partial': 0,
   'possible': 0,
   'spurius': 7},
  'strict': {'actual': 0,
   'correct': 173,
   'incorrect': 84,
   'missed': 33,
   'partial': 0,
   'possible': 0,
   'spurius': 7}},
 'ORG': {'ent_type': {'actual': 0,
   'correct': 1181,
   'incorrect': 231,
   'missed': 34,
   'partial': 0,
   'possible': 0,
   'spurius': 31},
  'strict': {'actual': 0,
   'correct': 1120,
   'incorrect': 292,
   'missed': 34,
   'partial': 0,
   'possible': 0,
   'spurius': 31}},
 'PER': {'ent_type': {'actual': 0,
   'correct': 656,
   'incorrect': 107,
   'missed': 12,
   'partial': 0,
   'possible': 0,
   'spurius': 2},
  'strict': {'actual': 0,
   'correct': 646,
   'incorrect': 117,
   'missed': 12,
   'partial': 0,
   'possible': 0,
   'spurius': 2}}}

References