Named-Entity evaluation metrics based on entity-level

When you train a NER system the most typical evaluation method is to measure precision, recall and f1-score at a token level. These metrics are indeed useful to tune a NER system. But when using the predicted named-entities for downstream tasks, it is more useful to evaluate with metrics at a full named-entity level. In this post I will go through some metrics that go beyond simple token-level performance.

You can find the complete code associated with this blog post on this repository:

https://github.com/davidsbatista/NER-Evaluation

You can find more about Named-Entity Recognition here:

Comparing NER system output and golden standard

Comparing the golden standard annotations with the output of a NER system different scenarios might occur:

I. System surface string and entity type match

Golden Standard		System Prediction
Surface String	Entity Type	Surface String	Entity Type
in	O	in	O
New	B-LOC	New	B-LOC
York	I-LOC	York	I-LOC
.	O	.	O

II. The system hypothesised an entity

Golden Standard		System Prediction
Surface String	Entity Type	Surface String	Entity Type
an	O	an	O
Awful	O	Awful	B-ORG
Headache	O	Headache	I-ORG
in	O	in	O

III. The system misses an entity

Golden Standard		System Prediction
Surface String	Entity Type	Surface String	Entity Type
in	O	in	O
Palo	B-LOC	Palo	O
Alto	I-LOC	Alto	O
,	O	,	O

Note that considering only this 3 scenarios, and discarding every other possible scenario we have a simple classification evaluation that can be measured in terms of false negatives, true positives, false negatives and false positives, and subsequently compute precision, recall and f1-score for each named-entity type.

But of course we are discarding partial matches, or other scenarios when the NER system gets the named-entity surface string correct but the type wrong, and we might also want to evaluate these scenarios again at a full-entity level.

IV. The system assigns the wrong entity type

Golden Standard		System Prediction
Surface String	Entity Type	Surface String	Entity Type
I	O	I	O
live	O	live	O
in	O	in	O
Palo	B-LOC	Palo	B-ORG
Alto	I-LOC	Alto	I-ORG
,	O	,	O

V. The system gets the boundaries of the surface string wrong

Golden Standard		System Prediction
Surface String	Entity Type	Surface String	Entity Type
Unless	O	Unless	B-PER
Karl	B-PER	Karl	I-PER
Smith	I-PER	Smith	I-PER
resigns	O	resigns	O

VI. The system gets the boundaries and entity type wrong

Golden Standard		System Prediction
Surface String	Entity Type	Surface String	Entity Type
Unless	O	Unless	B-ORG
Karl	B-PER	Karl	I-ORG
Smith	I-PER	Smith	I-ORG
resigns	O	resigns	O

How can we incorporate these described scenarios into evaluation metrics ?

Different Evaluation Schemas

Throughout the years different NER forums proposed different evaluation metrics:

CoNLL: Computational Natural Language Learning

The Language-Independent Named Entity Recognition task introduced at CoNLL-2003 measures the performance of the systems in terms of precision, recall and f1-score, where:

“precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities in the corpus found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.”

so basically it only considers scenarios I, II and III, the other described scenarios are not considered for evaluation.

Automatic Content Extraction (ACE)

The ACE challenges use a more complex evaluation metric which includes a weighting schema, I will not go into detail here, and just point for the papers about it:

I kind of gave up on trying to understand results and replicating experiments and baselines from ACE since all the datasets and results are not open and free, so I guess this challenge results and experiments will fade away with time.

Message Understanding Conference (MUC)

MUC introduced detailed metrics in an evaluation considering different categories of errors, these metrics can be defined as in terms of comparing the response of a system against the golden annotation:

Correct (COR) : both are the same;
Incorrect (INC) : the output of a system and the golden annotation don’t match;
Partial (PAR) : system and the golden annotation are somewhat “similar” but not the same;
Missing (MIS) : a golden annotation is not captured by a system;
Spurius (SPU) : system produces a response which doesn’t exist in the golden annotation;

these metrics already go a beyond the simple strict classification and consider partial matching for instance. They are also close to cover the scenarios defined in the beginning of this post, we just need to find a way to consider the differences - between NER output and golden annotations - based on two axes, the surface string and the entity type.

An implementation of the MUC evaluation metrics can be found here:

https://github.com/jantrienes/nereval

International Workshop on Semantic Evaluation (SemEval)

The SemEval’13 introduced four different ways to measure precision/recall/f1-score results based on the metrics defined by MUC.

Strict: exact boundary surface string match and entity type;
Exact: exact boundary match over the surface string, regardless of the type;
Partial: partial boundary match over the surface string, regardless of the type;
Type: some overlap between the system tagged entity and the gold annotation is required;

each of these ways to measure the performance accounts for correct, incorrect, partial, missed and spurious in different ways. Let’s look in detail and see how each of the metrics defined by MUC falls into each of the scenarios described above

Scenario	Golden Standard		System Prediction		Evaluation Schema
	Entity Type	Surface String	Entity Type	Surface String	Type	Partial	Exact	Strict
III	brand	TIKOSYN			MIS	MIS	MIS	MIS
II			brand	healthy	SPU	SPU	SPU	SPU
V	drug	warfarin	drug	of warfarin	COR	PAR	INC	INC
IV	drug	propranolol	brand	propranolol	INC	COR	COR	INC
I	drug	phenytoin	drug	phenytoin	COR	COR	COR	COR
I	Drug	theophylline	drug	theophylline	COR	COR	COR	COR
VI	group	contraceptives	drug	oral contraceptives	INC	PAR	INC	INC

Then precision/recall/f1-score are calculated for each different evaluation schema. In order to achieve data, two more quantities need to be calculated:

Number of gold-standard annotations contributing to the final score

$$\text{POSSIBLE} (POS) = COR + INC + PAR + MIS = TP + FN $$

Number of annotations produced by the NER system:

$$\text{ACTUAL} (ACT) = COR + INC + PAR + SPU = TP + FP$$

Then we can compute precision/recall/f1-score, where roughly describing precision is the percentage of correct named-entities found by the NER system, and recall is the percentage of the named-entities in the golden annotations that are retrieved by the NER system. This is computed in two different ways depending whether we want an exact match (i.e., strict and exact ) or a partial match (i.e., partial and type) scenario:

Exact Match (i.e., strict and exact )

$$\text{Precision} = \frac{COR}{ACT} = \frac{TP}{TP+FP}$$ $$\text{Recall} = \frac{COR}{POS} = \frac{TP}{TP+FN}$$

Partial Match (i.e., partial and type)

$$\text{Precision} = \frac{COR\ +\ 0.5\ \times\ PAR}{ACT} = \frac{TP}{TP+FP}$$ $$\text{Recall} = \frac{COR\ +\ 0.5\ \times\ PAR}{POS} = \frac{COR}{ACT} = \frac{TP}{TP+FP}$$

Putting all together:

Measure	Type	Partial	Exact	Strict
Correct	3	3	3	2
Incorrect	2	0	2	3
Partial	0	2	0	0
Missed	1	1	1	1
Spurius	1	1	1	1
Precision	0.5	0.66	0.5	0.33
Recall	0.5	0.66	0.5	0.33
F1	0.5	0.66	0.5	0.33

Code

I did a small experiment using sklearn-crfsuite wrapper around CRFsuite to train a NER over the CoNLL 2002 Spanish data. Next I evaluate the trained CRF over the test data and show the performance with the different metrics:

Note you can find the complete code for this blog post on this repository:

https://github.com/davidsbatista/NER-Evaluation

Example

import nltk
import sklearn_crfsuite

from copy import deepcopy
from collections import defaultdict

from sklearn_crfsuite import metrics

from ner_evaluation import collect_named_entities
from ner_evaluation import compute_metrics

Train a CRF on the CoNLL 2002 NER Spanish data

nltk.corpus.conll2002.fileids()
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Feature Extraction

%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

CPU times: user 1.12 s, sys: 98.2 ms, total: 1.22 s
Wall time: 1.22 s

Training

%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CPU times: user 34.1 s, sys: 197 ms, total: 34.3 s
Wall time: 34.4 s

Performance per label type per token

y_pred = crf.predict(X_test)
labels = list(crf.classes_)
labels.remove('O') # remove 'O' label from evaluation
sorted_labels = sorted(labels,key=lambda name: (name[1:], name[0])) # group B and I results
print(sklearn_crfsuite.metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels, digits=3))

             precision    recall  f1-score   support

      B-LOC      0.810     0.784     0.797      1084
      I-LOC      0.690     0.637     0.662       325
     B-MISC      0.731     0.569     0.640       339
     I-MISC      0.699     0.589     0.639       557
      B-ORG      0.807     0.832     0.820      1400
      I-ORG      0.852     0.786     0.818      1104
      B-PER      0.850     0.884     0.867       735
      I-PER      0.893     0.943     0.917       634

avg / total      0.809     0.787     0.796      6178

Performance over full named-entity

test_sents_labels = []
for sentence in test_sents:
    sentence = [token[2] for token in sentence]
    test_sents_labels.append(sentence)

index = 2
true = collect_named_entities(test_sents_labels[index])
pred = collect_named_entities(y_pred[index])

true

[Entity(e_type='MISC', start_offset=12, end_offset=12),
 Entity(e_type='LOC', start_offset=15, end_offset=15),
 Entity(e_type='PER', start_offset=37, end_offset=39),
 Entity(e_type='ORG', start_offset=45, end_offset=46)]

pred

[Entity(e_type='MISC', start_offset=12, end_offset=12),
 Entity(e_type='LOC', start_offset=15, end_offset=15),
 Entity(e_type='PER', start_offset=37, end_offset=39),
 Entity(e_type='LOC', start_offset=45, end_offset=46)]

compute_metrics(true, pred)

({'ent_type': {'actual': 4,
   'correct': 3,
   'incorrect': 1,
   'missed': 0,
   'partial': 0,
   'possible': 4,
   'precision': 0.75,
   'recall': 0.75,
   'spurius': 0},
  'strict': {'actual': 4,
   'correct': 3,
   'incorrect': 1,
   'missed': 0,
   'partial': 0,
   'possible': 4,
   'precision': 0.75,
   'recall': 0.75,
   'spurius': 0}},
 {'LOC': {'ent_type': {'correct': 1,
    'incorrect': 1,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 1,
    'incorrect': 1,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'MISC': {'ent_type': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'ORG': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'PER': {'ent_type': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}}})

to_test = [2,4,12,14]

index = 2
true_named_entities_type = defaultdict(list)
pred_named_entities_type = defaultdict(list)

for true in collect_named_entities(test_sents_labels[index]):
    true_named_entities_type[true.e_type].append(true)

for pred in collect_named_entities(y_pred[index]):
    pred_named_entities_type[pred.e_type].append(pred)

true_named_entities_type

defaultdict(list,
            {'LOC': [Entity(e_type='LOC', start_offset=15, end_offset=15)],
             'MISC': [Entity(e_type='MISC', start_offset=12, end_offset=12)],
             'ORG': [Entity(e_type='ORG', start_offset=45, end_offset=46)],
             'PER': [Entity(e_type='PER', start_offset=37, end_offset=39)]})

pred_named_entities_type

defaultdict(list,
            {'LOC': [Entity(e_type='LOC', start_offset=15, end_offset=15),
              Entity(e_type='LOC', start_offset=45, end_offset=46)],
             'MISC': [Entity(e_type='MISC', start_offset=12, end_offset=12)],
             'PER': [Entity(e_type='PER', start_offset=37, end_offset=39)]})

true_named_entities_type['LOC']

[Entity(e_type='LOC', start_offset=15, end_offset=15)]

pred_named_entities_type['LOC']

[Entity(e_type='LOC', start_offset=15, end_offset=15),
 Entity(e_type='LOC', start_offset=45, end_offset=46)]

compute_metrics(true_named_entities_type['LOC'], pred_named_entities_type['LOC'])

({'ent_type': {'actual': 2,
   'correct': 1,
   'incorrect': 0,
   'missed': 0,
   'partial': 0,
   'possible': 1,
   'precision': 0.5,
   'recall': 1.0,
   'spurius': 1},
  'strict': {'actual': 2,
   'correct': 1,
   'incorrect': 0,
   'missed': 0,
   'partial': 0,
   'possible': 1,
   'precision': 0.5,
   'recall': 1.0,
   'spurius': 1}},
 {'LOC': {'ent_type': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 1},
   'strict': {'correct': 1,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 1}},
  'MISC': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'ORG': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}},
  'PER': {'ent_type': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0},
   'strict': {'correct': 0,
    'incorrect': 0,
    'missed': 0,
    'partial': 0,
    'spurius': 0}}})

results over all messages

metrics_results = {'correct': 0, 'incorrect': 0, 'partial': 0,
                   'missed': 0, 'spurius': 0, 'possible': 0, 'actual': 0}

# overall results
results = {'strict': deepcopy(metrics_results),
           'ent_type': deepcopy(metrics_results)
           }

# results aggregated by entity type
evaluation_agg_entities_type = {e: deepcopy(results) for e in ['LOC','PER','ORG','MISC']}

for true_ents, pred_ents in zip(test_sents_labels, y_pred):    
    # compute results for one message
    tmp_results, tmp_agg_results = compute_metrics(collect_named_entities(true_ents),collect_named_entities(pred_ents))

    # aggregate overall results
    for eval_schema in results.keys():
        for metric in metrics_results.keys():
            results[eval_schema][metric] += tmp_results[eval_schema][metric]

    # aggregate results by entity type
    for e_type in ['LOC','PER','ORG','MISC']:
        for eval_schema in tmp_agg_results[e_type]:
            for metric in tmp_agg_results[e_type][eval_schema]:
                evaluation_agg_entities_type[e_type][eval_schema][metric] += tmp_agg_results[e_type][eval_schema][metric]

results

{'ent_type': {'actual': 3518,
  'correct': 2909,
  'incorrect': 564,
  'missed': 111,
  'partial': 0,
  'possible': 3584,
  'spurius': 45},
 'strict': {'actual': 3518,
  'correct': 2779,
  'incorrect': 694,
  'missed': 111,
  'partial': 0,
  'possible': 3584,
  'spurius': 45}}

evaluation_agg_entities_type

{'LOC': {'ent_type': {'actual': 0,
   'correct': 861,
   'incorrect': 180,
   'missed': 32,
   'partial': 0,
   'possible': 0,
   'spurius': 5},
  'strict': {'actual': 0,
   'correct': 840,
   'incorrect': 201,
   'missed': 32,
   'partial': 0,
   'possible': 0,
   'spurius': 5}},
 'MISC': {'ent_type': {'actual': 0,
   'correct': 211,
   'incorrect': 46,
   'missed': 33,
   'partial': 0,
   'possible': 0,
   'spurius': 7},
  'strict': {'actual': 0,
   'correct': 173,
   'incorrect': 84,
   'missed': 33,
   'partial': 0,
   'possible': 0,
   'spurius': 7}},
 'ORG': {'ent_type': {'actual': 0,
   'correct': 1181,
   'incorrect': 231,
   'missed': 34,
   'partial': 0,
   'possible': 0,
   'spurius': 31},
  'strict': {'actual': 0,
   'correct': 1120,
   'incorrect': 292,
   'missed': 34,
   'partial': 0,
   'possible': 0,
   'spurius': 31}},
 'PER': {'ent_type': {'actual': 0,
   'correct': 656,
   'incorrect': 107,
   'missed': 12,
   'partial': 0,
   'possible': 0,
   'spurius': 2},
  'strict': {'actual': 0,
   'correct': 646,
   'incorrect': 117,
   'missed': 12,
   'partial': 0,
   'possible': 0,
   'spurius': 2}}}