Text summarisation resumes the information contained in a text, into a shorter version, still capturing the original semantics. This can be achieved by selecting representative words or phrases from the original text and combining them to make a summary (extractive), or by writing a summary as a human would do, by generating a textual summary of the main content in the original text (abstractive). This post quickly reviews some of the recent works on NLP Text summarisation using both approaches.


Extractive approaches select representative phrases from the original text and combine them to make a summary, in its essence, they score or measure some sort of saliency or ranking of the sentences contained in a text. It’s language and domain-agnostic, doesn’t require any training data and the generated summaries are grammatically and factually correct.

However, the generated summary may not cover all important content or be longer than necessary, containing unnecessary parts that may not be needed in the summary, moreover, it might have a reduction of semantic quality or cohesion and possible wrong connections between selected sentences. Here are some (old) popular approaches, and some more recent ones:

2004 - TextRank

2011 - LexRank

2019 - HIBERT

Figure 1 - The encoding and training mechanism of the HIBERT model.

2019 - BERTSumExt

  • Text Summarization with Pretrained Encoders
  • An encoder creates sentence representations and a classifier predicts which sentences should be selected as summaries
  • Has a document-level encoder based on BERT to obtain sentence representations
  • The modification mainly consists of surrounding each sentence of the text with:
    • a [CLS] (which represents the entire sentence)
    • and [SEP] (which represents the boundary between two sentences)
    • assigning different segment embeddings for every pair of sentences
  • Sentence-level contextual representations fed to a classifier for binary classification
Figure 2 - BertSum architecture.

2020 - MatchSum

  • Extractive Summarization as Text Matching
  • The source document and candidate summaries will be extracted from the original text and matched in a semantic space
  • Siamese-BERT architecture to compute the similarity between the source document and the candidate summary
  • Leverages the pre-trained BERT in a Siamese network structure to derive semantically meaningful text embeddings that can be compared using cosine-similarity
Figure 3 - MatchSum approach of matching candidate summaries and original text in the same semantic space.


Generated a concise summary capturing the essence of the original text. This approach can decrease the amount of summary text, by removing redundancy and obtaining an expressive summary. But, it seeds a precise syntactic and semantic representation of text data and may contain factual errors due to the algorithm not grasping the text’s context well enough.

Earlier approaches would be templated-based or rule-based, the Seq2Seq (encoder-decoder) based on vanilla RNN/LSTM or augmented with Attention Mechanism was another popular approach exploring the generative capability of recurrent neural networks, and more recently the Encoder-Decoder mechanism based on the Transformer dominate most of the new approaches:

2019 - BERTSumAbs

  • Text Summarization with Pretrained Encoders
  • Adopts an encoder-decoder architecture, combining the same pre-trained BERT encoder and a randomly-initialised Transformer decoder
  • Training separates the optimisers of the encoder and the decoder in order to accommodate the fact that the encoder is pre-trained while the decoder must be trained from scratch
  • Propose a two-stage fine-tuning approach, where we first fine-tune the encoder on the extractive summarisation task and then fine-tune it on the abstractive summarisation task
  • Combine extractive and abstractive objectives, a two-stage approach: the encoder is fine-tuned twice, first with an extractive objective and subsequently on the abstractive summarisation task.

2019 T5 - Text-to-Text Transfer Transformer

  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  • Pre-trained Transformer encoder-decoder unifying many tasks as the same text-to-text problem
  • Input of the encoder is a task description (e.g., “Summarize: ”) followed by task input (e.g., a sequence of tokens from an article), and the decoder predicts the task output (e.g., a sequence of tokens summarizing the input article)
  • T5 is trained to generate some target text conditional on input text to perform as text-to-text
Figure 4 - Text-to-Text Transfer Transformer (T5) approach.

2020 BART

Figure 5 - BART training approach.


  • PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization
  • Specifically designed and pre-trained neural network for the automatic text summarisation task
  • Pre-training self-supervised objective - gap-sentence generation - for Transformer encoder-decoder models
  • Fine-tuned on 12 diverse summarisation datasets
  • Select and mask whole sentences from documents, and concatenate the gap sentences into a pseudo-summary.
Figure 5 - PEGASUS__ training approach.

Comparative Summarisation

Name Method Max. Input Code Pre-Trained models Languages
TextRank (2004) Extractive - gensim - -
LexRank (2011 Extractive - lexrank - -
BertSum (2019) Both ? PreSumm see code English
HIBERT (2019) Extractive ? HIBERT see code English
T5 (2019) Abstractive 1024 - 16384 T5x huggingface.co several
MatchSum (2020) Extractive ? MatchSum see code English
BART (2020) Abstractive 1024 fairseq huggingface.co several
PEGASUS (2020 Abstractive ? pegasus huggingface.co several


  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation) - Recall
    • compare produced summary against reference summary by overlapping n-grams
    • ROUGE-N: report with different n-grams size
    • ROUGE-L: instead of n-gram overlap measures the Longest Common Subsequence
  • BLEU (Bilingual Evaluation Understudy) - Precision
    • compare the reference summary against the produced summary
    • how much the words (and/or n-grams) in the reference appeared in the machine summary
  • Limitations
    • always need a reference summary
    • just measuring string overlaps
    • alternative is to have a human evaluation