David S. Batista

A Package for Machine Learning Evaluation Reporting

Sat, 16 Nov 2024 01:00:00 +0100

When working on machine learning projects, evaluating a model’s performance is a critical step. The ML-Report-Kit is a Python package that simplifies this process by automating the generation of evaluation metrics and reports. In this post, we’ll take a closer look at what ML-Report-Kit offers and how you can use it effectively.

Figure 1 - Precision Recall Curve and a Confusion Matrix.

Introduction

ML-Report-Kit is designed to help data scientists and machine learning practitioners create comprehensive evaluation reports for supervised learning models. It provides a straightforward way to generate various metrics and visualizations that can aid in understanding model performance.

To use ML-Report-Kit, you first need to install it. You can do this via pip:

pip install ml-report-kit

Once installed, you can easily create a report by following these steps:

from ml_report import MLReport

report = MLReport(y_true, y_pred, y_pred_prob, class_names)
report.run(results_path="results")

This code will generate a report with various metrics, saving it the results folder, containing:

Classification Report: Detailed metrics for each class, including precision, recall, and F1-score.
Confusion Matrix: A visual representation of true vs. predicted classifications.
Precision-Recall Curves: Graphs that show the trade-off between precision and recall at different thresholds.
CSV Files: Data files containing detailed metric values for further analysis.

Running ML-Report-Toolkit on cross-fold classification

This example demonstrates how to use ml-report-kit in a cross-fold classification scenario generating reports for individual folds and the entire dataset. We’ll use the 20 Newsgroups dataset, a popular text classification dataset, to illustrate the process.

Install the following packages

pip install ml-report-kit
pip install scikit-learn

Run the code

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

from ml_report_kit import MLReport

dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
k_folds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
folds = {}

for fold_nr, (train_index, test_index) in enumerate(k_folds.split(dataset.data, dataset.target)):
    x_train, x_test = np.array(dataset.data)[train_index], np.array(dataset.data)[test_index]
    y_train, y_test = np.array(dataset.target)[train_index], np.array(dataset.target)[test_index]
    folds[fold_nr] = {"x_train": x_train, "x_test": x_test, "y_train": y_train, "y_test": y_test}

all_y_true_label = []
all_y_pred_label = []
all_y_pred_prob = []

for fold_nr in folds.keys():
    clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LogisticRegression(class_weight='balanced'))])
    clf.fit(folds[fold_nr]["x_train"], folds[fold_nr]["y_train"])
    y_pred = clf.predict(folds[fold_nr]["x_test"])
    y_pred_prob = clf.predict_proba(folds[fold_nr]["x_test"])
    y_true_label = [dataset.target_names[sample] for sample in folds[fold_nr]["y_test"]]
    y_pred_label = [dataset.target_names[sample] for sample in y_pred]
    
    # accumulate the results for all folds to generate a report for the entire dataset
    all_y_true_label.extend(y_true_label)
    all_y_pred_label.extend(y_pred_label)
    all_y_pred_prob.extend(list(y_pred_prob))
    
    # generate the report for the current fold
    report = MLReport(y_true_label, y_pred_label, y_pred_prob, dataset.target_names)
    report.run(results_path="results", fold_nr=fold_nr)

# generate the report for the entire dataset
ml_report = MLReport(all_y_true_label, all_y_pred_label, list(all_y_pred_prob), dataset.target_names, y_id=None)
ml_report.run(results_path="results", final_report=True)

This code will generate reports for each fold and the entire dataset, saving them in the results folder. The reports will include:

classification reports with precision, recall, and F1-score for each class
confusion matrices in both text and image formats
- confusion_matrix.png
- confusion_matrix.txt
the precision-recall curve for each fold and the entire dataset in both raw CSV values and image formats
- precision_recall_threshold_.csv
- precision_recall_threshold_.png

Where to get ML-Report-Kit

Improving RAG Retrieval with Auto-Merging

Thu, 12 Sep 2024 00:00:00 +0200

For most RAG applications, where we first have to retrieve the most relevant context, we end up having to split up documents first, and index those smaller splits of documents. Reasons for this range from needing to retrieve only relevant sections of larger bits of documents to the simple fact that (although they’re improving massively) LLMs simply don’t have infinite context lengths.

NOTE: this text originally posted on the Haystack Blog - I’m adding it to my personal blog for more awareness.

Auto-Merging is a retrieval technique that leverages a hierarchical document structure. When a document is too long, it is split into smaller documents or chunks, where we can think of the smaller documents as the children of the original document and the original document as the parent. This results in a hierarchical tree structure where each smaller document is a child of a previous larger document. The leaves of the tree are the documents which don’t have any children, and the root is the original document.

Auto-merging retrieval is a technique we can use if the parent document is likely to contain more of the relevant context about the information the user is after, in comparison to a subset of it’s child documents. When a query is made, the the retriever will normally return the top_k number of document chunks that are relevant to the query. However, if the number of retrieved document chunks that belong to the same parent document is above a certain threshold, the retriever would return the parent document instead of the individual chunks.

Haystack Components

Haystack implements the Auto-Merging Retrieval with two components:

HierarchicalDocumentSplitter: splits a Document into multiple Document objects of different block sizes, building a hierarchical tree structure where each smaller block is a child of a previous larger block. The init method expects three parameters:
- block_sizes: Set of block sizes to split the document into. The blocks are split in descending order. So, block_sizes of {20, 5} would mean that each ‘parent’ split would be of length max 20, and and each of its children would be of length max 5.
- split_overlap: The number of overlapping units for each split.
- split_by: The unit for splitting your documents.
AutoMergingRetriever: a retriever that leverages the hierarchical tree structure of documents, where the leaf nodes are indexed in a document store. During retrieval, if the number of matched leaf documents below the same parent is higher than a defined threshold, the retriever will return the parent document instead of the individual leaf documents. The init method expects three parameters:
- document_store: DocumentStore from which to retrieve the parent documents
- threshold: Threshold to decide whether the parent instead of the individual documents is returned

Introductory Example

Let’s see a simple example of how the AutoMergingRetriever works. In this example we will use a single document. We use the HierarchicalDocumentSplitter to split the document into chunks, represented by smaller documents, and capturing the hierarchical structure of the document.

    from haystack import Document
    from haystack.components.preprocessors import HierarchicalDocumentSplitter

    docs = [Document(content="The monarch of the wild blue yonder rises from the eastern side of the horizon.")]
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
    docs = splitter.run(docs)

We start by creating a document, and then we split it into smaller documents using the HierarchicalDocumentSplitter. We need to specify the block sizes that we want to split the document into. In this case, we are splitting the document into 10 and 3-word blocks - this means that the splitter will only have 2 levels, the first with a maximum of 10 words and the second a maximum of 3 words. There are no overlaps among the documents, and we also specify that we want to split the document by words. This results in 9 documents being created from the original document. The documents are split as follows:

`The monarch of the wild blue yonder rises from the eastern side of the horizon.` -- (root)
|
|
|
|--- `The monarch of the wild blue yonder rises from the`
|               |
|               |
|               |--- `The monarch of` -- (leaf)
|               |
|               |--- `the wild blue` -- (leaf)
|               |
|               |--- `yonder rises from` -- (leaf)
|               |
|               |--- `the` -- (leaf)
|
|
|--- `eastern side of the horizon.` -- (leaf)
|               |
|               |
|               |--- `eastern side of` -- (leaf)
|               |
|               |--- `the horizon.` -- (leaf)

Note that the original document is always the root of the tree. We then have two levels of children, the first with a maximum block size of 10 words, and the second with a maximum block size of 3 words.

We now need to split this documents into two distinct document stores. During initialization the AutoMergingRetriever requires the document store where the parent documents are indexed. At run time it receives leaf documents that matched a user query, it returns the parent document if the number of matched leaf documents below the same parent is higher than a defined threshold, otherwise it returns the original retrieved leaf documents.

Let’s see it in practice. We index the parent documents, by selecting the ones with a __level of 1.

    from haystack.document_stores.in_memory import InMemoryDocumentStore

    parent_docs_store = InMemoryDocumentStore()
    parent_docs = [doc for doc in docs["documents"] if doc.meta["__level"]==1]
    parent_docs_store.write_documents(parent_docs)

Let’s now initialize the AutoMergingRetriever with parent document store and a parent threshold of 0.5, meaning that if at least 50% of the leaf documents below the same parent match the query, the retriever will return the parent instead of the leaf documents which matched the user query. If we query the document store with a single leaf document, the retriever will return the same leaf document.

    from haystack.components.retrievers import AutoMergingRetriever

    retriever = AutoMergingRetriever(document_store=parent_docs_store, threshold=0.5)
    retriever.run(matched_leaf_documents=[docs['documents'][4]])

If we now we query the document store with two leaf documents, the retriever will return the parent document instead of the individual leaf documents, as the threshold of 0.5 is met.

    matched_leaf_documents = [docs['documents'][4], docs['documents'][5]]
    retriever.run(matched_leaf_documents=matched_leaf_documents)

This was a simple introductory example to show how AutoMergingRetriever works and retrieves parent documents instead of individual leaf documents. Next we will see a full example over news articles dataset.

Advanced Example

We will use the BBC news dataset to show how the AutoMergingRetriever works with a dataset containing multiple news articles. This dataset consists of 2.225 documents from the BBC corresponding to stories in five topical areas collected between 2004-2005, and was part of work by D. Greene and P. Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”, Proc. ICML 2006.

Reading the dataset

The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but we are going to use a version that was already preprocessed and stored in a single CSV file available at the following URL:

https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv

from typing import List
import csv
from haystack import Document

def read_documents(file: str) -> List[Document]:
    with open(file, "r") as file:
        reader = csv.reader(file, delimiter="\\t")
        next(reader, None)  # skip the headers
        documents = []
        for row in reader:
            category = row[0].strip()
            title = row[2].strip()
            text = row[3].strip()
            documents.append(Document(content=text, meta={"category": category, "title": title}))

    return documents

docs = read_documents("bbc-news-data.csv")
len(docs)
>> 2225

Indexing the documents

After reading the converting the news articles into Haystack Document objects, let’s now let’s index them. We will use as document store theInMemoryDocumentStore for the sake of simplicity. We first apply the HierarchicalDocumentSplitter to the list of Documents, creating a hierarchical structure

We will create two document stores, one for the parent documents, and one for the leaf documents. We will later say that there will be an intermediate retriever to match user query with the indexed leaf documents, this intermediate retriever will then be connected to an AutoMergingRetriever which decides for when to return the parent instead of the matched leaf documents.

The function below receives the news articles as Documents and filters them by the meta field __level to differentiate between children and parent Documents, indexing them in their respective document stores, which are then both returned by the function.

from typing import Tuple

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

from haystack.components.preprocessors import HierarchicalDocumentSplitter

def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 5}, split_overlap=0, split_by="sentence")
    docs = splitter.run(documents)

    # store the leaf documents in one document store
    leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
    leaf_doc_store = InMemoryDocumentStore()
    leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.SKIP)

    # store the parent documents in another document store
    parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
    parent_doc_store = InMemoryDocumentStore()
    parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.SKIP)

    return leaf_doc_store, parent_doc_store

Querying the documents

Now that we have our document stores let’s construct a querying pipeline, consisting of a BM25Retriever associated with the document store containing the leaf documents, and an AutoMergingRetriever associated with the parent documents and with a threshold of 0.6, meaning that if at least 60% of the matched leaf documents belong to the same parent, their parent is returned instead of each individual Document.

from haystack import Pipeline
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack.components.retrievers import AutoMergingRetriever

def querying_pipeline(leaf_doc_store: InMemoryDocumentStore, parent_doc_store: InMemoryDocumentStore, threshold: float = 0.6):
    pipeline = Pipeline()
    bm25_retriever = InMemoryBM25Retriever(document_store=leaf_doc_store)
    auto_merge_retriever = AutoMergingRetriever(parent_doc_store, threshold=threshold)
    pipeline.add_component(instance=bm25_retriever, name="BM25Retriever")
    pipeline.add_component(instance=auto_merge_retriever, name="AutoMergingRetriever")
    pipeline.connect("BM25Retriever.documents", "AutoMergingRetriever.matched_leaf_documents")
    return pipeline

Putting it all together

docs = read_documents("bbc-news-data.csv")
leaf_doc_store, parent_doc_store = indexing(docs)
pipeline = querying_pipeline(leaf_doc_store, parent_doc_store, threshold=0.6)

So, now can run each function individually and have a querying pipeline that uses the AutoMergingRetriever. We can then use the pipeline to query the document store for articles related to cybersecurity, and let’s also make use of the pipeline parameter include_outputs_from to also get the outputs from the BM25Retriever component.

result = pipeline.run(data={'query': 'phishing attacks spoof websites spam e-mails spyware'},  include_outputs_from={'BM25Retriever'})

The result will have two keys, one for each retriever component: AutoMergingRetriever, BM25Retriever.

Let’s see how many documents were retrieved by each component.

In [17]: len(result['AutoMergingRetriever']['documents'])
Out[17]: 7

In [18]: len(result['BM25Retriever']['documents'])
Out[18]: 10

As we can see, the AutoMergingRetriever retrieved 7 documents, while the BM25Retriever retrieved 10 documents. This is because the AutoMergingRetriever returned parent documents instead of individual leaf documents. Let’s compare the titles of the documents retrieved by the BM25Retriever and the AutoMergingRetriever.

doc_titles = sorted([d.meta['title'] for d in result['BM25Retriever']['documents']])
In [14]: doc_titles
Out[14]:
['Bad e-mail habits sustains spam',
 'Bad e-mail habits sustains spam',
 'Cyber crime booms in 2004',
 'Cyber criminals step up the pace',
 'Cyber criminals step up the pace',
 'Junk e-mails on relentless rise',
 'More women turn to net security',
 'Security scares spark browser fix',
 'Spam e-mails tempt net shoppers',
 'Spam e-mails tempt net shoppers']

In [15]: doc_titles = sorted([d.meta['title'] for d in result['AutoMergingRetriever']['documents']])
In [16]: doc_titles
Out[16]:
['Bad e-mail habits sustains spam',
 'Cyber crime booms in 2004',
 'Cyber criminals step up the pace',
 'Junk e-mails on relentless rise',
 'More women turn to net security',
 'Security scares spark browser fix',
 'Spam e-mails tempt net shoppers']

Instead of returning individual leaf documents, the AutoMergingRetriever returned parent document for the articles:

“Bad e-mail habits sustains spam”,
“Cyber criminals step up the pace”,
“Spam e-mails tempt net shoppers”;

since at least 60% of the leaf documents of each of those documents matched the query.

Conclusion

In this tutorial we saw how the AutoMergingRetriever works. One important aspect of the AutoMergingRetriever implementation in Haystack is that it requires the documents to be split using the HierarchicalDocumentSplitter. Another aspect to notice as we saw, is that the AutoMergingRetriever should be used in conjunction with other base Retrievers allowing for a more flexible retrieval system.

Benchmarking Haystack Pipelines for Optimal Performance

Mon, 24 Jun 2024 00:00:00 +0200

In this article, we will show you how to use Haystack to evaluate the performance of a RAG pipeline. Note that the code in this article is meant to be illustrative and may not run as is; if you want to run the code, please refer to the python script.

NOTE: this text originally posted on the Haystack Blog - I’m adding it to my personal blog for more awareness.

Introduction

This article will guide you through building a Retrieval-Augmented Generation (RAG) pipeline using Haystack, adjusting various parameters, and evaluating it with the ARAGOG dataset. The dataset consists of pairs of questions and answers, and our objective is to assess the RAG pipeline’s efficiency in retrieving the correct context and generating accurate answers. To do this, we will use the following evaluation metrics:

We did this experiment by relying on three different Haystack pipelines with different purposes: one pipeline for indexing, another for RAG, and one for evaluation. We describe each of these pipelines in detail and show how to combine them together to evaluate the RAG pipeline.

The article is organized as follows: we first describe the origin and authorship of the ARAGOG dataset, then we build the pipelines. We then demonstrate how to integrate everything, performing multiple runs over the dataset and adjusting parameters. These parameters were chosen based on feedback from our community, reflecting how users optimize their pipelines:

top_k: the maximum number of documents returned by the retriever. For this experiment, we tested our pipeline with top_k value of [1, 2, 3].
embedding_model: the model used to encode the documents and the question. For this example, we used these sentence-transformers models:
- all-MiniLM-L6-v2
- msmarco-distilroberta-base-v2
- all-mpnet-base-v2
chunk_size: the number of tokens in the input text that makes up segments of text to be embedded and indexed. For this experiment, we tested our pipeline with chunk_size of [64, 128, 256].

We end by discussing the results of the evaluation and sharing some lessons learned.

The “ARAGOG: Advanced RAG Output Grading” Dataset

The knowledge data, as well as the questions and answers, all stem from the ARAGOG: Advanced RAG Output Grading paper. The data is a subset of the AI ArXiv Dataset and consists of 423 selected research papers centered around the themes of Transformers and Large Language Models (LLMs).

The evaluation dataset comprises 107 question-answer pairs (QA) generated with the assistance of GPT-4. Each QA pair is validated and corrected by humans, ensuring that the evaluation is correct and accurately measures the RAG techniques’ performance in real-world applications.

Within the scope of this article, we only considered 16 papers, the ones from which the questions were drawn, instead of the 423 papers in the original dataset, to reduce the computational cost.

The Indexing Pipeline

The indexing pipeline is responsible for preprocessing and storing the documents in a DocumentStore. We will define a function that wraps a pipeline, takes the embedding model and the chunk size as parameters, and returns a DocumentStore for later use. The pipeline in the function first converts the PDF files into Documents, cleans them, splits them into chunks, and then embeds them using a SentenceTransformers model. The embeddings are then stored in an InMemoryDocumentStore. Learn more about creating an indexing pipeline in 📚 Tutorial: Preprocessing Different File Types.

For this example, we store the documents using the InMemoryDocumentStore, but you can use any other document store supported by Haystack. We split the documents by word, but you can split them by sentence or paragraph by changing the value of split_by parameter in the DocumentSplitter component.

We need to pass the parameters embedding_model and chunk_size to this indexing pipeline function since we want to experiment with different indexing approaches.

The indexing pipeline function is defined as follows:

import os

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PyPDFToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy

def indexing(embedding_model: str, chunk_size: int):
    files_path = "datasets/ARAGOG/papers_for_questions"
    document_store = InMemoryDocumentStore()
    pipeline = Pipeline()
    pipeline.add_component("converter", PyPDFToDocument())
    pipeline.add_component("cleaner", DocumentCleaner())
    pipeline.add_component("splitter", DocumentSplitter(split_length=chunk_size))  # default splitting by word
    pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP))
    pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(embedding_model))
    pipeline.connect("converter", "cleaner")
    pipeline.connect("cleaner", "splitter")
    pipeline.connect("splitter", "embedder")
    pipeline.connect("embedder", "writer")
    pdf_files = [files_path+"/"+f_name for f_name in os.listdir(files_path)]
    pipeline.run({"converter": {"sources": pdf_files}})

    return document_store

The RAG Pipeline

We use a simple RAG pipeline composed of a retriever, a prompt builder, a language model, and an answer builder. First, we use the SentenceTransformersTextEmbedder to embed the query and an InMemoryEmbeddingRetriever to retrieve the top-k documents relevant to the query. We then rely on an LLM to generate an answer based on the context retrieved from the documents and the query question.

We used the OpenAI API through the OpenAIGenerator with the gpt-3.5-turbo model in our implementation. The PromptBuilder is responsible for building the prompt to be fed to the LLM, using a template that includes the context and the question. Finally, the AnswerBuilder is responsible for extracting the answer from the LLM output and returning it. Learn more about creating a RAG pipeline in 📚 Tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation.

Note that we instruct the LLM to explicitly answer "None" when the context is empty. We do this to avoid the LLM answering the prompt with its own internal knowledge.

After creating the pipeline, we wrap it with a function to easily initialize it with different parameters. We expect a document_store, an embedding_model, and the top_k for this function.

The RAG pipeline is defined as follows:

from haystack import Pipeline
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever

def rag_pipeline(document_store, embedding_model, top_k=2):
    template = """
        You have to answer the following question based on the given context information only.
        If the context is empty or just a '\\n' answer with None, example: "None".

        Context:
        

        Question: 
        Answer:
        """

    basic_rag = Pipeline()
    basic_rag.add_component("query_embedder", SentenceTransformersTextEmbedder(
        model=embedding_model, progress_bar=False
    ))
    basic_rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=top_k))
    basic_rag.add_component("prompt_builder", PromptBuilder(template=template))
    basic_rag.add_component("llm", OpenAIGenerator(model="gpt-3.5-turbo"))
    basic_rag.add_component("answer_builder", AnswerBuilder())

    basic_rag.connect("query_embedder", "retriever.query_embedding")
    basic_rag.connect("retriever", "prompt_builder.documents")
    basic_rag.connect("prompt_builder", "llm")
    basic_rag.connect("llm.replies", "answer_builder.replies")
    basic_rag.connect("llm.meta", "answer_builder.meta")
    basic_rag.connect("retriever", "answer_builder.documents")

    return basic_rag

The Evaluation Pipeline

We will also need an evaluation pipeline, which will be responsible for computing the scoring metrics to measure the performance of the RAG pipeline. You can learn how to build an evaluation pipeline in 📚 Tutorial: Evaluating RAG Pipelines. The evaluation pipeline will include three evaluators:

ContextRelevanceEvaluator will assess the relevancy of the retrieved context to answer the query question
FaithfulnessEvaluator evaluates whether the generated answer can be derived from the context
SASEvaluator compares the embedding of a generated answer against a ground-truth answer based on a common embedding model.

This new function returns the evaluation results and the inputs used to run the evaluation. This data is useful for later analysis and understanding the pipeline’s performance in more detail and granularity. We need to pass the questions and answers from the dataset to the function, plus the data generated by the RAG pipeline, i.e., retrieved_contexts, predicted_answers, and the embedding_model used for these results.

from haystack import Pipeline
from haystack.components.evaluators import ContextRelevanceEvaluator, FaithfulnessEvaluator, SASEvaluator

def evaluation(questions, answers, retrieved_contexts, predicted_answers, embedding_model):
    eval_pipeline = Pipeline()
    eval_pipeline.add_component("context_relevance", ContextRelevanceEvaluator(raise_on_failure=False))
    eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator(raise_on_failure=False))
    eval_pipeline.add_component("sas", SASEvaluator(model=embedding_model))

    eval_pipeline_results = eval_pipeline.run(
        {
            "context_relevance": {"questions": questions, "contexts": retrieved_contexts},
            "faithfulness": {"questions": questions, "contexts": retrieved_contexts, "predicted_answers": predicted_answers},
            "sas": {"predicted_answers": predicted_answers, "ground_truth_answers": answers},
        }
    )

    results = {
        "context_relevance": eval_pipeline_results['context_relevance'],
        "faithfulness": eval_pipeline_results['faithfulness'],
        "sas": eval_pipeline_results['sas']
    }

		inputs = {
				'questions': sample_questions,
        'contexts': retrieved_contexts,
        'true_answers': sample_answers,
        'predicted_answers': predicted_answers
     }

    return results, inputs

Putting it all together

We now have the building blocks to evaluate the RAG pipeline: indexing the knowledge data, generating answers using a RAG architecture, and evaluating the results. However, we still need a method to run the questions over our RAG pipeline and collect all the needed results to perform an evaluation. We will use a function that wraps up all the interactions with the RAG pipeline. It takes as parameters a document_store, the questions, an embedding_model and the top_k and returns the retrieved contexts and the predicted answers.

def run_rag(document_store, sample_questions, embedding_model, top_k):
    """
    A function to run the basic rag model on a set of sample questions and answers
    """

    rag = rag_pipeline(document_store=document_store, embedding_model=embedding_model, top_k=top_k)

    predicted_answers = []
    retrieved_contexts = []
    for q in tqdm(sample_questions):
        try:
            response = rag.run(
                data={"query_embedder": {"text": q}, "prompt_builder": {"question": q}, "answer_builder": {"query": q}})
            predicted_answers.append(response["answer_builder"]["answers"][0].data)
            retrieved_contexts.append([d.content for d in response['answer_builder']['answers'][0].documents])
        except BadRequestError as e:
            print(f"Error with question: {q}")
            print(e)
            predicted_answers.append("error")
            retrieved_contexts.append(retrieved_contexts)

    return retrieved_contexts, predicted_answers

Notice that we wrap the call to the RAG pipeline in a try-except block to handle any errors that may occur during the pipeline’s execution. This might happen, for instance, if the prompt is too big—due to large contexts—for the model to generate an answer, if there are network errors, or simply if the model cannot generate an answer for any other reason.

You can decide if the LLM-based evaluators stop immediately if an error is found or if they ignore the evaluation for a particular sample and continue see, for instance in the ContextRelevanceEvaluator, the raise_on_failure parameter.

Finally, we need to run whole query questions through the pipeline over the dataset for each possible combination of the parameters top_k, embedding_model, and chunk_size. That’s handled by the next function.

Note that for indexing, we only vary the embedding_model and chunk_size, as the top_k parameter does not affect the indexing.

def parameter_tuning(out_path: str):

    base_path = "../datasets/ARAGOG/"

    with open(base_path + "eval_questions.json", "r") as f:
        data = json.load(f)
        questions = data["questions"]
        answers = data["ground_truths"]

    embedding_models = {
        "sentence-transformers/all-MiniLM-L6-v2",
        "sentence-transformers/msmarco-distilroberta-base-v2",
        "sentence-transformers/all-mpnet-base-v2"
    }
    top_k_values = [1, 2, 3]
    chunk_sizes = [64, 128, 256]

    # create results directory
    out_path = Path(out_path)
    out_path.mkdir(exist_ok=True)

    for embedding_model in embedding_models:
        for chunk_size in chunk_sizes:
            print(f"Indexing documents with {embedding_model} model with a chunk_size={chunk_size}")
            doc_store = indexing(embedding_model, chunk_size)
            for top_k in top_k_values:
                name_params = f"{embedding_model.split('/')[-1]}__top_k:{top_k}__chunk_size:{chunk_size}"
                print(name_params)
                print("Running RAG pipeline")
                retrieved_contexts, predicted_answers = run_rag(doc_store, questions, embedding_model, top_k)
                print(f"Running evaluation")
                results, inputs = evaluation(questions, answers, retrieved_contexts, predicted_answers, embedding_model)
                eval_results = EvaluationRunResult(run_name=name_params, inputs=inputs, results=results)
                eval_results.score_report().to_csv(f"{out_path}/score_report_{name_params}.csv", index=False)
                eval_results.to_pandas().to_csv(f"{out_path}/detailed_{name_params}.csv", index=False)

This function will store the results in a directory specified by the out_path parameter. The results will be stored in .csv files. For each parameter combination, there will be two files generated, one with the aggregated score report overall questions (e.g.: score_report_all-MiniLM-L6-v2__top_k:3__chunk_size:128.csv) and another with the detailed results for each question (e.g.: detailed_all-MiniLM-L6-v2__top_k:3__chunk_size:128.csv).

Note that we make use of the EvaluationRunResult to store the results and generate the score report and the detailed results in the .csv files.

In the next section, we will show the evaluation results and discuss the insights gained from the experiment.

Results Analysis

You can run this notebook to visualize and analyze the results. All relevant .csv files can be found in the aragog_parameter_search_2024_06_12 folder.

To make the analysis of the results easier, we will load all the aggregated score reports from the different parameter combinations from multiple .csv files into a single DataFrame. For that, we use the following code to parse the file content:

import os
import re
import pandas as pd

def parse_results(f_name: str):
    pattern = r"score_report_(.*?)__top_k:(\\d+)__chunk_size:(\\d+)\\.csv"
    match = re.search(pattern, f_name)
    if match:
        embeddings_model = match.group(1)
        top_k = int(match.group(2))
        chunk_size = int(match.group(3))
        return embeddings_model, top_k, chunk_size
    else:
        print("No match found")

def read_scores(path: str):
    all_scores = []
    for root, dirs, files in os.walk(path):
        for f_name in files:
            if not f_name.startswith("score_report"):
                continue

            embeddings_model, top_k, chunk_size = parse_results(f_name)

            df = pd.read_csv(path+"/"+f_name)

            df.rename(columns={'Unnamed: 0': 'metric'}, inplace=True)
            df_transposed = df.T
            df_transposed.columns = df_transposed.iloc[0]
            df_transposed = df_transposed[1:]

            # Add new columns
            df_transposed['embeddings'] = embeddings_model
            df_transposed['top_k'] = top_k
            df_transposed['chunk_size'] = chunk_size

            all_scores.append(df_transposed)

    df = pd.concat(all_scores)
    df.reset_index(drop=True, inplace=True)
    df.rename_axis(None, axis=1, inplace=True)

    return df

We can then read the scores from the CSV files and analyze the results.

df = read_scores('aragog_results/')

We can now analyze the results in a single table:

context_relevance	faithfulness	sas	embeddings	top_k	chunk_size
0.834891	0.738318	0.524882	all-MiniLM-L6-v2	1	64
0.869485	0.895639	0.633806	all-MiniLM-L6-v2	2	64
0.933489	0.948598	0.65133	all-MiniLM-L6-v2	3	64
0.843447	0.831776	0.555873	all-MiniLM-L6-v2	1	128
0.912355	NaN	0.661135	all-MiniLM-L6-v2	2	128
0.94463	0.928349	0.659311	all-MiniLM-L6-v2	3	128
0.912991	0.827103	0.574832	all-MiniLM-L6-v2	1	256
0.951702	0.925456	0.642837	all-MiniLM-L6-v2	2	256
0.909638	0.932243	0.676347	all-MiniLM-L6-v2	3	256
0.791589	0.67757	0.480863	all-mpnet-base-v2	1	64
0.82648	0.866044	0.584507	all-mpnet-base-v2	2	64
0.901218	0.890654	0.611468	all-mpnet-base-v2	3	64
0.897715	0.845794	0.538579	all-mpnet-base-v2	1	128
0.916422	0.892523	0.609728	all-mpnet-base-v2	2	128
0.948038	NaN	0.643175	all-mpnet-base-v2	3	128
0.867887	0.834112	0.560079	all-mpnet-base-v2	1	256
0.946651	0.88785	0.639072	all-mpnet-base-v2	2	256
0.941952	0.91472	0.645992	all-mpnet-base-v2	3	256
0.909813	0.738318	0.530884	msmarco-distilroberta-base-v2	1	64
0.88004	0.929907	0.600428	msmarco-distilroberta-base-v2	2	64
0.918135	0.934579	0.67328	msmarco-distilroberta-base-v2	3	64
0.885314	0.869159	0.587424	msmarco-distilroberta-base-v2	1	128
0.953649	0.919003	0.664224	msmarco-distilroberta-base-v2	2	128
0.945016	0.936916	0.68591	msmarco-distilroberta-base-v2	3	128
0.949844	0.866822	0.613355	msmarco-distilroberta-base-v2	1	256
0.952544	0.893769	0.662694	msmarco-distilroberta-base-v2	2	256
0.964182	0.943925	0.62854	msmarco-distilroberta-base-v2	3	256

We can see some NaN values for the faithfullness scores which is based on an LLM-based evaluator. This was due to network errors when calling the OpenAI API.

Let’s now see which parameter configuration yielded the best Semantic Similarity Answer score

df.sort_values(by=['sas'], ascending=[False])

context_relevance	faithfulness	sas	embeddings	top_k	chunk_size
0.945016	0.936916	0.68591	msmarco-distilroberta-base-v2	3	128
0.909638	0.932243	0.676347	all-MiniLM-L6-v2	3	256
0.918135	0.934579	0.67328	msmarco-distilroberta-base-v2	3	64
0.953649	0.919003	0.664224	msmarco-distilroberta-base-v2	2	128
0.952544	0.893769	0.662694	msmarco-distilroberta-base-v2	2	256
0.912355	NaN	0.661135	all-MiniLM-L6-v2	2	128
0.94463	0.928349	0.659311	all-MiniLM-L6-v2	3	128
0.933489	0.948598	0.65133	all-MiniLM-L6-v2	3	64
0.941952	0.91472	0.645992	all-mpnet-base-v2	3	256
0.948038	NaN	0.643175	all-mpnet-base-v2	3	128
0.951702	0.925456	0.642837	all-MiniLM-L6-v2	2	256
0.946651	0.88785	0.639072	all-mpnet-base-v2	2	256
0.869485	0.895639	0.633806	all-MiniLM-L6-v2	2	64
0.964182	0.943925	0.62854	msmarco-distilroberta-base-v2	3	256
0.949844	0.866822	0.613355	msmarco-distilroberta-base-v2	1	256
0.901218	0.890654	0.611468	all-mpnet-base-v2	3	64
0.916422	0.892523	0.609728	all-mpnet-base-v2	2	128
0.88004	0.929907	0.600428	msmarco-distilroberta-base-v2	2	64
0.885314	0.869159	0.587424	msmarco-distilroberta-base-v2	1	128
0.82648	0.866044	0.584507	all-mpnet-base-v2	2	64
0.912991	0.827103	0.574832	all-MiniLM-L6-v2	1	256
0.867887	0.834112	0.560079	all-mpnet-base-v2	1	256
0.843447	0.831776	0.555873	all-MiniLM-L6-v2	1	128
0.897715	0.845794	0.538579	all-mpnet-base-v2	1	128
0.909813	0.738318	0.530884	msmarco-distilroberta-base-v2	1	64
0.834891	0.738318	0.524882	all-MiniLM-L6-v2	1	64
0.791589	0.67757	0.480863	all-mpnet-base-v2	1	64

Focusing on the Semantic Answer Similarity:

The msmarco-distilroberta-base-v2 embeddings model with a top_k=3 and a chunk_size=128 yields the best results.
In this evaluation, retrieving documents with top_k=3 will most usually yield a higher semantic similarity score than with top_k=1 or top_k=2
It also seems that regardless of the top_k and chunk_size the best semantic similarity scores come from using the embedding model all-MiniLM-L6-v2 and the msmarco-distilroberta-base-v2

Let’s inspect how the scores of each embedding model compare with each other in terms of Semantic Answer Similarity. For that, we will group the results by the embeddings column and plot the scores using box plots

from matplotlib import pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
df.boxplot(column='sas', by='embeddings', ax=ax)

plt.xlabel("Embeddings Model")
plt.ylabel("Semantic Answer Similarity Values")
plt.title("Boxplots of Semantic Answer Similarity Values Aggregated by Embeddings")

plt.show()

The box-plots above show that:

The all-MiniLM-L6-v2 and the msmarco-distilroberta-base-v2 embedding models outperform the all-mpnet-base-v2
The msmarco-distilroberta-base-v2 scores have less variance, indicating that this model is more stable to top_k and chunk_size parameter variations than the other models
All three embedding models have an outlier corresponding to the highest-scoring and lowest-scoring parameter combination
Not surprisingly, all the lowest scores outliers correspond to top_k=1 and chunk_size=64
The highest scores outliers correspond to top_k=3 and a chunk_size of 128 or 256

Since we have the ground truth answers, we focuses on the Semantic Similarity Answer, but let’s also look at the Faithfulness and Context Relevance scores for a few examples. For that, we will need to load the detailed scores:

detailed_best_sas_df = pd.read_csv("results/aragog_results/detailed_all-MiniLM-L6-v2__top_k:3__chunk_size:128.csv")

def inspect(idx):
    print("Question: ")
    print(detailed_best_sas_df.loc[idx]['questions'])
    print("\nTrue Answer:")
    print(detailed_best_sas_df.loc[idx]['true_answers'])
    print()
    print("Generated Answer:")
    print(detailed_best_sas_df.loc[idx]['predicted_answers'])
    print()
    print(f"Context Relevance  : {detailed_best_sas_df.loc[idx]['context_relevance']}")
    print(f"Faithfulness       : {detailed_best_sas_df.loc[idx]['faithfulness']}")
    print(f"Semantic Similarity: {detailed_best_sas_df.loc[idx]['sas']}")

Let’s look at the query question 6:

inspect(6)

Question: 
How does BERT's performance on the GLUE benchmark compare to previous state-of-the-art models?

True Answer:
BERT achieved new state-of-the-art on the GLUE benchmark (80.5%), surpassing the previous best models.

Generated Answer:
BERT's performance on the GLUE benchmark significantly outperforms previous state-of-the-art models, achieving 4.5% and 7.0% respective average accuracy improvement over the prior state of the art.

Context Relevance  : 1.0
Faithfulness       : 1.0
Semantic Similarity: 0.9051246047019958

Contexts:
recent work in this area.
Since its release, GLUE has been used as a testbed and showcase by the developers of several
inﬂuential models, including GPT (Radford et al., 2018) and BERT (Devlin et al., 2019). As shown
in Figure 1, progress on GLUE since its release has been striking. On GLUE, GPT and BERT
achieved scores of 72.8 and 80.2 respectively, relative to 66.5 for an ELMo-based model (Peters
et al., 2018) and 63.7 for the strongest baseline with no multitask learning or pretraining above the
word level. Recent models (Liu et al., 2019d; Yang et al., 2019) have clearly surpassed estimates of
non-expert human performance on GLUE (Nangia and Bowman, 2019). The success of these models
on GLUE has been driven by ever-increasing model capacity, compute power, and data quantity, as
well as innovations in 
---------
56.0 75.1
BERT BASE 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6
BERT LARGE 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1 82.1
Table 1: GLUE Test results, scored by the evaluation server ( https://gluebenchmark.com/leaderboard ).
The number below each task denotes the number of training examples. The “Average” column is slightly different
than the ofﬁcial GLUE score, since we exclude the problematic WNLI set.8BERT and OpenAI GPT are single-
model, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and
accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.
We use a batch size of 32 and ﬁne-tune for 3
epochs over the data for all GLUE tasks. For each
task, we selected the best ﬁne-tuning learning rate
(among 5e-5, 
---------
4e-5, 3e-5, and 2e-5) on the Dev set.
Additionally, for BERT LARGE we found that ﬁne-
tuning was sometimes unstable on small datasets,
so we ran several random restarts and selected the
best model on the Dev set. With random restarts,
we use the same pre-trained checkpoint but per-
form different ﬁne-tuning data shufﬂing and clas-
siﬁer layer initialization.9
Results are presented in Table 1. Both
BERT BASE and BERT LARGE outperform all sys-
tems on all tasks by a substantial margin, obtaining
4.5% and 7.0% respective average accuracy im-
provement over the prior state of the art. Note that
BERT BASE and OpenAI GPT are nearly identical
in terms of model architecture apart from the at-
tention masking. For the largest and most widely
reported GLUE task, MNLI, BERT obtains a 4.6%
absolute accuracy improvement. On the ofﬁcial
GLUE leaderboard10, BERT LARGE obtains a score
of 
---------

In this example, the context relevancy and faithfulness scores are both 1.0. This indicates that the context is relevant to the question and our RAG LLM used this context to generate a semantically similar answer to the correct (ground-truth) answer.

Let’s take a look at another example:

inspect(44)

Question: 
How should future language model benchmarks be structured to ensure a holistic assessment of models' capabilities and knowledge breadth?

True Answer:
Future benchmarks should integrate a broader spectrum of subjects and cognitive skills, emphasizing the inclusion of tasks that test models' ethical reasoning, understanding of human values, and ability to perform complex problem-solving, beyond the mere scale of data and parameters.

Generated Answer:
Future language model benchmarks should be structured to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings across a diverse set of subjects that humans learn. The benchmark should cover a wide range of subjects across STEM, humanities, social sciences, and more, ranging in difficulty from elementary to advanced professional levels. It should test both world knowledge and problem-solving ability, ensuring a holistic assessment of models' capabilities and knowledge breadth.

Context Relevance  : 0.6
Faithfulness       : 1.0
Semantic Similarity: 0.6483339071273804

Contexts:
learning model
usage should be developed for guiding users to learn ‘Dos’
and Dont’ in AI. Detailed policies could also be proposed
to list all user’s responsibilities before the model access.
C. Language Models Beyond ChatGPT
The examination of ethical implications associated with
language models necessitates a comprehensive examina-
tion of the broader challenges that arise within the domainof language models, in light of recent advancements in
the field of artificial intelligence. The last decade has seen
a rapid evolution of AI techniques, characterized by an
exponential increase in the size and complexity of AI
models, and a concomitant scale-up of model parameters.
The scaling laws that govern the development of language
models,asdocumentedinrecentliterature[84,85],suggest
thatwecanexpecttoencounterevenmoreexpansivemod-
els that incorporate multiple modalities in the near future.
Efforts to integrate multiple modalities into a single model
are driven by the ultimate goal of realizing the concept of
foundation models [86]. 
---------
language models are
at learning and applying knowledge from many domains.
To bridge the gap between the wide-ranging knowledge that models see during pretraining and the
existing measures of success, we introduce a new benchmark for assessing models across a diverse
set of subjects that humans learn. We design the benchmark to measure knowledge acquired during
pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the
benchmark more challenging and more similar to how we evaluate humans. The benchmark covers
57subjects across STEM, the humanities, the social sciences, and more. It ranges in difﬁculty from
an elementary level to an advanced professional level, and it tests both world knowledge and problem
solving ability. Subjects range from traditional areas, such as mathematics and history, to more
1arXiv:2009.03300v3 [cs.CY] 12 Jan 2021Published as a conference paper at 
---------
a
lack of access to the benefits of these models for people
who speak different languages and can lead to biased or
unfairpredictionsaboutthosegroups[14,15].Toovercome
this, it is crucial to ensure that the training data contains
a substantial proportion of diverse, high-quality corpora
from various languages and cultures.
b) Robustness: Another major ethical consideration
in the design and implementation of language models is
their robustness. Robustness refers to a model’s ability
to maintain its performance when given input that is
semantically or syntactically different from the input it
was trained on.
Semantic Perturbation: Semantic perturbation is a type
of input that can cause a language model to fail [40, 41].
This input has different syntax but is semantically similar
to the input used for training the model. To address this,
it is crucial to ensure that the training data is diverse and
representative of the population it will 
---------

It seems that for this question, the content is not completely relevant (Context Relevance = 0.6) and only the second context was used to generate the answer.

Running your own experiments

If you want to run this experiment yourself, follow the Python code evaluation_aragog.py in the haystack-evaluation repository.

Start by cloning the repository

git clone https://github.com/deepset-ai/haystack-evaluation
cd haystack-evaluation
cd evaluations

Next, run the Python script:

usage: evaluation_aragog.py [-h] --output_dir OUTPUT_DIR [--sample SAMPLE]

You can specify the output directory to hold the results and the sample size, i.e.: how many questions to use for the evaluation.

Don’t forget to define your Open AI API key in the environmental variable OPENAI_API_KEY

 OPENAI_API_KEY=<your_key> python evaluation_aragog.py --output-dir experiment_a --sample 10

Execution Time and Costs

NOTE: all the numbers reported were run on an Mac Book Pro Apple M3 Pro with 36GB of RAM with Haystack 2.2.1 and Python 3.9

Indexing

The Indexing pipeline needs to consider the parameter combinations defined below:

3 different values for embedding_model
3 different chunk_size values

Therefore, the index runs 9 times in total.

RAG Pipeline

The RAG pipeline needs to run 27 times, since the following parameters affect the retrieval process:

3 different values for embedding_model
3 different top_k values
3 different chunk_size values

This needs to run for each of the 107 questions, so in total, the RAG pipeline will run 2.889 times (3 x 3 x 3 x 107) and produce 2889 calls to OpenAI API.

Evaluation Pipeline

The Evaluation pipeline also runs 27 times since all parameter combinations need to be evaluated for each of the 107 questions. Note, however, that the Evaluation pipeline contains two Evaluators that rely on an LLM through OpenAI API, so this pipeline runs 2.889 times. However, due to the Faithfulness and ContextRelevance evaluators, it will produce 5.778 (2 x 2.889) calls to OpenAI API.

You can see the detailed running times for each parameter combination in the Benchmark Times Spreadsheet.

Pricing

For detailed pricing information, visit OpenAI Pricing 💸

Lessons Learned

In this article, we have shown how to use the Haystack Evaluators to find the best combination of parameters that yield the best performance of our RAG pipeline, as opposed to using only the default parameters.

For this ARAGOG dataset, in particular, the best performance is achieved using the msmarco-distilroberta-base-v2 embeddings model instead of the default model (sentence-transformers/all-mpnet-base-v2), together with a top_k=3 and a chunk_size=128.

A few learnings are important to take:

When using an LLM through an external API, it is important to account for potential network errors or other issues. Ensure that during your experiments, running the questions through the RAG pipeline or evaluating the results doesn’t crash due to an error, for instance, by wrapping the call within a try/except code block.
Before starting your experiments, estimate the costs and time involved. If you plan to use an external LLM through an API, calculate approximately how many API calls you will need to run queries through your RAG pipeline and evaluate the results if you use LLM-based evaluators. This will help you understand the total costs and time required for your experiments.
Depending on your dataset size and running time, Python notebooks might not be the best approach to run your experiments; a Python script is probably a more reliable solution.
Beware of which parameters affect which components. For instance, for indexing, only the embedding_model and the chunk_size are important - this can reduce the number of experiments you need to carry out.

Explore a variety of evaluation examples tailored to different use cases and datasets by visiting the haystack-evaluation repository on GitHub.

Extract Metadata from Queries to Improve Retrieval

Mon, 13 May 2024 00:00:00 +0200

In Retrieval-Augmented Generation (RAG) applications, the retrieval step, which provides relevant context to your large language model (LLM), is vital for generating high-quality responses. There are possible ways of improving retrieval and metadata filtering is one of the easiest ways. Metadata filtering, the approach of limiting the search space based on some concrete metadata, can really enhance the quality of the retrieved documents.

NOTE: this text originally posted on the Haystack Blog - I’m adding it to my personal blog for more awareness.

Here are some advantages of using metadata filtering:

Relevance: Metadata filtering narrows down the information being retrieved. This ensures that the generated responses align with the specific query or topic.
Accuracy: Filtering based on metadata such as domain, source, date, or topic guarantees that the information used for generation is accurate and trustworthy. This is particularly important for applications where accuracy is paramount. For instance, if you need information about a specific year, using the year as a metadata filter will retrieve only pertinent data.
Efficiency: Eliminating irrelevant or low-quality information boosts the efficiency of your RAG application, reduces the amount of processing needed, and speeds up retrieval response times.

You have two options for applying the metadata filter: you can either specify it directly when running the pipeline or, you can extract it from the query itself. In this article, we’ll focus on extracting filters from a query to improve the quality of generated responses in RAG applications. Let’s get started.

Introduction to Metadata Filters

First things first, what is metadata? Metadata (or meta tag) is actually data about your data, used to categorize, sort, and filter information based on various attributes such as date, topic, source, or any other information that you find relevant. After incorporating meta information into your data, you can apply filters to queries used with Retrievers to limit the scope of your search based on this metadata and ensure that your answers come from a specific slice of your data.

Imagine that you have following Documents in your document store:

documents = [
    Document(
        content="Some text about revenue increase",
        meta={"year": 2022, "company": "Nvidia", "name":"A"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2023, "company": "Nvidia", "name":"B"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2022, "company": "BMW", "name":"C"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2023, "company": "BMW", "name":"D"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2022, "company": "Mercedes", "name":"E"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2023, "company": "Mercedes", "name":"F"}),
]

When the query is “Causes of the revenue increase”, the retriever returns all documents as they all contain some information about revenue. However, the metadata filter below ensures that any returned document by the retriever has a value of 2022 in the year metadata field and either BMW or Mercedes in the company metadata field. So, only documents with name “C” and “E” are retrieved.

pipeline.run(
    data={
        "retriever":{
            "query": "Causes of the revenue increase",
            "filters": {
                "operators": "AND",
                "conditions": [
                    {"field": "meta.year", "operator": "==", "value": "2022"},
                    {"field": "meta.company", "operator": "in", "value": ["BMW", "Mercedes"]}
                ]
            }
        }
    }
)

In this example, we pass the filter explicitly, but sometimes, the query itself might contain information that can be used as a metadata filter during the querying process. In this case, we need to preprocess the query to extract filters before we use it with a retriever.

Extracting Metadata Filters from a Query

In LLM-based applications, queries are written in natural language. From time to time, they include valuable hints that can be used as metadata filters to improve the retrieval. We can extract these hints, formulate them as metadata filters and use them with the retriever alongside the query. For instance, when the query is “What was the revenue of Nvidia in 2022?”, we can extract 2022 as years and Nvidia as companies. Based on this information, formulated metadata filter to use with a retriever should look like:

"filters": {
    "operators": "AND",
    "conditions": [
        {"field": "meta.years", "operator": "==", "value": "2022"},
        {"field": "meta.companies", "operator": "==", "value": "Nvidia"}
    ]
}

Thankfully, LLMs are highly capable of extracting structured information from unstructured text. Let’s see step-by-step how we can implement a custom component that uses an LLM to extract keywords, phrases, or entities from the query and formulate the metadata filter.

Implementing `QueryMetadataExtractor`

🧑‍🍳 You can find and run all the code in our cookbook Extrating Metadata Filter from a Query

We start by creating a custom component, QueryMetadataExtractor, which takes query and metadata_fields as inputs and outputs filters. This component encapsulates a generative pipeline, made up of PromptBuilder and OpenAIGenerator. The pipeline instructs the LLM to extract keywords, phrases, or entities from a given query which can then be used as metadata filters. In the prompt, we include instructions to ensure the output format is in JSON and provide metadata_fields along with the query to ensure the correct entities are extracted from the query.

Once the pipeline is initialized in the init method of the component, we post-process the LLM output in the run method. This step ensures the extracted metadata is correctly formatted to be used as a metadata filter.

import json
from typing import Dict, List

from haystack import Pipeline, component
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

@component()
class QueryMetadataExtractor:

    def __init__(self):
        prompt = """
        You are part of an information system that processes users queries.
        Given a user query you extract information from it that matches a given list of metadata fields.
        The information to be extracted from the query must match the semantics associated with the given metadata fields.
        The information that you extracted from the query will then be used as filters to narrow down the search space
        when querying an index.
        Just include the value of the extracted metadata without including the name of the metadata field.
        The extracted information in 'Extracted metadata' must be returned as a valid JSON structure.
        ###
        Example 1:
        Query: "What was the revenue of Nvidia in 2022?"
        Metadata fields: {"company", "year"}
        Extracted metadata fields: {"company": "nvidia", "year": 2022}
        ###
        Example 2:
        Query: "What were the most influential publications in 2023 regarding Alzheimer's disease?"
        Metadata fields: {"disease", "year"}
        Extracted metadata fields: {"disease": "Alzheimer", "year": 2023}
        ###
        Example 3:
        Query: ""
        Metadata fields: ""
        Extracted metadata fields:
        """
        self.pipeline = Pipeline()
        self.pipeline.add_component(name="builder", instance=PromptBuilder(prompt))
        self.pipeline.add_component(name="llm", instance=OpenAIGenerator(model="gpt-3.5-turbo"))
        self.pipeline.connect("builder", "llm")

    @component.output_types(filters=Dict[str, str])
    def run(self, query: str, metadata_fields: List[str]):
        result = self.pipeline.run({'builder': {'query': query, 'metadata_fields': metadata_fields}})
        metadata = json.loads(result['llm']['replies'][0])

        # this can be done with specific data structures and in a more sophisticated way
        filters = []
        for key, value in metadata.items():
            field = f"meta.{key}"
            filters.append({f"field": field, "operator": "==", "value": value})

        return {"filters": {"operator": "AND", "conditions": filters}}

First, let’s test the QueryMetadataExtractor in isolation, passing a query and a list of metadata fields.

extractor = QueryMetadataExtractor()

query = "What were the most influential publications in 2022 regarding Parkinson's disease?"
metadata_fields = {"disease", "year"}

result = extractor.run(query, metadata_fields)
print(result)

The result should look like this:

{'filters': {'operator': 'AND',
  'conditions': [
    {'field': 'meta.disease', 'operator': '==', 'value': 'Alzheimers'},
    {'field': 'meta.year', 'operator': '==', 'value': 2023}
  ]}
}

Notice that the QueryMetadataExtractor has extracted the metadata fields from the query and returned them in a format that can be used as filters passed directly to a Retriever. By default, the QueryMetadataExtractor will use all metadata fields as conditions together with an AND operator.

Using `QueryMetadataExtractor` in a Pipeline

Now, let’s plug the QueryMetadataExtractor into a Pipeline with a Retriever connected to a DocumentStore to see how it works in practice.

We start by creating a InMemoryDocumentStore and adding some documents to it. We include info about “year” and “disease” in the “meta” field of each document.

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

documents = [
    Document(
        content="some publication about Alzheimer prevention research done over 2023 patients study",
        meta={"year": 2022, "disease": "Alzheimer", "author": "Michael Butter"}),
    Document(
        content="some text about investigation and treatment of Alzheimer disease",
        meta={"year": 2023, "disease": "Alzheimer", "author": "John Bread"}),
    Document(
        content="A study on the effectiveness of new therapies for Parkinson's disease",
        meta={"year": 2022, "disease": "Parkinson", "author": "Alice Smith"}
    ),
    Document(
        content="An overview of the latest research on the genetics of Parkinson's disease and its implications for treatment",
        meta={"year": 2023, "disease": "Parkinson", "author": "David Jones"}
    )
]

document_store = InMemoryDocumentStore(bm25_algorithm="BM25Plus")
document_store.write_documents(documents=documents, policy=DuplicatePolicy.OVERWRITE)

We then create a pipeline consisting of the QueryMetadataExtractor and a InMemoryBM25Retriever connected to the InMemoryDocumentStore created above.

Learn about connecting components and creating pipelines in Docs: Creating Pipelines.

from haystack import Pipeline, Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retrieval_pipeline = Pipeline()
metadata_extractor = QueryMetadataExtractor()
retriever = InMemoryBM25Retriever(document_store=document_store)

retrieval_pipeline.add_component(instance=metadata_extractor, name="metadata_extractor")
retrieval_pipeline.add_component(instance=retriever, name="retriever")
retrieval_pipeline.connect("metadata_extractor.filters", "retriever.filters")

Now define a query and metadata fields and pass them to the pipeline:

query = "publications 2023 Alzheimer's disease"
metadata_fields = {"year", "author", "disease"}

retrieval_pipeline.run(data={"metadata_extractor": {"query": query, "metadata_fields": metadata_fields}, "retriever":{"query": query}})

This returns only documents whose metadata field year = 2023 and disease = Alzheimer

{'documents': 
 [Document(
     id=e3b0bfd497a9f83397945583e77b293429eb5bdead5680cc8f58dd4337372aa3, 
     content: 'some text about investigation and treatment of Alzheimer disease', 
     meta: {'year': 2023, 'disease': 'Alzheimer', 'author': 'John Bread'}, 
     score: 2.772588722239781)]
     }

Conclusion

Metadata filtering stands out as a powerful technique for improving the relevance and accuracy of retrieved documents, thus enabling the generation of high-quality responses in RAG applications. Using the custom component QueryMetadataExtractor we implemented, we can extract filters from user queries and directly use them with Retrievers.

Incorporate HyDE into Haystack RAG pipelines

Wed, 28 Feb 2024 00:00:00 +0100

Hypothetical Document Embeddings (HyDE) is a technique proposed in the paper “Precise Zero-Shot Dense Retrieval without Relevance Labels” which improves retrieval by generating “fake” hypothetical documents based on a given query, and then uses those “fake” documents embeddings to retrieve similar documents from the same embedding space. In this article, we will see how to implement and incorporate it into Haystack by creating a custom component that implements HyDE.

NOTE: this text originally posted on the Haystack Blog - I’m adding it to my personal blog for more awareness.

To learn more about how HyDE works, and where it’s useful, check out our guide on Hypothetical Document Embeddings (HyDE)

Build a Pipeline to Create Hypothetical Document Embeddings

First, let’s build a simple pipeline to generate these hypothetical documents. To do so, we will use the following Haystack components:

PromptBuilder and OpenAIGenerator to query an instruction-following language model and generate hypothetical documents.
SentenceTransformersDocumentEmbedder encodes the hypothetical documents into vector embeddings.
OutputAdapter to adapt the output of the Generator to be compatible with the input of the SentenceTransformersDocumentEmbedder, which expects List[Document]

To use the OpenAIGenerator, you need to set your OPENAI_API_KEY
export OPENAI_API_KEY="secret_string"

We first build a way to query an instruction-following language model to generate hypothetical documents.

from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.builders import PromptBuilder

generator = OpenAIGenerator(
	model="gpt-3.5-turbo",
	generation_kwargs={"n": 5, "temperature": 0.75, "max_tokens": 400},
)

template="""Given a question, generate a paragraph of text that answers the question.
	    Question: 
	    Paragraph:"""
prompt_builder = PromptBuilder(template=template)

This will output a list of 5 hypothetical documents, the same number the authors used for the experiments in the paper. We then use the SentenceTransformersDocumentEmbedder to encode these hypothetical documents into embeddings.

But, the SentenceTransformersDocumentEmbedder expects List[Document] objects as input, so we need to adapt the output of the OpenAIGenerator to be compatible with the input of the SentenceTransformersDocumentEmbedder. For this, we use an OutputAdapter with a custom filter:

from haystack import Document
from haystack.components.converters import OutputAdapter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from typing import List

adapter = OutputAdapter(
    template="",
    output_type=List[Document],
    custom_filters={"build_doc": lambda data: [Document(content=d) for d in data]}
)

embedder = SentenceTransformersDocumentEmbedder(
	model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()

We can now create a custom component, HypotheticalDocumentEmbedder, that expects documents and can return a list of hypotethetical_embeddings which is the average of the embeddings from the “hypothetical” (fake) documents.

from numpy import array, mean
from haystack import component

@component
class HypotheticalDocumentEmbedder:

@component.output_types(hypothetical_embedding=List[float])
def run(self, documents: List[Document]):
    stacked_embeddings = array([doc.embedding for doc in documents])
    avg_embeddings = mean(stacked_embeddings, axis=0)
    hyde_vector = avg_embeddings.reshape((1, len(avg_embeddings)))
    return {"hypothetical_embedding": hyde_vector[0].tolist()}

Now we can add all of these into a pipeline and generate hypothetical document embeddings.

from haystack import Pipeline

hyde = HypotheticalDocumentEmbedder()

pipeline = Pipeline()
pipeline.add_component(name="prompt_builder", instance=prompt_builder)
pipeline.add_component(name="generator", instance=generator)
pipeline.add_component(name="adapter", instance=adapter)
pipeline.add_component(name="embedder", instance=embedder)
pipeline.add_component(name="hyde", instance=hyde)

pipeline.connect("prompt_builder", "generator")
pipeline.connect("generator.replies", "adapter.answers")
pipeline.connect("adapter.output", "embedder.documents")
pipeline.connect("embedder.documents", "hyde.documents")

query = "What should I do if I have a fever?"
result = pipeline.run(data={"prompt_builder": {"question": query}})

Below a graphical representation of the pipeline we created

Build a Complete HyDE Component

Optionally, we could also create a HypotheticalDocumentEmbedder that encapsulates the entire logic that we saw above. This way, we would be able to use this one components for improved retrieval.

This component can do a few things:

Allow the user to pick the LLM which generates the hypothetical documents
Allow users to define how many documents should be created with nr_completions
Allow users to define the embedding model they want to use to generate the HyDE embeddings.

from haystack import Pipeline, component, Document, default_to_dict, default_from_dict
from haystack.components.converters import OutputAdapter
from haystack.components.embedders.sentence_transformers_document_embedder import SentenceTransformersDocumentEmbedder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.builders import PromptBuilder

from typing import Dict, Any, List
from numpy import array, mean

from haystack.utils import Secret

@component
class HypotheticalDocumentEmbedder:

    def __init__(
        self,
        instruct_llm: str = "gpt-3.5-turbo",
        instruct_llm_api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
        nr_completions: int = 5,
        embedder_model: str = "sentence-transformers/all-MiniLM-L6-v2",
    ):
        self.instruct_llm = instruct_llm
        self.instruct_llm_api_key = instruct_llm_api_key
        self.nr_completions = nr_completions
        self.embedder_model = embedder_model
        self.generator = OpenAIGenerator(
            api_key=self.instruct_llm_api_key,
            model=self.instruct_llm,
            generation_kwargs={"n": self.nr_completions, "temperature": 0.75, "max_tokens": 400},
        )
        self.prompt_builder = PromptBuilder(
            template="""Given a question, generate a paragraph of text that answers the question.
            Question: 
            Paragraph:
            """
        )

        self.adapter = OutputAdapter(
            template="",
            output_type=List[Document],
            custom_filters={"build_doc": lambda data: [Document(content=d) for d in data]},
        )

        self.embedder = SentenceTransformersDocumentEmbedder(model=embedder_model, progress_bar=False)
        self.embedder.warm_up()

        self.pipeline = Pipeline()
        self.pipeline.add_component(name="prompt_builder", instance=self.prompt_builder)
        self.pipeline.add_component(name="generator", instance=self.generator)
        self.pipeline.add_component(name="adapter", instance=self.adapter)
        self.pipeline.add_component(name="embedder", instance=self.embedder)
        self.pipeline.connect("prompt_builder", "generator")
        self.pipeline.connect("generator.replies", "adapter.answers")
        self.pipeline.connect("adapter.output", "embedder.documents")

    def to_dict(self) -> Dict[str, Any]:
        data = default_to_dict(
            self,
            instruct_llm=self.instruct_llm,
            instruct_llm_api_key=self.instruct_llm_api_key,
            nr_completions=self.nr_completions,
            embedder_model=self.embedder_model,
        )
        data["pipeline"] = self.pipeline.to_dict()
        return data

    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> "HypotheticalDocumentEmbedder":
        hyde_obj = default_from_dict(cls, data)
        hyde_obj.pipeline = Pipeline.from_dict(data["pipeline"])
        return hyde_obj

    @component.output_types(hypothetical_embedding=List[float])
    def run(self, query: str):
        result = self.pipeline.run(data={"prompt_builder": {"question": query}})
        # return a single query vector embedding representing the average of the hypothetical document embeddings
        stacked_embeddings = array([doc.embedding for doc in result["embedder"]["documents"]])
        avg_embeddings = mean(stacked_embeddings, axis=0)
        hyde_vector = avg_embeddings.reshape((1, len(avg_embeddings)))
        return {"hypothetical_embedding": hyde_vector[0].tolist()}

Using the `HypotheticalDocumentEmbedder` for Retrieval

As a final step, let’s see how we can use our new component in a retrieval pipeline. To start, we can create a document store that has some data in it.

from datasets import load_dataset, Dataset
from haystack import Pipeline, Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

embedder_model = "sentence-transformers/all-MiniLM-L6-v2"

def  index_docs(data: Dataset):
	document_store = InMemoryDocumentStore()
	pipeline = Pipeline()
	
	pipeline.add_component("cleaner", DocumentCleaner())
	pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=10))
	pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(model=embedder_model))
	pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy="skip"))

	pipeline.connect("cleaner", "splitter")
	pipeline.connect("splitter", "embedder")
	pipeline.connect("embedder", "writer")
	pipeline.run({"cleaner": {"documents": [Document.from_dict(doc) for doc in data["train"]]}})

	return document_store
	
data = load_dataset("Tuana/game-of-thrones")
doc_store = index_docs(data)

Now that we’ve populated an InMemoryDocumentStore with some data, let’s see how we can use the HypotheticalDocumentEmbedder as a way to retrieve documents 👇

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

def  retriever_with_hyde(doc_store):
	hyde = HypotheticalDocumentEmbedder(instruct_llm="gpt-3.5-turbo", nr_completions=5)
	retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
	
	retrieval_pipeline = Pipeline()
	retrieval_pipeline.add_component(instance=hyde, name="query_embedder")
	retrieval_pipeline.add_component(instance=retriever, name="retriever")

	retrieval_pipeline.connect("query_embedder.hypothetical_embedding", "retriever.query_embedding")
	return retrieval_pipeline

retrieval_pipeline = retriever_with_hyde(doc_store)
query = "Who is Araya Stark?"
retrieval_pipeline.run(data={"query_embedder": {"query": query}, "retriever": {"top_k": 5}})

Semantic Drift in Machine Learning

Wed, 15 Nov 2023 01:00:00 +0100

Machine Learning models are static artefacts based on historical data, which start consuming consuming “real-world” data when deployed into production. Real-world data might not reflected the historical training data and test data, the progressive changes between training data and “real-world” are called the drift and it can be one of the reasons model accuracy decreases over time.

How can drift occur?

Sudden: A new concept occurs within a short time.

Gradual: A new concept gradually replaces an old one over a period of time.

Incremental: An old concept incrementally changes to a new concept.

Reoccurring: An old concept may reoccur after some time.

Drift detection in a nutshell

1) Create two datasets: reference and current/serving

2) Calculate statistical measures for features values in the reference (e.g.: distribution of values over features)

3) The reference dataset act as the benchmark/baseline

4) Select “real-world” data - current/serving

5) Calculate the same statistical measures as for the reference

6) Compare them both using statistical tools (e.g.: distance metrics, test hypothesis)

7) Depending on a threshold assume or not drift occurred

NOTE: repeat steps 4 to 6 periodically

Drift, Skew, Shift

• Data Drift: changes in the statistical properties of features over time, due seasonality, trends or unexpected events.

• Concept Drift: changes in the statistical properties of the labels over time, e.g.: mapping to labels in training remains static while changes in real-world.

• Schema Skew: training and service data do not conform to the same schema, e.g.: getting an integer when expecting a float, empty string vs None.

• Distribution Skew: divergence between training and serving datasets, e.g.: dataset shift caused by covariate/concept shift.

Covariate Shift

Marginal distribution of features $x$ is not the same during training and serving, but the conditional distribution remains unchanged.

Example: number of predictions of relevant/non-relevant text samples is in line with development test set, but the distribution of features is different between training.

Concept Shift

Conditional distribution of labels and features are not the same during training and serving, but the marginal distribution features remain unchanged.

Example: although the text samples being crawled did not change and the distribution of features values is still the same, what determines if a text sample is relevant or non-relevant changed.

How to detect them?

Measuring Embeddings Drift

Average the embeddings in the current and reference dataset, compare with some similarity/distance metric: Euclidean distance, Cosine similarity;
Train a binary classification to discriminate between reference and current distributions. If the model can confidently identify which embeddings belong to which you can consider the two datasets differ significantly.

Embeddings, can be seen as a structured tabular dataset. Rows are individual texts and columns are components of each embedding
Treat each component as a numerical “feature” and check for the drift in its distribution between reference and current datasets.
If many embedding components drift, you can consider that there is a meaningful change in the data.

References

Sentence Transformer Fine-Tuning - SetFit

Mon, 23 Oct 2023 02:00:00 +0200

Sentence Transformers Fine-Tunning (SetFit) is a technique to mitigate the problem of a few annotated samples by fine-tuning a pre-trained sentence-transformers model on a small number of text pairs in a contrastive learning manner. The resulting model is then used to generate rich text embeddings, which are then used to train a classification head, resulting in a final classifier fine-tuned to the specific dataset.

Figure 1 - SetFit two phases.

Contrastive Learning

The first step relies on a sentence-transformer model and adapts a contrastive training approach that is often used for image similarity detection (Koch et al., 2015).

The basic contrastive learning framework consists of selecting a data sample, called anchor a data point belonging to the same distribution as the anchor, called the positive sample, and another data point belonging to a different distribution called the negative sample, as shown in Figure 1.

The model tries to minimize the distance between the anchor and positive sample and, at the same time, maximize the distance between the anchor and the negative samples. The distance function can be anything in the embedding space.

Figure 1 - Contrastive Learning from Vision AI (source).

Selecting Positive and Negative Triples

Given a dataset of $K$ labeled examples

\[D = {(x_i, y_i)}\]

where $x_i$ and $y_i$ are sentences and their class labels, respectively.

For each class label $c \in C$ in the dataset we need to generate a set of positive triples:

\[T_{p}^{c} = {(x_{i},x_{j}, 1)}\]

where $x_{i}$ and $x_{j}$ are pairs of randomly chosen sentences from the same class $c$, i.e $(y_{i} = y_{j} = c)$

and, also a set of negative triples:

\[T_{n}^{c} = {(x_{i} , x_{j} , 0)}\]

where $x_{i}$ and $x_{j}$ are randomly chosen sentences from different classes such that $(y_{i} = c, y_{j} \neq c)$.

Building the Contrastive Fine-tuning Dataset

The contrastive fine-tuning data set $T$ is produced by concatenating the positive and negative triplets across all class labels:

\[T = { (T_{p}^{0},T{n}^{0}), (T_{p}^{1},T{n}^{1}), \ldots, (T_{p}^{|C|}, T_{n}^{|C|}) }\]

$\vert C \vert$ is the number of class labels

\[\vert T \vert = 2R \vert C \vert\]

is the number of pairs in $T$ and $R$ is a hyperparameter.

Fine-Tuning

The contrastive fine-tuning dataset is then used to fine-tune the pre-trained sentence-transformer model using a contrastive loss function. The contrastive loss function is designed to minimize the distance between the anchor and positive samples and maximize the distance between the anchor and negative samples.

Figure 2 - The new embedded latent space after siamese contrastive learning.

Training Classification Head

This step is a standard supervised learning task, where the fine-tuned sentence-transformer model is used to generate embeddings for the training data, and a classification head is trained on top of the embeddings to predict the class labels.

References

Sentence Transformers

Sun, 22 Oct 2023 02:00:00 +0200

The sentence-transformers proposed in Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks is an effective and efficient way to train a neural network such that it represents good embeddings for a sentence or paragraph based on the Transformer architecture. In this post I review mechanism to train such embeddings presented in the paper.

Introduction

Measuring the similarity of a pair of short-text (.e.g: sentences or paragraphs) is a common NLP task. One can achieve this with BERT, using a cross-encoder, concatenating both sentences with a separating token [SEP], passing this input to the transformer network and the target value is predicted, in this case, similar or not similar.

But, this approach grows quadratically due to too many possible combinations. For instance, finding in a collection of $n$ sentences the pair with the highest similarity requires with BERT $n \cdot(n−1)/2$ inference computations.

Alternatively one could also rely on BERT’s output layer, the embeddings, or use the output of the first token, i.e.: the [CLS] token, but as the authors shows this often leads to worst results than just using static-embeddings, like GloVe embeddings (Pennington et al., 2014).

To overcame this issue, and still use the contextual word embeddings representations provided by BERT (or any other Transformer-based model) the authors use a pre-trained BERT/RoBERTa network and fine-tune it to yield useful sentence embeddings, meaning that semantically similar sentences are close in vector space.

Fine-Tuning BERT for Semantically (Dis)Similarity

The main architectural components of this approach is a Siamese Neural Network, a neural network containing two or more identical sub-networks, whose weights are updated equally across both sub-networks.

Figure 1 - The architecture to fine-tune sentence-transformers.

The training procedure of the network is the following. Each input sentence is feed into BERT, producing embeddings for each token of the sentence. To have fixed-sized output representation the authors apply a pooling layer, exploring three strategies:

[CLS]-token
MEAN-strategy: computing the mean of all output vectors
MAX-strategy: computing a max-over-time of the output vectors

They experiment with 3 different network objective function depending on the training data:

Classification:

\[o = \text{softmax}(Wt(u, v, |u − v|))\]

concatenating the sentence embeddings $u$ and $v$ with the element-wise difference $\vert u − v \vert$ and multiply it with the trainable weight $W_t \in R^{3n×k}$.

Regression:

\[cos(u,v)\]

simply the cosine between the two sentence embeddings.

Triplet Objective Function:

\[max(||s_{a} − s_{p}|| − || s_{a} − s_{n} || + ε, 0)\]

a baseline anchor $a$ input is compared to a positive $p$ input and a negative $n$ input. The distance from the baseline $a$ to the positive $p$ input is minimized, and the distance from the baseline $a$ to the negative $n$ is maximized. As far as I’ve understood this objective function is only used in one experiment with the Wikipedia Sections Distinction dataset.

Training Data

The objective function (classification vs. regression) depends on the training data. For the classification objective function, the authors used the The Stanford Natural Language Inference (SNLI) Corpus, a collection of 570,000 sentence pairs annotated with the labels:

contradiction
entailment
neutral

The The Multi-Genre NLI Corpus containing 430,000 sentence pairs and covers a range of genres of spoken and written text.

For the regression objective function, the authors trained on the training set of the Semantic Textual Similarity (STS) benchmark dataset from SemEval.

Training: fine-tuning

In order to fine-tune BERT and RoBERTa, the authors used a Siamese Neural Network (SNN) strategy to update the weights such that the produced sentence embeddings are semantically meaningful.

An SNNs can be used to find the similarity of the inputs by comparing its feature vectors, so these networks learn a similarity function that takes two inputs and outputs 1 if they belong to the same class and zero other wise,

It learn parameters such that, if the two sentences are similar:

\[|| f(x_1) - (f(x_2)||^2 \text{ is small}\]

and if the two sentences are dissimilar:

\[|| f(x_1) - (f(x_2) ||^2 \text{ is large}\]

where $f(x)$ is embedding of $x$.

Configuration parameters

batch-size: 16,
Adam optimizer with learning rate 2e−5,
linear learning rate warm-up over 10% of the training data.
MEAN as the default pooling strategy is .
3-way softmax-classifier objective function for one epoch.

Evaluation

The authors evaluated their approach on several datasets common Semantic Textual Similarity (STS) tasks, using the cosine-similarity to compare the similarity between two sentence embeddings, in opposition to learning a regression function that maps two sentence embeddings to a similarity score. They also experimented with Manhatten and negative Euclidean distances as similarity measures, but the results remained roughly the same.

Figure 2 - sentence-transformers in inference mode.

To recapitulate BERT and RoBERTa are fine-tuned using the training described above, and the resulting model are used to generate embeddings for sentences, whose similarity is measured by the cosine.

SemEval 2012-2016 - Semantic Textual Similarity (STS) datasets: 2012, 2013, 2014, 2015, 2016
SICK (Sentences Involving Compositional Knowledge)
SemEval-2017 Task 1: STSimilarity Multilingual and Crosslingual Focused Evaluation

Figure 3 - Results from the experimental evaluation.

The authors also carry other experiments with a few other datasets, but I will refer the reader to the the original paper for further details.

Ablation Study

The study explored different methods to concatenate the sentence embeddings for training the softmax classifier in the process of fine-tuning a BERT/RoBERTa transformer model.

According to the authors the most important component is the element-wise difference $\vert u − v \vert$ which measures the distance between the dimensions of the two sentence embeddings, ensuring that similar pairs are closer and dissimilar pairs are further apart.

Figure 8 - Results from the ablations study.

Implementation and others

The sentence-transformers package gain popularity in the NLP community and can be used for multiple tasks as semantic text similarity, semantic search, retrieve and re-rank, clustering and others, see the official webpage SBERT for several tutorials and API documentation

One of the authors of the paper Nils Reimers has made several talks on ideas and approaches levering on sentence-transformers, here are two I’ve found interesting:

References

Triplet Loss and Online Triplet Mining in TensorFlow from Olivier Moindrot blog

Generative AI with Large Language Models

Fri, 15 Sep 2023 02:00:00 +0200

I’ve completed the course and recommend it to anyone interested in delving into some of the intricacies of Transformer architecture and Large Language Models. The course covered a wide range of topics, starting with a quick introduction to the Transformer architecture and dwelling into fine-tuning LLMs and also exploring chain-of-thought prompting augmentation techniques to overcome knowledge limitations. This post contains my notes taken during the course and concepts learned.

Week 1 - Introduction (slides)

Introduction to Transformers architecture

Going through uses cases of generative AI with Large Language Models, given examples such as: summarisation, translation or information retrieval; and also how those were achieved before Transformers came into play. There’s also an introduction to the Transformer architecture which is the base component for Large Language Models, and also an overview of the inference parameters that one can tune.

Generative AI project life-cycle

Then it’s first introduced in the course the Generative AI project lifecycle which is followed up to the end of the course

Figure 1 - Generative AI projet life-cycle as presented in the course.

Prompt Engineering and Inference Paramaters

In-Context Learning

no-prompt engineering: just asking the model predict next sequence of words
```
  "Whats the capital of Portugal?"
```

zero-shot_: giving an instruction for a task

  "Classify this review: I loved this movie! Sentiment: "

one-shot - giving an instruction for a task with one example

  "Classify this review: I loved this movie! Sentiment: Positive"

  "Classify this review: I don't like this album! Sentiment: "

few shot - giving an instruction for a task with a few examples (2~6)

  "Classify this review: I loved this movie! Sentiment: Positive"

  "Classify this review: I don't like this album! Sentiment: Negative"
	
  ...
	
  "Classify this review: I don't like this soing! Sentiment: "

Inference Parameters

Figure 2 - Parameters affecting how the model selects the next token to generate.

greedy: the word/token with the highest probability is selected.
random(-weighted) sampling: select a token using a random-weighted strategy across the probabilities of all tokens.
top-k: select an output from the top-k results after applying random-weighted strategy using the probabilities

Figure 3 - top-k, with k=3

top-p: select an output using the random-weighted strategy with the top-ranked consecutive results by probability and with a cumulative probability <= p

Figure 4 - top-p, with p=30.

A higher temperature results in higher randomness and affects softmax directly and how probability is computed, the temperature = 1 is the softmax function at default, meaning an unaltered probability distribution.

see the transformers.GenerationConfig class for the complete details

The lab exercise consists of a dialogue summarisation task using the T5 model from Huggingface by exploring how in-context learning and inference parameters affects the output of the model.

Week 2: Fine-Tuning (slides)

Instruction Fine-Tuning

Instruction fine-tuning/fine-tuning trains the whole model parameters using examples that demonstrate how it should respond to a specific instruction, e.g:

[PROMT]
[1.EXAMPLE TEXT]
[1.EXAMPLE COMPLETION]

[PROMT]
[2.EXAMPLE TEXT]
[2.EXAMPLE COMPLETION]

...

[PROMT]
[n.EXAMPLE TEXT]
[n.EXAMPLE COMPLETION]

All of the model’s weights are updated (full fine-tuning) and it involves using many prompt-completion examples as the labeled training dataset to continue training the model by updating its weights
Comparing to in-context learning, where one only provides prompt-completion during inference, here we do it during training
Adapting a foundation model through instruction fine-tuning, requires prompt templates and datasets
Compare the LLM completion with the label use the loss (cross-entropy) to calculate the loss between the two token distribution, and use the loss the updated the model weights using back-propagation
The instruction fine-tuning dataset can include multiple tasks

Single-Task Fine-Tuning

An application may only need to perform a single task, one can fine-tune a pre-trained model to improve performance the single-task only
Often just 500-1,000 examples can result in good performance, however, this process may lead to a phenomenon called catastrophic forgetting
Happens because the full fine-tuning process modifies the weights of the original LLM
Leads to great performance on the single fine-tuning task, it can degrade performance on other tasks

Multi-Task Fine-Tuning

Summarize the following text
[1.EXAMPLE TEXT]
[1.EXAMPLE COMPLETION]

Classify the following reviews
[2.EXAMPLE TEXT]
[2.EXAMPLE COMPLETION]

...

Extract the following named-entities
[n.EXAMPLE TEXT]
[n.EXAMPLE COMPLETION]

Requires lot of data, one may need as many as 50-100,000 examples
Fine-tuned Language Net
- FLAN-T5 - fine-tune version of pre-trained T5 model
- Paper: Scaling Instruction-Finetuned Language Models

Model Evaluation

ROUGE

based on $n$-grams
\[\text{ROUGE-Precision} = \frac{\text{n-grams matches}}{\text{n-grams in reference}}\] \[\text{ROUGE-Recall} = \frac{\text{n-grams matches}}{\text{n-grams in output}}\]
ROUGE-L-score: longest common subsequence between generated output and reference
\[\text{ROUGE-Precision} = \frac{\text{LCS(gen,ref)}}{\text{n-grams in reference}}\] \[\text{ROUGE-Recall} = \frac{\text{LCS(gen,ref)}}{\text{n-grams in output}}\]

BLEU

focuses on precision, it computes the precisions across different $n$-gram sizes and then averaged

Benchmarks

Parameter Efficient Fine-Tuning

During a full-fine tuning of LLMs every model weight is updated during supervised learning, this operation has memory requirements which can be 12-20x the model’s memory:

gradients, forward activations, temporary memory for training process.

There are Parameter Efficient Fine-Tuning (PEFT) techniques to train LLMs for specific tasks which don’t require to train ever weight in the model:

only a small number of trainable layers
LLM with additional layers, new trainable layers

Low-Rank Adaptation for Large Language Models (LoRA)

LoRA reduces fine-tuning parameters by freezing the original model’s weights and injecting smaller rank decomposition matrices that match the dimensions of the weights they modify.

During training, the original weights remain static while the two low-rank matrices are updated. For inference, multiplying the two low-rank matrices generates a matrix matching the frozen weights’ dimensions, which then is added to the original weights in the model.

LoRA allows using a single model for different tasks by switching out matrices trained for specific tasks. It avoids storing multiple large versions of the model by employing smaller matrices that can be added to and replace the original weights as needed for various tasks.

Researchers have found that applying LoRA only to the self-attention layers of the model is often enough to fine-tune for a task and achieve performance gains.

The Transformer architecture described in the Attention is All You Need paper, specifies that the transformer weights have dimensions of 512 x 64, meaning each weight matrix has 32,768 trainable parameters.

\[ W= \begin{bmatrix} \ddots & & & & \\ & \ddots & & & \\ & & \ddots & & \\ & & & \ddots & \\ & & & & \ddots\\ \end{bmatrix} \]

Dimensions \[ 512 \times 64 = 32,768 \text{ parameters} \]

Applying LoRA as a fine-tuning method with the $rank = 8$, we train two small rank decomposition matrices A and B, whose small dimension is 8:

\[ A= \begin{bmatrix} \ddots & & \\ & & & \\ & \ddots & \\ & & \ddots \\ & & & \\ \end{bmatrix} \]

\[ B= \begin{bmatrix} \ddots & & \\ & \ddots & \\ \end{bmatrix} \]

\[{512 \times 8 = 4,096 \text{ parameters}}\] \[{8 \times 64 = 512 \text{ parameters}}\] \[A \times B = W\]

By updating the weights of these new low-rank matrices instead of the original weights, we train 4,608 parameters instead of 32,768 resulting in a 86% reduction of parameters to train.

Figure 5 - Switching LoRa generated weights for different tasks.

Advantages:

LoRA allows you to significantly reduce the number of trainable parameters, allowing this method of fine tuning to be performed in a single GPU.
The rank-decomposition matrices are small, can be fine-tune a different set for each task and then switch them out at inference time by updating the weights.

Soft Prompts or Prompt Tuning

This technique adds additional trainable tokens to your prompt and leave it up to the supervised learning process to determine their optimal values. The set of trainable tokens is called a soft prompt, and it gets prepended to embedding vectors that represent the input text.

Figure 6 - Soft Prompts.

The soft prompt vectors have the same length as input embedding vectors, and usually somewhere between 20 and 100 virtual tokens can be sufficient for good performance.

Figure 7 - Soft Prompts.

The trainable tokens and the input flow normally through the model, which is going to generate a prediction which is used to calculate a loss. The loss is back-propagated through the model to create gradients, but the original model weights are frozen and only the the virtual tokens embeddings are updated such that the model learns embeddings for those virtual tokens.

Figure 8 - Switching Soft Prompts embeddings for different tasks.

As with the LoRA method, one can also train soft prompts for different tasks and store them, which take much less resources, an then at inference time switch them to change the LLMs task.

References

Week 3: Reinforcement Learning From Human Feedback (slides)

The goal of Reinforcement Learning From Human Feedback (RLHF) is to align the model with human values. This is accomplished using a type of machine learning where an agent learns to make decisions related to a specific goal by taking actions in an environment, with the objective of maximising the reward received for actions taken, i.e.: Reinforcement Learning; this is yet another method to fine-tune Large Language Models.

Fine-Tuning with RLHF

Figure 9 - Reinforcement Learning From Human Feedback overview.

Policy: the agent’s policy that guides the actions is the LLM.
Environment: the context window of the model, the space in which text can be entered via a prompt.
Actions: the act of generating text, this could be a single word, a sentence, or a longer form text, depending on the task specified by the user.
Action Space: the token vocabulary, meaning all the possible tokens that the model can choose from to generate the completion.
State: the state that the model considers before taking an action is the current context, i.e.: any text currently contained in the context window.
Objective: to generate text that is perceived as being aligned with the human preferences, i.e.: helpful, accurate, and non-toxic.
Reward: assigned based on how closely the completions align with the goal, i.e. human preferences.

Reward Model

To determine the reward a human can evaluate the completions of the model against some alignment metric, such as determining whether the generated text is toxic or non-toxic. This feedback can be represented as a scalar value, either a zero or a one.

The LLM weights are then updated iteratively to maximize the reward obtained from the human classifier, enabling the model to generate non-toxic completions.

However, obtaining human feedback can be time consuming and expensive. A scalable alternative is to use an additional model, known as the reward model, to classify the outputs of the LLM and evaluate the degree of alignment with human preferences.

Train a reward model to assess how well aligned is the LLM output with the human preferences. Once trained, it’s used to update the weights off the LLM and train a new human aligned version. Exactly how the weights get updated as the model completions are assessed, depends on the algorithm used to optimize the policy.

Collect Data and Training a Reward Model

Figure 10 - Preparing labels for training.

select a model which has capability for the task you are interested
LLM + prompt dataset = produce a set of completions
collect human feedback from the produced completions

Figure 11 - Reward model losses based on promp completion.

humans rank completions to prompts for a task
ranking to pairwise for supervised learning
ranking gives more training data to train the reward model in comparison for instance to a thumbs up/down approach
use the model as a binary classifier
a reward model can be as well an LLM such as BERT for instance

Fine-Tuning With Reinforcement Learning

Reward Model

Figure 12 - Training a reward model overview I.

1 - pass prompt $P$ to an instruct LLM get the output $X$

2 - pass the pair (P,X) to the reward model, and the get reward score

3 - pass the reward value to the RL algorithm to updated the weight os the LLM

Figure 13 - Training a reward model overview II.

this is repeated and the LLM should converge to a human-aligned LLM and the reward should improve after each iteration
stop when some defined threshold value for helpfulness is reached or this is repeated for a number n of steps

Reinforcement Learning Algorithm

Proximal Policy Optimization (PPO) makes updates to the LLM. The updates are small and within a bounded region, resulting in an updated LLM that is close to the previous version. The loss of this algorithm is made up from 3 different losses. The whole detail of this algorithm is complex and out of scope of my notes.

Figure 14 - PPO Loss.

Figure 15 - Value Loss.

Figure 16 - Policy Loss.

Figure 17 - Entropy Loss.

Reward Hacking

As the policy seeks to maximize rewards, it may result in the model generating exaggeratedly positive language or nonsensical text to achieve low toxicity scores. Such outputs (e.g.: most awesome, most incredible) are not particularly useful.

To prevent board hacking, use the initial LLM as a benchmark, called the reference model. Its weights stay fixed during RLHF iterations. Each prompt is run through both models, generating responses. At this point, you can compare the two completions and calculate the Kullback-Leibler divergence and determine how much the updated model has diverged from the reference.

Figure 18 - How to avoid reward hacking.

KL divergence is computed for every token in the entire vocabulary of the LLM, which can reach tens or hundreds of thousands. After calculating the KL divergence between the models, it’s added to the reward calculation as a penalty. This penalizes the RL updated model for deviating too much from the reference LLM and producing distinct completions.

NOTE: you can benefit from combining our relationship with puffed. In this case, you only update the weights of a path adapter, not the full weights of the LLM. This means that you can reuse the same underlying LLM for both the reference model and the PPO model, which you update with a trained path parameters. This reduces the memory footprint during training by approximately half.

Scaling Human Feedback

Scaling reinforcement learning fine-tuning via reward models demands substantial human effort to create labeled datasets, involving numerous evaluators and significant resources.

This labor-intensive process becomes a bottleneck as model numbers and applications grow, making human input a limited resource.

Constitutional AI offers a strategy for scaling through model self-supervision, presenting a potential remedy to the limitations by human involvement in creating labeled datasets for RLHF fine-tuning.

Figure 19 - Constitutional AI supervised learning I.

The process involves supervised learning, where red teaming prompts aim to detect potentially harmful responses. The model then evaluates its own harmful outputs based on constitutional principles, subsequently revising them to align with these rules.

Figure 20 - Constitutional AI supervised learning II.

Then we ask the model to write a new response that removes all of the harmful or illegal content. The model generates a new answer that puts the constitutional principles into practice. The original red team prompt, and this final constitutional response can then be used as training data. The model undergoes fine-tuning using pairs of red team prompts and the revised constitutional responses.

Figure 21 - Constitutional AI supervised learning III.

Figure 22 - Constitutional AI overview.

Check the paper: Constitutional AI: Harmlessness from AI Feedback

Large Language Models Distillation

The distillation process trains a second, smaller model to use during inference. In practice, distillation is not as effective for generative decoder models. It’s typically more effective for encoder only models.

Figure 23 - LLMs Distillation I.

Freeze the teacher model’s weights and use it to generate completions for your training data. At the same time, you generate completions for the training data using your student model.
The knowledge distillation between teacher and student model is achieved by minimizing a loss function called the distillation loss.
To calculate this loss, distillation uses the probability distribution over tokens that is produced by the teacher model’s softmax layer.

Now, the teacher model is already fine tuned on the training data. So the probability distribution likely closely matches the ground truth data and won’t have much variation in tokens.
Distillation applies a little trick adding a temperature parameter to the softmax function, a temperature parameter greater than one, increases the creativity of the language the model, the probability distribution becomes broader and less strongly peaked.
This softer distribution provides you with a set of tokens that are similar to the ground truth tokens.
In the context of Distillation, the teacher model’s output is often referred to as soft labels and the student model’s predictions as soft predictions.

Figure 24 - LLMs Distillation II.

In parallel, you train the student model to generate the correct predictions based on your ground truth training data. Here, you don’t vary the temperature setting and instead use the standard softmax function.
Distillation refers to the student model outputs as the hard predictions and hard labels. The loss between these two is the student loss.
The combined distillation and student losses are used to update the weights of the student model via back propagation.
The key benefit of distillation methods is that the smaller student model can be used for inference in deployment instead of the teacher model.

Generative AI Project Lifecycle Cheat Sheet

Figure 25 - Generative AI Project Lifecycle.

References

Week 1
Week 2
Week 3
All the figures are taken from the slides of the course.

Support and Opposition Relationships in Political News Headlines

Thu, 14 Sep 2023 02:00:00 +0200

I was awarded the 2nd place in the Arquivo.pt Awards 2021 for the Politiquices project. The project aimed at extracting supportive and opposing relationships between political personalities from news headlines archived by Arquivo.PT, and associating the personalities with their identifier on Wikidata, thus resulting in a semantic graph. I published recently the results of this project in Portuguese on Linguamatica v. 15 n. 1. The content of this blog post is the same as in the paper but translated to English.

The datasets resulted from this work are publicly available

@article{Soares Batista_2023, 
  title={Extracção de Relações de Apoio e Oposição em Títulos de Notícias de Política em Português}, 
  volume={15}, 
  url={https://linguamatica.com/index.php/linguamatica/article/view/386}, 
  DOI={10.21814/lm.15.1.386},
  number={1}, 
  journal = {Linguam{\'a}tica},
  author={Soares Batista, David}, 
  year={2023}, 
  month={Jul.}, 
  pages={91-101}
}

Introduction

News headlines related to politics or politicians often report interactions involving two or more political personalities. Many of these interactions correspond to relationships of support or opposition from one personality to another, for example:

“Marques Mendes criticises Rui Rio's strategy”
“Catarina Martins calls for the resignation of Governor Carlos Costa"
“Sócrates went to the grassroots to call for a vote for Soares"

Analysing a large number of these types of relationships over time allows for various studies, for example: finding out which are the major communities of support or opposition depending on the governments in power, or finding the major alliances and oppositions and their dynamics.

You can also explore an individual personality over time, for example by comparing the relationships of support or opposition before taking office in a particular public position with the relationships after taking office, or to see which relationships of support have suddenly emerged.

A database collating news stories expressing relationships of support or opposition between political personalities can be used to quickly assemble a collection of news stories containing or involving specific personalities and political parties, for example, to assist in an investigative journalism task.

Having an automatic method for extracting relationships and being able to apply it to a collection of data covering long periods of time would make it possible to realise the examples described above.

In this paper we present a method for extracting relationships of support or opposition between political personalities and describe the results of applying it to a news collection covering a period of around 25 years.

During the relationship extraction process, we linked the political personalities involved with their Wikidata identifier (Malyshev et al., 2018) ¹, thus enriching the relationship with information associated with the personality (e.g. political affiliation, public offices held, legislatures, family relationships, etc.).

All the relationships extracted are represented in the form of semantic triples following the Resource Description Framework (RDF) standard (Schreiber & Raimond, 2014) ². The political personalities involved, represented by their Wikidata identifier, are linked through a relationship of opposition or support represented by the news item that supports the relationship. This structure thus gives rise to a semantic graph, making it possible to formulate SPARQL queries (Prud’hommeaux et al., 2013) ³ involving the Wikidata information associated with each personality and the relationships extracted from the news headlines, for example:

List all the news items where personality X opposes personality Y
List the members of a given party who supported a specific personality
List the members of a particular party supported/opposed by members of another party
List personalities who are linked through a family relationship and an opposition/support relationship
List personalities who are part of the same government and are involved in an opposition/support relationship

The main contributions of this work are:

a semantic graph linking political personalities represented on Wikidata through an opposition or support relationship supported by a news item
an annotated dataset used to train classifiers for extracting sentiment-driven relations from news headlines, and also to link the personalities mentioned to Wikidata
a web interface to explore the semantic graph

This article is organised as follows:

in Section 2 we refer to related work,
in Section 3 we describe the knowledge base used to support the linking of personalities to Wikidata
in Section 4 refers to and describes the news sources used.
in Section 5 we detail the annotated dataset
in Section 6 the supervised learning classifiers developed
in Section 7 we describe the RDF triple extraction process and the construction of the semantic graph.
in Section 8 we summarise the conclusions of this work and present some ideas for future work.

Related Work

Sentiment analysis, in the context of Natural Language Processing, has mostly been studied in content generated on social networks (Zimbra et al., 2018) ⁴ or in the evaluation of products or services (Pontiki et al., 2016) ⁵. In these areas, the author of the text and the target of the opinion are explicit. In the context of analysing political news, where there is often sentiment expressed between political actors in the form of support or opposition relationships (Balahur et al., 2009 ⁶, 2010 ⁷), sentiment analysis approaches to products or services do not apply, as the direction of the sentiment relationship has to be considered.

In this section we describe resources similar to those we have produced in this work, which we have made public, and approaches to the task of extracting targeted sentiment in political news text.

Resources and annotated datasets

Sarmento et al., (2009) ⁸ propose a method for the automatic creation of a corpus for the detection of positive or negative sentiment towards a political personality, and apply the method to comments on online newspaper reports. In this resource, the source of the sentiment is assumed to be the commentator.

Moreira et al., (2013) ⁹ provide an ontology describing political actors, their positions and affiliated political parties, using official sources of information and information gathered from the web to add alternative names to the personalities present in the ontology.

de Arruda et al. (2015) ¹⁰ created a corpus of political news in Brazilian Portuguese, annotating each paragraph with the sentiment according to two dimensions: the political actor referred to by the paragraph, and the sentiment of that reference: positive, negative or neutral. The origin of the sentiment is left open in this resource. Baraniak & Sydow, (2021) ¹¹ provide similar corpora, annotating the sentiment towards a political personality in newspaper texts on-line, for English and Polish.

Extracting Targeted Sentiment from News Text

Several authors have explored methods for extracting sentiment involving political actors. It should be noted that many of the works transform the task of detecting sentiment into a task of detecting a relationship between mentioned entities (Bassignana & Plank, 2022) ¹².

Some explore these relationships in an international political context, i.e.: the actors are nations mentioned in political news text, and some of these relationships implicitly have a positive or negative sentiment. O’Connor et al. (2013) ¹³ propose an unsupervised model based on topic models and linguistic patterns to identify relationships, in an open-ended way, describing conflicts between nations referenced in English news articles.

Han et al. (2019) ¹⁴ also propose an unsupervised model to generate relationship descriptors for pairs of nations mentioned in English news articles. The proposed model extends the work of Iyyer et al. (2016) ¹⁵ by integrating linguistic information (i.e.: verbal predicates and common and proper nouns) in order to identify the context of the relations.

Liang et al. (2019) ¹⁶ defines the task of extracting guilt relations for English texts: given an article $d$ and a set of entities $E$, present in the article, detect if there is a guilt relation $(s,t)$, where $s,t \in E$, when $s$ blames $t$ based on the article $d$, and there$ are $\lvert{E}\rvert \cdot (\lvert{E}\rvert - 1)$, possible guilt relations. To detect these relationships, the authors propose 3 models. The Entity Prior model extracts information about entities, trying to capture a prior about who is likely to blame whom without additional information. The Context model makes use of the context information of the sentence where two entities occur to determine the presence of a blame relationship. The Combined model combines the information from the two previous models into a single model. The authors applied this approach to a corpus with 998 news articles and about 3 entities per article, reporting a macro-average F₁ of 0.70 with the Combined model.

Park et al. (2021) ¹⁷ proposes a structure of relations to detect sentiment and direction: given a sentence $s$ referring to two entities $p$ and $q$, detect which sentiment relation between $p$ and $q$ out of five possible ones: neutral, $p$ has a positive or negative opinion of $q$, or $q$ has a positive or negative opinion of $p$. In their work, the authors use multiple models by transforming the sentiment extraction task into sub-tasks that answer yes/no questions for each of the 5 possible sentiments, then combining the various results into a final result. This approach is applied to English in a corpus created by the authors containing sentences from news articles containing at least two entities.The pairs of entities are annotated with one of the 5 possible sentiments. The authors report a macro-average F₁ of 0.68.

Knowledge Base Construction

Given that the personalities involved in the relationships to be extracted are relevant political personalities, we started by building a knowledge base from Wikidata (Malyshev et al., 2018) ¹.

By making SPARQL queries to the public endpoint we collected the identifier of all:

people who are or have been affiliated with a Portuguese political party
Portuguese people born after 1935 whose profession is:
- judge
- economist
- lawyer
- civil servant
- politician
- businessman
- banker
people who hold or have held at least one office from a list of previously selected Portuguese public offices (e.g.: minister, party leader, ambassador, etc.)

In addition to the results of these queries, we manually selected some identifiers of personalities not covered by the SPARQL queries defined above, many of them from an international political context, but who interact with Portuguese personalities.

We also added all the identifiers of political parties to which the personalities collected are affiliated. This process resulted in a total of 1,757 personalities and 37 political parties. It should be noted that some of the parties included are now defunct and/or from an international context.

For each of the identifiers of the personalities and parties, we downloaded the corresponding page from Wikidata using another public endpoint. For each political figure we selected: their Wikidata identifier, their most common name and alternative names, i.e. combinations of first names and surnames.

Based on these three fields, we created an index in ElasticSearch (Gormley & Tong, 2015) ¹⁸ using its default configuration, not making use of any extra functionality such as $n$-gram parsers.

Data Sources

The main source of news was the Portuguese web archive (Gomes et al., 2013) ¹⁹. Using the public search API we collected archived pages, restricting the results to occurrences of names gathered in Section 3 and 45 .pt domains associated with various sources of information such as:

online newspaper
websites of television
radio stations websites
content aggregator portals

A second news source was the CHAVE collection (Santos & Rocha, 2004 ²⁰, 2001²¹), containing articles from the newspaper PÚBLICO published between 1994 and 1995. Finally, some articles not archived by arquivo.pt were also added, taken directly from the World, Politics and Society sections of the publico.pt website.

This process resulted in a collection of around 13.7 million article titles published between 1994 and 2022. Pre-processing was then applied in order to remove news items with: duplicate titles, titles with less than 4 words, and titles or URLs containing words that are part of a pre-defined list (e.g.: sports, celebrities, arts, cinema, etc.) that suggest a context other than politics. This pre-processing resulted in 1.3 million different titles, around 10 per cent of the data initially collected.

Dataset of Support and Opposition Political Relationships

In order to be able to train supervised learning classifiers to identify the relationships present in the news headlines, and to link the personalities with Wikidata, we manually annotated headlines with: the mentions of personalities, the identifiers in Wikidata and the relationship between the personalities mentioned.

We began by pre-processing all the headlines collected using the spaCy 3.0 software package ²², and the pt_core_news_lg-3.0.0 model to recognise named-entities of the PERSON type. For each recognised entity we tried to find its corresponding identifier in Wikidata by querying the index described in Section 3 and assuming that in the list of results the first is the correct identifier associated with the entity. We then selected the titles for annotation, including only titles referring to at least two personalities.

In the annotation process all the titles were loaded into the Argilla annotation tool, and using the graphical interface we selected titles to annotate.

For each title, we corrected the recognised entities and their Wikidata identifiers where necessary. We annotated the existing relationship: opposition or support, and its direction. When neither is the case, the relationship is noted as other.

Table 1 shows some examples of the annotated relationships. The annotation process was carried out by one single annotator. In the most ambiguous situations, for example, where the full information in the news text is needed to decide, the relationships have been annotated as other.

	Headline	Relationship
	Sá Fernandes accuses António Costa of defending corporate interests	Ent1-opposes-Ent2
	Joana Mortágua: statements by Cavaco are “a series of nonsense”	Ent1-opposes-Ent2
	Passos Coelho is accused of political immaturity by Santos Silva	Ent2-opposes-Ent1
	Durão Barroso supports Paulo Portas as an “excellent minister”	Ent1-supports-Ent2
	Armando Vara chosen by Guterres to coordinate local elections	Ent2-supports-Ent1
	Manuel Alegre receives support from Jorge Sampaio	Ent2-supports-Ent1
	Rui Tavares and Ana Drago elected in the LIVRE primaries	other
	Teresa Zambujo acknowledges Isaltino Morais’ victory	other
	CDS accuses Marcelo Rebelo de Sousa of jeopardising the relationship with Cavaco	other

Table 1: Examples of headlines and the corresponding manually annotated relationships.

This process resulted in a dataset containing 3 324 annotated titles. For each title we annotated only two personalities and the relationship between them, even if the titles contain references to more than two personalities.

Table 2 characterises the data in terms of number of relationships and direction. Most titles contain an opposition or other relationship, and the vast majority of relationships have a direction from the first to the second entity, Ent1 → Ent2.

Relação	Ent1 → Ent2	Ent1 ← Ent2	Total
opposes	1 155	102	1 257
supports	717	44	761
other	-	-	1,306
Total	1,872	146	3,324

Table 2: Relationships by class and direction.

The ratio of oppositional relationships to supportive relationships is 1.6. This value is similar to the data for English provided by Park et al. (2021) ¹⁷, where this same ratio between the two classes is 1.8. In terms of class representativeness, aggregated by sentiment, the two datasets are also similar, with other being the most present class, followed by opposition and lastly support.

Figure 1 - Frequency distribution of occurrences of the personalities in the annotated titles. annotated.

Of the 6 648 mentions of names of political personalities annotated, 515 are distinct and have an identifier on Wikidata. A total of 129 distinct entities, identified by aggregating the string that mentions them in the title, are not associated with an identifier because they are not present in Wikidata.

Analysing the frequency of occurrence of each entity shows that there are a small number of entities responsible for a large proportion of all entity occurrences in the annotated data. As shown in Figure 1 there is a small number of frequent entities, and a long list of infrequent entities, specifically, 96 distinct personalities, i,.e.: 19% of the personalities, are responsible for 80% of the mentions of personalities in the data. In terms of the number of words contained in the titles, excluding words that are part of the entities, there is a median of 8 words with a maximum of 22 and a minimum of 1.

This annotated dataset is online in JSON format as illustrated in below in Figure 2.

{"title": "Ana Gomes defende Durão Barroso",
   "label": "ent1_supports_ent2",
   "date": "2002-05-11 08:26:00",
   "url": "http://www.publico.pt/141932",
   "ent1": "Ana Gomes",
   "ent2": "Durão Barroso",
   "ent1_id": "Q2844986'",
   "ent2_id": "Q15849"}

Figure 2: Example of one annotated sample in JSON.

Relationship Extraction Process

The process of extracting RDF triples from news headlines involves 4 sub-processes:

recognising entity mentions of type PERSON
linking entities with an identifier in Wikidata
classifying the type of relationship
classifying the direction of the relationship

Named-Entity Recognition

The recognition of entities mentioned is based on a hybrid method, combining rules with a supervised model.

Using the EntityRuler component of spaCy 3.0, we define a series of of rules combining patterns based on the names of all the personalities from the knowledge base described in Section 3.

To detect entities of type PERSON this classifier applies first the rules and then the supervised model for Portuguese model. In situations of disagreement between the two approaches, the entities marked with rules are prioritised. Table 3 shows the performance for the 3 approaches on the annotated dataset.

Approach	Precision	Recall	F-1
Rules	0,99	0,42	0,59
Model	0,97	0,91	0,94
Rules+Model	0,97	0,92	0,94

Table 3: Precision, Recall and F₁ for the NER component combining rules and a supervised model.

Entity Linking over Wikidata

The algorithm for associating personalities with identifiers on Wikidata has two phases. In the first phase, the algorithm only tries to use the title of the news item; if this process fails, it then tries to use possible references to the personalities in the text of the news item.

The algorithm first interrogates the knowledge base (KB) using the reference to the personality in the headline, thus generating a list of candidates for a given personality. If the list contains only one candidate and the Jaro (1998) ²³ similarity to the personality mentioned in the headline is at least 0.8, that candidate is selected. If there is more than one candidate, the algorithm filters out only those with a similarity of 1.0 and if there is only one, that is the candidate selected. In any other case, no candidate is returned.

Algorithm 1 describes the procedure that uses only the headline.

def title_only(ent, candidates):
    if len(candidates) == 1:
        if jaro(ent, candidates[0]) >= 0.8:
            return candidates[0]
    else:
        filtered = exact(ent, candidates)
        if len(filtered) == 1:
            return candidates[0]
    return None

If no candidates are generated in the first phase or none are selected from the list of candidates, the algorithm tries to expand the entities mentioned in the headline based on the news text, exploiting a pattern: a personality mentioned in the headline by a short version of their name (e.g. just their surname) is usually referred to in the news text by a fuller name.

The algorithm identifies all the people mentioned in the news text, using the component described in Section 6.1, and selects only those that have at least one name in common with the name of the personality mentioned in the headline, thus generating an expanded entity, and assuming that it corresponds to the same entity mentioned in the headline.

If the process results in only one expanded entity and there is a similarity of 1.0 with one of the candidates previously selected from the KB, that candidate is chosen. Otherwise, the expanded entity is used to query the KB and collect a new list of candidates. If there is only one candidate on this list and its similarity is at least 0.8 to the expanded entity, that candidate is chosen. If there is more than one candidate and only one has a similarity of 1.0 to the expanded entity, that one is chosen.

def article_text(expanded, candidates):
  if len(expanded) == 1:
    filtered = exact(expanded[0], candidates)
    if len(filtered) == 1:
      return filtered[0]

    x_candidates = get_candidates(expanded)
    if len(x_candidates) == 1:
      if jaro(expanded, x_candidates[0])>=0.8:
        return x_candidates[0]
 
    filtered = exact(expanded, x_candidates)
    if len(filtered) == 1:
      return matches[0]
  
  if len(expanded) > 1:
    filtered = []
    for e in expanded:
      exact_candidates = exact(e, candidates)
      for c in exact_candidates:
        filtered.append(c)
    if len(filtered) == 1:
      return filtered[0]

  return None

If the expansion process results in several expanded entities, we filter out candidates from the KB with a similarity of 1.0 to the expanded entity, and if there is only one, that candidate is chosen. In any other case not described here, no candidate is selected.

Algorithm 2 describes this procedure using the text of the news item.

The results of this approach on the annotated dataset are described in Table 3. The incorrect classification corresponds to personalities who were not associated with the correct identifier in Wikidata, not disambiguated for those for whom the algorithm was unable to select a unique identifier from all the candidates or the KB did not return any results.

In Table 3 two evaluations are reported, the first column describes the results for the base algorithm, without mappings. The second column considers the ambiguity that a reference may have in terms of the personalities it represents. For example, in the annotated data, all mentions of Cavaco correspond to the personality Cavaco Silva, based on this the algorithm maps all references to Cavaco to Cavaco Silva. Similarly, all mentions of Marques Mendes correspond to the personality Luís Marques Mendes. By using these mappings we reduce the number of entities for which the algorithm cannot find an identifier.

Classification	Base	Mappings
correct	5 059	5 136
incorrect	43	43
not disambiguated	246	169
Accuracy	0,93	0,96

Table 4: Accuracy results for the linking approach.

Relationship Type Classifier

We chose to break down the task of classifying the relationship into two tasks: classifying the type of relationship and the direction of the relationship, as opposed to developing a single classifier that would have to distinguish between 5 possible classes, and with classes that are very unbalanced in terms of representativeness.

This section describes the classifier developed to detect the type of relationship present in a title, with 3 possible classes: opposes, supports and other. All the experiments were carried out with cross-evaluation of 4 partitions.

We evaluated different approaches for the supervised classification of the relationships present in the titles, namely:

an SVM classifier (Cortes & Vapnik, 1995) ²⁴ with a linear kernel
a recurrent neural network, a LSTM (Hochreiter & Schmidhuber, 1997) ²⁵
and Transformer neural network DistilBERT (Sanh et al., 2019) ²⁶.

For the SVM classifier we used as features an approach based on TF-IDF vectors (Salton & Buckley, 1988) ²⁷, pre-processing the title using a pattern in order to identify the relevant context, i.e. the context in the title that contains information describing the relationship:

<Ent1 X Ent2 context>

where X = {“says to”, “responds to”, “suggests to”, “says that, “claims that”, “hopes that”, “argues that”, “considers that”, “suggests that”, “wonders if”, “considers”, “commands”}.

Whenever the pattern doesn’t hold, we use all the words in the title to build the vector, except for the names of the personalities.

The LSTM recurrent neural network was used in a bidirectional architecture, i.e. two LSTM networks are used, both with a dimension of 128, one reading the title from the first to the last word and the other from the last to the first word, and the two final states of each LSTM are concatenated and passed to a linear layer. We used pre-trained embeddings for Portuguese based on the FastText method (skip-gram) of dimension 50 (Hartmann et al., 2017) ²⁸. The network was trained for 5 epochs with a batch size of 8.

The DistilBERT model was trained on the basis of a pre-trained model for Portuguese (Abdaoui et al., 2020) ²⁹ and then fine-tuned on the annotated dataset, i.e.: the weights of all the pre-trained layers were updated taking into account the task of classifying the relation. The network was trained for 5 epochs with a batch size of 8.

Relationship	Precision	Recall	F1
opposes	0,71	0,69	0,70
supports	0,65	0,69	0,67
other	0,69	0,69	0,69
macro avg.	0,69	0,69	0,69

a): SVM with a linear kernel linear.

Relationship	Precision	Recall	F1
opposes	0,75	0,64	0,69
supports	0,65	0,62	0,63
other	0,65	0,75	0,70
macro avg.	0,69	0,68	0,68

b): bi-directional LSTM with Portuguese embeddings.

Relationship	Precision	Recall	F1
opposes	0,74	0,76	0,75
supports	0,72	0,71	0,71
other	0,72	0,71	0,72
macro avg.	0,73	0,72	0,72

c): DistilBERT pre-trained on Portuguese corpora.

Table 5: Precision, Recall and F1 for an evaluation with 4-partitions and cross-validation with different classifiers.

Table 5 describes the results for the various classifiers. There are no marked differences in performance between the 3 classifiers, although the approach using DistilBERT achieved the best results. When analysing the results, we noticed that there are relations that are difficult to classify correctly, particularly those containing idiomatic expressions, for example:

José Lello says that Nogueira Leite wants to “abifar uns tachos”
Louçã says that Passo Coelho’s “António Borges is the talking cricket”.

Other relationships are ambiguous and difficult to categorise without any other context than the one in the title. In the dataset we have made public, all the headlines contain a URL to the text of the news item.

The results obtained with the approaches described, for Portuguese data, are in line with the results previously reported on English data [(Liang et al. 2019) ¹⁶; Park et al. (2021) ¹⁷.

Relationship Direction Classifier

The direction classifier has 2 possible classes. As shown in Table 1, the dataset has a bias towards the Ent₁ → Ent₂ class representing 91.5% of the data. We therefore chose to develop a rule-based approach to detect only the Ent₁ ← Ent₂ class, and whenever none of the rules are verified, the classifier assigns the Ent₁ ← Ent₂ class.

We defined rules based on patterns built with morphological and syntactic information (Nivre et al., 2020) ³⁰ extracted from the title with spaCy, using the same model as described in Section 5. We extracted morpho-syntactic information from all the words, including information on conjugation for verbs: person and number. The patterns defined were as follows:

PASSIVE_VOICE: we look for patterns <VERB><ADP>, a verb followed by a proposition. We check whether the passive voice is present and involves the personalities mentioned in the title: whether the Ent1 entity has a dependency on the verb of type acl, whether the verb has a dependency on the Ent₁ of type nsubj:pass or whether the verb has a dependency on the Ent₂ of type obl:agent.
VERB_ENT2: detects the morphological pattern <PUNCT><VERB>Ent2<EOS>, a punctuation mark followed by a verb, and ending with Ent2, restricting the verb to be conjugated in the 3rd person singular of the present tense, and where <EOS> represents the end of the title, meaning that Ent2 is the last word in the title text.
NOUN_ENT2: checks whether the pattern <ADJ>?<NOUN><ADJ>?<ADP>Ent2<EOS> is present in the title, i.e.: a noun can be preceded or succeeded by one or more adjectives ending with Ent2, and the noun is restricted to a predefined list of nouns.

Table 6 shows some examples of news headlines and the rules that were applied to detect the Ent₁ ← Ent₂ class direction. The rules are applied sequentially, in the same order as described here. If none of the patterns are detected in the headline, the classifier assigns the Ent₁ → Ent₂ class.

Headline	Matched Rule
Marques Júnior elogiado por Cavaco Silva pela “integridade de carácter”	PASSIVE_VOICE
Passos Coelho é acusado de imaturidade política por Santos Silva	PASSIVE_VOICE
António Costa vive no “país das maravilhas” acusa Assunção Cristas	VERB_ENT2
Passos Coelho “insultou 500 mil portugueses”, acusa José Sócrates	VERB_ENT2
Maria Luís Albuquerque sob críticas de Luís Amado	NOUN_ENT2
André Ventura diz-se surpreendido com perda de apoio de Cristas	NOUN_ENT2

Table 6: Examples of titles in Portuguese and respective rules used to detect the direction of the relationship.

Above in Table 7 we can see the results of this classifier for the annotated data set. The results show that the proposed method correctly classifies a large part of the direction of the Ent₁ ← Ent₂ class relations, the only class for which rules have been developed, without prejudice to the Ent₁ → Ent₂ class class.

Direction	Precision	Recall	F1	#Headlines
Ent1 → Ent2	0,99	1,00	0,99	1,488
Ent1 ← Ent2	0,95	0,84	0,89	129
weighted avg.	0,98	0,98	0,98	1,517

Table 7: Precision, Recall and F1 for the relationship direction classifier.

Semantic Graph

The components described in the previous section form the process of extracting RDF triples from the collected news headlines.

The extraction process begins by recognising the personalities in the headline and linking them to each personality’s identifier in Wikidata. The extraction process continues if both recognised personalities have been linked with an identifier in Wikidata, otherwise the headline is discarded. The type of relationship present in the title is detected with the DistilBERT model. If the relationship between the personalities in the headline is not classified as other the classifier for the direction of the relationship is also applied to the headline, otherwise the headline is discarded.

For all the headlines considered, the final result is an RDF triple linking the personalities through a relationship of opposition or support supported by a news item. The RDF triples generated are indexed in a SPARQL engine (Jena, 2015) ³¹ together with a Wikidata sub-graph described in Section 3.

The graph generated has a total of 680 political personalities, 107 political parties and 10,361 news items covering a period of 25 years. It is available online in Terse RDF Triple Language format and can also be explored via a web interface.

Conclusions and Future Work

This work describes in detail the process of constructing a semantic graph from political news headlines.

Using SPARQL queries and referring to the various properties taken from Wikidata for each personality, it is possible to explore support and opposition relationships through aggregations by political parties, public offices, constitutional governments, constituent assemblies, among others, thus being able to formulate more complex queries, for example:

“Ministers of the XXII Constitutional Government who were opposed by PCP or BE personalities."

The answer is the list of ministers and the articles that support the opposition relations coming from the BE.

One of the limitations of this work is that the headline doesn’t contain enough information to realise what kind of relationship or feeling exists from one personality to another, or the presence of idiomatic expressions, which make automatic classification difficult.

As future work we would like to explore the text of the news item in order to complement the headline and improve the detection of the relationship. Also based on the text of the headline, the relationships could be enriched by categorising them into topics, giving the relationship another dimension, a context for the feeling of support or opposition.

Some headlines contain a mutual relationship, for example:

“Sócrates and Alegre exchange accusations over co-incineration"
“Pinto da Costa hits back at Pacheco Pereira’s criticisms"

could be categorised as Ent₁↔Ent₂, indicating in this case that both personalities are accusing each other.

This work also leaves open the possibility of carrying out various studies based on the structure of the graph, for example: finding communities of support and opposition as a function of time and verifying the changes within these communities. Political triangles can also be studied: if two political personalities, X and Y, always accuse or defend a third personality Z, what is the typical relationship expected between X and Y?

Acknowledgements

We would like to thank Nuno Feliciano for all his comments during the preparation of this work and the Arquivo.PT team for providing access to the archived data via an API and for considering this work for the Arquivo.PT 2021 awards. To Edgar Felizardo and Tiago Cogumbreiro for their extensive revisions to the article, and also to reviewers Sérgio Nunes and José Paulo Leal for all their comments and corrections.

References

Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph
- Authors: Stanislav Malyshev, Markus Krötzsch, Larry González, Julius Gonsior, Adrian Bielefeldt
- Conference: Proceedings of the 17th International Semantic Web Conference (ISWC’18)
- Year: 2018
- DOI: N/A
↩ ↩²
RDF 1.1 Primer W3C Working Group Note
- Authors: Guus Schreiber, Yves Raimond
- Year: 2014
- URL: RDF 1.1 Primer W3C Working Group Note
↩
SPARQL 1.1 Query Language
- Authors: Eric Prud’hommeaux, Steve Harris, Andy Seaborne
- Year: 2013
- URL: SPARQL 1.1 Query Language
↩
David Zimbra, Ahmed Abbasi, Daniel Zeng, and Hsinchun Chen
- Title: The State-of-the-Art in Twitter Sentiment Analysis: A Review and Benchmark Evaluation
- Year: 2018
- Issue Date: June 2018
- Journal: ACM Trans. Manage. Inf. Syst.
- Volume: 9
- Number: 2
- ISSN: 2158-656X
- DOI: 10.1145/3185045
- Month: August
- Article No: 5
- Number of Pages: 29
↩
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq, Véronique Hoste, Marianna Apidianaki, Xavier Tannier, Natalia Loukachevitch, Evgeniy Kotelnikov, Nuria Bel, Salud María Jiménez-Zafra, and Gülşen Eryiğit
- Title: SemEval-2016 Task 5: Aspect Based Sentiment Analysis
- Book Title: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
- Year: 2016
- Month: June
- Pages: 19–30
- DOI: 10.18653/v1/S16-1002
↩
Balahur, Alexandra, Ralf Steinberger, Erik van der Goot, Bruno Pouliquen, and Mijail Kabadjov
- Title: Opinion Mining on Newspaper Quotations
- Book Title: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 03
- Year: 2009
- Pages: 523–526
- DOI: Link
↩
Balahur, Alexandra, Ralf Steinberger, Mijail Kabadjov, Vanni Zavarella, Erik van der Goot, Matina Halkia, Bruno Pouliquen, and Jenya Belyaeva
- Title: Sentiment Analysis in the News
- Book Title: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10)
- Year: 2010
- Month: May
- Pages: 655–662
- DOI: Link
↩
Automatic Creation of a Reference Corpus for Political Opinion Mining in User-Generated Content
- Authors: Luís Sarmento, Paula Carvalho, Mario J. Silva, Eugénio de Oliveira
- Conference: Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion
- Year: 2009
- DOI: 10.1145/1651461.1651468
- Pages: 29–36
↩
Tracking politics with POWER
- Authors: Silvio Moreira, David S. Batista, Paula Carvalho, Francisco M. Couto, Mario J. Silva
- Journal: Program: electronic library and information systems
- Volume: 47
- Number: 2
- Year: 2013
- DOI: Link
↩
An Annotated Corpus for Sentiment Analysis in Political News
- Authors: Gabriel Domingos de Arruda, Norton Trevisan Roman, Ana Maria Monteiro
- Conference: Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology
- Year: 2015
- DOI: 10.1145/2835988.2835995
- Pages: 101-110
↩
A dataset for Sentiment analysis of Entities in News headlines (SEN)
- Authors: Katarzyna Baraniak, Marcin Sydow
- Journal: Procedia Computer Science
- Volume: 192
- Year: 2021
- DOI: 10.1016/j.procs.2021.09.136
- Pages: 3627-3636
↩
What Do You Mean by Relation Extraction? A Studyurvey on Datasets and Study on Scientific Relation Classification
- Authors: Elisa Bassignana, Barbara Plank
- Conference: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
- Year: 2022
- DOI: 10.18653/v1/2022.acl-srw.7
- Pages: 67-83
↩
Learning to Extract International Relations from Political Context
- Authors: Brendan O’Connor, Brandon M. Stewart, Noah A. Smith
- Conference: Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers)
- Year: 2013
- URL: PDF
↩
No Permanent Friends or Enemies: Tracking Relationships between Nations from News
- Authors: Xiaochuang Han, Eunsol Choi, Chenhao Tan
- Conference: Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies, Volume 1 (Long and Short Papers)
- Year: 2019
- DOI: 10.18653/v1/N19-1167
↩
Feuding Families and Former Friends: Unsupervised Learning for Dynamic Fictional Relationships
- Authors: Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, Hal Daumé III
- Conference: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Year: 2016
- DOI: 10.18653/v1/N16-1180
- Pages: 1534-1544
↩
Liang, Shuailong, Olivia Nicol, and Yue Zhang
- Title: Who blames whom in a crisis? detecting blame ties from news articles using neural networks
- Book Title: Proceedings of the AAAI Conference on Artificial Intelligence
- Volume: 33
- Number: 01
- Year: 2019
- Pages: 655–662
- DOI: Link
↩ ↩²
Park, Kunwoo, Zhufeng Pan, and Jungseock Joo
- Title: Who Blames or Endorses Whom? Entity-to-Entity Directed Sentiment Extraction in News Text
- Book Title: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
- Year: 2021
- Month: August
- DOI: 10.18653/v1/2021.findings-acl.358
- Pages: 4091–4102
↩ ↩² ↩³
ElasticSearch: The Definitive Guide
- Authors: Clinton Gormley, Zachary Tong
- Year: 2015
- ISBN: 1449358543
- Publisher: O’Reilly Media, Inc.
- Edition: 1st
↩
Search the Past with the Portuguese Web Archive
- Authors: Daniel Gomes, David Cruz, João Miranda, Miguel Costa, Simão Fontes
- Conference: 22nd International World Wide Web Conference
- Year: 2013
- DOI: 10.1145/2487788.2487934
↩
Evaluating CETEMPúblico, a Free Resource for Portuguese
- Authors: Diana Santos, Paulo Rocha
- Conference: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics
- Year: 2001
- URL: PDF
↩
CHAVE: Topics and Questions on the Portuguese Participation in CLEF
- Authors: Diana Santos, Paulo Rocha
- Conference: Working Notes for CLEF 2004 Workshop co-located with the 8th European Conference on Digital Libraries (ECDL 2004)
- Year: 2004
- URL: PDF
↩
spaCy: Industrial-strength Natural Language Processing in Python
- Authors: Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Adriane Boyd
- Year: 2020
- Publisher: Zenodo
- DOI: 10.5281/zenodo.1212303
↩
Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida
- Author: Matthew A. Jaro
- Journal: Journal of the American Statistical Association
- Volume: 84
- Number: 406
- Year: 1989
- DOI: 10.1080/01621459.1989.10478785
↩
Support-Vector Networks
- Authors: Corinna Cortes, Vladimir Vapnik
- Journal: Machine Learning
- Volume: 20
- Number: 3
- Year: 1995
- Pages: 273-297
- DOI: 10.1007/BF00994018
↩
Long Short-Term Memory
- Authors: Sepp Hochreiter, Jürgen Schmidhuber
- Journal: Neural Comput.
- Volume: 9
- Number: 8
- Year: 1997
- DOI: 10.1162/neco.1997.9.8.1735
- Pages: 1735–1780
↩
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
- Authors: Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf
- Conference: 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS)
- Year: 2019
- DOI: N/A
- Pages: N/A
↩
Term-Weighting Approaches in Automatic Text Retrieval
- Authors: Gerard Salton, Chris Buckley
- Journal: Information Processing & Management
- Volume: 24
- Number: 5
- Year: 1988
- DOI: 10.1016/0306-4573(88)90021-0
↩
Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
- Authors: Nathan Hartmann, Erick Fonseca, Christopher Shulby, Marcos Treviso, Jéssica Silva, Sandra Aluísio
- Conference: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology
- Year: 2017
- DOI: 10.18653/v1/W17-6615
- Pages: 122-131
↩
Load What You Need: Smaller Versions of Multilingual BERT
- Authors: Amine Abdaoui, Camille Pradel, Grégoire Sigel
- Conference: Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
- Year: 2020
- DOI: 10.18653/v1/2020.sustainlp-1.16
- Pages: 119-123
↩
Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
- Authors: Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman
- Conference: Proceedings of the 12th Language Resources and Evaluation Conference
- Year: 2020
- DOI: 10.18653/v1/2020.lrec-1.497
- Pages: 4034-4043
- ISBN: 979-10-95546-34-4
↩
A Free and Open Source Java Framework for Building Semantic Web and Linked Data Applications
- Author: Apache Jena
- Year: 2015
- URL: Official Website
↩

David S. Batista

A Package for Machine Learning Evaluation Reporting

Introduction

Running ML-Report-Toolkit on cross-fold classification

Where to get ML-Report-Kit

Improving RAG Retrieval with Auto-Merging

Haystack Components

Introductory Example

Advanced Example

Reading the dataset

Indexing the documents

Querying the documents

Putting it all together

Conclusion

Benchmarking Haystack Pipelines for Optimal Performance

Introduction

The “ARAGOG: Advanced RAG Output Grading” Dataset

The Indexing Pipeline

The RAG Pipeline

The Evaluation Pipeline

Putting it all together

Results Analysis

Running your own experiments

Execution Time and Costs

Indexing

RAG Pipeline

Evaluation Pipeline

Pricing

Lessons Learned

Extract Metadata from Queries to Improve Retrieval

Introduction to Metadata Filters

Extracting Metadata Filters from a Query

Implementing QueryMetadataExtractor

Using QueryMetadataExtractor in a Pipeline

Conclusion

Incorporate HyDE into Haystack RAG pipelines

Build a Pipeline to Create Hypothetical Document Embeddings

Build a Complete HyDE Component

Using the HypotheticalDocumentEmbedder for Retrieval

Semantic Drift in Machine Learning

How can drift occur?

Sudden: A new concept occurs within a short time.

Gradual: A new concept gradually replaces an old one over a period of time.

Incremental: An old concept incrementally changes to a new concept.

Reoccurring: An old concept may reoccur after some time.

Drift detection in a nutshell

Drift, Skew, Shift

Covariate Shift

Concept Shift

How to detect them?

Measuring Embeddings Drift

Share of drifted components

References

Sentence Transformer Fine-Tuning - SetFit

Contrastive Learning

Selecting Positive and Negative Triples

Building the Contrastive Fine-tuning Dataset

Fine-Tuning

Training Classification Head

References

Sentence Transformers

Introduction

Fine-Tuning BERT for Semantically (Dis)Similarity

Training Data

Training: fine-tuning

Evaluation

Ablation Study

Implementation and others

References

Generative AI with Large Language Models

Week 1 - Introduction (slides)

Introduction to Transformers architecture

Generative AI project life-cycle

Prompt Engineering and Inference Paramaters

In-Context Learning

Inference Parameters

Week 2: Fine-Tuning (slides)

Instruction Fine-Tuning

Single-Task Fine-Tuning

Multi-Task Fine-Tuning

Implementing `QueryMetadataExtractor`

Using `QueryMetadataExtractor` in a Pipeline

Using the `HypotheticalDocumentEmbedder` for Retrieval