Evaluate Library for Hugging Face

Last Updated : 1 Apr, 2026

Evaluate library is a tool designed to assess the performance of Hugging Face models using a wide range of evaluation metrics. It simplifies the process of measuring model accuracy, precision, recall and other metrics across different tasks.

  • Provides ready‑to‑use evaluation metrics for various ML tasks
  • Supports benchmarking and comparison of model performance
  • Works seamlessly with Hugging Face datasets and transformers

Techniques for Evaluation

1. ROUGE: Used for evaluating text summarisation by comparing generated text with the actual text

  • ROUGE-1: Checks how many words in the generated summary are also present in the actual summary
  • ROUGE-2: Checks how many pairs of words (phrases) match between the two summaries
  • ROUGE-L: Checks how much of the sentence structure matches in the same order
  • ROUGE-Lsum: Checks overall similarity of the complete summary

2. BLEU: Used for evaluating generated text by comparing it with reference text and checks how many words and word sequences match with the actual text.

3. Accuracy: Measure how many predictions are correct out of total prediction and gives overall correctness of the model.

4. Precision: Measures how many predicted positive results are actually correct

5. Recall: Measure how many actual positive results are correctly identified

6. F1 Score: Used for balancing precision and recall into a single value

Implementation

Let’s understand the implementation using a text summarization model from Hugging Face with a sample dataset.

Step 1: Importing the required libraries

  • transformer (pipeline) is used to load pre-trained models for tasks like text summarization
  • evaluate is used for computing evaluation metrics
  • datasets is used for loading dataset
Python
from transformers import pipeline
import evaluate
from datasets import load_dataset

Step 2: Loading the Dataset

Loading the dataset from a CSV file to use real text and summary data.

You can download the dataset from here.

Python
df = pd.read_csv("bbc_real_dataset.csv")
print(df.head())

Output:

Screenshot-from-2026-03-31-13-18-03
Dataset preview showing text and summaries


Step 3: Preparing the Dataset

Selecting a small subset of the dataset and converting it into the required format.

Python
dataset = df.head(2).to_dict(orient="records")
print(dataset)

Output:

Screenshot-from-2026-03-31-13-22-41
Prepared dataset in dictionary format


Step 4: Loading the Summarization Model

Loading a pre-trained model to generate summaries from the given text.

Python
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
print("\nModel loaded successfully")

Step 5: Generating Summaries

Generating summaries for the input text using the loaded model.

Python
predictions = []
for item in dataset:
    summary = summarizer(
        item["text"],   # article text
        max_length=40,
        min_length=15,
        do_sample=False
    )[0]['summary_text']
    
    predictions.append(summary)

# Show one generated summary
print("\nGenerated Summaries:")
for i in range(len(predictions)):
    print(f"{i+1}. {predictions[i]}\n")

Output:

Screenshot-from-2026-03-31-13-26-44
Generated summaries by the model


Step 6: Preparing Reference Summaries

Extracting the actual summaries from the dataset for comparison.

Python
references = [item["summary"] for item in dataset]

print("Reference Summaries:")
for i in range(len(references)):
    print(f"{i+1}. {references[i]}\n")

Output:

Screenshot-from-2026-03-31-14-38-44
Actual Reference Summaries from the Dataset


Step 7: Loading the ROUGE metric

  • ROUGE measures similarity between generated and reference summaries
  • Higher score indicates better performance
Python
rouge = evaluate.load("rouge")
print("ROUGE metric loaded")

Step 8: Computing the ROUGE Score

Compare generated summaries with actual summaries

Python
result = rouge.compute(predictions=predictions, references=references)

Step 9: Displaying the Results

Python
print("ROUGE-1:", result['rouge1'])
print("ROUGE-2:", result['rouge2'])
print("ROUGE-L:", result['rougeL'])
print("ROUGE-Lsum:", result['rougeLsum'])

Output:

Screenshot-from-2026-03-31-14-43-24
Output


The output shows how close the generated summary is to the actual summary.

  • ROUGE-1 (0.14) means only about 14% of the words in the generated summary match the actual summary
  • ROUGE-2 (0.0) means no word pairs match, so phrases are not correctly captured
  • ROUGE-L (0.12) shows very low similarity in the overall structure of the summary
  • ROUGE-Lsum (0.12) also indicates low similarity for the complete summary
  • Lower values indicate that the generated summary is not very accurate

We performed this implementation to check how good the model’s output is. It helped us compare the generated summary with the actual summary and showed that the model needs improvement as the similarity is quite low.

Comment

Explore