Skip to content

Metric to measure RAG responses with inline citations

Notifications You must be signed in to change notification settings

shanghongsim/trust_eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Trust Eval

Welcome to Trust Eval! 🌟

A comprehensive tool for evaluating the trustworthiness of inline-cited outputs generated by large language models (LLMs) within the Retrieval-Augmented Generation (RAG) framework. Our suite of metrics measures correctness, citation quality, and groundedness.

This is the official implementation of the metrics introduced in the paper "Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse" (accepted at ICLR '25).

Installation 🛠️

Prerequisites

  • OS: Linux
  • Python: Versions 3.10 – 3.12 (preferably 3.10.13)
  • GPU: Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100)

Steps

  1. Set up a Python environment

    conda create -n trust_eval python=3.10.13
    conda activate trust_eval
  2. Install dependencies

    pip install trust_eval

    Note: that vLLM will be installed with CUDA 12.1. Please ensure your CUDA setup is compatible.

  3. Set up NLTK

    import nltk
    nltk.download('punkt_tab')

Quickstart 🔥

Set up

Download eval_data from Trust-Align Huggingface and place it at the same level as the prompt folder. If you would like to use the default path configurations, please do not rename the folders. If you rename your folders, you will need to specify your own path.

quickstart/
├── eval_data/
├── prompts/

Quick look at the data

Here, we are working with ASQA where the questions are of the type long form factoid QA. Each sample has 3 fields: question, answers, docs. Below is one example of the dataset:

[ ...
    {   // The question asked.
        "question": "Who has the highest goals in world football?",

        // A list containing all correct (short) answers to the question, represented as arrays where each element contains variations of the answer. 
        "answers": [
            ["Daei", "Ali Daei"],                // Variations for Ali Daei
            ["Bican", "Josef Bican"],            // Variations for Josef Bican
            ["Sinclair", "Christine Sinclair"]   // Variations for Christine Sinclair
        ],

        // A list of 100 dictionaries where each dictionary contains one document.
        "docs": [
            {   
                // The title of the document being referenced.
                "title": "Argentina\u2013Brazil football rivalry",

                // A snippet of text from the document.
                "text": "\"Football Player of the Century\", ...",

                // A binary list where each element indicates if the respective answer was found in the document (1 for found, 0 for not found).
                "answers_found": [0,0,0],

                // A recall score calculated as the percentage of correct answers that the document entails.
                "rec_score": 0.0
            },
        ]
    },
...
]
    

Please refer to datasets page for examples of how ELI5 or QAMPARI sampless

Configuring yaml files

For generator related configurations, there are three field that are mandatory: data_type, model and max_length. data_type determines which benchmark dataset to evaluate on. model determines which model to evaluate and max_length is the maximum context length of the model. We will be using Qwen2.5-3B-Instruct in this tutorial but you can replace it with the path to your model checkpoints to evaluate your model.

data_type: "asqa"
model: Qwen/Qwen2.5-3B-Instruct
max_length: 8192

For evaluation related configurations, only data_type is mandatory.

data_type: "asqa"

Your directory should now look like this:

quickstart/
├── eval_data/
├── prompts/
├── generator_config.yaml
├── eval_config.yaml

Running evals

Now define your main script:

Generating Responses

from config import EvaluationConfig, ResponseGeneratorConfig
from evaluator import Evaluator
from logging_config import logger
from response_generator import ResponseGenerator

# Configure the response generator
generator_config = ResponseGeneratorConfig.from_yaml(yaml_path="generator_config.yaml")

# Generate and save responses
generator = ResponseGenerator(generator_config)
generator.generate_responses()
generator.save_responses()

Evaluating Responses

# Configure the evaluator
evaluation_config = EvaluationConfig.from_yaml(yaml_path="eval_config.yaml")

# Compute and save evaluation metrics
evaluator = Evaluator(evaluation_config)
evaluator.compute_metrics()
evaluator.save_results()

Your directory should look like this:

quickstart/
├── eval_data/
├── prompts/
├── example_usage.py
├── generator_config.yaml
├── eval_config.yaml
CUDA_VISIBLE_DEVICES=0,1 python example_usage.py 

Note: Define the GPUs you wish to run on in CUDA_VISIBLE_DEVICES. For reference, we are able to run up to 7b models on two A40s.

Sample output:

{ // refusal response: "I apologize, but I couldn't find an answer..."
    
    // Basic statistics
    "num_samples": 948,
    "answered_ratio": 50.0, // Ratio of (# answered qns / total # qns)
    "answered_num": 5, // # of qns where response is not refusal response
    "answerable_num": 7, // # of qns that ground truth answerable, given the documents
    "overlapped_num": 5, // # of qns that are both answered and answerable
    "regular_length": 46.6, // Average length of all responses
    "answered_length": 28.0, // Average length of non-refusal responses

    // Refusal groundedness metrics

    // # qns where (model refused to respond & is ground truth unanswerable) / # qns is ground truth unanswerable
    "reject_rec": 100.0,

    // # qns where (model refused to respond & is ground truth unanswerable) / # qns where model refused to respond
    "reject_prec": 60.0,

    // F1 of reject_rec and reject_prec
    "reject_f1": 75.0,

    // # qns where (model respond & is ground truth answerable) / # qns is ground truth answerable
    "answerable_rec": 71.42857142857143,

    // # qns where (model respond & is ground truth answerable) / # qns where model responded
    "answerable_prec": 100.0,

    // F1 of answerable_rec and answerable_prec
    "answerable_f1": 83.33333333333333,

    // Avg of reject_rec and answerable_rec
    "macro_avg": 85.71428571428572,

    // Avg of reject_f1 and answerable_f1
    "macro_f1": 79.16666666666666,

    // Response correctness metrics

    // Regardless of response type (refusal or answered), check if ground truth claim is in the response. 
    "regular_str_em": 41.666666666666664,

    // Only for qns with answered responses, check if ground truth claim is in the response. 
    "answered_str_em": 66.66666666666666,

    // Calculate EM for all qns that are answered and answerable, avg by # of answered questions (EM_alpha)
    "calib_answered_str_em": 100.0,

    // Calculate EM for all qns that are answered and answerable, avg by # of answerable questions (EM_beta)
    "calib_answerable_str_em": 71.42857142857143,

    // F1 of calib_answered_claims_nli and calib_answerable_claims_nli
    "calib_str_em_f1": 83.33333333333333,

    // EM score of qns that are answered and ground truth unanswerable, indicating use of parametric knowledge
    "parametric_str_em": 0.0,

    // Citation quality metrics

    // (Avg across all qns) Does the set of citations support statement s_i? 
    "regular_citation_rec": 28.333333333333332,

    // (Avg across all qns) Any redundant citations? (1) Does citation c_i,j fully support statement s_i? (2) Is the set of citations without c_i,j insufficient to support statement s_i? 
    "regular_citation_prec": 35.0,

    // F1 of regular_citation_rec and regular_citation_prec
    "regular_citation_f1": 31.315789473684212,

    // (Avg across answered qns only)
    "answered_citation_rec": 50.0,

    // (Avg across answered qns only)
    "answered_citation_prec": 60.0,

    // F1 answered_citation_rec and answered_citation_prec
    "answered_citation_f1": 54.54545454545455,

    // Avg (macro_f1, calib_claims_nli_f1, answered_citation_f1)
    "trust_score": 72.34848484848486
}

Please refer to metrics for explanations of outputs when evaluating with ELI5 or QAMPARI.

The end

Congratulations! You have reached the end of the quickstart tutorial and you are now ready to benchmark your own RAG application (running the evaluations with custom data instead of benchmark data) or reproduce our experimental setup! 🥳

Contact 📬

For questions or feedback, reach out to Shang Hong ([email protected]).

Citation 📝

If you use this software in your research, please cite the Trust-Eval paper as below.

@misc{song2024measuringenhancingtrustworthinessllms,
      title={Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse}, 
      author={Maojia Song and Shang Hong Sim and Rishabh Bhardwaj and Hai Leong Chieu and Navonil Majumder and Soujanya Poria},
      year={2024},
      eprint={2409.11242},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.11242}, 
}

About

Metric to measure RAG responses with inline citations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages