Welcome to Trust Eval! 🌟
A comprehensive tool for evaluating the trustworthiness of inline-cited outputs generated by large language models (LLMs) within the Retrieval-Augmented Generation (RAG) framework. Our suite of metrics measures correctness, citation quality, and groundedness.
This is the official implementation of the metrics introduced in the paper "Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse" (accepted at ICLR '25).
- OS: Linux
- Python: Versions 3.10 – 3.12 (preferably 3.10.13)
- GPU: Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100)
-
Set up a Python environment
conda create -n trust_eval python=3.10.13 conda activate trust_eval
-
Install dependencies
pip install trust_eval
Note: that vLLM will be installed with CUDA 12.1. Please ensure your CUDA setup is compatible.
-
Set up NLTK
import nltk nltk.download('punkt_tab')
Download eval_data
from Trust-Align Huggingface and place it at the same level as the prompt folder. If you would like to use the default path configurations, please do not rename the folders. If you rename your folders, you will need to specify your own path.
quickstart/
├── eval_data/
├── prompts/
Here, we are working with ASQA where the questions are of the type long form factoid QA. Each sample has 3 fields: question
, answers
, docs
. Below is one example of the dataset:
[ ...
{ // The question asked.
"question": "Who has the highest goals in world football?",
// A list containing all correct (short) answers to the question, represented as arrays where each element contains variations of the answer.
"answers": [
["Daei", "Ali Daei"], // Variations for Ali Daei
["Bican", "Josef Bican"], // Variations for Josef Bican
["Sinclair", "Christine Sinclair"] // Variations for Christine Sinclair
],
// A list of 100 dictionaries where each dictionary contains one document.
"docs": [
{
// The title of the document being referenced.
"title": "Argentina\u2013Brazil football rivalry",
// A snippet of text from the document.
"text": "\"Football Player of the Century\", ...",
// A binary list where each element indicates if the respective answer was found in the document (1 for found, 0 for not found).
"answers_found": [0,0,0],
// A recall score calculated as the percentage of correct answers that the document entails.
"rec_score": 0.0
},
]
},
...
]
Please refer to datasets page for examples of how ELI5 or QAMPARI sampless
For generator related configurations, there are three field that are mandatory: data_type
, model
and max_length
. data_type
determines which benchmark dataset to evaluate on. model
determines which model to evaluate and max_length
is the maximum context length of the model. We will be using Qwen2.5-3B-Instruct
in this tutorial but you can replace it with the path to your model checkpoints to evaluate your model.
data_type: "asqa"
model: Qwen/Qwen2.5-3B-Instruct
max_length: 8192
For evaluation related configurations, only data_type
is mandatory.
data_type: "asqa"
Your directory should now look like this:
quickstart/
├── eval_data/
├── prompts/
├── generator_config.yaml
├── eval_config.yaml
Now define your main script:
Generating Responses
from config import EvaluationConfig, ResponseGeneratorConfig
from evaluator import Evaluator
from logging_config import logger
from response_generator import ResponseGenerator
# Configure the response generator
generator_config = ResponseGeneratorConfig.from_yaml(yaml_path="generator_config.yaml")
# Generate and save responses
generator = ResponseGenerator(generator_config)
generator.generate_responses()
generator.save_responses()
Evaluating Responses
# Configure the evaluator
evaluation_config = EvaluationConfig.from_yaml(yaml_path="eval_config.yaml")
# Compute and save evaluation metrics
evaluator = Evaluator(evaluation_config)
evaluator.compute_metrics()
evaluator.save_results()
Your directory should look like this:
quickstart/
├── eval_data/
├── prompts/
├── example_usage.py
├── generator_config.yaml
├── eval_config.yaml
CUDA_VISIBLE_DEVICES=0,1 python example_usage.py
Note: Define the GPUs you wish to run on in
CUDA_VISIBLE_DEVICES
. For reference, we are able to run up to 7b models on two A40s.
Sample output:
{ // refusal response: "I apologize, but I couldn't find an answer..."
// Basic statistics
"num_samples": 948,
"answered_ratio": 50.0, // Ratio of (# answered qns / total # qns)
"answered_num": 5, // # of qns where response is not refusal response
"answerable_num": 7, // # of qns that ground truth answerable, given the documents
"overlapped_num": 5, // # of qns that are both answered and answerable
"regular_length": 46.6, // Average length of all responses
"answered_length": 28.0, // Average length of non-refusal responses
// Refusal groundedness metrics
// # qns where (model refused to respond & is ground truth unanswerable) / # qns is ground truth unanswerable
"reject_rec": 100.0,
// # qns where (model refused to respond & is ground truth unanswerable) / # qns where model refused to respond
"reject_prec": 60.0,
// F1 of reject_rec and reject_prec
"reject_f1": 75.0,
// # qns where (model respond & is ground truth answerable) / # qns is ground truth answerable
"answerable_rec": 71.42857142857143,
// # qns where (model respond & is ground truth answerable) / # qns where model responded
"answerable_prec": 100.0,
// F1 of answerable_rec and answerable_prec
"answerable_f1": 83.33333333333333,
// Avg of reject_rec and answerable_rec
"macro_avg": 85.71428571428572,
// Avg of reject_f1 and answerable_f1
"macro_f1": 79.16666666666666,
// Response correctness metrics
// Regardless of response type (refusal or answered), check if ground truth claim is in the response.
"regular_str_em": 41.666666666666664,
// Only for qns with answered responses, check if ground truth claim is in the response.
"answered_str_em": 66.66666666666666,
// Calculate EM for all qns that are answered and answerable, avg by # of answered questions (EM_alpha)
"calib_answered_str_em": 100.0,
// Calculate EM for all qns that are answered and answerable, avg by # of answerable questions (EM_beta)
"calib_answerable_str_em": 71.42857142857143,
// F1 of calib_answered_claims_nli and calib_answerable_claims_nli
"calib_str_em_f1": 83.33333333333333,
// EM score of qns that are answered and ground truth unanswerable, indicating use of parametric knowledge
"parametric_str_em": 0.0,
// Citation quality metrics
// (Avg across all qns) Does the set of citations support statement s_i?
"regular_citation_rec": 28.333333333333332,
// (Avg across all qns) Any redundant citations? (1) Does citation c_i,j fully support statement s_i? (2) Is the set of citations without c_i,j insufficient to support statement s_i?
"regular_citation_prec": 35.0,
// F1 of regular_citation_rec and regular_citation_prec
"regular_citation_f1": 31.315789473684212,
// (Avg across answered qns only)
"answered_citation_rec": 50.0,
// (Avg across answered qns only)
"answered_citation_prec": 60.0,
// F1 answered_citation_rec and answered_citation_prec
"answered_citation_f1": 54.54545454545455,
// Avg (macro_f1, calib_claims_nli_f1, answered_citation_f1)
"trust_score": 72.34848484848486
}
Please refer to metrics for explanations of outputs when evaluating with ELI5 or QAMPARI.
Congratulations! You have reached the end of the quickstart tutorial and you are now ready to benchmark your own RAG application (running the evaluations with custom data instead of benchmark data) or reproduce our experimental setup! 🥳
For questions or feedback, reach out to Shang Hong ([email protected]
).
If you use this software in your research, please cite the Trust-Eval paper as below.
@misc{song2024measuringenhancingtrustworthinessllms,
title={Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse},
author={Maojia Song and Shang Hong Sim and Rishabh Bhardwaj and Hai Leong Chieu and Navonil Majumder and Soujanya Poria},
year={2024},
eprint={2409.11242},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.11242},
}