⚖️ LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

LRAGE (Legal Retrieval Augmented Generation Evaluation, pronounced as 'large') is an open-source toolkit designed to evaluate Large Language Models (LLMs) in a Retrieval-Augmented Generation (RAG) setting, specifically tailored for the legal domain.

LRAGE is developed to address the unique challenges that Legal AI researchers face, such as building and evaluating retrieval-augmented systems effectively. It seamlessly integrates datasets and tools to help researchers in evaluating LLM performance on legal tasks without cumbersome engineering overhead.

You can check the demo video at here.

Features

Legal Domain Focused Evaluation: LRAGE is specifically developed for evaluating LLMs in a RAG setting with datasets and document collections from the legal domain, such as Pile-of-law and LegalBench.
Pre-compiled indexes for the legal domain: Comes with pre-generated BM25 indices and embeddings for Pile-of-law, reducing the setup effort for researchers.
Retriever & Reranker Integration: Easily integrate and evaluate different retrievers and rerankers. LRAGE modularizes retrieval and reranking components, allowing for flexible experimentation.
LLM-as-a-Judge: A feature where LLMs are used to evaluate the quality of LLM responses on an instance-by-instance basis, using customizable rubrics within the RAG setting.
Graphical User Interface: A GUI demo for intuitive usage, making the tool accessible even for those who are not deeply familiar with command-line interfaces.

Extensions for RAG Evaluation from lm-evaluation-harness

graph TB
    subgraph LRAGE ["LRAGE Extensions"]
        API[lrage/api]
        RT[Retriever Abstract Class]
        RR[Reranker Abstract Class]
        TR[Task with RAG]
        LJ[LLM-as-Judge]
        
        subgraph Implementations
            PR[Pyserini Retriever]
            RER[Rerankers Reranker]
        end
        
        API --> RT
        API --> RR
        API --> TR
        
        RT --> PR
        RR --> RER
        
        TR --> |build_all_requests| FLOW[Document Flow]
        FLOW --> | Retrieval| RT
        FLOW --> | Reranking| RR
        FLOW --> | LLM Response| LLM
        
        TR --> |process_results| LJ
    end
    
    subgraph lm-evaluation-harness
        LM[LM Abstract Class]
        IMPL[HF/vLLM Implementations]
        
        LM --> IMPL
        IMPL --> LLM[LLM Interface]
    end
    
    style LRAGE fill:#e6f3ff,stroke:#4a90e2
    style lm-evaluation-harness fill:#f5f5f5,stroke:#666
    style Implementations fill:#f0f9ff,stroke:#4a90e2

Addition of Retriever and Reranker abstract classes: LRAGE introduces retriever and reranker abstract classes in the lrage/api/. These additions allow the process of building requests in the api.task.Task class’s build_all_requests() method to go through both retrieval and reranking steps, enhancing the evaluation process for RAG.
Extensible Retriever and Reranker implementations: While maintaining the same structure as lm-evaluation-harness, LRAGE allows for the flexible integration of different retriever and reranker implementations. Just as lm-evaluation-harness provides an abstract LM class with implementations for libraries like HuggingFace (hf) and vLLM, LRAGE provides pyserini_retriever (powered by Pyserini) in lrage/retrievers/ and rerankers_reranker (powered by rerankers) in lrage/rerankers/. This structure allows users to easily implement and integrate other retrievers or rerankers, such as those from LlamaIndex, by simply extending the abstract classes.
Integration of LLM-as-a-judge: LRAGE modifies ConfigurableTask.process_results to support 'LLM-Eval' metrics, enabling a more nuanced evaluation of RAG outputs by utilizing language models as judges.

Prerequisites

JDK 21 is required to use the Pyserini retriever

conda config --add channels conda-forge
conda install openjdk=21

Installation

Clone the repository:

git clone https://github.com/hoorangyee/LRAGE.git
cd LRAGE

Install:
```
pip install -e .
```

Quick Start

To evaluate a model on a sample dataset using the RAG setting, follow these steps:

Prepare your dataset in the supported format.

Choose one of the following methods to run:

A. Run the evaluation script in CLI:

lrage \
--model hf \
--model_args pretrained=meta-llama/Llama-3.2-1B \
--tasks abercrombie_tiny \
--batch_size 8 \
--device cuda \
--retrieve_docs \
--top_k 3 \
--retriever pyserini \
--retriever_args retriever_type=bm25,bm25_index_path=msmarco-v1-passage \
--rerank \
--reranker rerankers \
--reranker_args reranker_type=colbert

B. Run the GUI:

cd LRAGE
./run_gui.sh

Basic Usage

The basic usage follows the lm-evaluation-harness documentation.

Below is a detailed guide for using the LRAGE CLI.

CLI Arguments

Required Arguments

--model, -m: Name of the model to use (e.g., hf for HuggingFace models)
--tasks, -t: Names of tasks to evaluate (comma-separated)
- To see available tasks: lrage --tasks list

Model Configuration

--model_args, -a: Arguments for model configuration
- Format: key1=value1,key2=value2
- Example: pretrained=meta-llama/Llama-3.1-8B,dtype=float32
--device: Device to use (e.g., cuda, cuda:0, cpu)
--batch_size, -b: Batch size (auto, auto:N, or integer)
--system_instruction: System instruction for the prompt
--apply_chat_template: Enable chat template (flag)

RAG Settings

--retrieve_docs: Enable document retrieval
--top_k: Number of documents to retrieve per query (default: 3)
--retriever: Type of retriever (e.g., pyserini)
--rerank: Enable reranking
--reranker: Type of reranker (e.g., rerankers)

Retrievers and Rerankers Configuration

Retriever Arguments

`retriever`	Argument	Required	Description	Example
pyserini	`retriever_type`	Yes	Type of retriever to use	`bm25`, `sparse`, `dense`, `hybrid`
	`bm25_index_path`	Yes	Path to BM25 index or prebuilt index name	`msmarco-v1-passage`
	`encoder_path`	For sparse/dense/hybrid	Path to encoder or prebuilt encoder name	`castorini/tct_colbert-v2-hnp-msmarco`
	`encoder_type`	Optional	Type of encoder	`tct_colbert`, `dpr`, `auto`
	`faiss_index_path`	For dense/hybrid	Path to FAISS index or prebuilt index name	`msmarco-v1-passage.tct_colbert-v2-hnp`

Note: Since FAISS index and Sparse Vector Index (embedding generated by SPLADE, etc.) do not contain the original documents, when using them, a BM25 index is also needed for original document lookup.

Supported Prebuilt Resources:

Example Usage:

--retriever_args retriever_type=bm25,bm25_index_path=msmarco-v1-passage

Reranker Arguments

`reranker`	Argument	Required	Description	Example
rerankers	`reranker_type`	Yes	Type of reranker to use	`colbert`
	`reranker_path`	Optional	Name of specific reranker model	`gpt-4o` with `reranker_type=rankllm`

Example Usage:

--reranker_args reranker_type=colbert

Evaluation Settings

--judge_model: Model for LLM-as-judge evaluation
--judge_model_args: Configuration for judge model
--judge_device: Device for judge model
--num_fewshot, -f: Number of few-shot examples
--output_path, -o: Path for saving results
--log_samples, -s: Save model outputs and documents
--predict_only, -x: Only generate predictions without evaluation

Example Commands

Basic BM25 Evaluation:

lrage \
--model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B \
--tasks legalbench_tiny \
--batch_size 8 \
--device cuda \
--retrieve_docs \
--retriever pyserini \
--retriever_args retriever_type=bm25,bm25_index_path=msmarco-v1-passage

Dense Retrieval with Reranking:

lrage \
--model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B \
--tasks legalbench_tiny \
--batch_size 8 \
--device cuda \
--retrieve_docs \
--top_k 3 \
--retriever pyserini \
--retriever_args \
    retriever_type=dense,\
    bm25_index_path=msmarco-v1-passage,\
    faiss_index_path=msmarco-v1-passage.tct_colbert-v2-hnp,\
    encoder_path=castorini/tct_colbert-v2-hnp-msmarco,\
    encoder_type=tct_colbert \
--rerank \
--reranker rerankers \
--reranker_args reranker_type=colbert

Evaluation with LLM-as-a-Judge:

Note: LLM-as-a-judge evaluation is only available for tasks that specifically use the 'LLM-Eval' metric in their configuration. Make sure your task is configured to use this metric before applying the judge model.

lrage \
--model hf \
--model_args pretrained=meta-llama/Llama-3.1-8B \
--tasks legalbench_tiny \
--judge_model openai-chat-completions \
--judge_model_args model=gpt-4o-mini \
--retrieve_docs \
--retriever pyserini \
--retriever_args retriever_type=bm25,bm25_index_path=msmarco-v1-passage

Indexing

For now, you have three options:

Use Pyserini's prebuilt indexes available out of the box
Use our prebuilt Pile-of-law-mini indexes
Create your own index by following Pyserini's indexing documentation

Pre-compiled indexes for the legal domain

Note: We will soon share the pre-compiled Pile-of-law BM25 index and a mini-index containing approximately 1/10 of the data. Additionally, we plan to provide pre-compiled indexes for other legal domain document collections that can be used in RAG settings.

Roadmap

Implement LLM-as-a-judge functionality
Update pyserini_retriever to support Pyserini prebuilt index
Develop a GUI Demo for easier access and visualization
Document more detailed usage instructions
Publish and share Pile-of-law chunks
Publish and share Pile-of-law BM25 index
Publish and share Pile-of-law embeddings
Implement a simplified indexing feature in GUI
Publish benchmark results obtained using LRAGE

Contributing

Contributions and community engagement are welcome! We value your input in making this project better🤗.

Citation

@Misc{lrage,
  title =        {LARGE: Legal Retrieval Augmented Generation Evaluation Tool},
  author =       {Minhu Park, Hongseok Oh and Wonseok Hwang},
  howpublished = {\url{https://github.com/hoorangyee/LRAGE}},
  year =         {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
assets		assets
lrage		lrage
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_gui.sh		run_gui.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚖️ LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

Features

Extensions for RAG Evaluation from lm-evaluation-harness

Prerequisites

Installation

Quick Start

Basic Usage

CLI Arguments

Required Arguments

Model Configuration

RAG Settings

Retrievers and Rerankers Configuration

Retriever Arguments

Reranker Arguments

Evaluation Settings

Example Commands

Indexing

Pre-compiled indexes for the legal domain

Roadmap

Contributing

Citation

About

Releases 1

Packages

Contributors 2

Languages

hoorangyee/LRAGE

Folders and files

Latest commit

History

Repository files navigation

⚖️ LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

Features

Extensions for RAG Evaluation from lm-evaluation-harness

Prerequisites

Installation

Quick Start

Basic Usage

CLI Arguments

Required Arguments

Model Configuration

RAG Settings

Retrievers and Rerankers Configuration

Retriever Arguments

Reranker Arguments

Evaluation Settings

Example Commands

Indexing

Pre-compiled indexes for the legal domain

Roadmap

Contributing

Citation

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages