Skip to content

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

License

Notifications You must be signed in to change notification settings

KID-22/Cocktail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration

🌟Paper (Findings of ACL 2024)🌟: http://arxiv.org/abs/2405.16546
🌟Datasets and Checkpoints🌟: https://huggingface.co/IR-Cocktail

Introduction

Cocktail, a comprehensive benchmark designed to evaluate Information Retrieval (IR) models amidst the evolving landscape of AI-generated content (AIGC). In an era dominated by Large Language Models (LLMs), the traditional IR corpus, previously solely composed of human-written texts, has expanded to include a significant proportion of LLM-generated content. Cocktail emerges as a valuable resource to response to this transformation, aiming to provide a robust framework for assessing the performance and bias of IR models in handling mixed corpora in this LLM era.

Features

  • Comprehensive Dataset Collection: Cocktail comprises 15 existing IR datasets in a standard format, diversified across a range of text retrieval tasks and domains, each enriched with an LLM-generated corpus using Llama2.

  • Up-to-Date Evaluation Dataset: Introducing Natural Question Up-To-Date (NQ-UTD), a dataset featuring queries derived from the latest events, specifically designed to test the responsiveness of LLM-based IR models to new information not included in their pre-training data.

  • Easy-to-use Evaluation Tool: Cocktail includes a user-friendly evaluation tool, simplifying the process of assessing various IR models on the benchmarked dataset. This tool is designed with adaptability, allowing for seamless integration of new models and datasets, thereby enabling researchers and developers to efficiently evaluate the performance and bias of their IR systems.

File Structure

.
β”œβ”€β”€ dataset  # * dataset path
β”‚   β”œβ”€β”€ climate-fever
β”‚   β”œβ”€β”€ cqadupstack
β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ trec-covid
β”‚   └── webis-touche2020 
└── benchmark  # * evaluation benchmark
    β”œβ”€β”€ beir  # * requirements codes from beir
    β”œβ”€β”€ evaluate  # * codes for evaluation
    β”‚   β”œβ”€β”€ rerank # * code for re-rankers
    β”‚   β”œβ”€β”€ retrieval # * code for retreiever
    β”‚   └── utils # * codes for different evaluation setting
    └── shell  # * script for quick evaluation

Quick Start

We provide the detail scripts for all the benchmarked models in the folder benchmark/shell. Using neural retrieval models as an example, you can quickly and easily reproduce our results using the following scripts:

GPU=0
batch_size=128
for dataset in "msmarco" "dl19" "dl20" "trec-covid" "nfcorpus" "nq" "hotpotqa" "fiqa" "webis-touche2020" "cqadupstack" "dbpedia-entity" "scidocs" "fever" "climate-fever" "nq-utd"
do
    for model in "bert" "roberta" "tasb" "contriever" "dragon" "cocondenser" "ance" "retromae"
    do
        mkdir -p ./log/${dataset}/${model}/

        # sole human-written corpus evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list human \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/human.log 2>&1

        # sole llm-generated corpus evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list llama-2-7b-chat-tmp0.2 \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/llama2.log 2>&1

        # mix evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list human llama-2-7b-chat-tmp0.2 \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/human_llama2.log 2>&1

        # mix evaluation
        CUDA_VISIBLE_DEVICES=$GPU python evaluate/retrieval/${model}.py \
        --k_values 1 2 3 4 5 6 7 8 9 10 100\
        --corpus_list human llama-2-7b-chat-tmp0.2 \
        --target_list human llama-2-7b-chat-tmp0.2 \
        --save_results=1 \
        --dataset=${dataset} \
        --batch_size=${batch_size} \
        > ./log/${dataset}/${model}/human+llama2.log 2>&1
    done
done

Our evaluation tool is designed to support a variety of customized assessments, including the integration of corpora from different sources and the computation of metrics for specific target corpora. For personalized customization options, please refer to the code in our evaluate folder.

Available Datasets

All the 16 benchmarked datasets in Cocktail are listed in the following table and are available here at HuggingFace.

Dataset Raw Website Cocktail Download Cocktail-Name md5 for Processed Data Domain Relevancy # Test Query # Corpus
MS MARCO Homepage Homepage msmarco 985926f3e906fadf0dc6249f23ed850f Misc. Binary 6,979 542,203
DL19 Homepage Homepage dl19 d652af47ec0e844af43109c0acf50b74 Misc. Binary 43 542,203
DL20 Homepage Homepage dl20 3afc48141dce3405ede2b6b937c65036 Misc. Binary 54 542,203
TREC-COVID Homepage Homepage trec-covid 1e1e2264b623d9cb7cb50df8141bd535 Bio-Medical 3-level 50 128,585
NFCorpus Homepage Homepage nfcorpus 695327760647984c5014d64b2fee8de0 Bio-Medical 3-level 323 3,633
NQ Homepage Homepage nq a10bfe33efdec54aafcc974ac989c338 Wikipedia Binary 3,446 104,194
HotpotQA Homepage Homepage hotpotqa 74467760fff8bf8fbdadd5094bf9dd7b Wikipedia Binary 7,405 111,107
FiQA-2018 Homepage Homepage fiqa 4e1e688539b0622630fb6e65d39d26fa Finance Binary 648 57,450
TouchΓ©-2020 Homepage Homepage webis-touche2020 d58ec465ccd567d8f75edb419b0faaed Misc. 3-level 49 101,922
CQADupStack Homepage Homepage cqadupstack d48d963bc72689c765f381f04fc26f8b StackEx. Binary 1,563 39,962
DBPedia Homepage Homepage dbpedia-entity 43292f4f1a1927e2e323a4a7fa165fc1 Wikipedia 3-level 400 145,037
SCIDOCS Homepage Homepage scidocs 4058c0915594ab34e9b2b67f885c595f Scientific Binary 1,000 25,259
FEVER Homepage Homepage fever 98b631887d8c38772463e9633c477c69 Wikipedia Binary 6,666 114,529
Climate-FEVER Homepage Homepage climate-fever 5734d6ac34f24f5da496b27e04ff991a Wikipedia Binary 1,535 101,339
SciFact Homepage Homepage scifact b5b8e24ccad98c9ca959061af14bf833 Scientific Binary 300 5,183
NQ-UTD Homepage Homepage nq-utd 2e12e66393829cd4be715718f99d2436 Misc. 3-level 80 800

To verify the downloaded files, you can use the command to generate an MD5 hash using Terminal: md5sum filename.zip.

Checkpoints

We also provide some checkpoints trained with train_msmarco_v3.py in BEIR. Please see the following table:

Model PLM Pooling Strategy Download
bert-base-uncased-mean-v3-msmarco bert-base-uncased mean Link
bert-base-uncased-cls-v3-msmarco bert-base-uncased cls Link
bert-base-uncased-last-v3-msmarco bert-base-uncased last Link
bert-base-uncased-max-v3-msmarco bert-base-uncased max Link
bert-base-uncased-weightedmean-v3-msmarco bert-base-uncased weighted-mean Link
bert-mini-mean-v3-msmarco bert-mini mean Link
bert-small-mean-v3-msmarco bert-small mean Link
bert-large-uncased-mean-v3-msmarco bert-large-uncased mean Link
roberta-base-mean-v3-msmarco roberta-base mean Link
robreta-base-cls-v3-msmarco roberta-base cls Link
robreta-base-last-v3-msmarco roberta-base last Link
robreta-base-max-v3-msmarco roberta-base max Link
robreta-base-weightedmean-v3-msmarco roberta-base weighted-mean Link

Reference

The Cocktail benchmark is built based on the following projects:

Citation

If you find our benchmark or work useful for your research, please cite our work.

@article{dai2024cocktail,
  title={Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration},
  author={Dai, Sunhao and Liu, Weihao and Zhou, Yuqi and Pang, Liang and Ruan, Rongju and Wang, Gang and Dong, Zhenhua and Xu, Jun and Wen, Ji-Rong},
  journal={Findings of the Association for Computational Linguistics: ACL 2024},
  year={2024}
}

@article{dai2024neural,
  title={Neural Retrievers are Biased Towards LLM-Generated Content},
  author={Dai, Sunhao and Zhou, Yuqi and Pang, Liang and Liu, Weihao and Hu, Xiaolin and Liu, Yong and Zhang, Xiao and Wang, Gang and Xu, Jun},
  journal={Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  year={2024}
}

License

The proposed NQ-UTD dataset use MIT license. All data and code in this project can only be used for academic purposes.