This repository hosts the data and code of the paper: Analysing The Impact of Sequence Composition on Language Model Pre-Training
bash ./scripts/download_slimpajama.sh
Decompress and split SlimPajama to subsets according to the meta-information of documents
export PYTHONPATH="./"
python ./preprocessing/split_to_subsets.py
export PYTHONPATH="./"
python ./preprocessing/create_corpus.py
We split each subset to several files, defined by SUBSET_SPLIT_NUMS
in project_config.py
. Each file is saved
in ./data/SlimPajama-150B/[subset_name]/[[subset_name]_chunk[file_idx]_processed.jsonl]
.
python ./save_offline_dataset.py --packing_strategy=mixchunk
The result data is saved in ./data/offline_datasets/mixchunk
.
python ./save_offline_dataset.py --packing_strategy=unichunk
BM25 retrieval is based on Retriv
Build index:
python build_bm25_index.py
It builds BM25 index for each file independently. Each index is saved
in ./data/bm25index/collections/[subset_name]_[file_idx]
Retrieval strategy: retriv_bm25.py
and retrieval_packing.py
Construct BM25Chunk in one host:
python ./save_offline_dataset.py --packing_strategy=bm25chunk
Or construct BM25Chunk for each file by running:
python ./save_offline_dataset.py \
--packing_strategy=bm25chunk \
--bm25chunk_onefile \
--subset_name=RedPajamaWikipedia \
--file_idx=0
This is an example to construct BM25Chunk for one file, and we can distribute these construction tasks to different CPU
cores and hosts. subset_name
and its total number of split files are defined in project_config.py
.
After constructing BM25Chunk for all files, combine the data together by running:
python ./save_offline_dataset.py --packing_strategy=bm25chunk --combine_data
Use models from huggingface link
Download datasets
python ./scripts/download_eval_data.py
Reading comprehension and retrieval-augmented generation:
cd ./evaluation
bash ./mrc.sh
Knowledge memorisation:
cd ./evaluation
bash ./cbqa.sh
In-context learning:
cd ./evaluation
bash ./icl.sh
Calculate the Zipf's coefficient of token frequency: ./analysis/burstiness.py
Visualise the distraction proportion: ./analysis/distraction.py
@inproceedings{zhao-etal-2024-analysing,
title = "Analysing The Impact of Sequence Composition on Language Model Pre-Training",
author = "Zhao, Yu and
Qu, Yuanbin and
Staniszewski, Konrad and
Tworkowski, Szymon and
Liu, Wei and
Mi{\l}o{\'s}, Piotr and
Wu, Yuxiang and
Minervini, Pasquale",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.427",
pages = "7897--7912",
}