The English and French passages for this project are drawn from Wikipedia snapshots of 2022-05-01 and 2022-04-20 respectively, and are downloaded from the Internet Archive to enable open-domain experiments. The raw documents can be downloaded from the following URLS:
- https://archive.org/download/enwiki-20220501/enwiki-20220501-pages-articles-multistream.xml.bz2
- https://archive.org/download/frwiki-20220420/frwiki-20220420-pages-articles-multistream.xml.bz2
To run the Wikipedia processing pipeline; We adopt the same processing used in the Dense Passage Retriever Paper and Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering.
The pipeline has been bundled into this script. You can run using the code provided below:
bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps
This document contains information on how to convert downloaded XML wikipedia dumps into 100 token long passages stored in JSON files.
For processing, we extract the Wikipedia articles into multiple jsonlines file, The articles are then preprocessed, cleaned and stored in a SQLite database file after which they are chunked into 100 token long passages. The processing pipeline adopted here is same as described in Section 4.1 of the Dense Passage Retriever Paper.
The pipeline has been bundled into this script. You can run using the code provided below:
bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps
However, below is a step by step break down of the different steps in the processing pipeline for English:
-
Download the dumps into a specified file:
wget https://archive.org/download/enwiki-20220501/enwiki-20220501-pages-articles-multistream.xml.bz2 -P /path/to/dir
-
Use Wikiextractor (bundled into this repo as a submodule) to extract the Wikipedia articles into multiple Jsonlines
git clone https://github.com/attardi/wikiextractor.git cd wikiextractor && git checkout e4abb4cbd01 python3 WikiExtractor.py /path/to/your/enwiki-20220501-pages-articles-multistream.xml.bz2 --filter_disambig_pages --json -o /path/to/output/directory -s
-
Store data into SQLite database
python3 process/retriever/build_db.py /path/to/preprocessed/data/dir /path/to/db/enwiki-20220501.db
-
Chunk the articles in the database into 100 token long sequences/passages to improve answer extraction
python3 preprocess/retriever/wikipedia_generate_context_tsv.py --db_path /path/to/db/enwiki-20220501.db --output_path_100w /path/to/tsv/enwiki-20220501.tsv --lang en
-
(Optional) Shard the data into multiple
jsonl
files to make indexing easyfrom utils import shard_tsv_data shard_tsv_data(tsv_file_path="/path/to/tsv/enwiki-20220501.tsv", output_file_path="/path/to/jsonl_shards", shard_size="1GB")
This produces multiple jsonl files of size 1GB each
To view the processed files:
head -1 /path/to/jsonl_shards/docs-000.jsonl
Output:
{"docid":809223,"text":" The hockey rink's dimensions are × , ...", "title": "Bolshoy Ice Dome"}