Processing Wiki Dumps

The English and French passages for this project are drawn from Wikipedia snapshots of 2022-05-01 and 2022-04-20 respectively, and are downloaded from the Internet Archive to enable open-domain experiments. The raw documents can be downloaded from the following URLS:

https://archive.org/download/enwiki-20220501/enwiki-20220501-pages-articles-multistream.xml.bz2
https://archive.org/download/frwiki-20220420/frwiki-20220420-pages-articles-multistream.xml.bz2

To run the Wikipedia processing pipeline; We adopt the same processing used in the Dense Passage Retriever Paper and Pre-processing Matters! Improved Wikipedia Corpora for Open-Domain Question Answering.

The pipeline has been bundled into this script. You can run using the code provided below:

bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps

This document contains information on how to convert downloaded XML wikipedia dumps into 100 token long passages stored in JSON files.

For processing, we extract the Wikipedia articles into multiple jsonlines file, The articles are then preprocessed, cleaned and stored in a SQLite database file after which they are chunked into 100 token long passages. The processing pipeline adopted here is same as described in Section 4.1 of the Dense Passage Retriever Paper.

The pipeline has been bundled into this script. You can run using the code provided below:

bash scripts/generate_process_dumps.sh /path/to/dir_containing_dumps

However, below is a step by step break down of the different steps in the processing pipeline for English:

Download the dumps into a specified file:

wget https://archive.org/download/enwiki-20220501/enwiki-20220501-pages-articles-multistream.xml.bz2 -P /path/to/dir

Use Wikiextractor (bundled into this repo as a submodule) to extract the Wikipedia articles into multiple Jsonlines

git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor && git checkout e4abb4cbd01

python3 WikiExtractor.py /path/to/your/enwiki-20220501-pages-articles-multistream.xml.bz2 --filter_disambig_pages --json -o /path/to/output/directory -s

Store data into SQLite database

python3 process/retriever/build_db.py /path/to/preprocessed/data/dir /path/to/db/enwiki-20220501.db

Chunk the articles in the database into 100 token long sequences/passages to improve answer extraction

python3 preprocess/retriever/wikipedia_generate_context_tsv.py --db_path /path/to/db/enwiki-20220501.db --output_path_100w  /path/to/tsv/enwiki-20220501.tsv --lang en

(Optional) Shard the data into multiple jsonl files to make indexing easy

from utils import shard_tsv_data
shard_tsv_data(tsv_file_path="/path/to/tsv/enwiki-20220501.tsv", output_file_path="/path/to/jsonl_shards", shard_size="1GB")

This produces multiple jsonl files of size 1GB each

To view the processed files:

head -1 /path/to/jsonl_shards/docs-000.jsonl

Output:

{"docid":809223,"text":" The hockey rink's dimensions are × , ...", "title": "Bolshoy Ice Dome"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process_wiki_dumps.md

process_wiki_dumps.md

Processing Wiki Dumps

Files

process_wiki_dumps.md

Latest commit

History

process_wiki_dumps.md

File metadata and controls

Processing Wiki Dumps