This repo contains code for reproducing our results in the paper and using our models and tokenizers for your own tasks. Model checkpoints are available at: https://huggingface.co/thunlp/SubCharTokenization/tree/main. That Huggingface repo only contains the model checkpoints, the config and tokenizer files are in this repo, which you can load locally.
Before training SubChar tokenizers, you need to first transliterate (i.e., encode) the Chinese characters. You can use the script data/convert_corpus.py
to do so, note that you should specify in the script which transliration method you want to use (e.g., Pinyin, Wubi, etc.).
We use the SentencePiece library to train the tokenizers. The script we used is tokenizers/train_sp.py
, which also contains all the hyper-parameters that we used. Notably, we used a vocab size of 22675 and character coverage of 1.00 for the training the SubChar tokenizers.
The default subword tokenization implementation is unigram LM and you need to specify model_type="bpe"
if you want to use the byte pair encoding implementation.
The pretraining data that we used is from Baidu Baike, which consists of 2.2G raw text. You can download the raw text data from this link if you want to reproduce the results in the paper. Alternatively, you can use any other pretraining corpus you want, you should format the file by putting one document per line. Suppose the pretraining file is stored at wubi_corpus/formatted/baidubaike_corpus.txt
(if not, substitute the directory in the data processing script).
Run bash data/create_datasets_from_start.sh
. It consists of two data processing steps: 1) data sharding, and 2) creating HDF5 files. Note that we follow a two-stage pretraining pipeline where the first stage has max sequence length 128 and the second stage with max sequence length 512.
You should also specify in that script the specific tokenizer vocab and model files that you want to use for tokenizing the corpus. Also, make sure that you are using the correct tokenizer class in create_pretraining_data.py
(line 458-463) in the step of creating HDF5 files.
Once you have the processed data (the HDF5 files for the two stages of pretraining). Run bash scripts/run_pretraining.sh
to run the pretraining. The default hyper-parameters in the script are used in the paper, you can also adjust them based on your own needs. Note that we used 8 A100 GPUs when doing the pretraining, you should adjust the batch sizes if you are using other GPUs.
You can run one of the following python code to do finetuning depending on which task you want to finetune on. Note that different task/code might need different arguments.
run_glue.py
: classification tasks such as TNews, IFlytek, OCNLI etc.run_multichoice_mrc.py
: CHIDrun_ner.py
: CLUENERrun_{cmrc, drcd, c3}.py
: CMRC, DRCD or C3
For example, for finetuning on TNews using pinyin tokenizer:
python3 run_glue.py \
--task_name=tnews \
--train_dir=datasets/tnews/split \
--dev_dir=datasets/tnews/split \
--test_dir=datasets/tnews/split \
--do_train --do_eval --do_test \
--init_checkpoint=checkpoints/checkpoints_pinyin_zh_22675/ckpt_8804.pt \
--output_dir=logs/tnews/pinyin/ckpt_8804 \
--tokenizer_type=CommonZh \
--vocab_file=tokenizers/pinyin_zh_22675.vocab \
--vocab_model_file=tokenizers/pinyin_zh_22675.model \
--config_file=configs/bert_config_vocab22675.json \
--epochs=6
Another example, finetuning on CMRC using wubi tokenizer:
python3 run_cmrc.py \
--data_dir=datasets/cmrc/split \
--init_checkpoint=checkpoints/checkpoints_wubi_zh_22675/ckpt_8804.pt \
--config_file=configs/bert_config_vocab22675.json \
--tokenizer_type=CommonZh \
--vocab_file=tokenizers/wubi_zh_22675.vocab \
--vocab_model_file=tokenizers/wubi_zh_22675.model \
--output_dir=logs/cmrc/wubi_twolevel/ckpt_8804 \
--do_train --do_test \
--two_level_embeddings \
--epochs=6
Please consider citing our work if you found this code or our paper beneficial to your research.
@article{Si2021SubChar,
author = {Chenglei Si and Zhengyan Zhang and Yingfa Chen and Fanchao Qi and Xiaozhi Wang and Zhiyuan Liu and Yasheng Wang and Qun Liu and Maosong Sun},
journal={Transactions of the Association for Computational Linguistics},
year = {2023},
title = {{Sub-Character Tokenization for Chinese Pretrained Language Models}}
}