预训练

usage: pretrain.py [-h] [--dataset_path DATASET_PATH]
                   [--vocab_path VOCAB_PATH] [--spm_model_path SPM_MODEL_PATH]
                   [--pretrained_model_path PRETRAINED_MODEL_PATH]
                   --output_model_path OUTPUT_MODEL_PATH
                   [--config_path CONFIG_PATH] [--total_steps TOTAL_STEPS]
                   [--save_checkpoint_steps SAVE_CHECKPOINT_STEPS]
                   [--report_steps REPORT_STEPS]
                   [--accumulation_steps ACCUMULATION_STEPS]
                   [--batch_size BATCH_SIZE]
                   [--instances_buffer_size INSTANCES_BUFFER_SIZE]
                   [--dropout DROPOUT] [--seed SEED] [--embedding {bert,word}]
                   [--encoder {bert,lstm,gru,cnn,gatedcnn,attn,synt,rcnn,crnn,gpt,bilstm}]
                   [--bidirectional] [--target {bert,lm,cls,mlm,bilm}]
                   [--tie_weights] [--factorized_embedding_parameterization]
                   [--parameter_sharing] [--span_masking]
                   [--span_geo_prob SPAN_GEO_PROB]
                   [--span_max_length SPAN_MAX_LENGTH]
                   [--learning_rate LEARNING_RATE] [--warmup WARMUP]
                   [--beta1 BETA1] [--beta2 BETA2] [--fp16]
                   [--fp16_opt_level {O0,O1,O2,O3}] [--world_size WORLD_SIZE]
                   [--gpu_ranks GPU_RANKS [GPU_RANKS ...]]
                   [--master_ip MASTER_IP] [--backend {nccl,gloo}]

预训练需要明确指定模型的编码器和目标。 UER-py的编码器模块由以下组成：

lstm：长短期记忆（LSTM）
gru：门控循环神经网络（GRU）
bilstm：双向LSTM（与 --encoder lstm 和 --bidirectional不同，请参阅[问题]（https://github.com/pytorch/pytorch/issues/4930）了解更多信息）
gatedcnn：门控卷积神经网络（GatedCNN）
bert：掩码完全可见的注意力神经网络（在BERT中使用）
gpt：掩码正向可见的注意力神经网络（在GPT中使用）

目标任务应与预处理阶段中的目标任务一致。用户可以通过 --encoder 和 --target尝试编码器和目标任务的不同组合。
--config_path 表示配置文件的路径，该文件指定预训练模型的超参数。我们将常用的配置文件放在models文件夹中。用户应根据使用的编码器选择合适的编码器。
--instances_buffer_size 指定预训练阶段内存中的缓冲区大小。
--tie_weights 表示嵌入单词和softmax权重相关。

预训练的参数初始化策略有两种：1）随机初始化； 2）加载预训练模型。

随机初始化

单机CPU预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --output_model_path models/output_model.bin \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

预训练的输入由 --dataset_path 指定。单机单GPU预训练示例：（GPU的ID为3）：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --output_model_path models/output_model.bin --gpu_ranks 3 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

单机8GPU预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --output_model_path models/output_model.bin --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

--world_size 指定所开进程（也是使用GPU）的总数
--gpu_ranks 为每个进程/GPU指定唯一的id 如果想指定使用某几块GPU，使用CUDA_VISIBLE_DEVICES指定程序可见的GPU：

CUDA_VISIBLE_DEVICES=1,2,3,5 python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                                                 --output_model_path models/output_model.bin --world_size 4 --gpu_ranks 0 1 2 3 \
                                                 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

因为只使用4个GPU，因此 --world_size 设置为4，这4个进程/GPU的id从0到3，由 --gpu_ranks 指定。

2机每机8GPU预训练示例总共16个进程，依次在两台机器（Node-0和Node-1）上启动脚本。 --master_ip 指定为 --gpu_ranks 包含0的机器的ip:port，启动示例：

Node-0 : python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                             --output_model_path models/output_model.bin --world_size 16 --gpu_ranks 0 1 2 3 4 5 6 7 \
                             --total_steps 100000 --save_checkpoint_steps 10000 --report_steps 100 \
                             --master_ip tcp://9.73.138.133:12345 \
                             --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert
Node-1 : python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                             --output_model_path models/output_model.bin --world_size 16 --gpu_ranks 8 9 10 11 12 13 14 15 \
                             --total_steps 100000 \
                             --master_ip tcp://9.73.138.133:12345 \
                             --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

Node-0的ip地址为9.73.138.133 --total_steps 指定训练的步数。两台机器的训练的步数需保持一致 --save_checkpoint_steps 指定每隔多少步数对预训练模型进行保存。注意到我们只需要在Node-0指定，因为模型只会在Node-0机器上保存 --report_steps 指定每隔多少步数打印训练进度。注意到我们只需要在Node-0指定，因为打印结果只会在Node-0机器上显示需要注意的是，在指定 --master_ip 中的端口号（port）时，不能选择被其他程序占用的端口号通常来说，参数随机初始化的情况下，预训练需要更大的学习率。推荐使用 --learning_rate 1e-4（默认为2e-5）

加载预训练模型

我们推荐使用这种方案因为这种方案能够利用已有的预训练模型。我们通过参数 --pretrained_model_path 指定加载已有的预训练模型。单机CPU、单机单GPU预训练示例:

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --pretrained_model_path models/google_zh_model.bin \
                    --output_model_path models/output_model.bin \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert
python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --pretrained_model_path models/google_zh_model.bin \
                    --output_model_path models/output_model.bin --gpu_ranks 3 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

单机8GPU预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt --pretrained_model_path models/google_zh_model.bin \
                    --output_model_path models/output_model.bin --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

2机每机8GPU预训练示例：

Node-0 : python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                             --pretrained_model_path models/google_zh_model.bin \
                             --output_model_path models/output_model.bin --world_size 16 --gpu_ranks 0 1 2 3 4 5 6 7 \
                             --master_ip tcp://9.73.138.133:12345 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert  
Node-1 : python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                             --pretrained_model_path models/google_zh_model.bin \
                             --output_model_path models/output_model.bin --world_size 16 --gpu_ranks 8 9 10 11 12 13 14 15 \
                             --master_ip tcp://9.73.138.133:12345 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

3机每机8GPU预训练示例：

Node-0: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --output_model_path models/output_model.bin --world_size 24 --gpu_ranks 0 1 2 3 4 5 6 7 \
                            --master_ip tcp://9.73.138.133:12345 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert
Node-1: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --output_model_path models/output_model.bin --world_size 24 --gpu_ranks 8 9 10 11 12 13 14 15 \
                            --master_ip tcp://9.73.138.133:12345 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert
Node-2: python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                            --pretrained_model_path models/google_zh_model.bin \
                            --output_model_path models/output_model.bin --world_size 24 --gpu_ranks 16 17 18 19 20 21 22 23 \
                            --master_ip tcp://9.73.138.133:12345 --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

调整预训练模型尺寸

通常来说，大模型更消耗计算资源但是有更好的表现。我们可以通过 --config_path 在预训练阶段指定预训练模型的配置文件。针对BERT（和RoBERTa），项目在models文件夹中提供了4个配置文件，bert_large_config.json 、 bert_base_config.json 、 bert_small_config.json 、 bert_tiny_config.json ，对应不同大小的预训练模型。我们针对这4个配置文件均提供了中文预训练模型。详情见预训练模型默认使用 bert_base_config.json 作为配置文件对于ALBERT，项目提供了4个配置文件， albert_base_config.json 、 albert_large_config.json 、 albert_xlarge_config.json 、 albert_xxlarge_config.json 加载中文BERT-large预训练模型进行增量预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/mixed_corpus_bert_large_model.bin --config_path models/bert_large_config.json \
                    --output_model_path models/output_model.bin --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

加载中文BERT-small预训练模型进行增量预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/mixed_corpus_bert_small_model.bin --config_path models/bert_small_config.json \
                    --output_model_path models/output_model.bin --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

加载中文BERT-tiny预训练模型进行增量预训练示例：

python3 pretrain.py --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/mixed_corpus_bert_tiny_model.bin --config_path models/bert_tiny_config.json \
                    --output_model_path models/output_model.bin --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

基于词的预训练模型

UER-py提供了基于词的预训练模型。我们可以从模型仓库中下载 wiki_bert_word_model.bin 及其词典 wiki_word_vocab.txt 。
加载 wiki_bert_word_model.bin 进行增量预训练的示例：
假设训练语料库是《人民日报》的新闻数据。首先，我们进行分句并获得 rmrb_seg_bert.txt ， rmrb_seg_bert.txt 是bert格式，单词之间用空格分隔；然后我们在语料库上建立词典：

python3 scripts/build_vocab.py --corpus_path corpora/rmrb_seg_bert.txt --vocab_path models/rmrb_word_vocab.txt --tokenizer space --min_count 50

然后，我们调整预训练模型 wiki_bert_word_model.bin ，softmax之前的词向量层和输出层会根据旧词典和新词典之间的差异进行调整，新的词向量是随机初始化的：

python3 scripts/dynamic_vocab_adapter.py --old_model_path models/wiki_bert_word_model.bin --old_vocab_path models/wiki_word_vocab.txt \
                                         --new_vocab_path models/rmrb_word_vocab.txt --new_model_path models/rmrb_bert_word_model.bin

最后，我们对调整后的模型 rmrb_bert_word_model.bin 进行增量式预训练：

python3 preprocess.py --corpus_path corpora/rmrb_seg_bert.txt --vocab_path models/rmrb_word_vocab.txt \
                      --dataset_path rmrb_word_dataset.pt --processes_num 8 \
                      --target bert --tokenizer space --dynamic_masking --seq_length 256

python3 pretrain.py --dataset_path rmrb_word_dataset.pt --vocab_path models/rmrb_word_vocab.txt \
                    --pretrained_model_path models/rmrb_bert_word_model.bin \
                    --output_model_path models/rmrb_bert_word_incremental_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 1000 \
                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target bert

Home
主页
- 项目特色
- 依赖环境
- 快速上手
- 预训练数据
- 下游任务数据集
- 预训练模型仓库
- 使用说明
- 竞赛解决方案
  - 中文任务测评基准CLUE
  - SMP2020-EWECT
  - SMP2019-ECISA
  - CCF-BDCI2021-面向黑灰产治理的恶意短信变体字还原
  - 英文任务测评基准GLUE
- 引用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

预训练

随机初始化

加载预训练模型

调整预训练模型尺寸

基于词的预训练模型

Clone this wiki locally