Skip to content

Latest commit

 

History

History
302 lines (218 loc) · 8.39 KB

CHEATSHEET.md

File metadata and controls

302 lines (218 loc) · 8.39 KB

Cheatsheet for training the 🇹🇷 BERT model from scratch

Step 0: Collect corpora

For the 🇹🇷 BERT model we collect ~ 35GB text from various sorces like OPUS, Wikipedia, Leipzig Corpora Collection or from the OSCAR corpus.

In a preprocessing step we use the Turkish NLTK model to perform sentence splitting on the corpus. After sentence splitting we remove all sentences that are shorter than 5 tokens.

Then we split the preprocessed training corpus into 1G shards using split -C 1G.

Vocab generation

We use the awesome 🤗 Tokenizers library to create a BERT-compatible vocab.

The vocab is created on the complete training corpus (not just a single shard).

Cased model

For the cased model we use the following snippet to generate the vocab:

from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(
    clean_text=True, 
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=False, 
)

trainer = tokenizer.train( 
    "tr_final",
    vocab_size=32000,
    min_frequency=2,
    show_progress=True,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
    limit_alphabet=1000,
    wordpieces_prefix="##"
)

tokenizer.save("./", "cased")

Uncased model

For the uncased model we use the following snippet to generate the vocab:

from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,  # We need to investigate that further (stripping helps?)
    lowercase=True,
)

trainer = tokenizer.train(
    "tr_final",
    vocab_size=32000,
    min_frequency=2,
    show_progress=True,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
    limit_alphabet=1000,
    wordpieces_prefix="##"
)

tokenizer.save("./", "uncased")

BERT preprocessing

In this step, the create_pretraining_data.py script from the BERT repo is used to create the necessary input format (TFrecords) to train a model from scratch.

We need to clone the BERT repo first:

git clone https://github.com/google-research/bert.git

Cased model

We did split our huge training corpus into smaller shards (1G per shard):

split -C 1G tr_final tr-

Then we move all shards into a separate folder:

mkdir cased_shards
mv tr-* cased_shards

The preprocessing step for each step will approx. consume 50-60GB of RAM and will take 4-5 hours (depending on your machine). If you have a high memory machine, you can parallelize this step using awesome xargs magic ;)

You can set the number of parallel processes with:

export NUM_PROC=5

Then you can start the preprocessing with:

cd bert # go to the BERT repo

find ../cased_shards -type f | xargs -I% -P $NUM_PROC -n 1 \
python3 create_pretraining_data.py --input_file % --output_file %.tfrecord \
--vocab_file ../cased-vocab.txt --do_lower_case=False -max_seq_length=512 \
--max_predictions_per_seq=75 --masked_lm_prob=0.15 --random_seed=12345 \
--dupe_factor=5

So in this example we use 5 parallel processes. Furthermore, we use a sequence length of 512. You could start with a sequence length of 128, train the model for a few steps and then fine-tune the model with a sequence length of 512.

Uncased model

The steps for the uncased model are pretty much identical to the steps for the cased model.

However, we need to lowercase the training corpus first. In this example we use GNU AWK to lowercase the corpus. On Debian/Ubuntu please make sure that you've installed GNU AWK with:

sudo apt install gawk

Then the corpus can be lowercased with:

cat tr_final | gawk '{print tolower($0);}' > tr_final.lower

We split the lowercased corpus into 1G shards with:

split -C 1G tr_final.lower tr-

and move the shards into a separate folder:

mkdir uncased_shards
mv tr-* uncased_shards/

The number of parallel processes can be configured with:

export NUM_PROC=5

Then you can start the preprocessing with:

cd bert # go to the BERT repo

find ../uncased_shards -type f | xargs -I% -P $NUM_PROC -n 1 \
python3 create_pretraining_data.py --input_file % --output_file %.tfrecord \
--vocab_file ../uncased-vocab.txt --do_lower_case=True -max_seq_length=512 \
--max_predictions_per_seq=75 --masked_lm_prob=0.15 --random_seed=12345 \
--dupe_factor=5

Please make sure, that you use --do_lower_case=True and the lowercased vocab!

BERT Pretraining

Uploading TFRecords

The previously created TFRecords are copied into a separate folder:

mkdir cased_tfrecords uncased_tfrecords
mv cased_shards/*.tfrecord cased_tfrecords
mv uncased_shards/*.tfrecord uncased_tfrecords

Then this folder can be uploaded to a Google Storage Bucket using the gsutil command:

gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M cp -r cased_tfrecords gs://trbert
gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M cp -r uncased_tfrecords gs://trbert

Notice: You must create a Google Storage Bucket first. Please also make sure that the service user (e.g. service-<id>@cloud-tpu.iam.gserviceaccount.com) has "Storage Administrator" permissions in order to write files to the bucket.

TPU + VM instance

We use a v3-8 TPU from the Google's TensorFlow Research Cloud (TFRC). A TPU instance can be created with:

gcloud compute tpus create bert --zone=<zone> --accelerator-type=v3-8 \
--network=default --range=192.168.1.0/29 --version=1.15

Another TPU is created for the training the uncased model:

gcloud compute tpus create bert-2 --zone=<zone> --accelerator-type=v3-8 \
--network=default --range=192.168.2.0/29 --version=1.15

Please make sure, that you've set the correct --zone to avoid extra costs.

The following command is used to create a Google Cloud VM:

gcloud compute instances create bert --zone=<zone> --machine-type=n1-standard-2 \
--image-project=ml-images --image-family=tf-1-15 --scopes=cloud-platform

SSH into VM

Just ssh into the previously created VM and open a tmux session:

gcloud compute ssh bert

# First login: takes a bit of time...
tmux

Clone the BERT repository (first time) and go to the BERT repo:

git clone https://github.com/google-research/bert.git
cd bert

Config

The pretraining scripts needs a json-based configuration file with the correct vocab size. We just use the original BERT base configuration file from the Transformers library and adjusted the vocab size (32000 in our case):

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 32000
}

Store this configuration file to config.json in the bert repo folder.

Pretraining

Then the pretraining command can be run to train a BERT model from scratch:

python3 run_pretraining.py --input_file=gs://trbert/cased_tfrecords/*.tfrecord \
--output_dir=gs://trbert/bert-base-turkish-cased --bert_config_file=config.json \
--max_seq_length=512 --max_predictions_per_seq=75 --do_train=True \
--train_batch_size=128 --num_train_steps=3000000 --learning_rate=1e-4 \
--save_checkpoints_steps=100000 --keep_checkpoint_max=20 --use_tpu=True \
--tpu_name=bert --num_tpu_cores=8

To train the uncased model, just create a new tmux session window and run the pretraining command for the uncased model:

python3 run_pretraining.py --input_file=gs://trbert/uncased_tfrecords/*.tfrecord \
--output_dir=gs://trbert/bert-base-turkish-uncased --bert_config_file=config.json \
--max_seq_length=512 --max_predictions_per_seq=75 --do_train=True \
--train_batch_size=128 --num_train_steps=3000000 --learning_rate=1e-4 \
--save_checkpoints_steps=100000 --keep_checkpoint_max=20 --use_tpu=True \
--tpu_name=bert-2 --num_tpu_cores=8

This will train cased and uncased models for 3M steps. Checkpoints are saved after 100k steps. The last 20 checkpoints will be kept.

Notice: Due to a training command mistake, the uncased model was only trained for 2M steps.

Both cased and uncased models with a vocab size of 128k were trained for 2M steps.