Skip to content

Configure a model

Matthew Beech edited this page Jan 7, 2025 · 58 revisions

How to configure the training of a model.

The parameters for training a model are stored in the experiment folder in a file named 'config.yml'. The file uses the YAML format. Related settings are grouped together in sections.

Sections of a config file.

These are the sections of a config file.

data:
eval:
infer:
model:
params:
train:

It is not necessary to specify options for all of these sections for every training. Only those with parameters which differ from the default values need to be specified. See Parameter Definitions for a full list of supported parameters and their definitions.

A minimal config.yml file looks like this:

data:
  corpus_pairs:
  - type: train,val,test
    src: src-text
    trg: trg-text
  share_vocab: false
  src_vocab_size: 24000
  trg_vocab_size: 32000
model: facebook/nllb-200-distilled-600M

This minimal config file provides these instructions to the system. Train a model to translate between src and trg languages. Split the texts into three parts one for training, one for validation and one for test. Use the default sizes for the validation and test sets and all the remaining data for the training. Create a separate vocab file for the source and target languages. Instruct sentencepiece to create a source vocab of 24000 tokens and to create a target vocab of 32000 tokens. Use the defaults for all the other settings including the default model architecture and default early stopping conditions.

Another way to learn how to configure training is by examining the effective config file that is produced when an experiment is run.

Selection of books or chapters for training on Scripture data.

The parallel text available for low resource languages are translations of Scripture that are aligned by verse reference.

When the aligned Scripture files are used as a corpus pair it is possible to select parts of the data for training and testing without having to split the text files prior to training. We have added a corpus_books config option for this function. There is also a similar option to specify which books to include in the test set test_books. Another option in the terms section is filter_books, which specifies which books to be included for key terms, has the same available syntax at the book level (chapters cannot be specified).

The example below shows the corpus_pairs section for restricting the entire model to only the data in the New Testament. The training, validation and test sets are all drawn only from that data.

  corpus_pairs:
  - type: train,test,val
    corpus_books: NT 
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_size: 250

The following is an example showing how to specify a corpus_pairs to use the New Testament, Genesis and Psalms for the training and validation sets. It also shows how to restrict the test set to verses from the book of Exodus.

  corpus_pairs:
  - type: train,val,test
    corpus_books: NT,GEN,PSA
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_books: EXO
    test_size: 250
  seed: 111

In this example the book of Exodus is reserved for the test set and the remaining books of the Bible are available for training and validation. The test_books parameter excludes the books listed there from appearing in the Training or Validation sets. So even though only 250 verses of Exodus are used for the test set non of the remaining verses are included in either the training or validation sets. Therefore the test_books parameter may be used to restrict the training to a smaller set of data without having to modify the data files.

No error is raised if you specify a test_size larger than the number of verses in the test_books. In that case all of the verses in the test_books will be used as the test set.

model: SILTransformerBase
data:
  corpus_pairs:
  - type: train,val,test
    src: src-bible
    trg: trg-bible
    val_size: 250
    test_books: EXO
    test_size: 250

Alternative syntax for corpus_books, test_books, and filter_books to use chapter specification, book ranges, and subtraction.

In addition to using comma-separated lists to specify the books used for trianing and testing, it is also possible to specify data at the chapter level, with book ranges, and with subtraction. To do this, use a semicolon-separated list, where each section has one of the following formats:

  • A comma-separated list of chapters and chapter ranges for a specific book, e.g. MAT1,2,6-10. filter_books does not allow chapter specification.
  • A range of books, e.g. GEN-DEU
  • A single book or testament, e.g. MAT, OT
  • To subtract some data from the selection, use one of the above types preceded by -, e.g. -MAT1-4, -GEN-LEV. Sections are evaluated in the order that they appear, so make sure the selection being subtracted has already been added to the data set.

Examples:

GEN;EXO;LEV
OT;MAT-ROM;-ACT4-28
NT;-3JN

Using Multiple Sources

There are several ways to use more than one source in your experiment data. If you want to use different sources to get data from different parts of a text, you can define mulitple corpus pairs. This is useful when a source has incomplete data, or when you want to use different sources for training vs evaluation and testing.

data:
  corpus_pairs:
  - type: train,val,test
    src: src-bible1
    trg: trg-bible
    corpus_books: GEN,EXO
    test_books: LEV
  - type: train,val,test
    src: src-bible2
    trg: trg-bible
    corpus_books: NUM,DEU
    test_books: JOS

If you instead want to use multiple sources but want to select data from the same portion of the texts, you can define a mixed-source corpus pair. This will equally and randomly choose verses from each text without overlap.

data:
  corpus_pairs:
  - mapping: mixed_src
    type: train,val,test
    src:
    - src-bible1
    - src-bible2
    trg: trg-bible
    corpus_books: GEN,EXO
    test_books: LEV

Additionally, the many_to_many mapping allows you to map multiple sources to multiple targets.

data:
  corpus_pairs:
  - mapping: many_to_many
    type: train,val,test
    src:
    - src-bible1
    - src-bible2
    trg:
    - trg-bible1
    - trg-bible2
    corpus_books: GEN,EXO
    test_books: LEV

A complete list of the possible abbreviations for the books of the Bible recognized by the code.

Abbreviations for Old Testament Books

GEN EXO LEV NUM DEU JOS JDG RUT 1SA 2SA 1KI 2KI 1CH 2CH EZR NEH EST JOB PSA PRO
ECC SNG ISA JER LAM EZK DAN HOS JOL AMO OBA JON MIC NAM HAB ZEP HAG ZEC MAL 

Abbreviations for New Testament Books

MAT MRK LUK JHN ACT ROM 1CO 2CO GAL EPH PHP COL 1TH 2TH 1TI 2TI TIT PHM HEB JAS 1PE 2PE 1JN 2JN 3JN JUD REV 

Abbreviations for Deutero cannonical Books

TOB JDT ESG WIS SIR BAR LJE S3Y SUS BEL 1MA 2MA 3MA 4MA 1ES 2ES MAN PS2 ODA PSS JSA JDB TBS SST DNT BLT 
3ES EZA 5EZ 6EZ INT CNC GLO TDX NDX DAG PS3 2BA LBA JUB ENO 1MQ 2MQ 3MQ REP 4BA LAO 

A note about the seed parameter.

The seed parameter is used as a seed for a random number generator. The benefit of setting this explicitly is that the same random selection of Validation and Test set verses are chosen from the available data. Setting the seed means that other training runs using the makes it possible to compare the effect of changing other parameters against an identical test set. If this is not set explicitly then the training, validation and test sets contents' will vary between one training run and the next.

A note about YAML files.

YAML is designed to be easy to read. It is useful to know that there are various ways to specify a list. Inline lists are separated with commas and square brackets are optional for a simple list. For a list that is too long for a single each item can be on a separate line preceded with a hyphen and a space.

These are three ways of indicating the same list:

    test_books: GEN,EXO,LEV,NUM,DEU

    test_books: [GEN,EXO,LEV,NUM,DEU]

    test_books:
    - GEN
    - EXO
    - LEV
    - NUM
    - DEU

The hyphen and space - on the line after the corpus_pairs parameter indicates that these settings are part of a list. In the examples above only one corpus pair is specified. Here is an example of a complete config.yml file, the one we used to train our German to English parent model. There are three corpus pairs one for each of the Training, Validation and Test sets.

model: SILTransformerBaseAlignmentEnhanced
data:
  terms:
    dictionary: true
  corpus_pairs:
  - type: train
    src: de-WMT2020+Bibles
    trg: en-WMT2020+Bibles
  - type: val
    src: de-newstest2014_ende
    trg: en-newstest2014_ende
  - type: test
    src: de-newstest2017_ende
    trg: en-newstest2017_ende
  seed: 111
  share_vocab: false
  src_casing: lower
  src_vocab_size: 32000
  trg_casing: preserve
  trg_vocab_size: 32000
params:
  coverage_penalty: 0.1
  word_dropout: 0
train:
  keep_checkpoint_max: 5
  max_step: 1000000
  sample_buffer_size: 10000000
eval:      
  steps: 10000
  export_on_best: bleu
  early_stopping: null 
  export_format: checkpoint
  max_exports_to_keep: 100

Preprocessing.

The files required for training, validation, and testing will be tokenized using the tokenizer of the specified model and the outputs will be written to the experiment folder. These are named:

train.src.txt
train.src.detok.txt
train.trg.txt
train.trg.detok.txt
train.vref.txt
val.src.txt
val.src.detok.txt
val.trg.txt
val.trg.detok.txt
val.vref.txt
test.src.txt
test.src.detok.txt
test.trg.detok.txt
test.vref.txt

The seed in the config file is used in the selection of verses for each of the training splits, and this behavior is enabled by default for consistent experimentation.

The Effective Config file.

The effective config file is created as soon as the training begins. A good way to learn about all the default parameters is to compare a simple config file like this one to the effective config that it creates. Although there may be more than 100 parameters in the effective config file they all have default values. Typically we've found very few areas where we can get better results by changing a default value. They have been the subject of many experiments and are chosen by the OpenNMT project according to the results of the latest research.

Parameter Definitions

Definitions of every configurable experiment parameter and their default values. Information about Hugging Face parameters can be found here. Selected HF parameters are defined below for convenience, and default values are only given if they are explicitly defined in silnlp.

Data

  • add_new_lang_code=True: Add any language codes in language_codes to the tokenizer if they do not already exist.
  • aligner="fast_align": Aligner to use.
  • corpus_pairs:
    • augment=[]: List of data augmentation methods and their arguments to apply to the data. See example below.
      augment:
      - subword:
        - encodings: 2
      
    • corpus_books=[]: Books to be included in the dataset. See Selection of books or chapters for training on Scripture data.
    • disjoint_test=True: Use the same test set across all source-target pairs in the corpus pair to ensure no overlap between any train and test sets.
    • disjoint_val=True: Use the same evaluation set across all source-target pairs in the corpus pair to ensure no overlap between any train and evaluation sets.
    • lexical=False: Whether data is made up of lexical items rather than sentences.
    • mapping="one_to_one": How to map sources to targets. Options are one_to_one, mixed_src, or many_to_many. See Using Multiple Sources.
    • score_threshold=0.0: If <1, it is the minimum alignment score sentence pairs must have to be included in the training data. If >=1, that number of training sentence pairs with the lowest alignment scores will be filtered out of the training data.
    • size=1.0: Size of training split. If size is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size.
    • src: Required argument. List of sources. Sources can be a mix of strings and dictionaries. Passing a dictionary allows the user to configure the source file object. See the DataFile class for a list of the configurable properties. Targets (see trg) can also be defined in this way. See example below.
      src:
      - name: aaa-SRC_BT
        include_test: false
      - aaa-SRC
      
    • src_noise=[]: List of noise-adding methods and their arguments to apply to source sentences. See example below.
      src_noise:
      - dropout: .1
      - replacement: [.1, <blank>]
      - permutation: 2
      
    • tags=[]: Tags to prefix to each source sentence.
    • test_books=[]: Books to be included in the test set. See Selection of books or chapters for training on Scripture data.
    • test_size=250: Size of test split. If test_size is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size.
    • trg: Required argument. List of targets. See src.
    • type="train,test,val": What the data in the corpus pair will be used for. Possible values are any combination of train, test, and val.
    • use_test_set_from="": Use the set of verses in the given experiment's test set for this experiment.
    • val_size=250: Size of evaluation split. If val_size is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size.
  • lang_codes: Mapping of ISO language codes to their NLLB equivalents for each language included in the data. See example below.
    lang_codes:
      en: eng_Latn
      npi: npi_Deva
    
  • mirror=False: Add mirrored data to the dataset, where the source and target are flipped.
  • seed=111: Seed for random verse selection. See A note about the seed parameter.
  • share_vocab=False: Use the same vocab file for the source and target languages.
  • stats_max_size=100000: Maximum number of sentence pairs allowed for a stats file to be generated.
  • terms:
    • categories="PN": Which categories of key terms to include.
    • dictionary=False: Write dictionary with key terms.
    • include_glosses=True: Include glosses of key terms. Can also be set to the ISO language code of the gloss to include. The lang_code parameter must include this ISO to NLLB mapping for accurate results.
    • train=True: Train on key terms data.
    • filter_books=[]: Which books of key terms to include. See Selection of books or chapters for training on Scripture data.
  • tokenize=True: Tokenize data.
  • tokenizer:
    • update_src=False: Update the tokenizer for the source language.
    • update_trg=False: Update the tokenizer for the target language.
    • trained_tokens=False: If True, train a new tokenizer on the source and/or target (specified by the update_src and update_trg parameters) to obtain trained tokens tailored to the source and/or target. All of the resulting tokens that are not present in the existing tokenizer are then added to the existing tokenizer. If False, only individual characters that are present in the source and/or target text and not present in the existing tokenizer will be added to the existing tokenizer, rather than trained tokens.
    • src_vocab_size: Only applicable if update_src and trained_tokens are True. This sets the vocab size for the new tokenizer for the source side. There is no default value, so it must be explicitly specified when update_src and trained_tokens are True.
    • trg_vocab_size: Only applicable if update_trg and trained_tokens are True. This sets the vocab size for the new tokenizer for the target side. There is no default value, so it must be explicitly specified when update_trg and trained_tokens are True.
    • share_vocab=False: Only applicable if update_src, update_trg, and trained_tokens are True. Rather than create new tokenizers for the source and target separately, use a single new tokenizer for both the source and target combined with a vocab size of src_vocab_size + trg_vocab_size.
    • init_unk=False: Initialize new token embeddings using the embedding for the unk token rather than using the model's default initialization behavior.

Eval

HF Arguments: eval_accumulation_steps, eval_delay, eval_steps=1000, evaluation_strategy="steps", greater_is_better, include_inputs_for_metrics, load_best_model_at_end=True, per_device_eval_batch_size=16, predict_with_generate=True

  • eval_steps=1000: Number of update steps between two evaluations if evaluation_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
  • metric_for_best_model="bleu": Metric to use for evaluation during training. Supported values in silnlp are 'bleu', 'chrf3', 'chrf3+', and 'chrf3++'.

Other Arguments:

  • detokenize=True: Detokenize verses before computing metrics during evaluation/testing.
  • early_stopping:
    • min_improvement=0.2: How much the metric_for_best_model metric must improve for training to continue.
    • steps=4: The amount of times in a row that an evaluation can improve by less than min_improvement before training is stopped.
  • multi_ref_eval=False: Evaluate outputs against multiple targets.

Infer

  • infer_batch_size=16: Batch size for inference.
  • num_beams=2: Number of beams for beam search during translation.

Model

model: Required argument. Name of base model to be used. Defined at the top level of the config, i.e. at the same level as data, eval, etc..

Params

HF Arguments: adafactor, adam_beta1, adam_beta2, adam_epsilon, full_determinism, generation_max_length, generation_num_beams, label_smoothing_factor=0.2, learning_rate, lr_scheduler_type, max_grad_norm, optim="adamw_torch", warmup_ratio, warmup_steps=4000, weight_decay, attn_implementation="sdpa"

Other Arguments:

  • activation_dropout=0.0: Dropout rate for activation layers.
  • attention_dropout=0.1: Dropout rate for attention layers.
  • dropout=0.1: Dropout rate for all other layers.

Train

HF Arguments: gradient_accumulation_steps=4, gradient_checkpointing=True, "gradient_checkpointing_kwargs"={"use_reentrant": True}, group_by_length=True, log_level="info", logging_dir, logging_first_step, logging_nan_inf_filter, logging_steps, logging_strategy, max_steps=100000, num_train_epochs, output_dir=str(exp_dir / "run"), per_device_train_batch_size=16, save_on_each_node, save_steps=1000, save_strategy="steps", save_total_limit=2

  • gradient_accumulation_steps=4: Number of updates steps to accumulate the gradients for before performing a backward/update pass.
  • gradient_checkpointing=True: Use gradient checkpointing to save memory at the expense of slower backward pass.
  • "gradient_checkpointing_kwargs"={"use_reentrant": True}: Use the reentrant implementation of gradient checkpointing. (If errors occur with gradient checkpointing and LoRA or some other method that freezes parameters/layers, try setting use_reentrant to False.)
  • logging_steps=500: Number of update steps between two logs if logging_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
  • max_steps=100000: The total number of training steps to perform. For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until max_steps is reached. Overrides num_train_epochs. Set to -1 to instead use num_train_epochs.
  • num_train_epochs=3.0: Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).
  • per_device_train_batch_size=16: The batch size per GPU core/CPU for training.
  • save_steps=1000: Number of updates steps before two checkpoint saves if save_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
  • attn_implementation="sdpa": Sets the attention implementation for a model. Possible values are "sdpa", "eager", or "flash_attention_2". Note that "flash_attention_2" is not currently compatible with NLLB.

Other Arguments:

  • auto_grad_acc=False: Find and use the largest possible batch size and adjust the number of gradient accumulation steps accordingly to maintain an effective batch size of 64. The per_device_train_batch_size and gradient_accumulation_steps arguments are ignored while using this option.
  • delete_checkpoint_optimizer_state=True: Delete optimizer state from every saved checkpoint after training.
  • delete_checkpoint_tokenizer=True: Delete tokenizer from every saved checkpoint after training.
  • lora_config: Optional configuration for LoRA. See Common LoRA Parameters in PEFT.
    • alpha=32: Value for lora_alpha. "The alpha parameter for Lora scaling."
    • dropout=0.1: Value for lora_dropout. "The dropout probability for Lora layers."
    • modules_to_save: "List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint." Default value depends on the model being trained, but it normally includes "embed_tokens" and "lm_head".
    • r=4: "Lora attention dimension."
    • target_modules: "The names of the modules to apply Lora to." Default value depends on the model being trained, but it normally includes all linear layers.
  • max_source_length=200: Maximum length of a source segment. Segments longer than this value are truncated.
  • max_target_length=200: Maximum length of a target segment. Segments longer than this value are truncated.
  • use_lora=False: Train model using LoRA through the peft library. See here for more information.

Helpful Parameters for Development

The following are some parameters that can be useful to change when running experiments for the purpose of testing during development. This is mostly to reduce training time while still making sure each part of the process is run.

eval: 
  eval_steps
  per_device_eval_batch_size
infer: 
  infer_batch_size
params:
  warmup_steps
train:
  max_steps
  num_train_epochs
  per_device_train_batch_size
  save_steps
  • save_steps determines how often a model checkpoint is saved during training. For example, if you wanted to quickly get a model to inference with, you could set both max_steps and save_steps to 100.

How to configure translation requests for a model.

Using the --translate option when running an experiment allows drafts to be created immediately following the training of a model. The configuration for each transalation request must be specified in translate_config.yml in the experiment folder. The behavior of this process is identical to using the translate.py script, and so the possible arguments for a configuration match the command line options of the script (With the exception of the clearml_queue, and debug options). The format of translate_config.yml is a list of dictionaries, where each dictionary represents a translation request. See example below, as well as the translate.py usage documentation for descriptions of the arguments.

translate:
- books: 1JN
- src_project: NASB
  trg_project: NNRV
  books: 1JN1-2;2JN
  • In this example, the first request will translate 1 John from the experiment's source project to the target language. The second request will translate the specified chapters in the NASB to the target language, filling in incomplete books with text from the NNRV.

Using the legacy learning rate configuration

Originally, the default configuration for training a model in silnlp used a small learning rate and a large number of maximum steps, and rather than training each model for the maximum amount of steps, it used "early stopping" to detect when the model was adequately trained by comparing the model's evaluation scores over the course of training. The default configuration has since been updated to use a larger learning rate and a smaller number of maximum steps, and models are now always trained for the maximum number of steps. While models now train for much fewer steps (5k steps vs. 10-20k steps), the adjustments made to the learning rate and learning rate schedule allow the models to achieve equal performance compared to the previous setup in the majority of cases. However, there are still some situations, mainly more experimental ones, where the original configuration is better suited for the task. In that case, the original behavior can be restored by adding the below fields to their respective sections in the configuration file of an experiment. The current default values are also given for comparison.

Current training configuration (Oct 2024)

eval: 
  early_stopping: null
params:
  warmup_steps: 1000
  learning_rate: .0002
  lr_scheduler_type: cosine
train:
  max_steps: 5000

Previous training configuration:

eval: 
  early_stopping:
    min_improvement: 0.2
    steps: 4
params:
  warmup_steps: 4000
  learning_rate: .00005
  lr_scheduler_type: linear
train:
  max_steps: 100000