Skip to content

NMT: Usage

mshannon-sil edited this page Jan 10, 2025 · 25 revisions

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

experiment

The experiment tool runs the preprocess, train, and test tools in succession if none of the individual parts are specified.

usage: python -m silnlp.nmt.experiment [-h] [--stats] [--force-align] [--disable-mixed-precision]
[--num-devices NUM_DEVICES] [--clearml-queue QUEUE] [--save-checkpoints]
[--preprocess] [--train] [--test] [--translate] [--score-by-book] [--mt-dir DIR] [--debug]
[--commit ID] [--scorers [scorer [scorer ...]]] [--multiple-translations]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--stats Compute tokenization statistics Compute tokenization statistics.
--force-align Force recalculation of all alignment scores Only relevant when using the --stats option.
--disable-mixed-precision Disable mixed precision Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More...
--num-devices NUM_DEVICES Number of devices to train on To split a model across multiple GPUs, use this option to set how many GPUs to use. Currently, the available options are only 1 or 2 devices, and only for the NLLB model. If the GPUs are not automatically detected, you may need to ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using --num-devices 2 then set CUDA_VISIBLE_DEVICES=0,1
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--save-checkpoints Save checkpoints to s3 bucket Save checkpoints to s3 bucket.
--preprocess Run the preprocess step Run the preprocess step.
--train Run the train step Run the train step.
--test Run the test step Run the test step.
--translate Create drafts See here for more details.
--score-by-book Score individual books In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set.
--mt-dir DIR The machine translation directory Use an alternative machine translation directory for the location of the experiment.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.
--scorers [scorer [scorer ...]] Set scorers Specifies the list of scorers to be used on the predictions. Default is ['bleu', 'sentencebleu', 'chrf3', 'chrf3++', 'wer', 'ter', 'spbleu']. Additional options are 'chrf+' and 'meteor'.
--multiple-translations Produce multiple drafts If the translate or test steps are being performed, produce multiple drafts of the input data or test data, respectively. When translating, the system will produce multiple output files, one for each draft. In testing, a new column has been added to the output to specify the draft number (1, 2, etc.). See here for more details.

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

  • splitting the source and target files into the training, validation, and test data sets;
  • writing the train/validate/test data sets to files in the subfolder;
  • adapting the tokenizer of the parent model to be used by this experiment.
  • generating tokenization statistics about the data

usage: python -m silnlp.nmt.preprocess [-h] [--stats] [--force-align] experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--stats Compute tokenization statistics Compute tokenization statistics.
--force-align Force recalculation of all alignment scores Only relevant when using the --stats option.

train

The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: python -m silnlp.nmt.train [-h] [--disable-mixed-precision]
[--num-devices NUM_DEVICES]
experiments [experiments ...]

Arguments:

Argument Purpose Description
experiments Experiment names The names of the experiments to train. Each experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--disable-mixed-precision Disable mixed precision Only use this option if your GPU doesn't support mixed precision. It is considerably faster than full precision and has lower memory requirements allowing you to train larger models. It has a negligible effect on the final model. More...
--num-devices NUM_DEVICES Number of devices to train on To train a single model on multiple GPUs use this option to set how many GPUs to use. Ensure that the environment variable CUDA_VISIBLE_DEVICES is also set so that multiple GPUs are visible. eg. if using --num-devices 2 then set CUDA_VISIBLE_DEVICES=0,1

test

The test tool tests the neural model for an experiment. If no trained model exists in the experiment folder, the base model will be used.

usage: python -m silnlp.nmt.test [-h] [--checkpoint CHECKPOINT]
[--last] [--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books BOOKS] [--by-book]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the run subfolder of the specified experiment.
--last Test the last checkpoint Use the last training checkpoint to generate target language predictions.
--best Test the best checkpoint Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the run > export subfolder of the specified experiment.
--avg Test the averaged checkpoint Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the train: average_last_checkpoints: _<n>_ option, or it can be manually generated after training by using the average_checkpoints tool.
--ref-projects [project [project ...]] Reference projects The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions.
--force-infer Force inferencing If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions.
--scorers [scorer [scorer ...]] Set scorers Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'.
--books BOOKS Books to score Specifies one or more books/chapters to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s)/chapter(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis) and follow the syntax found here.
--by-book Score individual books In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the --books option, individual scores are provided for each of the specified books.

translate

The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:

  1. Using a trained model to translate the text in a file from the source language to a target language.
  2. Using a trained model to translate the text in a sequence of files into a target language.
  3. Using a trained model to translate a USFM-formatted book in a Paratext project into a target language.

The command line arguments for each of these scenarios are described below.

usage: python -m silnlp.nmt.translate [-h] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG]
[--src-prefix SRC_PREFIX] [--trg-prefix TRG_PREFIX] [--start-seq START_SEQ] [--end-seq END_SEQ]
[--src-project SRC_PROJECT] [--trg-project TRG_PROJECT]
[--books BOOKS] [--src-iso LANG] [--trg-iso LANG]
[--include-inline-elements] [--stylesheet-field-update ACTION] [--multiple-translations]
[--clearml-queue QUEUE] [--debug] [--commit ID]
experiment

Text file

Using the combination of command line arguments described in this section, the translate command will translate the sentences in a text file from the source language to the target language, using the requested checkpoint from a trained model.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src SRC Source file Name of a text file with the source language sentences to be translated (one sentence per line). The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file.
--trg TRG Target file Name of the text file where the translated sentences will be written (one per line).
--src-iso LANG Source language ISO code The ISO code for the source language.
--trg-iso LANG Target language ISO code The ISO code for the target language.
--multiple-translations Produce multiple drafts Produce a number of drafts equal to num_drafts in config.yml. The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified --trg output.txt, files named output.1.txt, output.2.txt, etc. will be created. See here for more details.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.

Sequence of Text Files

Using the combination of command line arguments described in this section, the translate command will translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src-prefix SRC_PREFIX Source file prefix (e.g., de-news2019-) The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory.
--trg-prefix TRG_PREFIX Target file prefix (e.g., en-news2019-) The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory.
--start-seq START_SEQ Starting file sequence number The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc).
--end-seq START_SEQ Ending file sequence number The final source language file sequence number to translate.
--src-iso LANG Source language ISO code The ISO code for the source language.
--trg-iso LANG Target language ISO code The ISO code for the target language.
--multiple-translations Produce multiple drafts Produce a number of drafts equal to num_drafts in config.yml. The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified --trg-prefix output_ and --end-seq 2, files named output_0000.1.txt, output_0000.2.txt, output_0001.1.txt, etc. will be created. See here for more details.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.

Paratext book (USFM file)

Using the combination of command line arguments described in this section, the translate command will translate a book from a Paratext project into the requested target language. The translated text is written into a USFM-formatted file with markup that closely follows the markup in the source book.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiments to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src-project SRC_PROJECT The source project to translate The name of the source Paratext project. The project name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Paratext > projects folder.
--trg-project TRG_PROJECT Target project The name of the target Paratext project that will fill in missing text for books that are not entirely translated. The project name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Paratext > projects folder.
--books BOOKS The books to translate A list of the books/chapters in the source Paratext project to be translated. Book identifiers should follow the USFM 3.0 standard and the selections should follow the syntax found here. If multiple selections are being made, put the selections in quotes so that the semicolons are not misinterpreted.
--trg-iso LANG Target language ISO code The ISO code for the target language.
--include-inline-elements Keep inline elements in USFM files Keeps inline USFM elements such as footnotes and cross references. Default behavior is to remove these elements before translating.
--stylesheet-field-update ACTION Handle USFM style conflicts What to do with the OccursUnder and TextProperties fields of a project's custom stylesheet. Possible values are 'replace', 'merge' (default), and 'ignore'.
--multiple-translations Produce multiple drafts Produce a number of drafts equal to num_drafts in config.yml. The way that source and target files are specified does not need to be changed when using this. Instead, a suffix will be added to the output file, corresponding to the draft number. For example, if you specified --books JOL, then in the target project's run directory, files named 29JOL.1.SFM, 29JOL.2.SFM, etc. will be created. See here for more details.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML.
--debug Show debug information Show information about the environment variables and arguments.
--commit ID Commit ID The silnlp git commit id with which to run a remote job.

Assessing Data Suitability for Training a Model

analyze_project_pairs

Gets verse counts and computes alignment scores for pairs of biblical texts. Outputs the raw counts/scores and optionally summarizes the information in Excel files

Configuration information: The script functions the same way as an experiment in that it operates within an experiment folder and uses a reduced version of an experiment's config.yml file. It only expects the "data" section of the config file to exist*. Within the data section, it only looks at the "aligner" and "corpus_pairs" fields. Within each corpus pair, it uses the "src", "trg", "mapping", "corpus_books", and "score_threshold" fields. See here for definitions and default values for each field.

*It will also optionally look at the "model" field to check if the model was trained on any data with the same script as the given data.

usage: python -m silnlp.nmt.analyze_project_pairs [-h] [--create-summaries] [--recalculate]
[--deutero] [--clearml-queue QUEUE] experiment

Arguments:

Argument Purpose Description
experiment Experiment folder The name of the subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder containing the config.yml file and where outputs will be written to.
--create-summaries Create summary Excel files Creates two files, one more general file containing verse counts and high level alignment stats, and another with a more in-depth breakdown of the alignment scores.
--recalculate Force recalculation of all verse counts and alignment scores Verse counts are cached globally but alignments will always be created from scratch the first time a given experiment is run and will be stored in the experiment folder.
--deutero Include books from the Deuterocanon A warning message will be printed for each text that has books from the Deuterocanon when this option is not used.
--clearml-queue QUEUE ClearML queue Run remotely on ClearML queue. Default: None - don't register with ClearML. The queue 'local' will run it locally and register it with ClearML. analyze_project_pairs is a CPU-intensive script that will not benefit from (and in fact will probably be slowed down by) a GPU-only queue.

Analyzing experiment metadata

alphabet_similarity

Calculates alphabet similarity between text corpora in a multilingual data set.

usage: python -m silnlp.nmt.alphabet_similarity [-h] experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.

segment_length

Display a histogram of segment lengths in tokens.

usage: python -m silnlp.nmt.segment_length [-h] experiment filename

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
filename Tokenized file in experiment folder Tokenized file in experiment folder.

vocab_overlap

Calculate the vocab overlap between two experiments.

usage: python -m silnlp.nmt.vocab_overlap [-h] exp1 exp2

Arguments:

Argument Purpose Description
exp1 Experiment 1 name The name of the first experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
exp2 Experiment 2 name The name of the second experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.

Analyzing the results of an experiment

check_train_val_test_split

After a model has been trained and used to generate predictions for the test set, the check_train_val_test_split tool can be used to analyze the word distributions across the train, validate, and test sets for the source and target corpora. By default, the tool will generate high-level statistics regarding the occurrence of "unknown" words (i.e., words that occur in the validation set or in the test set, but not in the training set). The tool can also be used to generate detailed listings of these unknown words and their occurrence counts. It is also possible to have the tool compare these unknown words to the valid words found in the training set to identify possible misspellings. Output is saved in the word_count.xlsx file in the specified experiment folder.

usage: python -m silnlp.nmt.check_train_val_test_split [-h]
[--details] [--similar-words]
[--distance DIST] [--detok-val]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiments to check. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--details Show detailed word lists Generate detailed lists of validation set and test set words that are not found in the training set. Separate lists are generated for the source and target corpora. Occurrence counts are provided for each identified word.
--similar-words Find similar words Compare each unknown words to the valid words found in the training set and identify possible misspellings in the validation and test set. Levenshtein distance is used to identify the possible misspellings.
--distance DIST Maximum Levenshtein distance for word similarity By default, a Levenshtein distance of 1 is used to identify similar words in the training set. This parameter can be used to specify a different distance.
--detok-val Detokenize the target validation set Detokenize the target validation set.

diff_predictions

The diff_predictions tool can be used to compare the test set predictions to the reference sentences for an experiment. The tool generates a spreadsheet (diff_predictions.xlsx) with multiple comparison tabs. The comparison includes the test set source text, the target language reference text, the predictions, and the sentence-level BLEU scores for the predictions. Optionally, the tool can mark-up each prediction to identify the differences between the reference text and the prediction. The source text can also be marked up to highlight test set words that are not found in the training set. Optionally, the training set source / target sentence pairs can be included in the output spreadsheet on a separate tab.

usage: python -m silnlp.nmt.diff_predictions [-h] [--last]
[--show-diffs] [--show-unknown] [--show-dict]
[--include-train] [--include-dict] [--analyze-digits]
[--preserve-case] [--tokenize TOK] [--scorers [scorer [scorer ...]]]
exp1

Arguments:

Argument Purpose Description
exp1 Experiment name The name of the experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--last Use last result Use last result instead of best one.
--show-diffs Show differences (predictions vs reference) Mark up the predictions to indicate where they differ from the reference text.
--show-unknown Show unknown words in source verse Mark up the test set source sentences to indicate words that do not occur in the training set.
--show-dict Show dictionary words in source verse Show dictionary words in source verse.
--include-train Include the src/trg training corpora in the spreadsheet Include the parallel source/target training sentence pairs in another tab in the spreadsheet.
--include-dict Include the src/trg dictionary in the spreadsheet Include the src/trg dictionary in the spreadsheet.
--analyze-digits Perform digits analysis Perform digits analysis.
--preserve-case Score predictions with case preserved Preserve case when calculating the sentence-level BLEU score for the source/target sentence pairs. By default, the tool will lower case the source and target. Note that this behavior is secondary to the source / target case settings specified in the config.yml file; if those settings specified lower casing, then this argument has no effect.
--tokenize TOKENIZE Sacrebleu tokenizer (none,13a,intl,zh,ja-mecab,char) Specifies the Sacrebleu tokenizer that will be used to calculate the sentence-level BLEU score for each source/target sentence pair. (Default: 13a)
--scorers [scorer [scorer ...]] List of scorers Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'sentencebleu', 'chrf3', 'chrf3+', 'chrf3++', 'meteor', 'ter', 'wer', and 'spbleu'.