Creating a finetuning command that simplify overall finetuning #1

PonteIneptique · 2024-08-28T09:32:09Z

No description provided.

Implemented via a new config parameter `load_pretrained_model` Enable finetuning of another PaPie model by loading its state dict into the current model. Customization of which parts to load is possible via the subparameter `load_pretrained_model["exclude"]`.

…ding - change variable name `model_tar` to `pretrained` - new nested function `load_state_dict_label_by_label` that applies to load wemb, cemb, lm and tasks with linear decoders label by label - variable `model_parts_to_load` can only contain parts which are part of the model (e.g. "lm" only if `self.include_lm` is True) - Load tasks of pretrained models even when the case don't match (e.g. "pos" task of the pretrained model can be loaded into the "POS" task of the new model) - raise NotImplementedError if the task to be loaded does not correspond to a LinearDecoder (planned to implement AttentionalDecoder soon, otherwise the user should exclude the task in the config)

`MultiLabelEncoder.fit()` has a new boolean option `expand_mode` with which new labels/vocab are added to the freqs, then a new method `LabelEncoder.expand_vocab()` is called to extract a new list of labels/vocabs and append new entries at the end of `LabelEncoder.table` and `LabelEncoder.inverse_table` . The `expand_mode` is optional and can be set to False/True (default True) in the json config ("load_pretrained_model"/"expand_labels"). If set to False in finetuning mode ("load_pretrained_model"/"pretrained":"pretrained_model.tar"), a new option `skip_fitted` is passed to `MultiLabelEncoder.fit()` instead, to be able to fit potential tasks that could be set in the new model but not in the pretrained one (example use case: fine-tune a model pretrained for POS tagging on a lemmatization task). + Fix typo in `LabelEncoder.from_json()` + reorganize imports in module dataset.py + fix logging.basicConfig by adding option force=True (fixes missing logs in the stdout) + change logs in module datasets.py with calls to a module-specific logger (best practice)

Previously, if two vocabulary entries had the same uppercasing, the inverse_table would include the duplicates, but the table would include the upper entry only with the last index, leading to a missing index. Found in an experiment where 2 characters were uppercased to "M", at positions 606 and 651, so table had only "M":651 with index 606 was missing, leading to random errors when the model tried to predict index 606.

option 'labels_mode' takes 3 possible values: - "expand" replaces expand_labels=true - append new vocab from the new data to the pretrained label encoders - "skip" replaces expand_labels=false - only fit new tasks that haven't been pretrained - "replace" fits a new MultiLabelEncoder (pretrained params will still be loaded for common vocab entries)

…ncoder loading

User can pass a seed value either from the command-line with a new option `--seed` or from the config file, where the default value is 'auto'.

This script can be useful in fine-tuning or reporting scenarios, as it aims to show the size of the vocabularies as well as the number and size of some important layers in the model.

When "labels_mode" is "expand" but the vocabulary max sizes ("char_max_size"/"word_max_size") are smaller than the parent model's vocabularies, the intended behavior is to keep only the most frequent entries from the parent vocabulary. A bug in the code led to removing the entire vocabulary, including the reserved entries (e.g. <UNK>). This commit fixes that.

OrianeN

Nice ! Not sure I prefer this option over the config file, but I guess it's a matter of taste.
I'm also discovering the model._settings attribute which I didn't find when writing the script get_pie_model_params.py ^^

Here are a few suggestions for improvement, what do you think ?

OrianeN · 2024-08-28T09:48:14Z

pie/scripts/group.py

+@click.argument('input_model', type=click.Path(exists=True, file_okay=True, readable=True, dir_okay=False))
+@click.argument('train', type=click.Path(exists=True, file_okay=True, readable=True, dir_okay=False))
+@click.argument('dev', type=click.Path(exists=True, file_okay=True, readable=True, dir_okay=False))
+# @click.option('--test', type=str, default=None)


Why did you comment the option for the test set ?
It is often necessary to declare a new test set when fine-tuning, for instance in case of transfer learning to a new language, or if the original path is no longer accessible.

OrianeN · 2024-08-28T09:59:10Z

pie/scripts/group.py

+@click.option('--seed', type=int, default=None)
+@click.option("--batch-size", type=int, default=None)
+@click.option("--epochs", type=int, default=100)
+@click.option("--mode", type=click.Choice(["expand", "skip", "replace"]), default="expand")


With the expand mode, I usually used the word/char_min_freq/max_size - could you add them as options as well ?

OrianeN · 2024-08-28T10:02:10Z

pie/scripts/group.py

+@click.option("--batch-size", type=int, default=None)
+@click.option("--epochs", type=int, default=100)
+@click.option("--mode", type=click.Choice(["expand", "skip", "replace"]), default="expand")
+@click.option("--device", type=str, default="cpu")


It could help in avoiding mistakes and avoid typing one more option to choose the device automatically based on what's available (for the default value)

OrianeN · 2024-08-28T10:03:32Z

pie/scripts/group.py

+@click.option("--epochs", type=int, default=100)
+@click.option("--mode", type=click.Choice(["expand", "skip", "replace"]), default="expand")
+@click.option("--device", type=str, default="cpu")
+@click.option("--out", type=str, default="model")


IIRC, the default model filename when training is a concatenation using the "modelname" setting. Maybe you could do something similar here as well ?

OrianeN and others added 12 commits April 4, 2024 19:16

Finetune: load tasks with AttentionalDecoder + fix missing sent RNN e…

2e571b0

…ncoder loading

Enable forcing seed from command-line or config file

f9bbecf

User can pass a seed value either from the command-line with a new option `--seed` or from the config file, where the default value is 'auto'.

Finetune: Improve logging when loading the pretrained parameters

3231f00

Finetune: Fix loading AttentionalDecoder tasks

b30b5aa

New utility script get_pie_model_params.py to show model info

85ea456

This script can be useful in fine-tuning or reporting scenarios, as it aims to show the size of the vocabularies as well as the number and size of some important layers in the model.

Creating a finetuning command that simplify overall finetuning

cdea923

OrianeN reviewed Aug 28, 2024

View reviewed changes

OrianeN force-pushed the ON/finetune branch from 97a25ac to 44d295c Compare October 31, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a finetuning command that simplify overall finetuning #1

Creating a finetuning command that simplify overall finetuning #1

PonteIneptique commented Aug 28, 2024

OrianeN left a comment

OrianeN Aug 28, 2024

OrianeN Aug 28, 2024

OrianeN Aug 28, 2024

OrianeN Aug 28, 2024

Creating a finetuning command that simplify overall finetuning #1

Are you sure you want to change the base?

Creating a finetuning command that simplify overall finetuning #1

Conversation

PonteIneptique commented Aug 28, 2024

OrianeN left a comment

Choose a reason for hiding this comment

OrianeN Aug 28, 2024

Choose a reason for hiding this comment

OrianeN Aug 28, 2024

Choose a reason for hiding this comment

OrianeN Aug 28, 2024

Choose a reason for hiding this comment

OrianeN Aug 28, 2024

Choose a reason for hiding this comment