Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a finetuning command that simplify overall finetuning #1

Open
wants to merge 12 commits into
base: ON/finetune
Choose a base branch
from

Conversation

PonteIneptique
Copy link

No description provided.

OrianeN and others added 12 commits April 4, 2024 19:16
Implemented via a new config parameter `load_pretrained_model`

Enable finetuning of another PaPie model by loading its state dict into the current model.

Customization of which parts to load is possible via the subparameter `load_pretrained_model["exclude"]`.
…ding

- change variable name `model_tar` to `pretrained`
- new nested function `load_state_dict_label_by_label` that applies to load wemb, cemb, lm and tasks with linear decoders label by label
- variable `model_parts_to_load` can only contain parts which are part of the model (e.g. "lm" only if `self.include_lm` is True)
- Load tasks of pretrained models even when the case don't match (e.g. "pos" task of the pretrained model can be loaded into the "POS" task of the new model)
- raise NotImplementedError if the task to be loaded does not correspond to a LinearDecoder (planned to implement AttentionalDecoder soon, otherwise the user should exclude the task in the config)
`MultiLabelEncoder.fit()` has a new boolean option `expand_mode` with which new labels/vocab are added to the freqs, then a new method `LabelEncoder.expand_vocab()` is called to extract a new list of labels/vocabs and append new entries at the end of `LabelEncoder.table` and `LabelEncoder.inverse_table` .

The `expand_mode` is optional and can be set to False/True (default True) in the json config ("load_pretrained_model"/"expand_labels"). If set to False in finetuning mode ("load_pretrained_model"/"pretrained":"pretrained_model.tar"), a new option `skip_fitted` is passed to `MultiLabelEncoder.fit()` instead, to be able to fit potential tasks that could be set in the new model but not in the pretrained one (example use case: fine-tune a model pretrained for POS tagging on a lemmatization task).

+ Fix typo in `LabelEncoder.from_json()`
+ reorganize imports in module dataset.py
+ fix logging.basicConfig by adding option force=True (fixes missing logs in the stdout)
+ change logs in module datasets.py with calls to a module-specific logger (best practice)
Previously, if two vocabulary entries had the same uppercasing, the inverse_table would include the duplicates, but the table would include the upper entry only with the last index, leading to a missing index.

Found in an experiment where 2 characters were uppercased to "M", at positions 606 and 651, so table had only "M":651 with index 606 was missing, leading to random errors when the model tried to predict index 606.
option 'labels_mode' takes 3 possible values:
- "expand" replaces expand_labels=true - append new vocab from the new data to the pretrained label encoders
- "skip" replaces expand_labels=false - only fit new tasks that haven't been pretrained
- "replace" fits a new MultiLabelEncoder (pretrained params will still be loaded for common vocab entries)
User can pass a seed value either from the command-line with a new option `--seed` or from the config file, where the default value is 'auto'.
This script can be useful in fine-tuning or reporting scenarios,
as it aims to show the size of the vocabularies as well as
the number and size of some important layers in the model.
When "labels_mode" is "expand" but the vocabulary max sizes ("char_max_size"/"word_max_size") are smaller than the parent model's vocabularies, the intended behavior is to keep only the most frequent entries from the parent vocabulary.

A bug in the code led to removing the entire vocabulary, including the reserved entries (e.g. <UNK>). This commit fixes that.
Copy link
Owner

@OrianeN OrianeN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Not sure I prefer this option over the config file, but I guess it's a matter of taste.
I'm also discovering the model._settings attribute which I didn't find when writing the script get_pie_model_params.py ^^

Here are a few suggestions for improvement, what do you think ?

@click.argument('input_model', type=click.Path(exists=True, file_okay=True, readable=True, dir_okay=False))
@click.argument('train', type=click.Path(exists=True, file_okay=True, readable=True, dir_okay=False))
@click.argument('dev', type=click.Path(exists=True, file_okay=True, readable=True, dir_okay=False))
# @click.option('--test', type=str, default=None)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you comment the option for the test set ?
It is often necessary to declare a new test set when fine-tuning, for instance in case of transfer learning to a new language, or if the original path is no longer accessible.

@click.option('--seed', type=int, default=None)
@click.option("--batch-size", type=int, default=None)
@click.option("--epochs", type=int, default=100)
@click.option("--mode", type=click.Choice(["expand", "skip", "replace"]), default="expand")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the expand mode, I usually used the word/char_min_freq/max_size - could you add them as options as well ?

@click.option("--batch-size", type=int, default=None)
@click.option("--epochs", type=int, default=100)
@click.option("--mode", type=click.Choice(["expand", "skip", "replace"]), default="expand")
@click.option("--device", type=str, default="cpu")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could help in avoiding mistakes and avoid typing one more option to choose the device automatically based on what's available (for the default value)

@click.option("--epochs", type=int, default=100)
@click.option("--mode", type=click.Choice(["expand", "skip", "replace"]), default="expand")
@click.option("--device", type=str, default="cpu")
@click.option("--out", type=str, default="model")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the default model filename when training is a concatenation using the "modelname" setting. Maybe you could do something similar here as well ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants