-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement full model finetuning #9
base: master
Are you sure you want to change the base?
Conversation
402b687
to
2e571b0
Compare
Implemented via a new config parameter `load_pretrained_model` Enable finetuning of another PaPie model by loading its state dict into the current model. Customization of which parts to load is possible via the subparameter `load_pretrained_model["exclude"]`.
…ding - change variable name `model_tar` to `pretrained` - new nested function `load_state_dict_label_by_label` that applies to load wemb, cemb, lm and tasks with linear decoders label by label - variable `model_parts_to_load` can only contain parts which are part of the model (e.g. "lm" only if `self.include_lm` is True) - Load tasks of pretrained models even when the case don't match (e.g. "pos" task of the pretrained model can be loaded into the "POS" task of the new model) - raise NotImplementedError if the task to be loaded does not correspond to a LinearDecoder (planned to implement AttentionalDecoder soon, otherwise the user should exclude the task in the config)
`MultiLabelEncoder.fit()` has a new boolean option `expand_mode` with which new labels/vocab are added to the freqs, then a new method `LabelEncoder.expand_vocab()` is called to extract a new list of labels/vocabs and append new entries at the end of `LabelEncoder.table` and `LabelEncoder.inverse_table` . The `expand_mode` is optional and can be set to False/True (default True) in the json config ("load_pretrained_model"/"expand_labels"). If set to False in finetuning mode ("load_pretrained_model"/"pretrained":"pretrained_model.tar"), a new option `skip_fitted` is passed to `MultiLabelEncoder.fit()` instead, to be able to fit potential tasks that could be set in the new model but not in the pretrained one (example use case: fine-tune a model pretrained for POS tagging on a lemmatization task). + Fix typo in `LabelEncoder.from_json()` + reorganize imports in module dataset.py + fix logging.basicConfig by adding option force=True (fixes missing logs in the stdout) + change logs in module datasets.py with calls to a module-specific logger (best practice)
Previously, if two vocabulary entries had the same uppercasing, the inverse_table would include the duplicates, but the table would include the upper entry only with the last index, leading to a missing index. Found in an experiment where 2 characters were uppercased to "M", at positions 606 and 651, so table had only "M":651 with index 606 was missing, leading to random errors when the model tried to predict index 606.
option 'labels_mode' takes 3 possible values: - "expand" replaces expand_labels=true - append new vocab from the new data to the pretrained label encoders - "skip" replaces expand_labels=false - only fit new tasks that haven't been pretrained - "replace" fits a new MultiLabelEncoder (pretrained params will still be loaded for common vocab entries)
User can pass a seed value either from the command-line with a new option `--seed` or from the config file, where the default value is 'auto'.
This script can be useful in fine-tuning or reporting scenarios, as it aims to show the size of the vocabularies as well as the number and size of some important layers in the model.
When "labels_mode" is "expand" but the vocabulary max sizes ("char_max_size"/"word_max_size") are smaller than the parent model's vocabularies, the intended behavior is to keep only the most frequent entries from the parent vocabulary. A bug in the code led to removing the entire vocabulary, including the reserved entries (e.g. <UNK>). This commit fixes that.
- New option 'replace_fill' enables replacing the vocab/labels with entries from the finetuning data, and then filling leftover spots with vocab/labels from the parent model - Renamed variables in MultiLabelEncoder.fit() method + created a property that returns a list with all LabelEncoder objects stored in the MultiLabelEncoder - Modify the `__main__` of train.py to reflect the behavior when launching training from group.py (for debugging)
…e freqs + refactor LabelEncoder.expand_vocab Using the parent LabelEncoder.freqs attributes enables storing and using frequencies of vocab entries/labels. This way, the min_freq config option can be used, and only the most common parent items are added with the max_size option.
To inform whether all available new uppercase entries could be registered or not.
Different methods in LabelEncoder class were altering the final vocabulary size when self.max_size is set, because the reserved entries were not always handled identically, leading to more or fewer model parameters than requested depending on the options passed. This commit attempts at fixing this by counting the reserved tokens as part of max_size. In practice: - compute_vocab will remove more entries to leave space for the reserved ones - expand_vocab will substract the reserved tokens when counting the number of slots left - expand_vocab will no longer erroneously expand the size of the vocab when max_size is set to shrink the new vocab size - register_upper will no longer substract reserved entries twice to count the number of slots left Additionally, in expand_vocab the min_freq condition is applied at the same time as the filtering to find new symbols, for optimization purpose
Fix vocab sizes wrt. reserved tokens
…he wemb, cemb and task modules
Finetune: enable effectively excluding loading of parent weights in the wemb, cemb and task modules
New changes since April include some bug fixes, minor improvements such as prints/logs during the initialization phase, and a new option value "replace_fill" for config option "load_pretrained_model"/"labels_mode" (that enables replacing the parent vocab with a new one computed from the fine-tuning data + completed with some parent labels if there is space left). I believe this PR is now ready for review. |
I just added a section to the README to explain how to finetune PaPie models. |
Implemented via a new config parameter
load_pretrained_model
and a new methodload_state_dict_from_pretrained()
in classSimpleModel
.Enable finetuning of another PaPie model by loading its state dict into the current model.
Customization of which parts to load is possible via the subparameter
load_pretrained_model["exclude"]
.I already used my solution in an experiment in which I finetuned a PaPie POS tagger for Occitan, pretrained on a large synthetic dataset, finetuned with a smaller manually annotated dataset.
The results for the pretrained POS tagger are 91,19 / 82,72 / 89,24 (all tokens, unknown tokens, ambiguous tokens) ; the results of the finetuned POS tagger (tried only with one config, best of 5 runs): 92,64 / 86,14 / 91,02.
I can also confirm that the state_dicts were successfully loaded, as I can see in the logs:
Not all parameters of the LM layers could be updated because the vocabulary size has changed (20002 for the pretrained model vs. 9818 for the finetuned model).
(I also developped it in a notebook to observe the state_dict tensors in all loading steps)
By the way I noticed that the
load_pretrained_encoder
param might not be working, as I'm not sure thatpie.Encoder.load()
method can be called like this (Encoder
doesn't seem to be imported in thepie.__init__
), and this method callspie.dataset.MultiLabelEncoder
, yetMultiLabelEncoder
seems to have moved topie.data.dataset.MultiLabelEncoder
.Since my solution enables loading only the encoder of a model, should I try to change the code so that the parameter
load_pretrained_encoder
also points to theload_state_dict_from_pretrained()
withexclude=["lm", *tasks_names*]
?