Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement full model finetuning #9

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

OrianeN
Copy link

@OrianeN OrianeN commented Apr 5, 2024

Implemented via a new config parameter load_pretrained_model and a new method load_state_dict_from_pretrained() in class SimpleModel.

Enable finetuning of another PaPie model by loading its state dict into the current model.

Customization of which parts to load is possible via the subparameter load_pretrained_model["exclude"].

I already used my solution in an experiment in which I finetuned a PaPie POS tagger for Occitan, pretrained on a large synthetic dataset, finetuned with a smaller manually annotated dataset.

The results for the pretrained POS tagger are 91,19 / 82,72 / 89,24 (all tokens, unknown tokens, ambiguous tokens) ; the results of the finetuned POS tagger (tried only with one config, best of 5 runs): 92,64 / 86,14 / 91,02.

I can also confirm that the state_dicts were successfully loaded, as I can see in the logs:

Loading pretrained model
Initialized 106/111 char embs
Initialized 2748/9818 fwd LM word parameters
Initialized 2748/9818 bwd LM word parameters

Not all parameters of the LM layers could be updated because the vocabulary size has changed (20002 for the pretrained model vs. 9818 for the finetuned model).

(I also developped it in a notebook to observe the state_dict tensors in all loading steps)

By the way I noticed that the load_pretrained_encoder param might not be working, as I'm not sure that pie.Encoder.load() method can be called like this (Encoder doesn't seem to be imported in the pie.__init__), and this method calls pie.dataset.MultiLabelEncoder, yet MultiLabelEncoder seems to have moved to pie.data.dataset.MultiLabelEncoder.
Since my solution enables loading only the encoder of a model, should I try to change the code so that the parameter load_pretrained_encoder also points to the load_state_dict_from_pretrained() with exclude=["lm", *tasks_names*] ?

@OrianeN OrianeN marked this pull request as draft April 11, 2024 12:28
@OrianeN OrianeN force-pushed the ON/finetune branch 3 times, most recently from 402b687 to 2e571b0 Compare April 18, 2024 08:40
Implemented via a new config parameter `load_pretrained_model`

Enable finetuning of another PaPie model by loading its state dict into the current model.

Customization of which parts to load is possible via the subparameter `load_pretrained_model["exclude"]`.
…ding

- change variable name `model_tar` to `pretrained`
- new nested function `load_state_dict_label_by_label` that applies to load wemb, cemb, lm and tasks with linear decoders label by label
- variable `model_parts_to_load` can only contain parts which are part of the model (e.g. "lm" only if `self.include_lm` is True)
- Load tasks of pretrained models even when the case don't match (e.g. "pos" task of the pretrained model can be loaded into the "POS" task of the new model)
- raise NotImplementedError if the task to be loaded does not correspond to a LinearDecoder (planned to implement AttentionalDecoder soon, otherwise the user should exclude the task in the config)
`MultiLabelEncoder.fit()` has a new boolean option `expand_mode` with which new labels/vocab are added to the freqs, then a new method `LabelEncoder.expand_vocab()` is called to extract a new list of labels/vocabs and append new entries at the end of `LabelEncoder.table` and `LabelEncoder.inverse_table` .

The `expand_mode` is optional and can be set to False/True (default True) in the json config ("load_pretrained_model"/"expand_labels"). If set to False in finetuning mode ("load_pretrained_model"/"pretrained":"pretrained_model.tar"), a new option `skip_fitted` is passed to `MultiLabelEncoder.fit()` instead, to be able to fit potential tasks that could be set in the new model but not in the pretrained one (example use case: fine-tune a model pretrained for POS tagging on a lemmatization task).

+ Fix typo in `LabelEncoder.from_json()`
+ reorganize imports in module dataset.py
+ fix logging.basicConfig by adding option force=True (fixes missing logs in the stdout)
+ change logs in module datasets.py with calls to a module-specific logger (best practice)
Previously, if two vocabulary entries had the same uppercasing, the inverse_table would include the duplicates, but the table would include the upper entry only with the last index, leading to a missing index.

Found in an experiment where 2 characters were uppercased to "M", at positions 606 and 651, so table had only "M":651 with index 606 was missing, leading to random errors when the model tried to predict index 606.
option 'labels_mode' takes 3 possible values:
- "expand" replaces expand_labels=true - append new vocab from the new data to the pretrained label encoders
- "skip" replaces expand_labels=false - only fit new tasks that haven't been pretrained
- "replace" fits a new MultiLabelEncoder (pretrained params will still be loaded for common vocab entries)
User can pass a seed value either from the command-line with a new option `--seed` or from the config file, where the default value is 'auto'.
This script can be useful in fine-tuning or reporting scenarios,
as it aims to show the size of the vocabularies as well as
the number and size of some important layers in the model.
When "labels_mode" is "expand" but the vocabulary max sizes ("char_max_size"/"word_max_size") are smaller than the parent model's vocabularies, the intended behavior is to keep only the most frequent entries from the parent vocabulary.

A bug in the code led to removing the entire vocabulary, including the reserved entries (e.g. <UNK>). This commit fixes that.
- New option 'replace_fill' enables replacing the vocab/labels with entries from the finetuning data, and then filling leftover spots with vocab/labels from the parent model
- Renamed variables in MultiLabelEncoder.fit() method + created a property that returns a list with all LabelEncoder objects stored in the MultiLabelEncoder
- Modify the `__main__` of train.py to reflect the behavior when launching training from group.py (for debugging)
…e freqs + refactor LabelEncoder.expand_vocab

Using the parent LabelEncoder.freqs attributes enables storing and using frequencies of vocab entries/labels. This way, the min_freq config option can be used, and only the most common parent items are added with the max_size option.
To inform whether all available new uppercase entries could be registered or not.
OrianeN and others added 6 commits October 31, 2024 18:42
Different methods in LabelEncoder class were altering the final vocabulary size when self.max_size is set, because the reserved entries were not always handled identically, leading to more or fewer model parameters than requested depending on the options passed.

This commit attempts at fixing this by counting the reserved tokens as part of max_size.

In practice:
- compute_vocab will remove more entries to leave space for the reserved ones
- expand_vocab will substract the reserved tokens when counting the number of slots left
- expand_vocab will no longer erroneously expand the size of the vocab when max_size is set to shrink the new vocab size
- register_upper will no longer substract reserved entries twice to count the number of slots left

Additionally, in expand_vocab the min_freq condition is applied at the same time as the filtering to find new symbols, for optimization purpose
Fix vocab sizes wrt. reserved tokens
Finetune: enable effectively excluding loading of parent weights in the wemb, cemb and task modules
@OrianeN
Copy link
Author

OrianeN commented Nov 19, 2024

New changes since April include some bug fixes, minor improvements such as prints/logs during the initialization phase, and a new option value "replace_fill" for config option "load_pretrained_model"/"labels_mode" (that enables replacing the parent vocab with a new one computed from the fine-tuning data + completed with some parent labels if there is space left).

I believe this PR is now ready for review.

@OrianeN OrianeN marked this pull request as ready for review November 19, 2024 14:30
@OrianeN
Copy link
Author

OrianeN commented Dec 10, 2024

I just added a section to the README to explain how to finetune PaPie models.
Don't hesitate if you'd still prefer to have this section in a separated file, and/or if you'd like to have other details in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant