Implement full model finetuning #9

OrianeN · 2024-04-05T09:12:17Z

Implemented via a new config parameter load_pretrained_model and a new method load_state_dict_from_pretrained() in class SimpleModel.

Enable finetuning of another PaPie model by loading its state dict into the current model.

Customization of which parts to load is possible via the subparameter load_pretrained_model["exclude"].

I already used my solution in an experiment in which I finetuned a PaPie POS tagger for Occitan, pretrained on a large synthetic dataset, finetuned with a smaller manually annotated dataset.

The results for the pretrained POS tagger are 91,19 / 82,72 / 89,24 (all tokens, unknown tokens, ambiguous tokens) ; the results of the finetuned POS tagger (tried only with one config, best of 5 runs): 92,64 / 86,14 / 91,02.

I can also confirm that the state_dicts were successfully loaded, as I can see in the logs:

Loading pretrained model
Initialized 106/111 char embs
Initialized 2748/9818 fwd LM word parameters
Initialized 2748/9818 bwd LM word parameters

Not all parameters of the LM layers could be updated because the vocabulary size has changed (20002 for the pretrained model vs. 9818 for the finetuned model).

(I also developped it in a notebook to observe the state_dict tensors in all loading steps)

By the way I noticed that the load_pretrained_encoder param might not be working, as I'm not sure that pie.Encoder.load() method can be called like this (Encoder doesn't seem to be imported in the pie.__init__), and this method calls pie.dataset.MultiLabelEncoder, yet MultiLabelEncoder seems to have moved to pie.data.dataset.MultiLabelEncoder.
Since my solution enables loading only the encoder of a model, should I try to change the code so that the parameter load_pretrained_encoder also points to the load_state_dict_from_pretrained() with exclude=["lm", *tasks_names*] ?

Implemented via a new config parameter `load_pretrained_model` Enable finetuning of another PaPie model by loading its state dict into the current model. Customization of which parts to load is possible via the subparameter `load_pretrained_model["exclude"]`.

…ding - change variable name `model_tar` to `pretrained` - new nested function `load_state_dict_label_by_label` that applies to load wemb, cemb, lm and tasks with linear decoders label by label - variable `model_parts_to_load` can only contain parts which are part of the model (e.g. "lm" only if `self.include_lm` is True) - Load tasks of pretrained models even when the case don't match (e.g. "pos" task of the pretrained model can be loaded into the "POS" task of the new model) - raise NotImplementedError if the task to be loaded does not correspond to a LinearDecoder (planned to implement AttentionalDecoder soon, otherwise the user should exclude the task in the config)

`MultiLabelEncoder.fit()` has a new boolean option `expand_mode` with which new labels/vocab are added to the freqs, then a new method `LabelEncoder.expand_vocab()` is called to extract a new list of labels/vocabs and append new entries at the end of `LabelEncoder.table` and `LabelEncoder.inverse_table` . The `expand_mode` is optional and can be set to False/True (default True) in the json config ("load_pretrained_model"/"expand_labels"). If set to False in finetuning mode ("load_pretrained_model"/"pretrained":"pretrained_model.tar"), a new option `skip_fitted` is passed to `MultiLabelEncoder.fit()` instead, to be able to fit potential tasks that could be set in the new model but not in the pretrained one (example use case: fine-tune a model pretrained for POS tagging on a lemmatization task). + Fix typo in `LabelEncoder.from_json()` + reorganize imports in module dataset.py + fix logging.basicConfig by adding option force=True (fixes missing logs in the stdout) + change logs in module datasets.py with calls to a module-specific logger (best practice)

Previously, if two vocabulary entries had the same uppercasing, the inverse_table would include the duplicates, but the table would include the upper entry only with the last index, leading to a missing index. Found in an experiment where 2 characters were uppercased to "M", at positions 606 and 651, so table had only "M":651 with index 606 was missing, leading to random errors when the model tried to predict index 606.

option 'labels_mode' takes 3 possible values: - "expand" replaces expand_labels=true - append new vocab from the new data to the pretrained label encoders - "skip" replaces expand_labels=false - only fit new tasks that haven't been pretrained - "replace" fits a new MultiLabelEncoder (pretrained params will still be loaded for common vocab entries)

…ncoder loading

User can pass a seed value either from the command-line with a new option `--seed` or from the config file, where the default value is 'auto'.

This script can be useful in fine-tuning or reporting scenarios, as it aims to show the size of the vocabularies as well as the number and size of some important layers in the model.

When "labels_mode" is "expand" but the vocabulary max sizes ("char_max_size"/"word_max_size") are smaller than the parent model's vocabularies, the intended behavior is to keep only the most frequent entries from the parent vocabulary. A bug in the code led to removing the entire vocabulary, including the reserved entries (e.g. <UNK>). This commit fixes that.

- New option 'replace_fill' enables replacing the vocab/labels with entries from the finetuning data, and then filling leftover spots with vocab/labels from the parent model - Renamed variables in MultiLabelEncoder.fit() method + created a property that returns a list with all LabelEncoder objects stored in the MultiLabelEncoder - Modify the `__main__` of train.py to reflect the behavior when launching training from group.py (for debugging)

…e freqs + refactor LabelEncoder.expand_vocab Using the parent LabelEncoder.freqs attributes enables storing and using frequencies of vocab entries/labels. This way, the min_freq config option can be used, and only the most common parent items are added with the max_size option.

To inform whether all available new uppercase entries could be registered or not.

Different methods in LabelEncoder class were altering the final vocabulary size when self.max_size is set, because the reserved entries were not always handled identically, leading to more or fewer model parameters than requested depending on the options passed. This commit attempts at fixing this by counting the reserved tokens as part of max_size. In practice: - compute_vocab will remove more entries to leave space for the reserved ones - expand_vocab will substract the reserved tokens when counting the number of slots left - expand_vocab will no longer erroneously expand the size of the vocab when max_size is set to shrink the new vocab size - register_upper will no longer substract reserved entries twice to count the number of slots left Additionally, in expand_vocab the min_freq condition is applied at the same time as the filtering to find new symbols, for optimization purpose

Fix vocab sizes wrt. reserved tokens

…he wemb, cemb and task modules

Finetune: enable effectively excluding loading of parent weights in the wemb, cemb and task modules

OrianeN · 2024-11-19T14:30:37Z

New changes since April include some bug fixes, minor improvements such as prints/logs during the initialization phase, and a new option value "replace_fill" for config option "load_pretrained_model"/"labels_mode" (that enables replacing the parent vocab with a new one computed from the fine-tuning data + completed with some parent labels if there is space left).

I believe this PR is now ready for review.

OrianeN · 2024-12-10T14:21:43Z

I just added a section to the README to explain how to finetune PaPie models.
Don't hesitate if you'd still prefer to have this section in a separated file, and/or if you'd like to have other details in it.

OrianeN marked this pull request as draft April 11, 2024 12:28

OrianeN force-pushed the ON/finetune branch 3 times, most recently from 402b687 to 2e571b0 Compare April 18, 2024 08:40

OrianeN added 15 commits October 31, 2024 16:54

Finetune: load tasks with AttentionalDecoder + fix missing sent RNN e…

da1cf3c

…ncoder loading

Enable forcing seed from command-line or config file

0024395

User can pass a seed value either from the command-line with a new option `--seed` or from the config file, where the default value is 'auto'.

Finetune: Improve logging when loading the pretrained parameters

294b907

Finetune: Fix loading AttentionalDecoder tasks

5b5cc10

New utility script get_pie_model_params.py to show model info

4a96d0c

This script can be useful in fine-tuning or reporting scenarios, as it aims to show the size of the vocabularies as well as the number and size of some important layers in the model.

train.py : Minor print message change

04ecc5e

Add logs during register_upper

44d295c

To inform whether all available new uppercase entries could be registered or not.

OrianeN force-pushed the ON/finetune branch from 97a25ac to 44d295c Compare October 31, 2024 16:35

OrianeN and others added 6 commits October 31, 2024 18:42

Improve logs during register_upper

53ef4e3

register_upper - minor logs improvements

b17f55a

Merge pull request #2 from OrianeN/fix_vocab_reserved

717fa51

Fix vocab sizes wrt. reserved tokens

Finetune: enable effectively excluding loading of parent weights in t…

fb9a4cb

…he wemb, cemb and task modules

Merge pull request #3 from OrianeN/fix_finetune_load

6f3a269

Finetune: enable effectively excluding loading of parent weights in the wemb, cemb and task modules

OrianeN marked this pull request as ready for review November 19, 2024 14:30

Update README - Finetuning section

a95153a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement full model finetuning #9

Implement full model finetuning #9

OrianeN commented Apr 5, 2024

OrianeN commented Nov 19, 2024

OrianeN commented Dec 10, 2024

Implement full model finetuning #9

Are you sure you want to change the base?

Implement full model finetuning #9

Conversation

OrianeN commented Apr 5, 2024

OrianeN commented Nov 19, 2024

OrianeN commented Dec 10, 2024