sequence labelling, n-fold training should use separate preprocessors #102

de-code · 2020-04-23T08:58:15Z

(Not sure if this was discussed before)
It seems that the train_nfold method allows the preprocessor to see the whole dataset.
It may be more correct to use a separate preprocessor for each split.
The difference would be how the model handles unseen characters, feature tokens etc.
(It may depend on the dataset how likely it only contains some of the characters in the validation split)

/cc @kermitt2 @lfoppiano

The text was updated successfully, but these errors were encountered:

kermitt2 · 2020-12-21T08:45:11Z

Thank you @de-code !
I don't think it was discussed before.

This is correct, the preprocessor is initialized on the whole training set in the n-fold scenario.

At first glance, my impression is that it has no impact. Using a fold for training will restrict the model to the characters, words and feature values present in the fold, so nothing is learned for indices outside the fold. When we evaluate on the excluded 10%, unseen characters/words/feature values might be mapped to existing indices but I don't see how it can impact the processing if these indices are ignored by the model - would they be mapped to UNK or 0 or 104, these are all considered discrete values for which nothing is learnt.

About the embeddings, as the whole static embeddings are loaded, unseen words are mapped to embeddings independently of the occurrence of the word in the training set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequence labelling, n-fold training should use separate preprocessors #102

sequence labelling, n-fold training should use separate preprocessors #102

de-code commented Apr 23, 2020

kermitt2 commented Dec 21, 2020

sequence labelling, n-fold training should use separate preprocessors #102

sequence labelling, n-fold training should use separate preprocessors #102

Comments

de-code commented Apr 23, 2020

kermitt2 commented Dec 21, 2020