Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sequence labelling, n-fold training should use separate preprocessors #102

Open
de-code opened this issue Apr 23, 2020 · 1 comment
Open

Comments

@de-code
Copy link
Contributor

de-code commented Apr 23, 2020

(Not sure if this was discussed before)
It seems that the train_nfold method allows the preprocessor to see the whole dataset.
It may be more correct to use a separate preprocessor for each split.
The difference would be how the model handles unseen characters, feature tokens etc.
(It may depend on the dataset how likely it only contains some of the characters in the validation split)

/cc @kermitt2 @lfoppiano

@kermitt2
Copy link
Owner

Thank you @de-code !
I don't think it was discussed before.

This is correct, the preprocessor is initialized on the whole training set in the n-fold scenario.

At first glance, my impression is that it has no impact. Using a fold for training will restrict the model to the characters, words and feature values present in the fold, so nothing is learned for indices outside the fold. When we evaluate on the excluded 10%, unseen characters/words/feature values might be mapped to existing indices but I don't see how it can impact the processing if these indices are ignored by the model - would they be mapped to UNK or 0 or 104, these are all considered discrete values for which nothing is learnt.

About the embeddings, as the whole static embeddings are loaded, unseen words are mapped to embeddings independently of the occurrence of the word in the training set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants