You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Not sure if this was discussed before)
It seems that the train_nfold method allows the preprocessor to see the whole dataset.
It may be more correct to use a separate preprocessor for each split.
The difference would be how the model handles unseen characters, feature tokens etc.
(It may depend on the dataset how likely it only contains some of the characters in the validation split)
Thank you @de-code !
I don't think it was discussed before.
This is correct, the preprocessor is initialized on the whole training set in the n-fold scenario.
At first glance, my impression is that it has no impact. Using a fold for training will restrict the model to the characters, words and feature values present in the fold, so nothing is learned for indices outside the fold. When we evaluate on the excluded 10%, unseen characters/words/feature values might be mapped to existing indices but I don't see how it can impact the processing if these indices are ignored by the model - would they be mapped to UNK or 0 or 104, these are all considered discrete values for which nothing is learnt.
About the embeddings, as the whole static embeddings are loaded, unseen words are mapped to embeddings independently of the occurrence of the word in the training set.
(Not sure if this was discussed before)
It seems that the
train_nfold
method allows the preprocessor to see the whole dataset.It may be more correct to use a separate preprocessor for each split.
The difference would be how the model handles unseen characters, feature tokens etc.
(It may depend on the dataset how likely it only contains some of the characters in the validation split)
/cc @kermitt2 @lfoppiano
The text was updated successfully, but these errors were encountered: