Changes to input format of pretokenized text
Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (["This", "is", "a", "pretokenized", "sentence"]
) is not accepted anymore. Therefore,
the is_tokenized
option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. "This is a pretokenized sentence"
, which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.
Specific changes:
- [conllparser] Breaking change:
is_tokenized
is not a valid argument toConllParser
any more. - [utils/conllparser] Breaking change: when using UDPipe, pretokenized data is not supported any more.
- [utils] Breaking change:
SpacyPretokenizedTokenizer.__call__
does not support a list of tokens any more.