You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes, there should be only those modifications you mentioned. The most tricky part should be the Chinese SpaCy models, which are not officially supported.
@hitvoice Much appreciated for the confirmation!
One more question: given that the Chinese language does not have natural word boundaries, when using DrQA with a Chinese language dataset, does it make any difference for DrQA whether the dataset is tokenized first (i.e., 分词,using a tool such as Jieba)? Or can I assume that since SpaCy kind of does tokenization in its own way, so I actually don't have to anything specially in this respect?
You should tokenize your Chinese data first. Prepare your data as "这是 一个 分词 后 的 样例" (separate tokens by spaces) and provide corresponding POS and NER tags. This not an easy copy-and-paste. A lot of work and modifications should be done for Chinese support.
Is it expected that this code can be applied to a Chinese language dataset with only minor changes?
I understand that I will need to provide the following:
Would very much appreciate any insights if there is any known reasons why this is not supposed to work.
The text was updated successfully, but these errors were encountered: