-
-
Notifications
You must be signed in to change notification settings - Fork 79
in stream mode, the English word has no space after detokenizer and Chinese were messed up #197
Comments
Hi @lucasjinreal. We need more information in order to assist you in resolving the issue. May I ask which model you are using? Are you using it through the API or through Python? |
@peakji Ithink its not related about model. For model am simple using Llama. The reason is that when we decode same id, compare with decode ids in a sentence, tokenizers can be different. For instance, for ids: [34, 56, 656], tokenizers would decode like: I love u But if you decode one by one, you will got: Iloveu It doesn't presever these spaces, and Chinese characters even worse. However, am not sure is because of this or not for real. But above is the problems I have indeed. What's your think? (Mainly simple word do not have spaces compare as original, and Chinese if wrong decoding) |
Or maybe these is something missed inside your StreamTokenizer? (like ignored some ids). Can u try get decode ids one by one and print it?
Me was wrong |
There is an example of the LLaMA tokenizer in the test case, which also includes Chinese characters: https://github.com/hyperonym/basaran/blob/master/tests/test_tokenizer.py#L48 |
@peakji Thanks, I just using tokenizer of StreamModel and the Chinese decoding error problems still exist. And I still can not get the spaces between engliesh words . I think the output stream has some problems, How can I combine it using with model and tokenizer and print correct words in terminal? |
I got no space and Chinese were wrong either (try print(word, end='')) I don't want change line in every word and I don't want unexpect spaces in un-English characters. |
Could you please provide some example code for us to reproduce the issue? The output in your first screenshot is apparently not from |
@peakji second one is, I fisrt usign model = StreamModel(model, tokeizer) and then using model.tokenizer to decode. Can u guys provide a effect print correct values without change line demo? (correctly all print word one by one |
You shouldn't use The correct way could be either: a. Call the model directly without the need for manual detokenization: basaran/examples/basaran-python-library/main.py Lines 8 to 9 in 5ef5ef0
b. Create an instance of StreamTokenizer and use that instead: basaran/tests/test_tokenizer.py Lines 54 to 61 in 5ef5ef0
|
@peakji thank u! I have solved the first problem. the english seems oK now. but Chinese still not OK the Chinese characters some are ok, some still got weird coding style |
We need more information to assist you in resolving the issue. These screenshots alone don't provide much valuable information. Could you please provide the code you are testing for us to reproduce? |
How to resolve this problem?>
The text was updated successfully, but these errors were encountered: