why token length is 100 times of char length #1094

reneix · 2024-11-20T02:16:06Z

reneix
Nov 20, 2024

Hi, all, I would summary long chinese text extracting from pdf. How ever, when input more then 12000 chars, repetition and hallucination appears.

But the most confusion comes from token length, token length is almost 100 times of char length, I tested on gpt-4o, char length is about 1 times of char length. is there some thing I missed?

per qwen doc,
https://qwen.readthedocs.io/en/latest/getting_started/concepts.html
"As a rule of thumb, 1 token is 3~~4 characters for English texts and 1.5~~1.8 characters for Chinese texts."

deploy info:
vllm 0.6
Qwen2.5-14B-Instruct

length info:

input_len: 10000
output_len: 535
output from include_usage:
{'input_tokens': 2046816, 'output_tokens': 47585, 'total_tokens': 2094401, 'input_token_details': {}, 'output_token_details': {}}

input_len: 12000
output_len: 678
output from include_usage:
{'input_tokens': 3158311, 'output_tokens': 81002, 'total_tokens': 3239313, 'input_token_details': {}, 'output_token_details': {}}

input_len: 14000
output_len: 42
{'input_tokens': 3396911, 'output_tokens': 69377, 'total_tokens': 3466288, 'input_token_details': {}, 'output_token_details': {}}

reneix · 2024-11-20T03:25:22Z

reneix
Nov 20, 2024
Author

I believe the problem comes from the api on streaming mode, closing steaming mode works well..

here is my code snippet, fyi

local_llm = ChatOpenAI(
openai_api_key="EMPTY",
openai_api_base=local_url_api,
model_name=local_model_name,
max_tokens = 10*1024,
request_timeout = 30,
# streaming = True,
# stream_options = {"include_usage": True},
callbacks=[StreamingStdOutCallbackHandler()]
)

0 replies

jklj077 · 2024-11-20T08:04:50Z

jklj077
Nov 20, 2024
Maintainer

that's definitely unexpected (the theoretical maximum is 4 tokens per char). there have been several fixes to usage stats since vllm 0.6.0. can you try upgrading vllm?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why token length is 100 times of char length #1094

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

why token length is 100 times of char length #1094

reneix Nov 20, 2024

Replies: 2 comments

reneix Nov 20, 2024 Author

jklj077 Nov 20, 2024 Maintainer

reneix
Nov 20, 2024

reneix
Nov 20, 2024
Author

jklj077
Nov 20, 2024
Maintainer