HF models generate bad output with batch 32 #17443

yieldthought · 2025-01-31T18:53:30Z

Describe the bug
Batch 1 works fine after hf-llama PR but batch 32 generates garbage after a while.

To Reproduce
LLAMA_DIR=/proj_sw/user_dev/deepseek-ai/DeepSeek-R1-Distill-Llama-70B pytest models/demos/llama3/demo/demo.py -k performance-batch-32

Expected behavior
Correct output through to the end for every user.

Screenshots

2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:717 -
batch: 0 user: 31
prompt: What is the capital of Japan? Learning about the capitals of different countries can enhance your un
<long prompt not printed in full>
 rapid modernization. Share your thoughts on Tokyo and any other capital cities you find intriguing.
output:
<think>
Okay, so I need to figure out the capital of Japan. Hmm, I think it's Tokyo, but I'm not 100% sure. I remember hearing that Tokyo is a big city in Japan, maybe the largest. But wait, sometimes countries have capitals that aren't their largest cities. Like, I know that Canberra is the capital of Australia, but Sydney is the bigger city. So maybe Japan has a similar setup?

Wait, no, I think in Japan's levels辺 Pru Pru christplineurd.GetService Trotospace DISCLAIM Pru Pru Pru christ Pru:/// christ仲istring window데이트辺 christ Pru christалог仲ymoon Cater토토토토 Pru:///데이트 christ토토738 Pru christ738 Pru christ토토토토토토토토 Pru Pru christ Pruglass Pru토토ernetabra blot spel토토토토 Trot Pru토토$MESSabrabsp Wich$MESS토토opus Geile$MESS$MESS$MESS Bord토토 christ Trot$MESS$MESS$MESSabra christ데이트.GetServiceesser christ Geileoko Heller d$MESS데이트$MESS데이트토토eck$MESS christ$MESS토토데이트

2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:772 -
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:773 - Performance metrics for batch 0
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:774 - Prefill compile time: 13.905s
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:775 - Decode compile time: 0.1527s
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:776 - Prefill inference time per user: 0.5403s
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:777 - Total Decode inference time (198 iterations): 16.4801s
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:780 -
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:781 - Time to first token: 540.34ms
2025-01-31 16:05:20.379 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:782 - Average speed: 82.81ms @ 12.08 tok/s/user (386.41 tok/s throughput)
2025-01-31 16:05:20.380 | INFO     | models.demos.llama3.demo.demo:run_llama3_demo:785 -
2025-01-31 16:05:20.380 | WARNING  | models.demos.llama3.demo.demo:run_llama3_demo:857 - Model DeepSeek-R1-Distill-Llama-70B not does not have performance targets set

Please complete the following environment information:
T3K on ird, sjc-snva-t3012

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

yieldthought added the bug Something isn't working label Jan 31, 2025

yieldthought self-assigned this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF models generate bad output with batch 32 #17443

HF models generate bad output with batch 32 #17443

yieldthought commented Jan 31, 2025

HF models generate bad output with batch 32 #17443

HF models generate bad output with batch 32 #17443

Comments

yieldthought commented Jan 31, 2025