New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Misc] Add offline test for disaggregated prefill #12418

Open

Shaoting-Feng wants to merge 2 commits into vllm-project:main from Shaoting-Feng:offline-disagg

+98 −0

Shaoting-Feng commented Jan 24, 2025 •

edited by github-actions bot

Loading

This PR adds an offline test for disaggregated prefill use case.


          Fix format

41d38c1

Signed-off-by: Shaoting Feng <[email protected]>

github-actions bot commented Jan 24, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

WangErXiao approved these changes

View reviewed changes

Contributor

WangErXiao left a comment

good

comaniac reviewed

View reviewed changes

Collaborator

comaniac left a comment

Thanks for the PR. Please note that we expect users to learn the feature from examples and ideally users can directly use or modify examples for their use cases. So please provide as many comments and explanations as possible. In addition, since it's not trivial for prefill disaggregation to be used in offline inference, could you also elaborate on the scenario?

examples/offline_inference/disaggregated_prefill.py Outdated

+                  prompts = [
+                      "Hello, my name is",
+                      # "Hi, your name is", # To simulate transmission failure

Collaborator

comaniac Jan 26, 2025

Can you elaborate? How this comment simulates the failure?

Author

Shaoting-Feng Jan 27, 2025

This is to trigger the partial prefill of requests in a batch. The prefill node receives two requests, while the decode node receives three requests. So the decode node will only receive the KV Cache for requests 1 and 3. The decode node will use the KV Cache of requests 1 and 3 and do prefilling on request 2.

This example demonstrates how to use disaggregated prefill and how the decode node manages receiving only a subset of requests within a batch.

examples/offline_inference/disaggregated_prefill.py



		def run_prefill(prefill_done):
		os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Collaborator

comaniac Jan 26, 2025

Please add comments to explain that we use GPU 0 for prefill and GPU1 for decode.

examples/offline_inference/disaggregated_prefill.py

+                      # "Hi, your name is", # To simulate transmission failure
+                      "Tell me a very long story",
+                  ]
+                  sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=1)

Collaborator

comaniac Jan 26, 2025

max_token=1 seems not a very good prefill disaggregated example.

Author

Shaoting-Feng Jan 27, 2025

Thanks for your comment. This example serves as a proxy API server in the offline case. As shown in the Readme of disaggregated prefill (https://github.com/vllm-project/vllm/tree/main/vllm/distributed/kv_transfer), we should set the max token of prefill node to 1.

Also in the online case, in benchmarks/disagg_benchmarks/disagg_prefill_proxy_server.py, the author set prefill_request['max_tokens'] = 1 in handle_request function.

examples/offline_inference/disaggregated_prefill.py

+                  llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct",
+                            kv_transfer_config=ktc,
+                            max_model_len=2000,
+                            gpu_memory_utilization=0.8)

Collaborator

comaniac Jan 26, 2025

Some GPUs may get OOM with this ratio. Please document the GPU you used and advise the ratio for other GPUs.

examples/offline_inference/disaggregated_prefill.py

Comment on lines +31 to +36

+                  # To keep the prefill node running in case the decode node is not done
+                  try:
+                      while True:
+                          time.sleep(1)
+                  except KeyboardInterrupt:
+                      print("Script stopped by user.")

Collaborator

comaniac Jan 26, 2025

This won't stop until user presses Carl+C right? Then this is not a good offline example...

Author

Shaoting-Feng Jan 27, 2025

The prefill node will automatically terminate once the decode node completes. The code is as follows:

# Terminate the prefill node when decode is finished
decode_process.join()
prefill_process.terminate()


          Add comments and explanations

cb3d827

Signed-off-by: Shaoting Feng <[email protected]>

Author

Shaoting-Feng commented Jan 27, 2025

@comaniac @WangErXiao Thank you very much for your feedback. I have added additional comments and addressed your questions. An offline disaggregated prefill example use case is valuable for the disaggregated prefill roadmap, as it simplifies debugging. If there are any further concerns about the code, I would greatly appreciate your input.

Shaoting-Feng requested a review from comaniac

January 27, 2025 19:52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet