[WIP] [V1] TPU support #11936

alexm-redhat · 2025-01-10T15:59:22Z

This PR is a rebase and modification of @robertgshaw2-redhat original PR for TPU support in vLLM V1 from 1.5 months ago #10241

Currently, TPU attention kernel has no support for mixing prefills and decodes in the same scheduler iteration. As a result, this PR separates the requests to (1) prefills and (2) decodes, and executes each one of them separately. Google guys are working on a new TPU attention kernel that will allow mixing prefills and decodes, the moment it is ready, we will be able to remove the separation logic and unify the requests (which will also provide better performance).

Notes:

@mgoin verified correctness with GSM8K on a TPU instance
No TP > 1 support yet
Only greedy sampler for now
V1 code had no support for multiple arches (this PR supports for CUDA and TPU), and this results in code duplications that are avoided as much as possible by introducing base classes for worker and model runner.
Not performance optimized yet

Follow up tasks (maybe I missed something):

Add all sampler options
Add prefix caching (currently supported in V0 TPU)
Add prefill chunking
Integrate with Google new super attention kernel to support mixing for prefills and decodes
Optimize

github-actions · 2025-01-10T15:59:34Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-01-10T16:00:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/worker/tpu_model_runner.py

liangfu · 2025-01-10T20:50:00Z

vllm/v1/worker/tpu_model_runner.py

+        return PrefillInputData(
+            request_ids=prefill_request_ids,
+            prompt_lens=prefill_prompt_lens,
+            token_ids=prefill_token_ids,
+            position_ids=prefill_position_ids,
+            attn_metadata=prefill_attn_metadata,
+        )


remove the PrefillInputData data structure, and make it consistent with gpu_model_runner ?

This will be removed the moment Google provides the new attention kernel that supports chunked prefill.

vllm/v1/worker/tpu_model_runner.py

liangfu · 2025-01-10T20:53:38Z

vllm/v1/worker/tpu_model_runner.py

+                                   effective_query_lens=None,
+                               ))
+
+    def _prepare_inputs(self, scheduler_output: "SchedulerOutput"):


this is almost identical with current gpu_model_runner implementation, consider reusing instead of duplicating ?

Added model_runner_base.py to hold common funcs

mgoin · 2025-01-13T20:16:58Z

Successfully ran an eval on GSM8k

VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-1.5B-Instruct,max_model_len=2048,max_num_seqs=512 --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen/Qwen2.5-1.5B-Instruct,max_model_len=2048,max_num_seqs=512), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5989|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.5428|±  |0.0137|

mgoin · 2025-01-13T21:15:43Z

vllm/platforms/tpu.py

-                parallel_config.worker_cls = \
-                    "vllm.worker.multi_step_tpu_worker.MultiStepTPUWorker"
+            if envs.VLLM_USE_V1:
+                parallel_config.worker_cls = "vllm.v1.worker.tpu_worker.TRUWorker"


Suggested change

parallel_config.worker_cls = "vllm.v1.worker.tpu_worker.TRUWorker"

parallel_config.worker_cls = "vllm.v1.worker.tpu_worker.TPUWorker"

mgoin · 2025-01-13T21:18:27Z

vllm/v1/worker/tpu_worker.py

@@ -0,0 +1,148 @@
+"""A GPU worker class."""


Suggested change

"""A GPU worker class."""

"""A TPU worker class."""

nice catch! :)

vanbasten23 · 2025-01-15T18:50:50Z

vllm/platforms/tpu.py

+
+        # TPU only supports DYNAMO_ONCE compilation level
+        if (compilation_config.level == CompilationLevel.NO_COMPILATION
+                or compilation_config.level == CompilationLevel.PIECEWISE):


The assert below makes sure it fails when compilation_config.level < CompilationLevel.PIECEWISE. So do you still need to check if compilation_config.level == CompilationLevel.PIECEWISE here?

We can remove it

vanbasten23 · 2025-01-15T18:52:11Z

vllm/platforms/tpu.py

-            if scheduler_config.is_multi_step:
-                parallel_config.worker_cls = \
-                    "vllm.worker.multi_step_tpu_worker.MultiStepTPUWorker"
+            if envs.VLLM_USE_V1:


Does it mean in V1, there is no distinction between MultiStepTPUWorker and TPUWorker?

The purpose of V1 is to remove the need for multistep scheduling so we can simplify the scheduler.

vanbasten23 · 2025-01-15T18:55:04Z

vllm/platforms/tpu.py

+        # TODO: Add support for these
+        if envs.VLLM_USE_V1:
+            if vllm_config.cache_config.enable_prefix_caching:
+                logger.info("[V1][TPU] Disable prefix caching")


nit: should here and the logger.info below be a logger.error?

changed to warning

vllm/v1/core/scheduler.py

vllm/v1/worker/gpu_model_runner.py

mgoin · 2025-01-14T22:01:11Z

examples/offline_inference/offline_inference.py

Please revert these changes

mgoin · 2025-01-15T20:59:35Z

vllm/platforms/tpu.py

+        if (compilation_config.level == CompilationLevel.NO_COMPILATION
+                or compilation_config.level == CompilationLevel.PIECEWISE):
+            logger.info("[TPU] Forcing DYNAMO_ONCE compilation level")


Suggested change

if (compilation_config.level == CompilationLevel.NO_COMPILATION

or compilation_config.level == CompilationLevel.PIECEWISE):

logger.info("[TPU] Forcing DYNAMO_ONCE compilation level")

if compilation_config.level != CompilationLevel.DYNAMO_ONCE:

logger.warning("[TPU] Unsupported compilation level "

f"{compilation_config.level}. Forcing DYNAMO_ONCE.")

mgoin · 2025-01-15T21:00:29Z

vllm/platforms/tpu.py

        assert compilation_config.level < CompilationLevel.PIECEWISE,\
-            "TPU does not support Inductor."
+            ("Current compilation level = {} but needs to be less"
+             " than {}".format(
+                 compilation_config.level,
+                 CompilationLevel.PIECEWISE))


I would just remove this assert entirely and leave the above override+log

mgoin · 2025-01-15T21:01:47Z

vllm/platforms/tpu.py

-            if scheduler_config.is_multi_step:
-                parallel_config.worker_cls = \
-                    "vllm.worker.multi_step_tpu_worker.MultiStepTPUWorker"
+            if envs.VLLM_USE_V1:


The purpose of V1 is to remove the need for multistep scheduling so we can simplify the scheduler.

mgoin · 2025-01-15T21:03:31Z

vllm/platforms/tpu.py

+        # TODO: Add support for these
+        if envs.VLLM_USE_V1:
+            if vllm_config.cache_config.enable_prefix_caching:
+                logger.info("[V1][TPU] Disable prefix caching")


Suggested change

logger.info("[V1][TPU] Disable prefix caching")

logger.warning("[V1][TPU] Disabling prefix caching")

vllm/v1/worker/gpu_model_runner.py

mgoin · 2025-01-15T21:56:23Z

vllm/v1/worker/tpu_model_runner.py

+            # TODO: Remove prompt_len param here
+            prefill_attn_metadata.append(
+                PallasMetadata(
+                    num_prefills=1,
+                    num_prefill_tokens=prompt_len,  # NOTE: This is not used.
+                    num_decode_tokens=0,
+                    slot_mapping=slot_mapping.to(self.device),
+                    multi_modal_placeholder_index_maps=None,
+                    block_tables=None,
+                    context_lens=None,
+                    effective_query_lens=None,
+                ))


Can you address this TODO?

mgoin · 2025-01-15T21:58:23Z

vllm/v1/worker/tpu_model_runner.py

+                assert req_id is not None
+                req_state = self.requests[req_id]
+
+                # TODO: ASSERT NO CHUNKED PREFILL.


Implement this TODO

It looks like the current assert combo is good enough

mgoin · 2025-01-15T21:58:45Z

vllm/v1/worker/tpu_model_runner.py

+                           scheduler_output.num_scheduled_tokens[req_id])
+                assert seq_len == req_state.num_tokens
+
+                # TODO: Verify if req_id_to_index mapping is needed here!


removed, it is an old comment

mgoin · 2025-01-15T21:59:19Z

vllm/v1/worker/tpu_model_runner.py

+            # TODO: ASSERT NO PREFIX CACHING.
+            assert req_state.num_computed_tokens == 0
+            seq_len = (req_state.num_computed_tokens +
+                       scheduler_output.num_scheduled_tokens[req_id])
+
+            # TODO: ASSERT NO CHUNKED PREFILL.


Could you make these asserts at the initialization level? Why would you need to assert this for each request?

they are now inside tpu.py of the platform, and here are just in case something changes in the code and messes something. All of these will change the moment we have chunked prefill attn kernel.

mgoin · 2025-01-16T21:50:35Z

vllm/v1/worker/tpu_model_runner.py

+            token_ids = torch.zeros((batch_size, seq_len),
+                                    dtype=torch.int32,
+                                    device=self.device)


Why do you build these dummy tensors each time rather than allocating the max in the initializer and taking slices for each run like the gpu_model_runner?

taking slices will result in copies as well, no?

vanbasten23 · 2025-01-22T05:44:39Z

vllm/v1/worker/tpu_worker.py

+        # Use persistent cache to avoid XLA recompilation.
+        # NOTE(woosuk): Set per-rank cache path since different ranks
+        # can have slightly different XLA graphs.
+        world_size = self.parallel_config.world_size


In v0 folder, there is a tpu_worker.py, tpu_model_runner.py, pallas.py. Could you summarize how the 3 files in the v1 folder differ from the ones in the v0 folder respectively?

The architecture of V1 is a slightly different than from V0, which required changing APIs. To avoid conflicts, when V1 was implemented, these files were duplicated (with necessary changes) for the NVIDIA backend. In this PR, we do the same for TPU backend, but also refactor the code to *_base.py classes to avoid code duplications (if possible)

alexm-redhat

@mgoin @vanbasten23 thanks for the review comments!

alexm-redhat · 2025-01-22T19:01:08Z

vllm/platforms/tpu.py

+
+        # TPU only supports DYNAMO_ONCE compilation level
+        if (compilation_config.level == CompilationLevel.NO_COMPILATION
+                or compilation_config.level == CompilationLevel.PIECEWISE):


We can remove it

alexm-redhat · 2025-01-22T19:04:19Z

vllm/platforms/tpu.py

+        if (compilation_config.level == CompilationLevel.NO_COMPILATION
+                or compilation_config.level == CompilationLevel.PIECEWISE):
+            logger.info("[TPU] Forcing DYNAMO_ONCE compilation level")


alexm-redhat · 2025-01-22T19:04:54Z

vllm/platforms/tpu.py

        assert compilation_config.level < CompilationLevel.PIECEWISE,\
-            "TPU does not support Inductor."
+            ("Current compilation level = {} but needs to be less"
+             " than {}".format(
+                 compilation_config.level,
+                 CompilationLevel.PIECEWISE))


alexm-redhat · 2025-01-22T19:06:21Z

vllm/platforms/tpu.py

+        # TODO: Add support for these
+        if envs.VLLM_USE_V1:
+            if vllm_config.cache_config.enable_prefix_caching:
+                logger.info("[V1][TPU] Disable prefix caching")


changed to warning

alexm-redhat · 2025-01-22T19:06:28Z

vllm/platforms/tpu.py

+        # TODO: Add support for these
+        if envs.VLLM_USE_V1:
+            if vllm_config.cache_config.enable_prefix_caching:
+                logger.info("[V1][TPU] Disable prefix caching")


alexm-redhat · 2025-01-22T19:34:42Z

vllm/v1/worker/tpu_model_runner.py

+                           scheduler_output.num_scheduled_tokens[req_id])
+                assert seq_len == req_state.num_tokens
+
+                # TODO: Verify if req_id_to_index mapping is needed here!


removed, it is an old comment

alexm-redhat · 2025-01-22T19:35:09Z

vllm/v1/worker/tpu_model_runner.py

+            # TODO: ASSERT NO PREFIX CACHING.
+            assert req_state.num_computed_tokens == 0
+            seq_len = (req_state.num_computed_tokens +
+                       scheduler_output.num_scheduled_tokens[req_id])
+
+            # TODO: ASSERT NO CHUNKED PREFILL.


they are now inside tpu.py of the platform, and here are just in case something changes in the code and messes something. All of these will change the moment we have chunked prefill attn kernel.

alexm-redhat · 2025-01-22T19:36:53Z

vllm/v1/worker/tpu_model_runner.py

+            token_ids = torch.zeros((batch_size, seq_len),
+                                    dtype=torch.int32,
+                                    device=self.device)


taking slices will result in copies as well, no?

alexm-redhat · 2025-01-22T19:37:36Z

vllm/v1/worker/tpu_worker.py

@@ -0,0 +1,148 @@
+"""A GPU worker class."""


nice catch! :)

alexm-redhat · 2025-01-22T19:39:11Z

vllm/v1/worker/tpu_worker.py

+        # Use persistent cache to avoid XLA recompilation.
+        # NOTE(woosuk): Set per-rank cache path since different ranks
+        # can have slightly different XLA graphs.
+        world_size = self.parallel_config.world_size


The architecture of V1 is a slightly different than from V0, which required changing APIs. To avoid conflicts, when V1 was implemented, these files were duplicated (with necessary changes) for the NVIDIA backend. In this PR, we do the same for TPU backend, but also refactor the code to *_base.py classes to avoid code duplications (if possible)

mergify · 2025-01-23T18:01:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexm-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2025-01-24T02:28:52Z

.pre-commit-config.yaml

@@ -89,4 +89,4 @@ repos:
    name: Suggestion
    entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
    language: system
-    verbose: true
+    verbose: true


robertgshaw2-redhat · 2025-01-24T02:28:57Z

examples/offline_inference/basic.py

@@ -8,15 +8,15 @@
    "The future of AI is",
 ]
 # Create a sampling params object.
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+sampling_params = SamplingParams()  #temperature=0.8, top_p=0.95)


robertgshaw2-redhat · 2025-01-24T02:29:04Z

tools/mypy.sh

@@ -34,4 +34,4 @@ run_mypy vllm/plugins
 run_mypy vllm/prompt_adapter
 run_mypy vllm/spec_decode
 run_mypy vllm/worker
-run_mypy vllm/v1
+run_mypy vllm/v1


Signed-off-by: Alexander Matveev <[email protected]>

vanbasten23 · 2025-01-24T20:39:11Z

Hi @alexm-redhat , thanks for adding vLLm v1 support for TPU!
One quick question, this vLLM slides mentioned a few key changes in vLLM v1:

Simplified scheduler
Incremental input preparation
Piecewise CUDA graphs
Enhanced API server
More efficient Prefix caching
Fine-grained scheduling for VLMs

could you help mark which changes are included in this PR and which are to be made in the future PRs?
cc @miladm

alexm-redhat requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96 and comaniac as code owners January 10, 2025 15:59

alexm-redhat self-assigned this Jan 10, 2025

mergify bot added the needs-rebase label Jan 10, 2025

alexm-redhat requested a review from mgoin January 10, 2025 16:18

alexm-redhat requested review from DarkLight1337 and simon-mo as code owners January 10, 2025 16:51

liangfu reviewed Jan 10, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

liangfu reviewed Jan 10, 2025

View reviewed changes

mgoin reviewed Jan 13, 2025

View reviewed changes

vanbasten23 reviewed Jan 15, 2025

View reviewed changes

vllm/v1/core/scheduler.py Show resolved Hide resolved

vanbasten23 reviewed Jan 15, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

mgoin reviewed Jan 15, 2025

View reviewed changes

mergify bot added the ci/build label Jan 16, 2025

mgoin reviewed Jan 16, 2025

View reviewed changes

mgoin mentioned this pull request Jan 17, 2025

[WIP] Multimodal model support for V1 TPU #12133

Draft

alexm-redhat force-pushed the tpu_v1 branch from d25ec0e to b65ed98 Compare January 20, 2025 14:13

vanbasten23 reviewed Jan 22, 2025

View reviewed changes

alexm-redhat force-pushed the tpu_v1 branch from b65ed98 to 0adf4a6 Compare January 22, 2025 18:31

alexm-redhat commented Jan 22, 2025

View reviewed changes

alexm-redhat force-pushed the tpu_v1 branch from 4b6599e to c86cd53 Compare January 22, 2025 21:52

alexm-redhat force-pushed the tpu_v1 branch from c86cd53 to d74c98b Compare January 22, 2025 22:28

mergify bot removed the needs-rebase label Jan 22, 2025

alexm-redhat force-pushed the tpu_v1 branch 2 times, most recently from dea6afd to c6f526c Compare January 22, 2025 22:38

mergify bot added the needs-rebase label Jan 23, 2025

alexm-redhat force-pushed the tpu_v1 branch from 0023b20 to 167c0f2 Compare January 24, 2025 02:24

mergify bot removed the needs-rebase label Jan 24, 2025

robertgshaw2-redhat reviewed Jan 24, 2025

View reviewed changes

alexm-redhat force-pushed the tpu_v1 branch from d50e6c7 to 90ecdbd Compare January 24, 2025 02:42

[V1] TPU support

eee6378

Signed-off-by: Alexander Matveev <[email protected]>

alexm-redhat force-pushed the tpu_v1 branch from 90ecdbd to eee6378 Compare January 24, 2025 19:44

	parallel_config.worker_cls = "vllm.v1.worker.tpu_worker.TRUWorker"
	parallel_config.worker_cls = "vllm.v1.worker.tpu_worker.TPUWorker"

	logger.info("[V1][TPU] Disable prefix caching")
	logger.warning("[V1][TPU] Disabling prefix caching")

[WIP] [V1] TPU support #11936

Are you sure you want to change the base?

[WIP] [V1] TPU support #11936

Conversation

alexm-redhat commented Jan 10, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 10, 2025

mergify bot commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgoin commented Jan 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-redhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vanbasten23 commented Jan 24, 2025

alexm-redhat commented Jan 10, 2025 •

edited by github-actions bot

Loading