Add ragged paged attention #8659

vanbasten23 · 2025-01-31T06:03:49Z

Test plan:

LIBTPU_INIT_ARGS=--xla_tpu_scoped_vmem_limit_kib=65536  python /workspaces/persist/pytorch/xla/test/test_ragged_paged_attention_kernel.py 2>&1 | tee out.txt

test/test_ragged_paged_attention_kernel.py

torch_xla/experimental/pallas_kernels/ragged_paged_attention_kernel.py

bythew3i · 2025-01-31T21:30:47Z

torch_xla/experimental/pallas_kernels/ragged_paged_attention_kernel.py

+  last_time_seeing_cur_physical_q_blk = jnp.logical_or(is_last_logical_q_blk, physical_q_blk_will_change)
+  should_store_to_hbm = jnp.logical_and(is_last_kv_blk_idx, last_time_seeing_cur_physical_q_blk)
+  @pl.when(should_store_to_hbm)
+  def store_to_hbm():  # pylint: disable=unused-variable


This function actually only stores to VMEM and all the ref used are actually VMEM ref. We rely on pipeline emitter to send vmem block back to HBM. And we can't store vregs directly to HBM in kernel. So I think original store_to_output makes more sense here.

bythew3i · 2025-01-31T21:55:28Z

torch_xla/experimental/pallas_kernels/ragged_paged_attention_kernel.py

+          pages_per_sequence=pages_per_sequence,
+          num_tokens=num_tokens,
+          num_seqs=num_seqs,  # it they changes, need to recompile.
+          num_kv_pages_per_compute_block=num_kv_pages_per_compute_block,
+          mask_value=mask_value,


These are all calculated from either static shape or static_argnames.

The overhead could be very large if some are too dynamic (num_seqs, num_tokens) because we need to recompile many times. Maybe to add a note or TODO here.

bythew3i · 2025-01-31T22:09:33Z

Test plan:

LIBTPU_INIT_ARGS=--xla_tpu_scoped_vmem_limit_kib=65536  python /workspaces/persist/pytorch/xla/test/test_ragged_paged_attention_kernel.py 2>&1 | tee out.txt

How is 65536 calculated?

vanbasten23 · 2025-02-01T00:11:12Z

Test plan:

LIBTPU_INIT_ARGS=--xla_tpu_scoped_vmem_limit_kib=65536  python /workspaces/persist/pytorch/xla/test/test_ragged_paged_attention_kernel.py 2>&1 | tee out.txt

How is 65536 calculated?

I found a ticket and someone uses it. I remember the number is the vmem limit on a TPU generation.

…_extreme_one_tokens_per_sequence.

…added runtime check.

bythew3i reviewed Jan 31, 2025

View reviewed changes

vanbasten23 force-pushed the xiowei/add_ragged_paged_attention branch from ad2f87c to 9e4b227 Compare February 1, 2025 00:32

bythew3i approved these changes Feb 1, 2025

View reviewed changes

vanbasten23 added 8 commits February 3, 2025 05:40

add the first version. All tests pass except for test_paged_attention…

b129e9a

…_extreme_one_tokens_per_sequence.

all tests passed.

6c3bf73

all tests passed except for one test which oom'ed.

dbc3fee

Improved the tests and all tests passed except for the OOM one. Also …

e589c86

…added runtime check.

clean up

9b3cdab

linter

ab69feb

address pr comments

cd65cc4

fix the rest of comments

7fe5071

vanbasten23 force-pushed the xiowei/add_ragged_paged_attention branch from 9e4b227 to 7fe5071 Compare February 3, 2025 05:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ragged paged attention #8659

Add ragged paged attention #8659

vanbasten23 commented Jan 31, 2025

bythew3i Jan 31, 2025

bythew3i Jan 31, 2025

bythew3i commented Jan 31, 2025

vanbasten23 commented Feb 1, 2025

Add ragged paged attention #8659

Are you sure you want to change the base?

Add ragged paged attention #8659

Conversation

vanbasten23 commented Jan 31, 2025

bythew3i Jan 31, 2025

Choose a reason for hiding this comment

bythew3i Jan 31, 2025

Choose a reason for hiding this comment

bythew3i commented Jan 31, 2025

vanbasten23 commented Feb 1, 2025