Faster MHA backwards pass #22820

Rifur13 · 2024-08-01T17:34:12Z

This PR implements a faster backwards pass of the Multi-Headed Attention pallas kernel.
The biggest improvements on the speedup are:

Parallelizing the backwards pass across the sequence length
Efficient pipelining of the 2 for loops inside the bwd kernel

This builds on work from @tonywu95 in jax-ml/jax-triton#177 and is inspired by the triton tutorial https://github.com/triton-lang/triton/blob/main/python/tutorials/06-fused-attention.py.

Comparison against the XLA bwd pass across different configurations:

Batch Size	Num Q Heads	Num KV Heads	Q Seq Len	KV Seq Len	Head Dim	DType	Kernel	Baseline	Improvement
16	24	24	328	328	32	bfloat16	446.337us +/- 5.97	1164.975us +/- 18.78	+61.69%
16	24	24	328	328	64	bfloat16	650.638us +/- 2.27	1182.095us +/- 40.87	+44.96%
16	24	24	328	328	128	bfloat16	1236.514us +/- 4.47	1477.283us +/- 18.82	+16.30%
4	24	24	4096	4096	32	bfloat16	10265.644us +/- 2.73	25743.450us +/- 18.58	+60.12%
4	24	24	4096	4096	64	bfloat16	14350.043us +/- 18.76	27095.708us +/- 83.19	+47.04%
4	24	24	4096	4096	128	bfloat16	25729.960us +/- 1733.48	29187.793us +/- 92.03	+11.85%
1	2	2	32768	32768	32	bfloat16	13020.634us +/- 17.10	33502.725us +/- 517.01	+61.14%
1	2	2	32768	32768	64	bfloat16	16153.004us +/- 1285.60	33949.966us +/- 328.75	+52.42%
1	2	2	32768	32768	128	bfloat16	28013.174us +/- 476.28	35049.036us +/- 13.35	+20.07%

sharadmv

Looks great!!

sharadmv · 2024-08-06T20:12:11Z

jax/experimental/pallas/ops/gpu/attention.py

+
+  del dv, dk
+
+  # Scan #2: dQ


Out of curiosity, is there an advantage to doing this in one kernel vs two kernels?

I think it comes down to the fact that there’s more work to do in a kernel, which leads to better occupancy of warps.

Other factors that will influence the gpu utilization:

There’s some data locality between the 2 loops, but it's more significant for smaller sequence lengths.

Overhead of launching 2 kernels.

Making sure the kernels are actually executing in parallel. They need to be launched on separate cuda streams, and even I don't think that it's guaranteed.

Rifur13 · 2024-08-16T16:54:30Z

Here are the numbers compared the previous kernel. Trying different values of num_wraps, num_wraps and block sizes can be helpful to further increase performance for your hardware.

Note that the improvement is the relative speedup, and not just the percentage increase.

Batch Size	Num Q Heads	Num KV Heads	Q Seq Len	KV Seq Len	Head Dim	DType	Kernel	Baseline	Improvement
16	24	24	328	328	32	bfloat16	411.552us +/- 6.66	399.462us +/- 3.72	-3.03%
16	24	24	328	328	64	bfloat16	642.733us +/- 1.68	666.949us +/- 2.04	+3.63%
16	24	24	328	328	128	bfloat16	1214.854us +/- 4.02	2955.590us +/- 3.62	+58.90%
4	24	24	4096	4096	32	bfloat16	9102.094us +/- 12.43	9859.752us +/- 7.32	+7.68%
4	24	24	4096	4096	64	bfloat16	14404.937us +/- 59.16	14744.581us +/- 18.49	+2.30%
4	24	24	4096	4096	128	bfloat16	21383.099us +/- 110.85	64695.606us +/- 65.18	+66.95%
1	2	2	32768	32768	32	bfloat16	11271.190us +/- 938.29	411253.577us +/- 486.08	+97.26%
1	2	2	32768	32768	64	bfloat16	17974.650us +/- 1989.63	672074.326us +/- 520.27	+97.33%
1	2	2	32768	32768	128	bfloat16	30878.254us +/- 3374.88	2278354.646us +/- 4696.14	+98.64%

Faster MHA backwards pass

181d17e

Rifur13 requested a review from sharadmv August 1, 2024 17:34

Rifur13 self-assigned this Aug 1, 2024

sharadmv approved these changes Aug 6, 2024

View reviewed changes

google-ml-butler bot added kokoro:force-run pull ready Ready for copybara import and testing labels Aug 6, 2024

kokoro-team removed the kokoro:force-run label Aug 6, 2024

copybara-service bot merged commit d3b6066 into jax-ml:main Aug 7, 2024
16 checks passed

Rifur13 deleted the mha-faster branch August 7, 2024 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster MHA backwards pass #22820

Faster MHA backwards pass #22820

Rifur13 commented Aug 1, 2024

sharadmv left a comment

sharadmv Aug 6, 2024

Rifur13 Aug 7, 2024

Rifur13 commented Aug 16, 2024

Faster MHA backwards pass #22820

Faster MHA backwards pass #22820

Conversation

Rifur13 commented Aug 1, 2024

sharadmv left a comment

Choose a reason for hiding this comment

sharadmv Aug 6, 2024

Choose a reason for hiding this comment

Rifur13 Aug 7, 2024

Choose a reason for hiding this comment

Rifur13 commented Aug 16, 2024