Improvements for: Groupwise scaling along M for FP8 gemm #2095

LucasWilkinson · 2025-02-10T05:30:35Z

Various improvements to "Groupwise scaling along M" (#2037) namely to address: #2087, context vllm-project/vllm#11868 (comment)

Improvements:

Multiple threads now participating in copy A scales
Predication when copying A scale loads, this means if there is partial M tile (due to the problem shape not being evenly divided by the M tile shape)
More commonly used scale layouts, currently CUTLASS uses a layout like:

(M_TILES, ScaleMsPerTile, K_TILES, L), ordered: (2, 0, 1, 3)

this PR moves to a layout of (i.e. standard M-major):

(M / ScaleGranularityM, K_TILES, L), ordered: (1, 0, 2)

making it much easier to integrate into inference libraries

These improvements were part of vLLMs adoption of this kernel https://github.com/vllm-project/vllm/blob/v0.7.1/csrc/cutlass_extensions/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp (PR: vllm-project/vllm#11868) and is in current wide scale use. Our goal is to rely on the CUTLASS implementation but that currently not possible given the issues above.

Signed-off-by: Lucas Wilkinson <[email protected]>

hwu36 · 2025-02-21T03:01:21Z

@LucasWilkinson , we upstreamed our change to groupwise scaling kernels. there are some conflicts in this PR that needs to be solved.

Our change is mainly:

Extend groupwise scaling gemm to support both M dimension and N dimension groupwise scaling in FP8 GEMM.
In examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu, two parameters ScaleGranularityM and ScaleGranularityNcontrol the scaling mode:


ScaleGranularityM == 128 and ScaleGranularityN == 128 --> 2Dx2D scaling (block-wise scaling, same as 67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu , 2Dx2D refers to the shape of the scaling factor)

ScaleGranularityM == 1 and ScaleGranularityN == 128 --> 1Dx2D scaling

ScaleGranularityM == 128 and ScaleGranularityN == 1 --> 2Dx1D scaling

ScaleGranularityM == 1 and ScaleGranularityN == 1 --> 1Dx1D scaling

LucasWilkinson added 3 commits February 10, 2025 04:34

fix blockwise fp8 kernels

358af4e

Signed-off-by: Lucas Wilkinson <[email protected]>

wip, < 128 not working

8fd2846

Signed-off-by: Lucas Wilkinson <[email protected]>

fix < 128

db87722

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson changed the title ~~[WIP][Bugfix] Bug fixes for: Groupwise scaling along M for FP8 gemm~~ Improvements: Groupwise scaling along M for FP8 gemm Feb 10, 2025

LucasWilkinson changed the title ~~Improvements: Groupwise scaling along M for FP8 gemm~~ Improvements for: Groupwise scaling along M for FP8 gemm Feb 10, 2025

LucasWilkinson marked this pull request as ready for review February 10, 2025 21:07

LucasWilkinson mentioned this pull request Feb 10, 2025

[QST] how to use groupwise scaling along M for FP8 gemm to impelement per-token-per-128-channel and blockwise? #2087

Open

yizhang2077 mentioned this pull request Feb 11, 2025

support blockwise fp8 matmul kernel sgl-project/sglang#3267

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements for: Groupwise scaling along M for FP8 gemm #2095

Improvements for: Groupwise scaling along M for FP8 gemm #2095

LucasWilkinson commented Feb 10, 2025 •

edited

Loading

hwu36 commented Feb 21, 2025

Improvements for: Groupwise scaling along M for FP8 gemm #2095

Are you sure you want to change the base?

Improvements for: Groupwise scaling along M for FP8 gemm #2095

Conversation

LucasWilkinson commented Feb 10, 2025 • edited Loading

hwu36 commented Feb 21, 2025

LucasWilkinson commented Feb 10, 2025 •

edited

Loading