Efficient GEMV #1700

dhernandez0 · 2024-11-29T12:20:08Z

dhernandez0
Nov 29, 2024
Collaborator

We got the request from migraphx to improve our vector-matrix multiplication. So, if M=1, N>>1 and K>>1, if we use (for example) wmma instructions, we are padding M to 16 with zeros and doing extra work, using only 1/16 of the instruction output.

Possible implementation

I think we could use the non-accel path of rocmlir for this, this would be a strategy:

Make sure we can tune with smaller m/n per block for example: 1, 2, 4, 8 (maybe with 8 it makes sense to use wmma?)
it might make sense to skip LDS of the matrix (then maybe we can make kperblock bigger?)
make sure we issue V_DUAL_DOT2ACC_F32_F16 instructions so that we do two dot products in one cycle (kPack=2?)
splitk is going to be important, as a second stage, try to improve it, can we use the LDS reduction trick (blockwise_broadcast_reduce)?

Steps/Tickets

write HIP code to verify if this makes sense
refactor non-accel code into an accelemitter (optional but would be nice)
changes in tuning params when n/m < 8 (or 4?)
changes to speed up GEMV: skip LDS for the matrix (not the vector), make sure we issue V_DUAL_DOT2ACC_F32_F16
further splitk investigation (fourth point in "Possible implementation")

dhernandez0 · 2024-11-29T14:53:58Z

dhernandez0
Nov 29, 2024
Collaborator Author

input from @manupak: there's batched 4x4 mfmas which could be used for this (if we have a batch dimension).

3 replies

umangyadav Nov 29, 2024
Collaborator

WMMA only has 16x16x16

dhernandez0 Nov 29, 2024
Collaborator Author

yes, sorry MFMA

manupak Nov 29, 2024
Collaborator

in the absense of batching, one could technically do 4x64 or 64x4 using broadcast modifier of 4x4 mfma where it will broadcast d=4 dimension implicitly

dhernandez0 · 2024-12-03T08:55:04Z

dhernandez0
Dec 3, 2024
Collaborator Author

After doing some experiments and analysis, I realized that the arithmetic intensity is almost 1. So, the kernel is bandwidth bound. Using https://github.com/ROCm/rocm_bandwidth_test I get an estimate of effective device bandwidth (not sure if this is the best way to do it):
gfx942 -> 3847 GB/s
gfx1101 -> 329 GB/s

So, just loading A and B and storing C should take about 63k ns (gfx942) or 742k ns (gfx1101). Using tuningRunner+perfRunner (exhaustive tuning, transB=true) we get 58k ns for gfx942 and 729k ns for gfx1101. So, it looks like there's no improvement possible given that the kernel is limited by bandwith. We can focus on fusing this kernel with previous or next operations.

Additionally, I've coded a specialized HIP kernel to do GEMV and we get the same run-time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient GEMV #1700

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Efficient GEMV #1700

dhernandez0 Nov 29, 2024 Collaborator

Possible implementation

Steps/Tickets

Replies: 2 comments · 3 replies

dhernandez0 Nov 29, 2024 Collaborator Author

umangyadav Nov 29, 2024 Collaborator

dhernandez0 Nov 29, 2024 Collaborator Author

manupak Nov 29, 2024 Collaborator

dhernandez0 Dec 3, 2024 Collaborator Author

dhernandez0
Nov 29, 2024
Collaborator

Replies: 2 comments 3 replies

dhernandez0
Nov 29, 2024
Collaborator Author

umangyadav Nov 29, 2024
Collaborator

dhernandez0 Nov 29, 2024
Collaborator Author

manupak Nov 29, 2024
Collaborator

dhernandez0
Dec 3, 2024
Collaborator Author