Efficient GEMV #1700
Replies: 2 comments 3 replies
-
input from @manupak: there's batched 4x4 mfmas which could be used for this (if we have a batch dimension). |
Beta Was this translation helpful? Give feedback.
-
After doing some experiments and analysis, I realized that the arithmetic intensity is almost 1. So, the kernel is bandwidth bound. Using https://github.com/ROCm/rocm_bandwidth_test I get an estimate of effective device bandwidth (not sure if this is the best way to do it): So, just loading A and B and storing C should take about 63k ns (gfx942) or 742k ns (gfx1101). Using tuningRunner+perfRunner (exhaustive tuning, transB=true) we get 58k ns for gfx942 and 729k ns for gfx1101. So, it looks like there's no improvement possible given that the kernel is limited by bandwith. We can focus on fusing this kernel with previous or next operations. Additionally, I've coded a specialized HIP kernel to do GEMV and we get the same run-time. |
Beta Was this translation helpful? Give feedback.
-
We got the request from migraphx to improve our vector-matrix multiplication. So, if M=1, N>>1 and K>>1, if we use (for example) wmma instructions, we are padding M to 16 with zeros and doing extra work, using only 1/16 of the instruction output.
Possible implementation
I think we could use the non-accel path of rocmlir for this, this would be a strategy:
Steps/Tickets
Beta Was this translation helpful? Give feedback.
All reactions