CPU/CUDA: fix GQA mul mat back, add CUDA support #11380

JohannesGaessler · 2025-01-23T21:50:41Z

On master the backward pass for matrix multiplication does not work correctly when the broadcasting for GQA is involved. However, this is not being detected because all of the relevant gradient tests are being skipped for speed. This PR fixes the backward pass and adds CUDA support. To make the backward pass work I am adding an extra parameter to ggml_repeat_back because the GQA broadcasting is different from e.g. the one in ggml_repeat.

This PR also adds minor fixes to other backward passes. After this PR it should not be necessary to make further changes to ggml ops for #10544 .

ggerganov · 2025-01-24T08:17:51Z

Can the adjacent logic be performed automatically without explicitly passing the argument to ggml_repeat_back(). Not 100% sure, but maybe checking if the repeat operation requires broadcast (i.e. nr1 > 1 || nr2 > 1) then use the adjacent == true branch? I could be missing something though.

JohannesGaessler · 2025-01-24T08:22:31Z

No, the problem is that the shape is the same but that different values need to be iterated over. Although now that I'm writing this I'm realizing that you could get the same result by interjecting a call to ggml_view and adding CUDA support for noncontiguous inputs. I'll do that instead.

ggerganov · 2025-01-24T08:35:40Z

Yup, sounds like a better alternative.

JohannesGaessler · 2025-01-24T10:20:20Z

I found and fixed another bug in the CUDA code for OUT_PROD related to dimension 1 not being contiguous.

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 23, 2025

CPU/CUDA: fix (GQA) mul mat back, add CUDA support

ae4cca3

JohannesGaessler force-pushed the back-misc branch from 2cf6f8b to ae4cca3 Compare January 24, 2025 10:19

ggerganov approved these changes Jan 24, 2025

View reviewed changes

JohannesGaessler merged commit 8137b4b into ggerganov:master Jan 24, 2025
45 checks passed

anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025

CPU/CUDA: fix (GQA) mul mat back, add CUDA support (ggerganov#11380)

dd59cc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU/CUDA: fix GQA mul mat back, add CUDA support #11380

CPU/CUDA: fix GQA mul mat back, add CUDA support #11380

JohannesGaessler commented Jan 23, 2025

ggerganov commented Jan 24, 2025

JohannesGaessler commented Jan 24, 2025

ggerganov commented Jan 24, 2025

JohannesGaessler commented Jan 24, 2025

CPU/CUDA: fix GQA mul mat back, add CUDA support #11380

CPU/CUDA: fix GQA mul mat back, add CUDA support #11380

Conversation

JohannesGaessler commented Jan 23, 2025

ggerganov commented Jan 24, 2025

JohannesGaessler commented Jan 24, 2025

ggerganov commented Jan 24, 2025

JohannesGaessler commented Jan 24, 2025