Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: fix mul_mat_vec for CC 6.0 #11775

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #10318 (comment) .

The problem is that by default the code is being compiled for compute capabilities 5.2, 6.1, 7.0, and 7.5. A GP100 has compute capability 6.0, the minimum for FP16 intrinsics. The host code says that it can do MMV with those intrinsics but without GGML_CUDA_F16 there is no actual device code available. This PR is more of a band-aid fix that just makes GPUs with compute capability use FP32 arithmetic if the code was not compiled with GGML_CUDA_F16. Medium-term I intend to revise the handling of these intrinsics and I'll do a proper fix at that time.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Feb 9, 2025
@slaren
Copy link
Collaborator

slaren commented Feb 9, 2025

Wouldn't be possible to check if the architecture is in __CUDA_ARCH_LIST__? Maybe we should do that more often instead of assuming that certain kernels are available.

const enum ggml_prec prec = fast_fp16_available(cc) ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;
#else
// FIXME by default there is no code for CC 6.0 so trying to use FP16 intrinsics results in a crash
const enum ggml_prec prec = fast_fp16_available(cc) && cc != 600 ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to common code:

#ifdef __CUDA_ARCH_LIST__
constexpr bool ggml_cuda_has_arch_impl(int) {
    return false;
}

template<class ... Archs>
constexpr bool ggml_cuda_has_arch_impl(int arch, int first, Archs... rest) {
    return arch == first || ggml_cuda_has_arch_impl(arch, rest...);
}

constexpr bool ggml_cuda_has_arch(int arch) {
    return ggml_cuda_has_arch_impl(arch, __CUDA_ARCH_LIST__);
}
#else
constexpr bool ggml_cuda_has_arch(int) {
    return false;
}
#endif // __CUDA_ARCH_LIST__

Then:

Suggested change
const enum ggml_prec prec = fast_fp16_available(cc) && cc != 600 ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;
const enum ggml_prec prec = fast_fp16_available(cc) && (cc != 600 || ggml_cuda_has_arch(600)) ? ggml_prec(dst->op_params[0]) : GGML_PREC_F32;

This will keep the check at compile time, so it shouldn't add any overhead. Though I am sure you could come up with a more generic check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants