Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary calls to cuFuncSetAttribute #2678

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rafbiels
Copy link
Contributor

@rafbiels rafbiels commented Feb 7, 2025

Calling cuFuncSetAttribute to set CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this, cuLaunchKernel fails with CUDA_ERROR_INVALID_VALUE. However, calling cuFuncSetAttribute introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU.

Avoid the CUDA runtime synchronisation causing poor performance by removing the cuFuncSetAttribute call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory.

No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory.

[1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

Calling cuFuncSetAttribute to set
CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES is required to
launch kernels using more than 48 kB of local memory[1] (CUDA
dynamic shared memory). Without this, cuLaunchKernel fails with
CUDA_ERROR_INVALID_VALUE. However, calling cuFuncSetAttribute
introduces synchronisation in the CUDA runtime which blocks its
execution until all H2D/D2H memory copies are finished (don't know
why), therefore effectively blocking kernel launches from
overlapping with memory copies. This introduces significant
performance degradation in some workflows, specifically in
applications launching overlapping memory copies and kernels from
multiple host threads into multiple CUDA streams to the same GPU.

Avoid the CUDA runtime synchronisation causing poor performance by
removing the cuFuncSetAttribute call unless it's strictly
necessary. Call it only when a specific carveout is requested by
user (using env variables) or when the kernel launch would fail
without it (local memory size >48kB). Good performance is recovered
for default settings with kernels using little or no local memory.

No performance effects were observed for kernel execution time
after removing the attribute across a wide range of tested kernels
using various amounts of local memory.

[1] Related to the 48 kB static shared memory limit, see the
footnote for "Maximum amount of shared memory per thread block" in
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability
@rafbiels rafbiels requested a review from a team as a code owner February 7, 2025 19:46
@rafbiels rafbiels requested a review from npmiller February 7, 2025 19:46
@github-actions github-actions bot added the cuda CUDA adapter specific issues label Feb 7, 2025
@rafbiels
Copy link
Contributor Author

rafbiels commented Feb 7, 2025

This issue came from a customer benchmark. I cannot share the code, but I tested the change after introducing a configurable amount of shared memory in their benchmark and observed the following results:
benchmark-UR-cuFuncSetAttribute-patch

Also, for the record, here's how the CUDA runtime synchronisation looks like in nsys:

without cuFuncSetAttribute (overlapping operations)
nsys-cuda-noattr

with cuFuncSetAttribute (locking occurs):
nsys-cuda-setattr

@rafbiels
Copy link
Contributor Author

rafbiels commented Feb 7, 2025

intel/llvm PR: intel/llvm#16928

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA adapter specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant