Avoid unnecessary calls to cuFuncSetAttribute #2678

rafbiels · 2025-02-07T19:46:34Z

Calling cuFuncSetAttribute to set CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this, cuLaunchKernel fails with CUDA_ERROR_INVALID_VALUE. However, calling cuFuncSetAttribute introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU.

Avoid the CUDA runtime synchronisation causing poor performance by removing the cuFuncSetAttribute call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory.

No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory.

[1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

Calling cuFuncSetAttribute to set CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES is required to launch kernels using more than 48 kB of local memory[1] (CUDA dynamic shared memory). Without this, cuLaunchKernel fails with CUDA_ERROR_INVALID_VALUE. However, calling cuFuncSetAttribute introduces synchronisation in the CUDA runtime which blocks its execution until all H2D/D2H memory copies are finished (don't know why), therefore effectively blocking kernel launches from overlapping with memory copies. This introduces significant performance degradation in some workflows, specifically in applications launching overlapping memory copies and kernels from multiple host threads into multiple CUDA streams to the same GPU. Avoid the CUDA runtime synchronisation causing poor performance by removing the cuFuncSetAttribute call unless it's strictly necessary. Call it only when a specific carveout is requested by user (using env variables) or when the kernel launch would fail without it (local memory size >48kB). Good performance is recovered for default settings with kernels using little or no local memory. No performance effects were observed for kernel execution time after removing the attribute across a wide range of tested kernels using various amounts of local memory. [1] Related to the 48 kB static shared memory limit, see the footnote for "Maximum amount of shared memory per thread block" in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

rafbiels · 2025-02-07T19:49:55Z

This issue came from a customer benchmark. I cannot share the code, but I tested the change after introducing a configurable amount of shared memory in their benchmark and observed the following results:

Also, for the record, here's how the CUDA runtime synchronisation looks like in nsys:

without cuFuncSetAttribute (overlapping operations)

with cuFuncSetAttribute (locking occurs):

rafbiels · 2025-02-07T21:36:16Z

intel/llvm PR: intel/llvm#16928

rafbiels requested a review from a team as a code owner February 7, 2025 19:46

rafbiels requested a review from npmiller February 7, 2025 19:46

github-actions bot added the cuda CUDA adapter specific issues label Feb 7, 2025

rafbiels mentioned this pull request Feb 7, 2025

[UR][CUDA] Avoid unnecessary calls to cuFuncSetAttribute intel/llvm#16928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary calls to cuFuncSetAttribute #2678

Avoid unnecessary calls to cuFuncSetAttribute #2678

rafbiels commented Feb 7, 2025

rafbiels commented Feb 7, 2025

rafbiels commented Feb 7, 2025

Avoid unnecessary calls to cuFuncSetAttribute #2678

Are you sure you want to change the base?

Avoid unnecessary calls to cuFuncSetAttribute #2678

Conversation

rafbiels commented Feb 7, 2025

rafbiels commented Feb 7, 2025

rafbiels commented Feb 7, 2025