refactor <cuda/std/cstring>

update docs update docs add `memcmp`, `memmove` and `memchr` implementations implement tests Use cuda::std::min/max in Thrust (NVIDIA#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (NVIDIA#2773) Deprecate thrust::null_type (NVIDIA#3367) Deprecate cub::DeviceSpmv (NVIDIA#3320) Fixes: NVIDIA#896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (NVIDIA#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (NVIDIA#3385) cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (NVIDIA#3324) Fixes: NVIDIA#100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (NVIDIA#3366) Replace `typedef` with `using` in libcu++ (NVIDIA#3368) Deprecate thrust::optional (NVIDIA#3307) Fixes: NVIDIA#3306 Upgrade to Catch2 3.8 (NVIDIA#3310) Fixes: NVIDIA#1724 refactor `<cuda/std/cstdint>` (NVIDIA#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (NVIDIA#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (NVIDIA#3408) Implement more cmath functions to be usable on host and device (NVIDIA#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (NVIDIA#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes NVIDIA#3411 Extend CUB reduce benchmarks (NVIDIA#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: NVIDIA#3283 Update upload-pages-artifact to v3 (NVIDIA#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421) `std::linalg` accessors and `transposed_layout` (NVIDIA#2962) Add round up/down to multiple (NVIDIA#3234) [FEA]: Introduce Python module with CCCL headers (NVIDIA#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment)) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * NVIDIA#3201 (comment) * NVIDIA#3201 (comment) * Install CCCL headers under cuda.cccl.include Trigger for this change: * NVIDIA#3201 (comment) Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d6. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a21. Error message: NVIDIA#3201 (comment) * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd. * Implement suggestion by @shwina (NVIDIA#3201 (review)) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404 Fix CI issues (NVIDIA#3443) Remove deprecated `cub::min` (NVIDIA#3450) * Remove deprecated `cuda::{min,max}` * Drop unused `thrust::remove_cvref` file Fix typo in builtin (NVIDIA#3451) Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435) uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436) Default transform_iterator's copy ctor (NVIDIA#3395) Fixes: NVIDIA#2393 Turn C++ dialect warning into error (NVIDIA#3453) Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437) * uses thrust's dynamic dispatch for merge_sort * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Refactor allocator handling of contiguous_storage (NVIDIA#3050) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop thrust::detail::integer_traits (NVIDIA#3391) Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379) Co-authored-by: Michael Schellenberger Costa <[email protected]> Improve docs of std headers (NVIDIA#3416) Drop C++11 and C++14 support for all of cccl (NVIDIA#3417) * Drop C++11 and C++14 support for all of cccl --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> Deprecate a few CUB macros (NVIDIA#3456) Deprecate thrust universal iterator categories (NVIDIA#3461) Fix launch args order (NVIDIA#3465) Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432) add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429) Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433) Drop universal iterator categories (NVIDIA#3474) Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472) Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470) Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements Co-authored-by: Michael Schellenberger Costa <[email protected]> Moves CUB kernel entry points to a detail namespace (NVIDIA#3468) * moves emptykernel to detail ns * second batch * third batch * fourth batch * fixes cuda parallel * concatenates nested namespaces Deprecate block/warp algo specializations (NVIDIA#3455) Fixes: NVIDIA#3409 Refactor CUB's util_debug (NVIDIA#3345)
davebayer · Jan 22, 2025 · 224a155 · 224a155
1 parent 6a0f48b
commit 224a155
Show file tree

Hide file tree

Showing 481 changed files with 11,206 additions and 5,285 deletions.
diff --git a/.clangd b/.clangd
@@ -51,6 +51,7 @@ CompileFlags:
     # strip CUDA flags unknown to clang
     - "-ccbin*"
     - "--compiler-options*"
+    - "--extended-lambda"
     - "--expt-extended-lambda"
     - "--expt-relaxed-constexpr"
     - "-forward-unknown-to-host-compiler"

diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1,20 +1,29 @@
-# general codeowners for all files
-# (Order matters. This needs to be at the top)
-* @nvidia/cccl-codeowners
-
 # Libraries
-thrust/ @nvidia/cccl-thrust-codeowners @nvidia/cccl-codeowners
-cub/ @nvidia/cccl-cub-codeowners @nvidia/cccl-codeowners
-libcudacxx/ @nvidia/cccl-libcudacxx-codeowners @nvidia/cccl-codeowners
+thrust/ @nvidia/cccl-thrust-codeowners
+cub/ @nvidia/cccl-cub-codeowners
+libcudacxx/ @nvidia/cccl-libcudacxx-codeowners
 cudax/ @nvidia/cccl-cudax-codeowners
 c/ @nvidia/cccl-c-codeowners
 python/ @nvidia/cccl-python-codeowners
 
 # Infrastructure
-.github/ @nvidia/cccl-infra-codeowners @nvidia/cccl-codeowners
-ci/ @nvidia/cccl-infra-codeowners @nvidia/cccl-codeowners
-.devcontainer/ @nvidia/cccl-infra-codeowners @nvidia/cccl-codeowners
+.github/ @nvidia/cccl-infra-codeowners
+ci/ @nvidia/cccl-infra-codeowners
+.devcontainer/ @nvidia/cccl-infra-codeowners
+.pre-commit-config.yaml @nvidia/cccl-infra-codeowners
+.clang-format @nvidia/cccl-infra-codeowners
+.clangd @nvidia/cccl-infra-codeowners
+c2h/ @nvidia/cccl-infra-codeowners
+.vscode @nvidia/cccl-infra-codeowners
 
 # cmake
-**/CMakeLists.txt @nvidia/cccl-cmake-codeowners @nvidia/cccl-codeowners
-**/cmake/ @nvidia/cccl-cmake-codeowners  @nvidia/cccl-codeowners
+**/CMakeLists.txt @nvidia/cccl-cmake-codeowners
+**/cmake/ @nvidia/cccl-cmake-codeowners
+
+# benchmarks
+benchmarks/ @nvidia/cccl-benchmark-codeowners
+**/benchmarks @nvidia/cccl-benchmark-codeowners
+
+# docs
+docs/ @nvidia/cccl-docs-codeowners
+examples/ @nvidia/cccl-docs-codeowners
diff --git a/.github/actions/docs-build/action.yml b/.github/actions/docs-build/action.yml
@@ -54,4 +54,4 @@ runs:
     # Upload docs as pages artifacts
     - name: Upload artifact
       if: ${{ inputs.upload_pages_artifact == 'true' }}
-      uses: actions/upload-pages-artifact@v2
+      uses: actions/upload-pages-artifact@v3
diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml
@@ -45,4 +45,4 @@ jobs:
     steps:
       - name: Deploy to GitHub Pages
         id: deployment
-        uses: actions/deploy-pages@v2
+        uses: actions/deploy-pages@v4
diff --git a/.github/workflows/build-rapids.yml b/.github/workflows/build-rapids.yml
@@ -134,6 +134,12 @@ jobs:
               sccache --show-adv-stats
             done
           done
+
+          # Exit with error if any failures occurred
+          if test ${#failures[@]} -ne 0; then
+            exit 1
+          fi
+
           EOF
 
           chmod +x "$RUNNER_TEMP"/ci{,-entrypoint}.sh

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -43,6 +43,17 @@ repos:
     hooks:
     - id: ruff  # linter
     - id: ruff-format  # formatter
+
+  # TOML lint & format
+  - repo: https://github.com/ComPWA/taplo-pre-commit
+    rev: v0.9.3
+    hooks:
+      # See https://github.com/NVIDIA/cccl/issues/3426
+      # - id: taplo-lint
+      #   exclude: "^docs/"
+      - id: taplo-format
+        exclude: "^docs/"
+
   - repo: https://github.com/codespell-project/codespell
     rev: v2.3.0
     hooks:

diff --git a/CMakePresets.json b/CMakePresets.json
@@ -73,8 +73,6 @@
         "CUB_ENABLE_DIALECT_CPP20": true,
         "THRUST_ENABLE_MULTICONFIG": true,
         "THRUST_MULTICONFIG_WORKLOAD": "LARGE",
-        "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP11": true,
-        "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP14": true,
         "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP17": true,
         "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP20": true,
         "THRUST_MULTICONFIG_ENABLE_SYSTEM_CPP": true,
@@ -128,28 +126,6 @@
         "LIBCUDACXX_ENABLE_LIBCUDACXX_TESTS": true
       }
     },
-    {
-      "name": "libcudacxx-cpp11",
-      "displayName": "libcu++: C++11",
-      "inherits": "libcudacxx-base",
-      "cacheVariables": {
-        "CMAKE_CXX_STANDARD": "11",
-        "CMAKE_CUDA_STANDARD": "11",
-        "LIBCUDACXX_TEST_STANDARD_VER": "c++11",
-        "CCCL_IGNORE_DEPRECATED_CPP_11": true
-      }
-    },
-    {
-      "name": "libcudacxx-cpp14",
-      "displayName": "libcu++: C++14",
-      "inherits": "libcudacxx-base",
-      "cacheVariables": {
-        "CMAKE_CXX_STANDARD": "14",
-        "CMAKE_CUDA_STANDARD": "14",
-        "LIBCUDACXX_TEST_STANDARD_VER": "c++14",
-        "CCCL_IGNORE_DEPRECATED_CPP_14": true
-      }
-    },
     {
       "name": "libcudacxx-cpp17",
       "displayName": "libcu++: C++17",
@@ -179,28 +155,6 @@
         "CMAKE_CUDA_ARCHITECTURES": "70"
       }
     },
-    {
-      "name": "libcudacxx-nvrtc-cpp11",
-      "displayName": "libcu++ NVRTC: C++11",
-      "inherits": "libcudacxx-nvrtc-base",
-      "cacheVariables": {
-        "CMAKE_CXX_STANDARD": "11",
-        "CMAKE_CUDA_STANDARD": "11",
-        "LIBCUDACXX_TEST_STANDARD_VER": "c++11",
-        "CCCL_IGNORE_DEPRECATED_CPP_11": true
-      }
-    },
-    {
-      "name": "libcudacxx-nvrtc-cpp14",
-      "displayName": "libcu++ NVRTC: C++14",
-      "inherits": "libcudacxx-nvrtc-base",
-      "cacheVariables": {
-        "CMAKE_CXX_STANDARD": "14",
-        "CMAKE_CUDA_STANDARD": "14",
-        "LIBCUDACXX_TEST_STANDARD_VER": "c++14",
-        "CCCL_IGNORE_DEPRECATED_CPP_14": true
-      }
-    },
     {
       "name": "libcudacxx-nvrtc-cpp17",
       "displayName": "libcu++ NVRTC: C++17",
@@ -261,8 +215,6 @@
         "THRUST_MULTICONFIG_ENABLE_SYSTEM_CUDA": true,
         "THRUST_MULTICONFIG_ENABLE_SYSTEM_OMP": true,
         "THRUST_MULTICONFIG_ENABLE_SYSTEM_TBB": true,
-        "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP11": false,
-        "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP14": false,
         "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP17": false,
         "THRUST_MULTICONFIG_ENABLE_DIALECT_CPP20": false
       }
@@ -420,22 +372,6 @@
         "libcudacxx.test.atomics.ptx"
       ]
     },
-    {
-      "name": "libcudacxx-nvrtc-cpp11",
-      "hidden": false,
-      "inherits": [
-        "libcudacxx-nvrtcc"
-      ],
-      "configurePreset": "libcudacxx-nvrtc-cpp11"
-    },
-    {
-      "name": "libcudacxx-nvrtc-cpp14",
-      "hidden": false,
-      "inherits": [
-        "libcudacxx-nvrtcc"
-      ],
-      "configurePreset": "libcudacxx-nvrtc-cpp14"
-    },
     {
       "name": "libcudacxx-nvrtc-cpp17",
       "hidden": false,
@@ -452,20 +388,6 @@
       ],
       "configurePreset": "libcudacxx-nvrtc-cpp20"
     },
-    {
-      "name": "libcudacxx-cpp11",
-      "configurePreset": "libcudacxx-cpp11",
-      "inherits": [
-        "libcudacxx-base"
-      ]
-    },
-    {
-      "name": "libcudacxx-cpp14",
-      "configurePreset": "libcudacxx-cpp14",
-      "inherits": [
-        "libcudacxx-base"
-      ]
-    },
     {
       "name": "libcudacxx-cpp17",
       "configurePreset": "libcudacxx-cpp17",
@@ -572,20 +494,6 @@
         "outputOnFailure": false
       }
     },
-    {
-      "name": "libcudacxx-lit-cpp11",
-      "configurePreset": "libcudacxx-cpp11",
-      "inherits": [
-        "libcudacxx-lit-base"
-      ]
-    },
-    {
-      "name": "libcudacxx-lit-cpp14",
-      "configurePreset": "libcudacxx-cpp14",
-      "inherits": [
-        "libcudacxx-lit-base"
-      ]
-    },
     {
       "name": "libcudacxx-lit-cpp17",
       "configurePreset": "libcudacxx-cpp17",
@@ -607,20 +515,6 @@
         "libcudacxx-lit-base"
       ]
     },
-    {
-      "name": "libcudacxx-nvrtc-cpp11",
-      "configurePreset": "libcudacxx-nvrtc-cpp11",
-      "inherits": [
-        "libcudacxx-nvrtc-base"
-      ]
-    },
-    {
-      "name": "libcudacxx-nvrtc-cpp14",
-      "configurePreset": "libcudacxx-nvrtc-cpp14",
-      "inherits": [
-        "libcudacxx-nvrtc-base"
-      ]
-    },
     {
       "name": "libcudacxx-nvrtc-cpp17",
       "configurePreset": "libcudacxx-nvrtc-cpp17",

diff --git a/c/parallel/src/reduce.cu b/c/parallel/src/reduce.cu
@@ -160,7 +160,7 @@ std::string get_single_tile_kernel_name(
   check(nvrtcGetTypeName<op_wrapper>(&reduction_op_t));
 
   return std::format(
-    "cub::DeviceReduceSingleTileKernel<{0}, {1}, {2}, {3}, {4}, {5}, {6}>",
+    "cub::detail::reduce::DeviceReduceSingleTileKernel<{0}, {1}, {2}, {3}, {4}, {5}, {6}>",
     chained_policy_t,
     input_iterator_t,
     output_iterator_t,
@@ -192,7 +192,7 @@ std::string get_device_reduce_kernel_name(cccl_op_t op, cccl_iterator_t input_it
   check(nvrtcGetTypeName<cuda::std::__identity>(&transform_op_t));
 
   return std::format(
-    "cub::DeviceReduceKernel<{0}, {1}, {2}, {3}, {4}, {5}>",
+    "cub::detail::reduce::DeviceReduceKernel<{0}, {1}, {2}, {3}, {4}, {5}>",
     chained_policy_t,
     input_iterator_t,
     offset_t,

diff --git a/c/parallel/test/test_main.cpp b/c/parallel/test/test_main.cpp
@@ -12,8 +12,7 @@
 
 #include <iostream>
 
-#define CATCH_CONFIG_RUNNER
-#include <catch2/catch.hpp>
+#include <catch2/catch_session.hpp>
 
 int device_guard(int device_id)
 {
@@ -40,7 +39,7 @@ int main(int argc, char* argv[])
   int device_id{};
 
   // Build a new parser on top of Catch's
-  using namespace Catch::clara;
+  using namespace Catch::Clara;
   auto cli = session.cli() | Opt(device_id, "device")["-d"]["--device"]("device id to use");
   session.cli(cli);
 

diff --git a/c/parallel/test/test_util.h b/c/parallel/test/test_util.h
@@ -22,7 +22,9 @@
 #include <type_traits>
 #include <vector>
 
-#include <catch2/catch.hpp>
+#include <catch2/catch_template_test_macros.hpp>
+#include <catch2/catch_test_macros.hpp>
+#include <catch2/generators/catch_generators_all.hpp>
 #include <cccl/c/reduce.h>
 #include <nvrtc.h>
 

diff --git a/c2h/include/c2h/catch2_main.h b/c2h/include/c2h/catch2_main.h
@@ -36,13 +36,9 @@
 //! executable, this header is included into each test. On the other hand, when all the tests are compiled into a single
 //! executable, this header is excluded from the tests and included into catch2_runner.cpp
 
-#ifdef CUB_CONFIG_MAIN
-#  define CATCH_CONFIG_RUNNER
-#endif
-
-#include <catch2/catch.hpp>
+#include <catch2/catch_session.hpp>
 
-#if defined(CUB_CONFIG_MAIN)
+#ifdef CUB_CONFIG_MAIN
 #  if THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
 #    include <c2h/catch2_runner_helper.h>
 
@@ -59,7 +55,7 @@ int main(int argc, char* argv[])
   int device_id{};
 
   // Build a new parser on top of Catch's
-  using namespace Catch::clara;
+  using namespace Catch::Clara;
   auto cli = session.cli() | Opt(device_id, "device")["-d"]["--device"]("device id to use");
   session.cli(cli);
 
@@ -73,4 +69,4 @@ int main(int argc, char* argv[])
 #  endif // THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
   return session.run(argc, argv);
 }
-#endif
+#endif // CUB_CONFIG_MAIN