Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using strict channel priority during RAPIDS builds #84

Open
vyasr opened this issue Jul 22, 2024 · 18 comments
Open

Switch to using strict channel priority during RAPIDS builds #84

vyasr opened this issue Jul 22, 2024 · 18 comments
Assignees

Comments

@vyasr
Copy link
Contributor

vyasr commented Jul 22, 2024

RAPIDS conda packages currently do not install successfully when using strict channel priority. This has caused some difficulty for users in the past. strict channel priority also in general leads to faster solves. The reason that RAPIDS requires flexible channel priority is that there are some packages that have historically been published to both the rapidsai[-nightly] and conda-forge channels. Typically this occurred because RAPIDS needed specific versions/builds of packages that were either not yet available on conda-forge. However, in recent years we have moved to a much stronger reliance on building and maintaining conda-forge packages as needed, so most of the packages that we've done this for in the past (ucx, nccl) are now made regularly available on conda-forge and no longer updated on the rapidsai[-nightly] channel.

We should clean out the old packages in the rapidsai[-nightly] channel that prevent strict solving from working. Rather than removing them altogether, we can move them under a new label so that old versions could still be installed with that label installed (although in general installing old versions will be quite challenging without a fully specified environment lock file anyway due to how conda-forge's global pinnings move and other packages on there are released).

@raydouglass
Copy link
Member

This is mostly documenting some of my tests for installing and running older versions of RAPIDS.

We also need to test arm64 installs because RAPIDS supported arm64 before many conda-forge packages did and we released those packages in our rapidsai conda channel.

This is the test script I used to check for import errors. It is not comprehensive. https://gist.github.com/raydouglass/ff100a114c2a370b68131af55959afc0

Test machine:

  • Driver 550.78
  • System CTK 12.3
  • x86_64
  • Ubuntu 22.04.4
  • 2x Quadro RTX 8000
  • Tests were run bare-metal until otherwise stated

Here are the conda list for each environment below: https://gist.github.com/raydouglass/5948d6cab3d3c9f29cc02533bb2b4d25

23.02

Solves & tested: mamba create -n rapids-23.02 python=3.10 cudatoolkit=11.8 rapids=23.02

22.02

Solved with mamba create -n rapids-22.02 python=3.9 cudatoolkit=11.5 rapids=22.02.

Test errored with:

Traceback (most recent call last):
  File "/home/rdouglass/workspace/snippets/test_rapids.py", line 2, in <module>
    import cudf
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/cudf/__init__.py", line 5, in <module>
    validate_setup()
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/cudf/utils/gpu_utils.py", line 20, in validate_setup
    from rmm._cuda.gpu import (
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/rmm/__init__.py", line 16, in <module>
    from rmm import mr
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/rmm/mr.py", line 14, in <module>
    from rmm._lib.memory_resource import (
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/rmm/_lib/__init__.py", line 15, in <module>
    from .device_buffer import DeviceBuffer
  File "rmm/_lib/device_buffer.pyx", line 1, in init rmm._lib.device_buffer
TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))

I think this is a system CTK issue since running the script in the original unmodified rapidsai/rapidsai:22.02-cuda11.5-runtime-ubuntu20.04-py3.9 image works for cudf/cuml. I did not reinstall the rapids package in the container.

0.10

This is the first version with the rapids meta package.

Solves with mamba create -n rapids-0.10 python=3.6 cudatoolkit=9.2 rapids=0.10

I did not test this.

@vyasr
Copy link
Contributor Author

vyasr commented Sep 23, 2024

Now that we have an idea of what works, the next step is to figure out what could break with strict channel priority and packages removed. The approach I would follow is to run the same installation commands as above, but adding the --strict-channel-priority flag. For a first pass, a dry run should be sufficient. For each version of RAPIDS tested, inspect the output list of packages and find which ones are being installed from the rapidsai channel. If any of them are packages that we plan to remove from the rapidsai channel, we should add those to the command with a channel specifier, e.g. for ucx mamba create ... rapids=${VERSION} conda-forge::ucx. This will force the conda solver to pull the ucx package from conda-forge instead of rapidsai(-nightly). The dry runs should be sufficient to put together this list and evaluate what will solve. Once a complete list is compiled going back to 23.02 (selected since that was the last working version tested above, but we could go back further) then we should actually create the environments (no dry run) and run the test script posted above to see if any results change. I expect that the dry runs should tell us most of what we need to know though since unless there are incompatible binaries of the same package with the same version on two different channels (hopefully unlikely) then if the solve succeeds we'll be getting package versions that should function according to the constraints in our package dependency spec.

@gforsyth
Copy link

gforsyth commented Jan 17, 2025

Testing package removals

Using a resurrected version of https://github.com/regro/conda-metachannel/ I was able to test the impact on various solves and environments by blocking packages currently available in the rapidsai channel (simulating the end-state if we move those packages under a label).

The current removal list consists of these files from rapidsai:

[linux-64] = ['blazingsql', 'blazingsql-build-env', 'blazingsql-notebook-env', 'clang', 'clx', 'cupy', 'faiss', 'faiss-gpu', 'faiss-proc', 'libclx', 'libcudf_cffi', 'libcypher-parser', 'libfaiss', 'libfaiss-avx2', 'libnvstrings', 'rapids-blazing', 'rapids-build-env', 'rapids-doc-env', 'rapids-notebook-env', 'rapids-scout', 'rapids-scout-local', 'strings_udf', 'ucx']

[linux-aarch64] = ['cupy', 'faiss', 'faiss-gpu', 'faiss-proc', 'libcypher-parser', 'libfaiss', 'nccl', 'pyarrow', 'rapids-build-env', 'rapids-notebook-env', 'rapids-scout-local', 'strings_udf', 'ucx']

[noarch] = ['cmake_setuptools', 'dask-xgboost', 'datashader', 'python-whois']

Results

All environments that could install were installed and then run against the test script linked above (https://gist.github.com/raydouglass/ff100a114c2a370b68131af55959afc0).

All x86 test runs were on my work laptop, running:

  • Ubuntu 24.04
  • Driver: 565.57.01
  • RTX 3500 AD104GLM

All aarch64 test runs were on an NVIDIA labs machine, running:

  • Ubuntu 22.04 LTS
  • Driver: 535.161.08
  • A100

The results are additionally divided by whether or not the single datashader tarball that is in rapidsai/noarch is removed or not. More on that below.

All of these installs are of the form:

mamba create -n {name} python={python_version} [cuda-version=12 | cudatoolkit=11.8] rapids={ver} -c rapidsai -c conda-forge -c nvidia (--strict-channel-priority)? (--platform=linux-aarch64)? --override-channels

With datashader=0.13.1a removed

RAPIDS version arch CUDA version Installs Strict Priority Passes Tests Failure reason
24.12 x86 12 Y Y Y
24.10 x86 12 Y Y Y
24.08 x86 12 Y Y Y
24.06 x86 12 Y Y Y
24.04 x86 12 Y N Y missing libcumlprims
24.02 x86 12 Y N Y missing libcumlprims
24.12 aarch64 12 Y Y Y
24.10 aarch64 12 Y Y Y
24.08 aarch64 12 Y Y Y
24.06 aarch64 12 Y Y Y
24.04 aarch64 12 Y N Y missing libcumlprims
24.02 aarch64 12 Y N Y missing libcumlprims

libcumlprims was added to rapidsai starting in 24.06, so strict solves fail before version 24.06.

RAPIDS version arch CUDA version Installs Strict Priority Passes Tests Failure reason
24.12 x86 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.10 x86 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.08 x86 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.06 x86 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.04 x86 11 Y N Y libcumlprims
24.02 x86 11 Y N Y libcumlprims
23.12 x86 11 Y N Y libcumlprims
23.10 x86 11 Y N Y libcumlprims
23.08 x86 11 Y N Y libcumlprims
23.06 x86 11 Y N Y libcumlprims
23.04 x86 11 Y N Y libcumlprims
23.02 x86 11 N datashader=0.13.1a
24.12 aarch64 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.10 aarch64 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.08 aarch64 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.06 aarch64 11 Y N Y cuda-profiler-api>=11.4.240,<12
24.04 aarch64 11 Y N Y libcumlprims
24.02 aarch64 11 Y N Y libcumlprims
23.12 aarch64 11 Y N Y libcumlprims
23.10 aarch64 11 Y N Y libcumlprims
23.08 aarch64 11 Y N Y libcumlprims
23.06 aarch64 11 Y N Y libcumlprims
23.04 aarch64 11 Y N Y libcumlprims
23.02 aarch64 11 N datashader=0.13.1a

cuda-profiler-api is only available on conda-forge for >=12 so strict priority won’t work (cuda-profiler-api<12 can’t be added to conda-forge without also adding older versions of cuda-toolkit which seems like too much work for this effort.)

With datashader=0.13.1a not removed

If datashader=0.13.a is left in-place, then strict solves fail because conda is looking for datashader>=0.14 but the presence of datashader=0.13.a in rapidsai constrains the search space to the rapidsai channel only (for that particular dependency).

None of the rapids environments can install with --strict-channel-priority. But rapids-23.02 will install.

Recommendations

I would very much welcome opinions from the broader build team on this.

Below are a few general approaches and the top-level pros and cons, as I see them:

Remove the packages listed above, including datashader (recommended)

  • All RAPIDS versions back to 23.04 install without changes to the install command.
  • All RAPIDS versions >=24.06 with CUDA12 now solve with --strict-channel-priority.
  • RAPIDS version 23.02 will need to specify a label in the install command for it to work.

Remove the packages listed above, don’t remove datashader and add all datashader versions >=0.14 to rapidsai

  • All RAPIDS versions back to 23.04 install without changes to the install command.
  • All RAPIDS versions >=24.06 with CUDA12 now solve with --strict-channel-priority.
  • RAPIDS version 23.02 will install without changes to the install command.
  • We have to build and upload several versions of datashader and continue doing so until we drop support for 23.02

Remove the packages listed above, including datashader, and add all libcumlprims versions from nvidia to rapidsai

  • Outcome is the same as the recommended approach, but --strict-channel-priority might work for RAPIDS versions >=23.04 with CUDA12 (this is harder to test at the moment).
  • We have to build and upload several versions of libcumlprims to rapidsai and continue doing so until we drop support for 24.04

@bdice
Copy link
Contributor

bdice commented Jan 17, 2025

Remove the packages listed above, including datashader (recommended)

Let's do this option. I think the outcome of "All RAPIDS versions back to 23.04 install without changes to the install command" meets our requirements for backwards-looking support. Strict channel priority is only important to us in a forward-looking context, so moving old versions of libcumlprims to rapidsai is not worthwhile in my opinion.

@raydouglass
Copy link
Member

Wow thanks @gforsyth! I suspect supporting back about two years should be enough, but since there were skeptics, I've added clarifying this to the PIC sync agenda on Tuesday.

However, just want to clarify whether the tests were done with just linux-64 or linux-aarch64 (or both) architectures.

One concern I had is that we built and published several third-party linux-aarch64 packages, but I don't remember what RAPIDS versions. I suspect it was before 23.04, but it would be good to verify aarch64 works for the same versions as well.

@gforsyth
Copy link

However, just want to clarify whether the tests were done with just linux-64 or linux-aarch64 (or both) architectures.

Great point @raydouglass -- this was just on my work laptop, so linux-64 only. I'll look into options for testing against linux-aarch64

@raydouglass
Copy link
Member

I'll look into options for testing against linux-aarch64

We have some arm64 machines in the RDS lab, you can file an issue to get access: https://github.com/rapidsai/ops/issues/new?template=02-rds-lab-machine-and-access-request.yml

@jakirkham
Copy link
Member

One can also test other architectures on the same machine using the --platform flag

For example running this Windows install...

conda create --dry-run --platform win-64 --name tst_win_64 zlib

...gives me this on my Mac:

Channels:
 - conda-forge
Platform: win-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/jkirkham/miniforge/envs/tst_win_64

  added / updated specs:
    - zlib


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libzlib-1.3.1              |       h2466b09_2          54 KB  conda-forge
    ucrt-10.0.22621.0          |       h57928b3_1         547 KB  conda-forge
    vc-14.3                    |      ha32ba9b_23          17 KB  conda-forge
    vc14_runtime-14.42.34433   |      he29a5d6_23         737 KB  conda-forge
    zlib-1.3.1                 |       h2466b09_2         105 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.4 MB

The following NEW packages will be INSTALLED:

  libzlib            conda-forge/win-64::libzlib-1.3.1-h2466b09_2 
  ucrt               conda-forge/win-64::ucrt-10.0.22621.0-h57928b3_1 
  vc                 conda-forge/win-64::vc-14.3-ha32ba9b_23 
  vc14_runtime       conda-forge/win-64::vc14_runtime-14.42.34433-he29a5d6_23 
  zlib               conda-forge/win-64::zlib-1.3.1-h2466b09_2 



DryRunExit: Dry run. Exiting.

@jakirkham
Copy link
Member

Thanks for digging into this Gil! 🙏

Remove the packages listed above, including datashader (recommended)

Agree this is a good recommendation

I suspect supporting back about two years should be enough, but since there were skeptics, I've added clarifying this to the PIC sync agenda on Tuesday.

A tweak on Gil's approach could be to instead of deleting the packages to move them to a label, like legacy

For all practical purposes those packages would still be ignored for the solve. There would just be a way for users to get those back by adding this label

Another advantage of a label is we can always tweak it again if needed

Anyways agree this would be good to discuss. Thanks for bringing it up Ray! 🙏

@gforsyth
Copy link

Updated my results above with dry-run solves for linux-aarch64 and a few additional proposed "removals" (relabels).
Still need to test that those environments function as expected, but in terms of environment solutions, aarch64 matches x86 in terms of where strict solves will work and how far back things will continue to work.

@mmccarty mmccarty assigned gforsyth and unassigned KyleFromNVIDIA Jan 23, 2025
@gforsyth
Copy link

Our approach moving forward is to target this package relabeling effort to land with 25.04. That will give us 2 years of backwards compatibility, where install commands will work without any changes.

Any install commands that are >2 years old will require specifying a label, like legacy.

This work is valuable on its own for speed of environment solving and better compatibility guarantees with possible future conda defaults. It is, however, no longer blocking the rattler-build work, as it is now possible to disable strict channel priority with rattler-build.

@gforsyth
Copy link

Ok, I've run all the tests on both x86 and aarch64 and confirmed that the plan as documented above will have the same impact on both architectures.

@vyasr
Copy link
Contributor Author

vyasr commented Jan 24, 2025

Awesome thanks Gil! Sounds like we're in good shape to make this happen in the next release then. Correct me if I am wrong, but practically speaking the only action item to enable strict channel priority is the move of the various packages behind a label, right? Trying to gauge how we actually move forward here. If we're saying that we want to break the backwards compatibility when we release 25.04, but the actual breakage occurs when we move the package, then in practice it sounds like our action items would be:

  • Start working on rattler-build early in 25.04 using flexible channel priority since that is now supported. Plan to have those PRs all merged during the release cycle.
  • During the 25.04 release, ops should move all of the packages listed above behind a legacy label. This should not happen before the release or we will break 23.04 installs before the 25.04 release.

Does that sound right?
CC @raydouglass for the label changes

@bdice
Copy link
Contributor

bdice commented Jan 24, 2025

(Yes, and… to the above)

Let’s try to move towards strict priority in builds where possible. I think it might work already for CUDA 12 builds, for a subset of RAPIDS packages. Whatever constraints we find, we can tighten them gradually and use flexible priority in the meantime.

@gforsyth
Copy link

Yes, I think that's right. I think the only action item we have in the shorter term is to add some kind of banner to https://docs.rapids.ai/install/ to let folks know that installs older than 23.06 will require (small) adjustments

@bdice
Copy link
Contributor

bdice commented Jan 24, 2025

Since this is a long-term announcement, we should publish a RAPIDS Support Notice (https://docs.rapids.ai/notices/) with more information on what actions will be needed from users. We can link to that RSN on the install page, release blogs, and other communications.

Here are some recent pull requests with RSNs that you can use as examples when writing this up: https://github.com/rapidsai/docs/pulls?q=is%3Apr+is%3Aclosed+RSN

@vyasr
Copy link
Contributor Author

vyasr commented Jan 24, 2025

Agreed. I think for rollout what we can do is during the 25.02 release we can switch package CI on a per package basis by adding conda config --set channel_priority strict to conda CI scripts and see what works. Once we're confident that every package is doing this, we can update the condarc template we use for our images and push out a new one. I'm not sure the best timing to do the latter; it might have to be done right at the end of code freeze in order to line up with the timelines that we proposed above vis-a-vis only removing the packages from rapidsai right at 25.04 release time. I'll let Ray comment on what he thinks is best when he's back.

@gforsyth
Copy link

Opened a PR in rapidsai/docs to add the RSN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants