-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to using strict channel priority during RAPIDS builds #84
Comments
This is mostly documenting some of my tests for installing and running older versions of RAPIDS. We also need to test arm64 installs because RAPIDS supported arm64 before many conda-forge packages did and we released those packages in our This is the test script I used to check for import errors. It is not comprehensive. https://gist.github.com/raydouglass/ff100a114c2a370b68131af55959afc0 Test machine:
Here are the
|
Now that we have an idea of what works, the next step is to figure out what could break with strict channel priority and packages removed. The approach I would follow is to run the same installation commands as above, but adding the |
Testing package removalsUsing a resurrected version of https://github.com/regro/conda-metachannel/ I was able to test the impact on various solves and environments by blocking packages currently available in the The current removal list consists of these files from
ResultsAll environments that could install were installed and then run against the test script linked above (https://gist.github.com/raydouglass/ff100a114c2a370b68131af55959afc0). All x86 test runs were on my work laptop, running:
All aarch64 test runs were on an NVIDIA labs machine, running:
The results are additionally divided by whether or not the single All of these installs are of the form: mamba create -n {name} python={python_version} [cuda-version=12 | cudatoolkit=11.8] rapids={ver} -c rapidsai -c conda-forge -c nvidia (--strict-channel-priority)? (--platform=linux-aarch64)? --override-channels With
|
RAPIDS version | arch | CUDA version | Installs | Strict Priority | Passes Tests | Failure reason |
---|---|---|---|---|---|---|
24.12 | x86 | 12 | Y | Y | Y | |
24.10 | x86 | 12 | Y | Y | Y | |
24.08 | x86 | 12 | Y | Y | Y | |
24.06 | x86 | 12 | Y | Y | Y | |
24.04 | x86 | 12 | Y | N | Y | missing libcumlprims |
24.02 | x86 | 12 | Y | N | Y | missing libcumlprims |
24.12 | aarch64 | 12 | Y | Y | Y | |
24.10 | aarch64 | 12 | Y | Y | Y | |
24.08 | aarch64 | 12 | Y | Y | Y | |
24.06 | aarch64 | 12 | Y | Y | Y | |
24.04 | aarch64 | 12 | Y | N | Y | missing libcumlprims |
24.02 | aarch64 | 12 | Y | N | Y | missing libcumlprims |
libcumlprims
was added to rapidsai
starting in 24.06, so strict solves fail before version 24.06.
RAPIDS version | arch | CUDA version | Installs | Strict Priority | Passes Tests | Failure reason |
---|---|---|---|---|---|---|
24.12 | x86 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.10 | x86 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.08 | x86 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.06 | x86 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.04 | x86 | 11 | Y | N | Y | libcumlprims |
24.02 | x86 | 11 | Y | N | Y | libcumlprims |
23.12 | x86 | 11 | Y | N | Y | libcumlprims |
23.10 | x86 | 11 | Y | N | Y | libcumlprims |
23.08 | x86 | 11 | Y | N | Y | libcumlprims |
23.06 | x86 | 11 | Y | N | Y | libcumlprims |
23.04 | x86 | 11 | Y | N | Y | libcumlprims |
23.02 | x86 | 11 | N | datashader=0.13.1a |
||
24.12 | aarch64 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.10 | aarch64 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.08 | aarch64 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.06 | aarch64 | 11 | Y | N | Y | cuda-profiler-api>=11.4.240,<12 |
24.04 | aarch64 | 11 | Y | N | Y | libcumlprims |
24.02 | aarch64 | 11 | Y | N | Y | libcumlprims |
23.12 | aarch64 | 11 | Y | N | Y | libcumlprims |
23.10 | aarch64 | 11 | Y | N | Y | libcumlprims |
23.08 | aarch64 | 11 | Y | N | Y | libcumlprims |
23.06 | aarch64 | 11 | Y | N | Y | libcumlprims |
23.04 | aarch64 | 11 | Y | N | Y | libcumlprims |
23.02 | aarch64 | 11 | N | datashader=0.13.1a |
cuda-profiler-api
is only available on conda-forge
for >=12
so strict priority won’t work (cuda-profiler-api<12
can’t be added to conda-forge
without also adding older versions of cuda-toolkit
which seems like too much work for this effort.)
With datashader=0.13.1a
not removed
If datashader=0.13.a
is left in-place, then strict solves fail because conda
is looking for datashader>=0.14
but the presence of datashader=0.13.a
in rapidsai
constrains the search space to the rapidsai
channel only (for that particular dependency).
None of the rapids
environments can install with --strict-channel-priority
. But rapids-23.02
will install.
Recommendations
I would very much welcome opinions from the broader build team on this.
Below are a few general approaches and the top-level pros and cons, as I see them:
Remove the packages listed above, including datashader
(recommended)
- All RAPIDS versions back to 23.04 install without changes to the install command.
- All RAPIDS versions >=24.06 with
CUDA12
now solve with--strict-channel-priority
. - RAPIDS version 23.02 will need to specify a label in the
install
command for it to work.
Remove the packages listed above, don’t remove datashader
and add all datashader
versions >=0.14
to rapidsai
- All RAPIDS versions back to 23.04 install without changes to the install command.
- All RAPIDS versions >=24.06 with
CUDA12
now solve with--strict-channel-priority
. - RAPIDS version 23.02 will install without changes to the install command.
- We have to build and upload several versions of
datashader
and continue doing so until we drop support for23.02
Remove the packages listed above, including datashader
, and add all libcumlprims
versions from nvidia
to rapidsai
- Outcome is the same as the recommended approach, but
--strict-channel-priority
might work for RAPIDS versions >=23.04 withCUDA12
(this is harder to test at the moment). - We have to build and upload several versions of
libcumlprims
torapidsai
and continue doing so until we drop support for24.04
Let's do this option. I think the outcome of "All RAPIDS versions back to 23.04 install without changes to the install command" meets our requirements for backwards-looking support. Strict channel priority is only important to us in a forward-looking context, so moving old versions of |
Wow thanks @gforsyth! I suspect supporting back about two years should be enough, but since there were skeptics, I've added clarifying this to the PIC sync agenda on Tuesday. However, just want to clarify whether the tests were done with just One concern I had is that we built and published several third-party |
Great point @raydouglass -- this was just on my work laptop, so |
We have some arm64 machines in the RDS lab, you can file an issue to get access: https://github.com/rapidsai/ops/issues/new?template=02-rds-lab-machine-and-access-request.yml |
One can also test other architectures on the same machine using the For example running this Windows install...
...gives me this on my Mac:
|
Thanks for digging into this Gil! 🙏
Agree this is a good recommendation
A tweak on Gil's approach could be to instead of deleting the packages to move them to a label, like For all practical purposes those packages would still be ignored for the solve. There would just be a way for users to get those back by adding this label Another advantage of a label is we can always tweak it again if needed Anyways agree this would be good to discuss. Thanks for bringing it up Ray! 🙏 |
Updated my results above with dry-run solves for |
Our approach moving forward is to target this package relabeling effort to land with 25.04. That will give us 2 years of backwards compatibility, where install commands will work without any changes. Any install commands that are >2 years old will require specifying a label, like This work is valuable on its own for speed of environment solving and better compatibility guarantees with possible future |
Ok, I've run all the tests on both x86 and aarch64 and confirmed that the plan as documented above will have the same impact on both architectures. |
Awesome thanks Gil! Sounds like we're in good shape to make this happen in the next release then. Correct me if I am wrong, but practically speaking the only action item to enable strict channel priority is the move of the various packages behind a label, right? Trying to gauge how we actually move forward here. If we're saying that we want to break the backwards compatibility when we release 25.04, but the actual breakage occurs when we move the package, then in practice it sounds like our action items would be:
Does that sound right? |
(Yes, and… to the above) Let’s try to move towards strict priority in builds where possible. I think it might work already for CUDA 12 builds, for a subset of RAPIDS packages. Whatever constraints we find, we can tighten them gradually and use flexible priority in the meantime. |
Yes, I think that's right. I think the only action item we have in the shorter term is to add some kind of banner to https://docs.rapids.ai/install/ to let folks know that installs older than 23.06 will require (small) adjustments |
Since this is a long-term announcement, we should publish a RAPIDS Support Notice (https://docs.rapids.ai/notices/) with more information on what actions will be needed from users. We can link to that RSN on the install page, release blogs, and other communications. Here are some recent pull requests with RSNs that you can use as examples when writing this up: https://github.com/rapidsai/docs/pulls?q=is%3Apr+is%3Aclosed+RSN |
Agreed. I think for rollout what we can do is during the 25.02 release we can switch package CI on a per package basis by adding |
Opened a PR in rapidsai/docs to add the RSN |
RAPIDS conda packages currently do not install successfully when using strict channel priority. This has caused some difficulty for users in the past. strict channel priority also in general leads to faster solves. The reason that RAPIDS requires flexible channel priority is that there are some packages that have historically been published to both the rapidsai[-nightly] and conda-forge channels. Typically this occurred because RAPIDS needed specific versions/builds of packages that were either not yet available on conda-forge. However, in recent years we have moved to a much stronger reliance on building and maintaining conda-forge packages as needed, so most of the packages that we've done this for in the past (ucx, nccl) are now made regularly available on conda-forge and no longer updated on the rapidsai[-nightly] channel.
We should clean out the old packages in the rapidsai[-nightly] channel that prevent strict solving from working. Rather than removing them altogether, we can move them under a new label so that old versions could still be installed with that label installed (although in general installing old versions will be quite challenging without a fully specified environment lock file anyway due to how conda-forge's global pinnings move and other packages on there are released).
The text was updated successfully, but these errors were encountered: