Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve design and address comments in the new CI #11079

Merged
merged 29 commits into from
Dec 10, 2024
Merged

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Dec 8, 2024

Follow up to #11001

Apply feedback from #11046. See #11046 to learn about the rationale for changes.

@hcho3 hcho3 marked this pull request as ready for review December 10, 2024 08:49
@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2024

@trivialfis Most of the changes here are stylistic and do not affect the development process too much. However, this pull request contains one major change that significantly changes the development flow.

Currently, we build Docker containers on the fly as part of the CI run for the pull request. But after this PR, we will start to have a separate CI pipeline to build and push CI containers in https://github.com/dmlc/xgboost-devops.

Why the change?

To make the code more legible and easier to reason about. The current script docker_build.sh is hard to understand because it does three things at once:

  1. build a container
  2. fetch a container from a private registry (for caching); and
  3. push the new container back to the registry.

Often a workflow would call docker_build.sh just to fetch a pre-built container from the registry. This makes it confusing to newcomers, as noted by @jameslamb in #11046 (comment) and #11046 (comment).

This PR separates the concerns by having each script do one thing only.

  • docker_build.sh (xgboost-devops repo) builds a container and pushes it to the registry.
  • docker_run.py (xgboost repo) fetches the container from the registry.

To ensure that the container stays up to date. In the current setup, we use the Docker registry as a caching layer for containers. This setup suffers from two problems.

  1. False cache hit: New versions of packages are released but the containers are not refreshed in time because the Dockerfile did not get a (textual) change. For example, the xgb-ci.gpu container installs scikit-learn (no constraints specified). Even though a new version of scikit-learn (1.6.0) was released today, the container was not refreshed because the Dockerfile for the container, Dockerfile.gpu, was unchanged. This behavior is a side effect of how Docker caches builds.
  2. Containers get re-built at unexpected timing: Whenever the base image gets updated, the whole Docker container gets re-built. The re-build tends to happen at an unfortunate timing, causing us to wait longer for the CI execution for time-critical PRs.

In the new setting, xgboost-devops repo hosts a dedicated CI pipeline to build CI containers at regular scheduled time. (It is triggered by push, pull_request, as well as schedule.) So the first issue is addressed. The second issue is also addressed, as PRs on the XGBoost repo will spend a predictable amount of time pulling the containers from the registry.

Lastly, other projects also de-couples building of CI containers from the consumption. Example: LightGBM, Rapids.

I've written a guide to show how CI containers are updated: https://xgboost--11079.org.readthedocs.build/en/11079/contrib/ci.html#making-changes-to-ci-containers. Can you take a look at it and see if it makes sense?

@hcho3 hcho3 merged commit 598133e into dmlc:master Dec 10, 2024
52 of 57 checks passed
@hcho3 hcho3 deleted the update_ci branch December 10, 2024 21:25
Copy link
Contributor

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to the @ from #11046 (comment) ... sorry I couldn't get to it sooner!

I know this is already merged, but left a few small suggestions you might want to. consider for followup. All very minor... I don't have any other significant structural changes to request.

This looks awesome! Thanks so much for giving me the opportunity to review and for talking through all the comments I left on #11046 with me.

ops/pipeline/build-test-jvm-packages.sh Show resolved Hide resolved
python-package/pyproject.toml Show resolved Hide resolved
ops/pipeline/manage-artifacts.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants