Pace build optional CI #1460

FlorianDeconinck · 2023-11-29T16:43:47Z

As climate models developed at NOAA and NASA are leveraging DaCe more and more for their performance backend, we have seen multiple occurrence of major/minor version breaking downstream.

This optional GitHub action is an attempt to reduce those breakage by allowing the DaCe ecosystem to pull on a vetted version of the Pace climate model and run a subset of the regression tests that should exercise enough DaCe to catch a good amount of errors.

NASA takes responsibility to keep this CI clean and working along. All non-DaCe issues should be pinged on @FlorianDeconinck.

All data and model are under open source licenses.

FlorianDeconinck · 2023-11-29T16:45:23Z

@phschaad @BenWeber42

As discussed, here's an attempt at bringing a CI that exercise a fraction (Remapping) of the model.

This shouldn't be merged up until we all agree on the general principle and execution time.

phschaad · 2023-12-04T09:20:31Z

Thank you for this nice CI bundle @FlorianDeconinck !
We have discussed this and have concluded that depending on the time it takes to run this CI workflow, it may be desirable to even make the workflow non-optional.

FlorianDeconinck · 2023-12-04T14:15:31Z

If we think it'll be good to have it in the regular suite we can tailor what we extract from the model.

Basically you have 3 big one-rank test (in descending order of complexity):

Remapping (<- this is the one in the PR)
Tracer advection
D_SW

And then we have 2 bigger and more complex test that need 6 ranks (because of halo exchanges):

FVDynamics (basically the entire dynamical core)
Acoustics

FVDynamics is a bit of beast. The inner Acoustics loop would cover about 75% of the code, so it might be the best ratio code coverage/time.

On the other side, for a lighting fast test D_SW is a the cornerstone of the acoustics loop and often the buggier code.

My only concern is that already installing the entire py stack is not fast. I'll come back with some numbers.

phschaad · 2023-12-04T14:49:26Z

In that case I'll be interested to see what the numbers say for the Remapping test, which should help us decide if we keep it optional or make it a required test. In any case having it as part of the available actions will make a lot of sense.

As for the multi-rank tests we will have to see what the best strategy is to avoid using GH runners here. From what I understand it would already be a big help if we get the 3 single rank tests up and running asap and that would buy us some time to strategize about the dynamical core and acoustics tests?

FlorianDeconinck · 2023-12-04T14:56:39Z

Agreed. You can force a mpirun --oversubscibe but depending on the quality of the hardware this might not be a real option...

I'd say we can get a decent code coverage with the single rank tests and as we implement more model code under DaCe we can also refine those to cover more patterns.

The multi-rank might not even be a good idea to do it DaCe side. After all, on our side for every major version of DaCe we need to run full coverage, including science testing. At that stage, something with a webhook from this repo to ours triggering a job on our HPC makes more sense. Let's call it future endeavor ;)

BenWeber42

Thanks for the nice PR!

This shouldn't be merged up until we all agree on the general principle and execution time.

Agreed, we should clarify some minor aspects w.r.t. to this CI.

.github/workflows/pace-build-ci.yml

Checkout DaCe before install Better base pip upgrade

FlorianDeconinck · 2023-12-04T19:23:22Z

Okay here's a proposal based on timing on my box which is not that great.

Timing for dace:gpu backend runtime:

Remapping: 3mn37s
D_SW: 4mn38s
Riem_Solver_C: 0mn45s

Overall 9mn on top of the install of the stack, so potentially around 15mn for a pretty sturdy test.

What do you think?

FlorianDeconinck · 2023-12-05T19:05:34Z

Alright @BenWeber42 / @phschaad

I've updated the script to run in order from faster to slower: RiemSolverC, D_SW and then Remapping.

The workflow is on workflow_dispatch for now. I propose if code is fine enough to merge this, so we can trigger it in GitHub and if timings are decent we can then PR a master/ci-fix trigger

phschaad · 2023-12-06T13:45:20Z

This seems like a good strategy, yes - Thank you!

BenWeber42

Thanks for the additional CI!

Follow up of #1460 - [x] Fixed the `ci` script (including `git checkout issues` around selecting the correct `dace`) - [x] Move `D_SW` to execute only on rank 0 to avoid rebuild - [x] Swapped Rieman Solver on C-grid for D-grid for better coverage ~~WARNING: this PR is blocked by #1477~~ ~~WARNING: this PR is blocked by #1568~~ --------- Co-authored-by: Tal Ben-Nun <[email protected]>

Pace build optional CI

db885c8

phschaad approved these changes Dec 4, 2023

View reviewed changes

phschaad enabled auto-merge (squash) December 4, 2023 09:18

Merge branch 'master' into pace_optional_ci

03dbd6c

phschaad disabled auto-merge December 4, 2023 09:19

BenWeber42 requested changes Dec 4, 2023

View reviewed changes

.github/workflows/pace-build-ci.yml Outdated Show resolved Hide resolved

.github/workflows/pace-build-ci.yml Outdated Show resolved Hide resolved

.github/workflows/pace-build-ci.yml Outdated Show resolved Hide resolved

FlorianDeconinck added 2 commits December 4, 2023 14:01

Merge remote-tracking branch 'origin/master' into pace_optional_ci

169744b

Use ci/DaCe branch on Pace side for canonical test

240c781

Checkout DaCe before install Better base pip upgrade

Add RiemSolver_C and D_SW to the tests

d4e5677

tbennun approved these changes Dec 5, 2023

View reviewed changes

Merge branch 'master' into pace_optional_ci

cc58108

phschaad enabled auto-merge (squash) December 6, 2023 13:45

Merge branch 'master' into pace_optional_ci

067f0cf

FlorianDeconinck requested a review from BenWeber42 December 8, 2023 22:38

BenWeber42 removed their request for review December 11, 2023 14:09

BenWeber42 approved these changes Dec 11, 2023

View reviewed changes

Merge branch 'master' into pace_optional_ci

71eabd2

phschaad merged commit b6e1c9d into spcl:master Dec 12, 2023
11 checks passed

FlorianDeconinck deleted the pace_optional_ci branch December 12, 2023 14:56

FlorianDeconinck mentioned this pull request Dec 12, 2023

NOAA/NASA pyFV3 CI on every commit #1478

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pace build optional CI #1460

Pace build optional CI #1460

FlorianDeconinck commented Nov 29, 2023

FlorianDeconinck commented Nov 29, 2023

phschaad commented Dec 4, 2023

FlorianDeconinck commented Dec 4, 2023

phschaad commented Dec 4, 2023

FlorianDeconinck commented Dec 4, 2023

BenWeber42 left a comment

FlorianDeconinck commented Dec 4, 2023

FlorianDeconinck commented Dec 5, 2023

phschaad commented Dec 6, 2023

BenWeber42 left a comment

Pace build optional CI #1460

Pace build optional CI #1460

Conversation

FlorianDeconinck commented Nov 29, 2023

FlorianDeconinck commented Nov 29, 2023

phschaad commented Dec 4, 2023

FlorianDeconinck commented Dec 4, 2023

phschaad commented Dec 4, 2023

FlorianDeconinck commented Dec 4, 2023

BenWeber42 left a comment

Choose a reason for hiding this comment

FlorianDeconinck commented Dec 4, 2023

FlorianDeconinck commented Dec 5, 2023

phschaad commented Dec 6, 2023

BenWeber42 left a comment

Choose a reason for hiding this comment