(SE) Utilizing more Cylc features #490

Dooruk · 2024-12-20T15:33:51Z

One of the main reasons we use Cylc is its design which includes many useful features for DA cycling. We currently exercise some of them but we should take advantage of more. I will create this as an epic and as users think of more features they can add this can expand.

Necessary ones:

Retry certain tasks in case they fail: Most of the time the failure of Run.. tasks are caused by filesystem issues rather than configuration problems. Cylc should try running a certain task at least one more time before giving up. This is especially important for suites involving ensembles as even if 31 members succeed and 1 fails the workflow will stall.
Execute certain tasks in TSE staging area: This would mainly be relevant for GEOS simulations but if it provides significant efficiency for others we could benefit from it.
Use compiler flags for different environments: For compute performance we should use certain flags (compiler and node type dependent), an example for Intel below:

setenv I_MPI_FABRICS shm:ofi
setenv I_MPI_OFI_PROVIDER psm3

From Matt: The latter is required for Intel MPI on the Milan nodes. the former is good to use well.

Hold before certain tasks: @rtodling mentioned this one. The ability for a suite to "hold" before a certain task, say RunJediVarExecutable so that different configurations could be tested swiftly without the need to swell create a whole new suite. JCSDA's https://github.com/JCSDA-internal/skylab uses this feature in a different workflow engine called EWOK.
Restart a failed suite from a certain cycle & task: This one is tricky, I know it is possible with Cylc but not sure how would this play out with Swell. After a suite cycled certain amount of days, we would want to continue from a that point rather than restarting altogether.

Optional but useful ones:

Workflow sending email if suite stops/fails. I was able to do this with a sandbox Cylc setup.
Using more SLURM and Milan node features:
- Assign different walltime for tasks: Default is 1hr but most tasks take less than this, which could improve queue times.
- Shared nodes: Milan nodes has a feature to share nodes between users, for low demand jobs this could save queue time.
- Per-host: Not sure about this one, @rtodling uses it, relevant for compute performance.

The text was updated successfully, but these errors were encountered:

Dooruk · 2024-12-20T18:41:57Z

Case in point for retrying, the task had SLURM issues (pink square) but now running after two failed attempts and with zero changes. I just happened to be monitoring:

ashiklom · 2025-01-10T16:42:52Z

For retrying tasks: As a first pass, this can be added directly to the cylc task configurations.

https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/runtime.html#task-retry-on-failure

Dooruk added enhancement New feature or request Epic labels Dec 20, 2024

Dooruk assigned ashiklom, Dooruk and mranst Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(SE) Utilizing more Cylc features #490

(SE) Utilizing more Cylc features #490

Dooruk commented Dec 20, 2024 •

edited

Loading

Dooruk commented Dec 20, 2024 •

edited

Loading

ashiklom commented Jan 10, 2025

(SE) Utilizing more Cylc features #490

(SE) Utilizing more Cylc features #490

Comments

Dooruk commented Dec 20, 2024 • edited Loading

Necessary ones:

Optional but useful ones:

Dooruk commented Dec 20, 2024 • edited Loading

ashiklom commented Jan 10, 2025

Dooruk commented Dec 20, 2024 •

edited

Loading

Dooruk commented Dec 20, 2024 •

edited

Loading