Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(SE) Utilizing more Cylc features #490

Open
10 tasks
Dooruk opened this issue Dec 20, 2024 · 2 comments
Open
10 tasks

(SE) Utilizing more Cylc features #490

Dooruk opened this issue Dec 20, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request Epic

Comments

@Dooruk
Copy link
Collaborator

Dooruk commented Dec 20, 2024

One of the main reasons we use Cylc is its design which includes many useful features for DA cycling. We currently exercise some of them but we should take advantage of more. I will create this as an epic and as users think of more features they can add this can expand.

Necessary ones:

  • Retry certain tasks in case they fail: Most of the time the failure of Run.. tasks are caused by filesystem issues rather than configuration problems. Cylc should try running a certain task at least one more time before giving up. This is especially important for suites involving ensembles as even if 31 members succeed and 1 fails the workflow will stall.
  • Execute certain tasks in TSE staging area: This would mainly be relevant for GEOS simulations but if it provides significant efficiency for others we could benefit from it.
  • Use compiler flags for different environments: For compute performance we should use certain flags (compiler and node type dependent), an example for Intel below:
setenv I_MPI_FABRICS shm:ofi
setenv I_MPI_OFI_PROVIDER psm3

From Matt: The latter is required for Intel MPI on the Milan nodes. the former is good to use well.

  • Hold before certain tasks: @rtodling mentioned this one. The ability for a suite to "hold" before a certain task, say RunJediVarExecutable so that different configurations could be tested swiftly without the need to swell create a whole new suite. JCSDA's https://github.com/JCSDA-internal/skylab uses this feature in a different workflow engine called EWOK.
  • Restart a failed suite from a certain cycle & task: This one is tricky, I know it is possible with Cylc but not sure how would this play out with Swell. After a suite cycled certain amount of days, we would want to continue from a that point rather than restarting altogether.

Optional but useful ones:

  • Workflow sending email if suite stops/fails. I was able to do this with a sandbox Cylc setup.
  • Using more SLURM and Milan node features:
    • Assign different walltime for tasks: Default is 1hr but most tasks take less than this, which could improve queue times.
    • Shared nodes: Milan nodes has a feature to share nodes between users, for low demand jobs this could save queue time.
    • Per-host: Not sure about this one, @rtodling uses it, relevant for compute performance.
@Dooruk Dooruk added enhancement New feature or request Epic labels Dec 20, 2024
@Dooruk
Copy link
Collaborator Author

Dooruk commented Dec 20, 2024

Case in point for retrying, the task had SLURM issues (pink square) but now running after two failed attempts and with zero changes. I just happened to be monitoring:

Screenshot 2024-12-20 at 1 40 56 PM

@ashiklom
Copy link
Collaborator

For retrying tasks: As a first pass, this can be added directly to the cylc task configurations.

https://cylc.github.io/cylc-doc/stable/html/user-guide/writing-workflows/runtime.html#task-retry-on-failure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Epic
Projects
None yet
Development

No branches or pull requests

3 participants