You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the main reasons we use Cylc is its design which includes many useful features for DA cycling. We currently exercise some of them but we should take advantage of more. I will create this as an epic and as users think of more features they can add this can expand.
Necessary ones:
Retry certain tasks in case they fail: Most of the time the failure of Run.. tasks are caused by filesystem issues rather than configuration problems. Cylc should try running a certain task at least one more time before giving up. This is especially important for suites involving ensembles as even if 31 members succeed and 1 fails the workflow will stall.
Execute certain tasks in TSE staging area: This would mainly be relevant for GEOS simulations but if it provides significant efficiency for others we could benefit from it.
Use compiler flags for different environments: For compute performance we should use certain flags (compiler and node type dependent), an example for Intel below:
From Matt: The latter is required for Intel MPI on the Milan nodes. the former is good to use well.
Hold before certain tasks:@rtodling mentioned this one. The ability for a suite to "hold" before a certain task, say RunJediVarExecutable so that different configurations could be tested swiftly without the need to swell create a whole new suite. JCSDA's https://github.com/JCSDA-internal/skylab uses this feature in a different workflow engine called EWOK.
Restart a failed suite from a certain cycle & task: This one is tricky, I know it is possible with Cylc but not sure how would this play out with Swell. After a suite cycled certain amount of days, we would want to continue from a that point rather than restarting altogether.
Optional but useful ones:
Workflow sending email if suite stops/fails. I was able to do this with a sandbox Cylc setup.
Using more SLURM and Milan node features:
Assign different walltime for tasks: Default is 1hr but most tasks take less than this, which could improve queue times.
Shared nodes: Milan nodes has a feature to share nodes between users, for low demand jobs this could save queue time.
Per-host: Not sure about this one, @rtodling uses it, relevant for compute performance.
The text was updated successfully, but these errors were encountered:
Case in point for retrying, the task had SLURM issues (pink square) but now running after two failed attempts and with zero changes. I just happened to be monitoring:
One of the main reasons we use Cylc is its design which includes many useful features for DA cycling. We currently exercise some of them but we should take advantage of more. I will create this as an epic and as users think of more features they can add this can expand.
Necessary ones:
Run..
tasks are caused by filesystem issues rather than configuration problems. Cylc should try running a certain task at least one more time before giving up. This is especially important for suites involving ensembles as even if 31 members succeed and 1 fails the workflow will stall.From Matt: The latter is required for Intel MPI on the Milan nodes. the former is good to use well.
RunJediVarExecutable
so that different configurations could be tested swiftly without the need toswell create
a whole new suite. JCSDA's https://github.com/JCSDA-internal/skylab uses this feature in a different workflow engine called EWOK.Optional but useful ones:
The text was updated successfully, but these errors were encountered: