Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul GH Action workflows #440

Merged
merged 7 commits into from
Mar 5, 2024
Merged

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Feb 28, 2024

Description of proposed changes

Overhaul the GH Action workflows with the main goal to use the shared pathogen-repo-build workflow.
This allows the GH Action workflow success/failure status to match the AWS Batch job status and will allow us to monitor the success/failure of automated workflows via the Pathogen workflow status page.

I made other improvements to use workflow inputs in order to reduce the 8 different GH Action workflows down to 2 workflows. See commits for details.

Related issue(s)

Related to nextstrain/.github#46

Checklist

@joverlee521
Copy link
Contributor Author

The test GISAID build completed successfully:

  • using the trial name and uploaded files to to s3://nextstrain-ncov-private/trial/overhaul-gh-action-workflows/
  • using the custom image nextstrain/ncov-ingest:branch-fix-image
  • using the TEST_SLACK_CHANNEL variable and sent messages to #scratch

It still fetched from the upstream database when the fetch_from_database input was false, but I believe this should be fixed with 4d5b611.


The test GenBank build failed during upload due to permission issues because I forgot the NextstrainNcovIngestUploader policies allow the "files/ncov/open/" prefix (not "files/workflows/ncov/open/").

I'll update the trial prefix to "files/ncov/open/trial/<TRIAL_NAME>/" in order to match the production prefix. I don't think we will ever want to update the ncov production URLs since they are advertised in our SARS-CoV-2 docs.

@joverlee521 joverlee521 force-pushed the overhaul-gh-action-workflows branch from 4d5b611 to 9ccf797 Compare February 29, 2024 20:57
@joverlee521
Copy link
Contributor Author

Updated GenBank trial prefix in force-push

Running a new GenBank trial build.

@joverlee521
Copy link
Contributor Author

GenBank trial build completed successfully and uploaded outputs to s3://nextstrain-data/files/ncov/open/trial/overhaul-gh-action-workflows/.

I'm doing one final trial run today to test the fetch_from_database input which should be fixed by 4c15c84. If all goes well, I'll merge this on Monday so that I can monitor the automated runs.

@joverlee521 joverlee521 force-pushed the overhaul-gh-action-workflows branch from 4c15c84 to b926bd1 Compare March 1, 2024 18:46
@joverlee521
Copy link
Contributor Author

Okay, actual final trial run with final fixes to fetch_from_database in b926bd1 🤞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be possible to combine the gisaid + genbank workflows into one (by adding a bit more conditional logic) - the workflow files are almost identical. Did you choose not to do this to allow the actions page to easily distinguish between the two?

That brings up a related point -- if we are using the actions page to monitor progress for the foreseeable future, is the best view into this to filter by event:scheduled to ignore trial / branch runs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you choose not to do this to allow the actions page to easily distinguish between the two?

Yeah, I think it's nice to be able to easily distinguish between GISAID and GenBank workflows. (The pathogen status page currently groups by repo/workflow name). Also, the scheduled GitHub Action workflows do not use inputs, so I think the conditionals to trigger scheduled workflows with different default values could get pretty messy.

if we are using the actions page to monitor progress for the foreseeable future, is the best view into this to filter by event:scheduled to ignore trial / branch runs?

Yup, that's should be the easiest way to monitor the automated runs for now. I think it'd also be reasonable to add a new status page that only filters to the scheduled events.

Update the fetch-and-ingest workflows to use the `trial_name` input to
determine whether the workflow should trigger downstream workflows
and the S3 destination prefix for uploaded files.

This changes the S3 destinations for both GISAID and GenBank test builds.

GISAID
  Old: s3://nextstrain-ncov-private/branch/${GITHUB_BRANCH}/
  New: s3://nextstrain-ncov-private/trial/${TRIAL_NAME}/

GenBank:
  Old: s3://nextstrain-staging/files/ncov/open/branch/${GITHUB_BRANCH}
  New: s3://nextstrain-data/files/ncov/open/trial/${TRIAL_NAME}

This follows the prefix patterns used by other more recent GitHub Action
workflows. By using the trial input, we also no longer have to maintain
extra copies of the "*-branch" workflows.

Note that I'm not updating the ingest-* workflows because they will be
made obsolete in the following commit.
Update the fetch-and-ingest workflows to use the `fetch_from_database`
input to determine whether the workflow should fetch from upstream
databases. This means we no longer have to maintain extra copies of the
"ingest-*" workflows.

One downside is this input is _not_ included with the repository_dispatch
event. However, it seems like everyone's using the workflow_dispatch via
the GitHub UI instead of the repository_dispatch anyways:

```
$ gh api /repos/nextstrain/ncov-ingest/actions/runs?event=workflow_dispatch --jq .total_count --paginate
60
$ gh api /repos/nextstrain/ncov-ingest/actions/runs?event=repository_dispatch --jq .total_count --paginate
0
```
Removes the need to use the `bin/write-envdir` script.
Also removes the need to use the roundabout `--exec env` invocations,
which means we no longer need to pass the `--cores`, `--resources`, and
`--printshellcmds` flags to Snakemake. These are automatically passed
through the default `snakemake` exec.

https://github.com/nextstrain/cli/blob/4fa53169059047454fb10b55a584395239b61a1a/nextstrain/cli/command/build.py#L147
https://github.com/nextstrain/cli/blob/4fa53169059047454fb10b55a584395239b61a1a/nextstrain/cli/command/build.py#L190-L236
Allows the GH Action workflows to stay connected to the AWS Batch jobs
when using the latest pathogen-repo-build workflow. This way, the status
of the GH Action workflow accurately reflects the status of the ingest
workflow.

We can then also use the Nextstrain pathogen workflow status page¹ to
monitor the success/failure of the automated workflows.

¹ https://nextstrain.github.io/status/pathogen-workflows.html
There are many instances in that past where we've made temp commits
(e.g. ddbf15d and aab1f5f)
to edit the Docker image used for a test run of the workflow.

This input follows the pattern of more recent GH Action workflows to use
an input to modify the Docker image.
Prevent the trial runs from being mixed with the automated runs by
sending Slack notifications to the `TEST_SLACK_CHANNEL` (#scratch).
If the GH Action event trigger is not `workflow_dispatch`, then set the
default value for `FETCH_FROM_DATABASE` to `true`. Otherwise, use the
provided input value.

Previously the `FETCH_FROM_DATABASE` was always set to `true` because
the ternary statement would return the value after `||` when the input
value was `false`. Note, the expression

```
FETCH_FROM_DATABASE: ${{ inputs.fetch_from_database == '' && true || inputs.fetch_from_database }}
```

also does _not_ work because GH Action expressions performs loose
equality comparisons that coerces the type to a number if types do not
match.¹ This coerced `false` to `0` and the empty string to `0` leading
the ternary operator to always return the default `true` value.

¹ https://docs.github.com/en/actions/learn-github-actions/expressions#operators
@joverlee521 joverlee521 force-pushed the overhaul-gh-action-workflows branch from b926bd1 to 3d92f48 Compare March 4, 2024 18:21
@joverlee521
Copy link
Contributor Author

Ah, missed today's automated runs. I'll merge EOD today in case there are other comments.

@joverlee521 joverlee521 merged commit d06e653 into master Mar 5, 2024
1 check passed
@joverlee521 joverlee521 deleted the overhaul-gh-action-workflows branch March 5, 2024 01:26
@corneliusroemer
Copy link
Member

Thanks for cleaning up the workflows @joverlee521!

I just tried this out and it took a bit of time to understand how to configure the workflow input params to get it to do what I want (run on branch, do a test run, no fetch).

Maybe there's scope for a single "test" workflow that simply does repository dispatch with these params? The new single workflow is great for maintenance but might cause errors when misused due to complexity in selecting the right inputs. A neutered test dispatch might reduce accidental misuse likelihood.

@corneliusroemer
Copy link
Member

Oh wow, I just discovered that the logs are streamed from AWS batch - that's amazing! Great job @joverlee521!

image

@joverlee521
Copy link
Contributor Author

Oh wow, I just discovered that the logs are streamed from AWS batch - that's amazing!

This is all thanks to @tsibley's update to the pathogen-repo-build workflow to let it stay attached to the AWS Batch jobs 🙌

@joverlee521
Copy link
Contributor Author

I just tried this out and it took a bit of time to understand how to configure the workflow input params to get it to do what I want (run on branch, do a test run, no fetch).

Maybe there's scope for a single "test" workflow that simply does repository dispatch with these params? The new single workflow is great for maintenance but might cause errors when misused due to complexity in selecting the right inputs. A neutered test dispatch might reduce accidental misuse likelihood.

Are there additional docs that I can add to clarify the use of the workflow input params?

If we wanted to support repository dispatch, we would have to complicate the workflows a bit to check for github.event.client_payload.<input_name> params in addition to the input.<input_name> checks. I'm not sure that's worth adding right now. I'm not too worried about "misuse" since cancelling the GitHub workflow run will also cancel the attached AWS Batch job now.

@@ -99,6 +113,7 @@ jobs:
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
PAT_GITHUB_DISPATCH: ${{ secrets.GH_TOKEN_NEXTSTRAIN_BOT_WORKFLOW_DISPATCH }}
TRIAL_NAME: ${{ inputs.trial_name }}
FETCH_FROM_DATABASE: ${{ inputs.fetch_from_database != '' && inputs.fetch_from_database || true }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, combined with the type: boolean of inputs.fetch_from_database, is going to result in FETCH_FROM_DATABASE being either the strings true or false, right?

And then the Bash condition (earlier up) is [[ "$FETCH_FROM_DATABASE" ]], which evaluates to true (exits 0) for both [[ "true" ]] and [[ "false" ]]. Yeah?

So does setting inputs.fetch_from_database to false actually work? (or am I missing something?)

Copy link
Member

@tsibley tsibley Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see it gets changed/fixed in a later commit, but also #440 (comment).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love to see this. :-)

- name: notify_pipeline_failed
if: ${{ failure() }}
run: ./vendored/notify-on-job-fail Ingest nextstrain/ncov-ingest
$CONFIG_OVERRIDES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this opens up a command-line injection via inputs.trial_name (though in practice it's probably fine given our usage).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, I've repeated this pattern across other GH Action workflows 🙈

What would be the safer thing to do here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best thing is building up an array of strings, which can then be expanded with quoting to prevent subsequent word splitting. But with the config overrides being computed in a separate job, you can't pass an array, only a single long string. One way forward with that is to use shell-quoting semantics inside that single long string (i.e. nested quoting), but that gets tedious and is easy to mess up. Stepping back, why did the config override get moved into a separate job? Does it need to be?

Or, as an alternative to passing around a command-line snippet in a string/env var, we could produce another YAML file and pass that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back, why did the config override get moved into a separate job? Does it need to be?

It had to get moved to a separate job because the config override is not currently part of the pathogen-repo-build workflow. Since configs are pretty different across pathogen workflows, I didn't want to shoehorn it into the reusable workflow.

Or, as an alternative to passing around a command-line snippet in a string/env var, we could produce another YAML file and pass that instead?

Ah, so build a custom config YAML that is uploaded/download as an artifact of the GH Action workflow. I think this is doable with updating pathogen-repo-build workflow to take a artifact name as an input and download the artifact as part of the reusable workflow. Yeah, this seems like the way to go if we are unable fully standardize configs across pathogens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It had to get moved to a separate job because the config override is not currently part of the pathogen-repo-build workflow. Since configs are pretty different across pathogen workflows, I didn't want to shoehorn it into the reusable workflow.

Can't it go in the run input of the pathogen-repo-build workflow call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, yes 🤦‍♀️

run: |
config="--config"

if [[ "$FETCH_FROM_DATABASE" ]]; then
if "$FETCH_FROM_DATABASE"; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on running the programs true and false... 🫣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging! Updating in #444

joverlee521 added a commit that referenced this pull request Apr 19, 2024
joverlee521 added a commit that referenced this pull request Apr 19, 2024
Prompted by post-merge review
#440 (comment)

Instead of running `true` and `false` programs, explicitly test
that `FETCH_FROM_DATABASE` is true.
joverlee521 added a commit that referenced this pull request Apr 19, 2024
Prompted by post-merge review
#440 (comment)

Instead of running `true` and `false` programs, explicitly test
that `FETCH_FROM_DATABASE` is true.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants