-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use nightly nextclade tree #1170
base: master
Are you sure you want to change the base?
Conversation
- [x] switch Nextclade dataset in directory format (which allows to replace dataset files) - [x] replace reference tree in the dataset with the nightly tree from https://nextstrain.org/staging/nextclade/sars-cov-2 This allows to bypass laggy Nextclade dataset updates and use the latest data always. Which may or may not be what we want. This aims to be a workaround until the dataset updates are sorted out. Potential problems: - nightly trees are not systematically reviewed and can contain bugs - does any other parts of the dataset need to be updated along with the tree? (such as pathogen.json) - does any other repos need to be updated to use nightly tree? (e.g. ncov-ingest) i.e. is there an assumption that the exact same dataset is used in 2 or more places?
I am very much behind on the latest developments in ncov repo, so I made this PR to just highlight the idea. Please feel free to adjust the code to the current best practices or let me know what needs to be adjusted. This is a potentially significant change in terms of science, so I expect this to be reviewed by Richard or Trevor before it happens. The definitive fix is to setup timely semi-automated reviewable updates of Nextclade SC2 datasets. |
Resolves error: ``` ImproperOutputException in rule prepare_nextclade in file /home/runner/work/ncov/ncov/workflow/snakemake_rules/main_workflow.smk, line 452: Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule prepare_nextclade: output: data/sars-cov-2-nextclade-defaults affected files: data/sars-cov-2-nextclade-defaults ``` https://github.com/nextstrain/ncov/actions/runs/13007476969/job/36277462988?pr=1170#step:8:136
We would need to update ncov-ingest so that the latest lineages are included in the metadata.tsv on S3, which is used to create the inputs for model results on https://nextstrain.org/sars-cov-2/forecasts. There's some data provenance tracking (previously discussed in nextstrain/ncov-ingest#458) that would get hairy with just overriding the tree.json in the dataset. The ncov-ingest caching system depends on the Nextclade + dataset versions, so if we use the nightly tree, would we drop the cache and run all sequences to Nextclade every day? |
Yes, looks like there is a number of problems with this approach which I did not realize. So let's ignore this PR for now. If now or later it is decided this idea is worth considering again, then here is a few more thoughts:
The basic idea here was that we decouple ncov from dependency on Nextclade dataset updates. In this particular case, to make ncov updates faster than Nextclade updates. Custom datasets and overriding dataset files (by replacing the files or adding individual But yes, the proper Nextclade dataset updates would be much better. I added this GitHub Action to pull nigtly trees and make automated PRs into nextclade_data for review: https://github.com/nextstrain/nextclade_data/blob/7d659a93803c5a8dd755a65e3a0d4b1f44350602/.github/workflows/update-sars-cov-2-datasets.yml (which is equivalent to overriding the tree each Monday, but comes in form of a new dataset version once the PR is reviewed, merged and released), but it needs to be done properly with help from @corneliusroemer, because not only the tree needs to be updated. |
Description of proposed changes
This allows to bypass laggy Nextclade dataset updates and use the latest data always. Which may or may not be what we want.
This aims to be a workaround until the dataset updates are sorted out.
Potential problems:
Related issue(s)
None I could find
Testing
I need help with this
Release checklist
If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:
docs/src/reference/change_log.md
in this pull request to document these changes and the new version number.If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.md
in this pull request to document these changes by the date they were added.