Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use nightly nextclade tree #1170

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented Jan 28, 2025

Description of proposed changes

This allows to bypass laggy Nextclade dataset updates and use the latest data always. Which may or may not be what we want.

This aims to be a workaround until the dataset updates are sorted out.

Potential problems:

  • nightly trees are not systematically reviewed and can contain bugs
  • does any other parts of the dataset need to be updated along with the tree? (such as pathogen.json)
  • does any other repos need to be updated to use nightly tree? (e.g. ncov-ingest) i.e. is there an assumption that the exact same dataset is used in 2 or more places?

Related issue(s)

None I could find

Testing

I need help with this

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

  • Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
  • Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
  • After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

- [x] switch Nextclade dataset in directory format (which allows to replace dataset files)
- [x] replace reference tree in the dataset with the nightly tree from https://nextstrain.org/staging/nextclade/sars-cov-2

This allows to bypass laggy Nextclade dataset updates and use the latest data always. Which may or may not be what we want.

This aims to be a workaround until the dataset updates are sorted out.

Potential problems:
- nightly trees are not systematically reviewed and can contain bugs
- does any other parts of the dataset need to be updated along with the tree? (such as pathogen.json)
- does any other repos need to be updated to use nightly tree? (e.g. ncov-ingest) i.e. is there an assumption that the exact same dataset is used in 2 or more places?
@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Jan 28, 2025

I am very much behind on the latest developments in ncov repo, so I made this PR to just highlight the idea. Please feel free to adjust the code to the current best practices or let me know what needs to be adjusted.

This is a potentially significant change in terms of science, so I expect this to be reviewed by Richard or Trevor before it happens.

The definitive fix is to setup timely semi-automated reviewable updates of Nextclade SC2 datasets.

Resolves error:

```
ImproperOutputException in rule prepare_nextclade in file /home/runner/work/ncov/ncov/workflow/snakemake_rules/main_workflow.smk, line 452:
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule prepare_nextclade:
    output: data/sars-cov-2-nextclade-defaults
    affected files:
        data/sars-cov-2-nextclade-defaults
```

https://github.com/nextstrain/ncov/actions/runs/13007476969/job/36277462988?pr=1170#step:8:136
@joverlee521
Copy link
Contributor

does any other repos need to be updated to use nightly tree? (e.g. ncov-ingest)

We would need to update ncov-ingest so that the latest lineages are included in the metadata.tsv on S3, which is used to create the inputs for model results on https://nextstrain.org/sars-cov-2/forecasts.

There's some data provenance tracking (previously discussed in nextstrain/ncov-ingest#458) that would get hairy with just overriding the tree.json in the dataset. The ncov-ingest caching system depends on the Nextclade + dataset versions, so if we use the nightly tree, would we drop the cache and run all sequences to Nextclade every day?

@ivan-aksamentov
Copy link
Member Author

ivan-aksamentov commented Jan 29, 2025

Yes, looks like there is a number of problems with this approach which I did not realize. So let's ignore this PR for now.

If now or later it is decided this idea is worth considering again, then here is a few more thoughts:

  • perhaps the tree override can use a fixed version of the tree (assuming the versioning is enabled for https://nextstrain.org/staging/nextclade/sars-cov-2 or older trees can be obtained otherwise, e.g. from S3)
  • then ncov-ingest can be adjusted to be using the same tree version as in ncov
  • updates could be made less frequent than daily - let's say take the tree version that is the latest on each given Monday and use it for the entire week. Or even less frequent, e.g. take a tree that is built on 15th of each month and use it for a month.
  • the tree JSON contains updated field, which allows to track the tree version once obtained

The basic idea here was that we decouple ncov from dependency on Nextclade dataset updates. In this particular case, to make ncov updates faster than Nextclade updates. Custom datasets and overriding dataset files (by replacing the files or adding individual --input-* args, e.g. --input-tree) is an advanced Nextclade feature, but it is a legal, official feature and advanced workflows like ncov can use it if needed.

But yes, the proper Nextclade dataset updates would be much better. I added this GitHub Action to pull nigtly trees and make automated PRs into nextclade_data for review: https://github.com/nextstrain/nextclade_data/blob/7d659a93803c5a8dd755a65e3a0d4b1f44350602/.github/workflows/update-sars-cov-2-datasets.yml (which is equivalent to overriding the tree each Monday, but comes in form of a new dataset version once the PR is reviewed, merged and released), but it needs to be done properly with help from @corneliusroemer, because not only the tree needs to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants