feat: use nightly nextclade tree #1170

ivan-aksamentov · 2025-01-28T09:33:25Z

Description of proposed changes

switch Nextclade dataset to directory format (which allows to replace dataset files)
replace reference tree in the dataset with the nightly tree from https://nextstrain.org/staging/nextclade/sars-cov-2

This allows to bypass laggy Nextclade dataset updates and use the latest data always. Which may or may not be what we want.

This aims to be a workaround until the dataset updates are sorted out.

Potential problems:

nightly trees are not systematically reviewed and can contain bugs
does any other parts of the dataset need to be updated along with the tree? (such as pathogen.json)
does any other repos need to be updated to use nightly tree? (e.g. ncov-ingest) i.e. is there an assumption that the exact same dataset is used in 2 or more places?

Related issue(s)

None I could find

Testing

I need help with this

Release checklist

If this pull request introduces backward incompatible changes, complete the following steps for a new release of the workflow:

Determine the version number for the new release by incrementing the most recent release (e.g., "v2" from "v1").
Update docs/src/reference/change_log.md in this pull request to document these changes and the new version number.
After merging, create a new GitHub release with the new version number as the tag and release title.

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

- [x] switch Nextclade dataset in directory format (which allows to replace dataset files) - [x] replace reference tree in the dataset with the nightly tree from https://nextstrain.org/staging/nextclade/sars-cov-2 This allows to bypass laggy Nextclade dataset updates and use the latest data always. Which may or may not be what we want. This aims to be a workaround until the dataset updates are sorted out. Potential problems: - nightly trees are not systematically reviewed and can contain bugs - does any other parts of the dataset need to be updated along with the tree? (such as pathogen.json) - does any other repos need to be updated to use nightly tree? (e.g. ncov-ingest) i.e. is there an assumption that the exact same dataset is used in 2 or more places?

ivan-aksamentov · 2025-01-28T09:37:16Z

I am very much behind on the latest developments in ncov repo, so I made this PR to just highlight the idea. Please feel free to adjust the code to the current best practices or let me know what needs to be adjusted.

This is a potentially significant change in terms of science, so I expect this to be reviewed by Richard or Trevor before it happens.

The definitive fix is to setup timely semi-automated reviewable updates of Nextclade SC2 datasets.

Resolves error: ``` ImproperOutputException in rule prepare_nextclade in file /home/runner/work/ncov/ncov/workflow/snakemake_rules/main_workflow.smk, line 452: Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule prepare_nextclade: output: data/sars-cov-2-nextclade-defaults affected files: data/sars-cov-2-nextclade-defaults ``` https://github.com/nextstrain/ncov/actions/runs/13007476969/job/36277462988?pr=1170#step:8:136

joverlee521 · 2025-01-29T01:00:33Z

does any other repos need to be updated to use nightly tree? (e.g. ncov-ingest)

We would need to update ncov-ingest so that the latest lineages are included in the metadata.tsv on S3, which is used to create the inputs for model results on https://nextstrain.org/sars-cov-2/forecasts.

There's some data provenance tracking (previously discussed in nextstrain/ncov-ingest#458) that would get hairy with just overriding the tree.json in the dataset. The ncov-ingest caching system depends on the Nextclade + dataset versions, so if we use the nightly tree, would we drop the cache and run all sequences to Nextclade every day?

ivan-aksamentov · 2025-01-29T11:59:50Z

Yes, looks like there is a number of problems with this approach which I did not realize. So let's ignore this PR for now.

If now or later it is decided this idea is worth considering again, then here is a few more thoughts:

perhaps the tree override can use a fixed version of the tree (assuming the versioning is enabled for https://nextstrain.org/staging/nextclade/sars-cov-2 or older trees can be obtained otherwise, e.g. from S3)
then ncov-ingest can be adjusted to be using the same tree version as in ncov
updates could be made less frequent than daily - let's say take the tree version that is the latest on each given Monday and use it for the entire week. Or even less frequent, e.g. take a tree that is built on 15th of each month and use it for a month.
the tree JSON contains updated field, which allows to track the tree version once obtained

The basic idea here was that we decouple ncov from dependency on Nextclade dataset updates. In this particular case, to make ncov updates faster than Nextclade updates. Custom datasets and overriding dataset files (by replacing the files or adding individual --input-* args, e.g. --input-tree) is an advanced Nextclade feature, but it is a legal, official feature and advanced workflows like ncov can use it if needed.

But yes, the proper Nextclade dataset updates would be much better. I added this GitHub Action to pull nigtly trees and make automated PRs into nextclade_data for review: https://github.com/nextstrain/nextclade_data/blob/7d659a93803c5a8dd755a65e3a0d4b1f44350602/.github/workflows/update-sars-cov-2-datasets.yml (which is equivalent to overriding the tree each Monday, but comes in form of a new dataset version once the PR is reviewed, merged and released), but it needs to be done properly with help from @corneliusroemer, because not only the tree needs to be updated.

ivan-aksamentov requested review from rneher, joverlee521, corneliusroemer and trvrb January 28, 2025 09:33

ivan-aksamentov removed request for trvrb, rneher, corneliusroemer and joverlee521 January 29, 2025 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use nightly nextclade tree #1170

feat: use nightly nextclade tree #1170

ivan-aksamentov commented Jan 28, 2025 •

edited

Loading

ivan-aksamentov commented Jan 28, 2025 •

edited

Loading

joverlee521 commented Jan 29, 2025

ivan-aksamentov commented Jan 29, 2025 •

edited

Loading

feat: use nightly nextclade tree #1170

Are you sure you want to change the base?

feat: use nightly nextclade tree #1170

Conversation

ivan-aksamentov commented Jan 28, 2025 • edited Loading

Description of proposed changes

Related issue(s)

Testing

Release checklist

ivan-aksamentov commented Jan 28, 2025 • edited Loading

joverlee521 commented Jan 29, 2025

ivan-aksamentov commented Jan 29, 2025 • edited Loading

ivan-aksamentov commented Jan 28, 2025 •

edited

Loading

ivan-aksamentov commented Jan 28, 2025 •

edited

Loading

ivan-aksamentov commented Jan 29, 2025 •

edited

Loading