-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SARS-CoV-2 ORF1b mutation reporting differences between SnpEff and Nextclade #263
Comments
Thanks for reporting @peterk87! Not sure what the best solution is here because by default the pipeline will and has always used the same reference fasta and annotation for reproducibility. It is up to the users to overwrite this configuration if required. Either we go low maintenance and add in a warning to ask users to point to their annotation or we deal with it internally somehow. Thoughts? Ping @saramonzon. |
yes..we've encountered this issue too, it's not only the orf1b gene, the case is that the NC_045512.2 fasta from refseq and the MN908947.3 fasta from genebank are exactly the same, but the gff from refseq is more updated and therefore correct. It does not have a good solution in terms of reproducibility, it is true that if we want the best results for the defaults parameters we should be using the gff for NC_045512.2, but that means manually changing either the primer bed files or the gff... |
Most other resources like the ARTIC primer sets etc all use We could introduce a breaking change in the next release where we use a more updated annotation for Are you using this annotation @saramonzon ? |
I've just checked they've just updated de gff for both genebank and refseq!! |
Nice! Thanks @saramonzon @peterk87 are you able to download the latest annotation for |
Hi @drpatelh, Thanks for looking into this so quickly! I was hoping there was a SnpEff option to get the correct AA coordinates for ORF1b mutations. Unfortunately, the updated GFF from NCBI did not produce the results we wanted. For now we've resorted to using a custom GFF based on In the modified GFF, the ORF1ab gene entry is replaced with 2 gene entries for ORF1a and ORF1b: $ diff GCA_009858895.3_ASM985889v3_genomic.220131.gff MN908947.3-orf1a-orf1b-gene-split.gff 10,12c10,13
< MN908947.3 Genbank gene 266 21555 . + . ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
< MN908947.3 Genbank CDS 266 13468 . + 0 ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
< MN908947.3 Genbank CDS 13468 21555 . + 0 ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
---
> MN908947.3 Genbank gene 266 13468 . + . ID=gene-ORF1a;Name=ORF1a;gbkey=Gene;gene=ORF1a;gene_biotype=protein_coding
> MN908947.3 Genbank CDS 266 13468 . + 0 ID=cds-QHD43415.1;Parent=gene-ORF1a;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1a;product=ORF1a polyprotein;protein_id=QHD43415.1
> MN908947.3 Genbank gene 13468 21555 . + . ID=gene-ORF1b;Name=ORF1b;gbkey=Gene;gene=ORF1b;gene_biotype=protein_coding
> MN908947.3 Genbank CDS 13468 21555 . + 0 ID=cds-QHD43415.1;Parent=gene-ORF1b;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1b;product=ORF1b polyprotein;protein_id=QHD43415.1
Like you said, it's not great from a reproducibility standpoint to use a modified unofficial GFF, but at least we get the expected results for the ORF1b region. |
Thanks @peterk87 ! That is very helpful. Not sure what to do about this since most of the standard annotations don't have an explicit entry for |
It's also interesting to note that GISAID/CoVsurver reports AA substitutions for the component NSPs of ORF1ab/ORF1a, e.g.
This info is also present in the GISAID metadata table ( |
We've encountered issues with the reporting of ORF1ab mutations falling in the ORF1b region where the -1 frameshift is not taken into account by SnpEff.
Nextclade uses a different GFF than viralrecon when using MN908947.3 genome profile:
https://github.com/nextstrain/nextclade/blob/96e42468cae447afdbc9d0eda73f160e8049e6e4/data/sars-cov-2/genemap.gff#L8
https://github.com/nf-core/test-datasets/blob/viralrecon/genome/MN908947.3/GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz
So for example a C14322T nucleotide substitution will be reported by SnpEff as a T4686I AA substitution, but Nextclade will only report the nucleotide substitution and no AA substitution.
The AA at position 4686 should be Y instead of the T reported by SnpEff:
screenshot from https://www.ncbi.nlm.nih.gov/protein/QHD43415.1?report=graph
Potential solution
Use a custom GFF with separate ORF1a and ORF1b "gene" entries like Nextclade does.
The text was updated successfully, but these errors were encountered: