-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove 2339 names that map to seqs on multiple lineage branches #2466
Remove 2339 names that map to seqs on multiple lineage branches #2466
Conversation
…ee that are on different lineage branches.
@corneliusroemer I will merge this tomorrow unless you would like some time to check it first. |
Thanks for this! I ran a quick check to see whether this would drop any lineages into precarious territory in terms of number of designations left but that's not the case so no objection from me. counts before and after pruning for affected lineages
I've had a look at the names of the duplicates. It looks like these are not exact name duplicates but rather the Genbank and GISAID strain names respectively. We could in theory have a rule to havea precedence of GISAID over Genbank to resolve ambiguities. But the differences are so small it might not worth the trouble. Commands I used: Run this on master and pr branch:
|
Thanks @corneliusroemer for taking a look! Besides the duplicates, there are some interesting cases from GenBase (CNCB) where the same name is reused for many sequences like this:
I pinged CNCB about that, asking if they could ask the submitters to use distinct names, but no response so far. Meanwhile, GISAID curators seem to be adding disambiguating suffixes for those (MSCDC-10, MSCDC-10-2, MSCDC-10-3, and so on). So in that case it would be better to go with GISAID names than public. |
48k names in lineages.csv map to multiple sequences in the UCSC UShER tree, sometimes because of deduplication failures and sometimes because the same name is used in multiple submissions to the same repository and we end up with the same name used for multiple accessions. Out of the 48k, 2339 names map to multiple sequences that are on different lineage branches in the UShER tree, causing confusion about which of the multiple accessions/sequences the name is meant to refer to. Remove those names from lineages.csv so we can rely on more uniquely mapped names.