diff --git a/joss.07604/10.21105.joss.07604.crossref.xml b/joss.07604/10.21105.joss.07604.crossref.xml new file mode 100644 index 0000000000..635cf2231a --- /dev/null +++ b/joss.07604/10.21105.joss.07604.crossref.xml @@ -0,0 +1,237 @@ + + + + 20250221141529-e8ecd121437a2980a32668cfeb644099c6023d23 + 20250221141529 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org + + + + + 02 + 2025 + + + 10 + + 106 + + + + Taxonomy Resolver: A Python package for building and filtering taxonomy trees + + + + Fábio + Madeira + + European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom + + https://orcid.org/0000-0001-8728-9449 + + + Nandana + Madhusoodanan + + European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom + + https://orcid.org/0000-0001-5004-152X + + + Joonheung + Lee + + European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom + + https://orcid.org/0000-0002-5760-2761 + + + Alberto + Eusebi + + European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom + + https://orcid.org/0000-0001-5179-7724 + + + Ania + Niewielska + + European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom + + https://orcid.org/0000-0003-0989-3389 + + + Sarah + Butcher + + European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom + + https://orcid.org/0000-0002-4494-5124 + + + + 02 + 21 + 2025 + + + 7604 + + + 10.21105/joss.07604 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.14847194 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/7604 + + + + 10.21105/joss.07604 + https://joss.theoj.org/papers/10.21105/joss.07604 + + + https://joss.theoj.org/papers/10.21105/joss.07604.pdf + + + + + + NCBI Taxonomy: A comprehensive update on curation, resources and tools + Schoch + Database: The Journal of Biological Databases and Curation + 2020 + 10.1093/database/baaa062 + 1758-0463 + 2020 + Schoch, C. L., Ciufo, S., Domrachev, M., Hotton, C. L., Kannan, S., Khovanskaya, R., Leipe, D., Mcveigh, R., O’Neill, K., Robbertse, B., Sharma, S., Soussov, V., Sullivan, J. P., Sun, L., Turner, S., & Karsch-Mizrachi, I. (2020). NCBI Taxonomy: A comprehensive update on curation, resources and tools. Database: The Journal of Biological Databases and Curation, 2020, baaa062. https://doi.org/10.1093/database/baaa062 + + + BLAST+: Architecture and applications + Camacho + BMC bioinformatics + 10 + 10.1186/1471-2105-10-421 + 1471-2105 + 2009 + Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics, 10, 421. https://doi.org/10.1186/1471-2105-10-421 + + + ggtree: An R package for visualization and annotation of phylogenetic trees with their covariates and other associated data + Yu + Methods in Ecology and Evolution + 1 + 8 + 10.1111/2041-210X.12628 + 2017 + Yu, G., Smith, D. K., Zhu, H., Guan, Y., & Lam, T. T.-Y. (2017). ggtree: An R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution, 8(1), 28–36. https://doi.org/10.1111/2041-210X.12628 + + + Entrez Direct: E-utilities on the Unix Command Line + Kans + Entrez Programming Utilities Help [Internet] + 2024 + Kans, J. (2024). Entrez Direct: E-utilities on the Unix Command Line. In Entrez Programming Utilities Help [Internet]. National Center for Biotechnology Information (US). https://www.ncbi.nlm.nih.gov/books/NBK179288/ + + + Pandas-dev/pandas: Pandas + The pandas development team + 10.5281/zenodo.13819579 + 2024 + The pandas development team. (2024). Pandas-dev/pandas: Pandas. Zenodo. https://doi.org/10.5281/zenodo.13819579 + + + The Search for Common Origin: Homology Revisited + Ochoterena + Systematic Biology + 5 + 68 + 10.1093/sysbio/syz013 + 1063-5157 + 2019 + Ochoterena, H., Vrijdaghs, A., Smets, E., & Claßen-Bockhoff, R. (2019). The Search for Common Origin: Homology Revisited. Systematic Biology, 68(5), 767–780. https://doi.org/10.1093/sysbio/syz013 + + + A globally integrated structure of taxonomy to support biodiversity science and conservation + Sandall + Trends in Ecology & Evolution + 12 + 38 + 10.1016/j.tree.2023.08.004 + 0169-5347 + 2023 + Sandall, E. L., Maureaud, A. A., Guralnick, R., McGeoch, M. A., Sica, Y. V., Rogan, M. S., Booher, D. B., Edwards, R., Franz, N., Ingenloff, K., Lucas, M., Marsh, C. J., McGowan, J., Pinkert, S., Ranipeta, A., Uetz, P., Wieczorek, J., & Jetz, W. (2023). A globally integrated structure of taxonomy to support biodiversity science and conservation. Trends in Ecology & Evolution, 38(12), 1143–1153. https://doi.org/10.1016/j.tree.2023.08.004 + + + Chapter 4 - Nested Set Model of Hierarchies + Celko + Joe Celko’s Trees and Hierarchies in SQL for Smarties + 10.1016/B978-155860920-4/50005-2 + 978-1-55860-920-4 + 2004 + Celko, J. (2004). Chapter 4 - Nested Set Model of Hierarchies. In J. Celko (Ed.), Joe Celko’s Trees and Hierarchies in SQL for Smarties (pp. 45–99). Morgan Kaufmann. https://doi.org/10.1016/B978-155860920-4/50005-2 + + + Anytree: Python tree data library + Cofe Code and contributors + GitHub repository + 2024 + Cofe Code and contributors. (2024). Anytree: Python tree data library. In GitHub repository. GitHub. https://github.com/c0fec0de/anytree + + + BigTree: Tree implementation and methods for python, integrated with list, dictionary, pandas and polars DataFrame. + Kay Jan W. and contributors + GitHub repository + 2024 + Kay Jan W. and contributors. (2024). BigTree: Tree implementation and methods for python, integrated with list, dictionary, pandas and polars DataFrame. In GitHub repository. GitHub. https://github.com/kayjan/bigtree + + + The EMBL-EBI Job Dispatcher sequence analysis tools framework in 2024 + Madeira + Nucleic Acids Research + W1 + 52 + 10.1093/nar/gkae241 + 0305-1048 + 2024 + Madeira, F., Madhusoodanan, N., Lee, J., Eusebi, A., Niewielska, A., Tivey, A. R. N., Lopez, R., & Butcher, S. (2024). The EMBL-EBI Job Dispatcher sequence analysis tools framework in 2024. Nucleic Acids Research, 52(W1), W521–W525. https://doi.org/10.1093/nar/gkae241 + + + + + + diff --git a/joss.07604/10.21105.joss.07604.pdf b/joss.07604/10.21105.joss.07604.pdf new file mode 100644 index 0000000000..232cf36c80 Binary files /dev/null and b/joss.07604/10.21105.joss.07604.pdf differ diff --git a/joss.07604/paper.jats/10.21105.joss.07604.jats b/joss.07604/paper.jats/10.21105.joss.07604.jats new file mode 100644 index 0000000000..e4a7987f8a --- /dev/null +++ b/joss.07604/paper.jats/10.21105.joss.07604.jats @@ -0,0 +1,517 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +7604 +10.21105/joss.07604 + +Taxonomy Resolver: A Python package for building and +filtering taxonomy trees + + + +https://orcid.org/0000-0001-8728-9449 + +Madeira +Fábio + + +* + + +https://orcid.org/0000-0001-5004-152X + +Madhusoodanan +Nandana + + + + +https://orcid.org/0000-0002-5760-2761 + +Lee +Joonheung + + + + +https://orcid.org/0000-0001-5179-7724 + +Eusebi +Alberto + + + + +https://orcid.org/0000-0003-0989-3389 + +Niewielska +Ania + + + + +https://orcid.org/0000-0002-4494-5124 + +Butcher +Sarah + + + + + +European Molecular Biology Laboratory, European +Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, +Hinxton, Cambridge CB10 1SD, United Kingdom + + + + +* E-mail: + + +26 +9 +2024 + +10 +106 +7604 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2025 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +Taxonomy +Tree +Hierarchy +NCBI Taxonomy +NCBI BLAST+ +Nested Set Model +Modified Preorder Tree Traversal + + + + + + Summary +

Taxonomy classification provides an important source of information + for studying biological systems. It is a key component for many areas + of biological sciences research, particularly genetics, evolutionary + biology, biodiversity and conservation + (Sandall + et al., 2023). Common ancestry, homology and conservation of + sequence and structure are all central ideas in biology that are + directly related to the evolutionary history of any group of organisms + (Ochoterena + et al., 2019). The National Center for Biotechnology + Information (NCBI) Taxonomy + (Schoch + et al., 2020) provides a curated classification and + nomenclature for all the organisms in the public sequence databases, + across the taxonomic ranks (i.e. Domain, Kingdom, Phylum, Class, + Order, Family, Genus and Species).

+

Here we describe Taxonomy Resolver, a Python + module and command-line interface (CLI) application for building and + filtering taxonomy trees based on the NCBI Taxonomy. Taxonomy Resolver + streamlines the process of manipulating trees, enabling fast tree + traversal, searching and filtering.

+
+ + Statement of need +

The NCBI Taxonomy Database + (Schoch + et al., 2020) provides a hierarchically arranged list of + organisms across all domains of life found in the sequence databases. + Tree filtering, i.e. generation of tree subsets, referred to as + subtrees, has various applications for sequence analysis, particularly + for reducing the search space of sequence similarity searching + algorithms. A sequence dataset composed of sequences from diverse taxa + can be more quickly searched if only a subset of sequences which + belong to taxonomies of interest are selected.

+

The NCBI BLAST+ suite is the most widely used toolset in + bioinformatics for performing sequence similarity search + (Camacho + et al., 2009). The suite provides a Bash script + (get_species_taxids.sh) to convert NCBI + Taxonomy identifiers (TaxIDs) or text into TaxIDs suitable for + filtering sequence searches. While this is a useful utility, it only + works with sequences submitted to GenBank or other NCBI-hosted + databases, and more importantly, it relies on making API calls via + Entrez Direct (EDirect) + (Kans, + 2024). EDirect requires an internet connection and does not + scale well when working with large sequence datasets. Other + general-purpose tree libraries exist for Python + (e.g. anytree + (Cofe + Code and contributors, 2024) and bigtree + (Kay + Jan W. and contributors, 2024)) and R + (e.g. ggtree + (Yu + et al., 2017)), but they do not support the core features + provided by Taxonomy Resolver or focus mainly on tree visualisation. + The development of Taxonomy Resolver started in 2020 and aims to + provide user-friendly interfaces for working directly with the NCBI + Taxonomy hierarchical dataset.

+
+ + Features +

Taxonomy Resolver has been developed with simplicity in mind and it + can be used both as a standard Python module or as a CLI application. + The main tasks performed by Taxonomy Resolver are:

+ + +

downloading the NCBI Taxonomy classification + hierarchy “dump” from the NCBI FTP server

+
+ +

building complete taxonomy tree data structures or + partial trees, i.e. subtrees

+
+ +

searching particular TaxIDs at any level of the + taxonomy hierarchy, performing fast tree traversal

+
+ +

validating TaxIDs against the NCBI Taxonomy or any + given subtree

+
+ +

generating taxonomy lists that compose any + subtree, at any level of the taxonomy hierarchy

+
+ +

filtering a tree based on the inclusion and/or + exclusion of certain TaxIDs

+
+ +

writing and loading tree data structures using + Python’s object serialisation

+
+ +

generating partial and complete tress in NEWICK + format

+
+
+
+ + Implementation +

A taxonomy tree is a hierarchical structure that can be seen as a + collection of deeply nested containers - nodes connected by edges, + following the hierarchy, from the parent node - the root, down to the + children nodes - the leaves. An object-oriented programming (OOP) tree + implementation based on recursion typically scales poorly for large + trees, such as the NCBI Taxonomy, which is composed of >2.6 million + nodes. To improve performance, Taxonomy Resolver represents the tree + structure following the Nested Set Model, which is a technique + developed to represent hierarchical data in relational databases + lacking recursion capabilities. This allows for efficient and + inexpensive querying of parent-child relationships. The full tree is + traversed following the Modified Preorder Tree Traversal (MPTT) + strategy + (Celko, + 2004), in which each node in the tree is visited twice. In a + preorder traversal, the root node is visited first, then recursively a + preorder traversal of the left subtree, followed by a recursive + preorder traversal of the right subtree, in order, until every node + has been visited. The modified strategy allows capturing the ‘left’ + and ‘right’ ( + + lft + and + + rgt, + respectively) boundaries of each subtree, which are stored as two + additional attributes. Finding a subtree is as simple as searching for + the nodes of interest where + node's\ lft]]> + lft>nodeslft + and + + rgt<nodesrgt. + Likewise, finding the full path to a node is as simple as searching + for the nodes where + + lft<nodeslft + and + node's\ rgt]]> + rgt>nodesrgt. + Traversal attributes, depth and node indexes are captured for each + tree node and are stored as a pandas DataFrame + (The + pandas development team, 2024).

+

Taxonomy Resolver has been developed to take advantage of the + Nested Set Model tree structure, so it can perform fast validation and + create lists of taxa that compose a particular subtree. Inclusion and + exclusion lists can also be seamlessly used to produce subset trees + with wide applications, particularly for sequence similarity search. + Taxonomy Resolver has been used in production since 2020, serving + thousands of users every month. It enables taxonomy filtering features + for NCBI BLAST+ provided by the popular EMBL-EBI Job Dispatcher + service, available from + https://www.ebi.ac.uk/jdispatcher/sss/ncbiblast + (Madeira + et al., 2024).

+
+ + Acknowledgements +

We would like to thank current and past members of the EMBL-EBI for + their continued support. We would like to also thank EMBL and its + funders.

+
+ + + + + + + + SchochConrad L. + CiufoStacy + DomrachevMikhail + HottonCarol L. + KannanSivakumar + KhovanskayaRogneda + LeipeDetlef + McveighRichard + O’NeillKathleen + RobbertseBarbara + SharmaShobha + SoussovVladimir + SullivanJohn P. + SunLu + TurnerSeán + Karsch-MizrachiIlene + + NCBI Taxonomy: A comprehensive update on curation, resources and tools + Database: The Journal of Biological Databases and Curation + 202001 + 2020 + 1758-0463 + 10.1093/database/baaa062 + 32761142 + baaa062 + + + + + + + CamachoChristiam + CoulourisGeorge + AvagyanVahram + MaNing + PapadopoulosJason + BealerKevin + MaddenThomas L. + + BLAST+: Architecture and applications + BMC bioinformatics + 200912 + 10 + 1471-2105 + 10.1186/1471-2105-10-421 + 20003500 + 421 + + + + + + + YuGuangchuang + SmithDavid K. + ZhuHuachen + GuanYi + LamTommy Tsan-Yuk + + ggtree: An R package for visualization and annotation of phylogenetic trees with their covariates and other associated data + Methods in Ecology and Evolution + 2017 + 8 + 1 + https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12628 + 10.1111/2041-210X.12628 + 28 + 36 + + + + + + KansJonathan + + Entrez Direct: E-utilities on the Unix Command Line + Entrez Programming Utilities Help [Internet] + National Center for Biotechnology Information (US) + 202407 + 20240925 + https://www.ncbi.nlm.nih.gov/books/NBK179288/ + + + + + + The pandas development team + + Pandas-dev/pandas: Pandas + Zenodo + 202409 + 20240925 + https://zenodo.org/records/13819579 + 10.5281/zenodo.13819579 + + + + + + OchoterenaHelga + VrijdaghsAlexander + SmetsErik + Claßen-BockhoffRegine + + The Search for Common Origin: Homology Revisited + Systematic Biology + 201909 + 20240926 + 68 + 5 + 1063-5157 + https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6701455/ + 10.1093/sysbio/syz013 + 30796841 + 767 + 780 + + + + + + SandallEmily L. + MaureaudAurore A. + GuralnickRobert + McGeochMelodie A. + SicaYanina V. + RoganMatthew S. + BooherDouglas B. + EdwardsRobert + FranzNico + IngenloffKate + LucasMaisha + MarshCharles J. + McGowanJennifer + PinkertStefan + RanipetaAjay + UetzPeter + WieczorekJohn + JetzWalter + + A globally integrated structure of taxonomy to support biodiversity science and conservation + Trends in Ecology & Evolution + 202312 + 20240926 + 38 + 12 + 0169-5347 + https://www.sciencedirect.com/science/article/pii/S016953472300215X + 10.1016/j.tree.2023.08.004 + 1143 + 1153 + + + + + + CelkoJoe + + Chapter 4 - Nested Set Model of Hierarchies + Joe Celko’s Trees and Hierarchies in SQL for Smarties + + CelkoJoe + + Morgan Kaufmann + San Francisco + 200401 + 20240926 + 978-1-55860-920-4 + https://www.sciencedirect.com/science/article/pii/B9781558609204500052 + 10.1016/B978-155860920-4/50005-2 + 45 + 99 + + + + + + Cofe Code and contributors + + Anytree: Python tree data library + GitHub repository + GitHub + 2024 + https://github.com/c0fec0de/anytree + + + + + + Kay Jan W. and contributors + + BigTree: Tree implementation and methods for python, integrated with list, dictionary, pandas and polars DataFrame. + GitHub repository + GitHub + 2024 + https://github.com/kayjan/bigtree + + + + + + MadeiraFábio + MadhusoodananNandana + LeeJoonheung + EusebiAlberto + NiewielskaAnia + TiveyAdrian R N + LopezRodrigo + ButcherSarah + + The EMBL-EBI Job Dispatcher sequence analysis tools framework in 2024 + Nucleic Acids Research + 202404 + 52 + W1 + 0305-1048 + https://doi.org/10.1093/nar/gkae241 + 10.1093/nar/gkae241 + W521 + W525 + + + + +