Skip to content

Latest commit

 

History

History
43 lines (36 loc) · 6.68 KB

24.sup.note.4.md

File metadata and controls

43 lines (36 loc) · 6.68 KB

Supplementary Note 4 - Prior Art

There have been numerous attempts at standardising knowledge graphs and making biomedical data stores more interoperable [@doi:10.1093/bib/bbac404;@doi:10.1146/annurev-biodatasci-010820-091627]. They can be divided into three broad classes representing increasing levels of abstraction of the KG build process:

  1. Centrally maintained databases include task-oriented data collections such as OmniPath [@doi:10.1038/nmeth.4077] or the CKG [@doi:10.1038/s41587-021-01145-6]. They are the least flexible form of knowledge representation, usually bound to a specific research purpose, and are highly dependent on their primary maintainers for continuous functioning. BioCypher reduces the development and maintenance overhead that usually goes along with such a resource, making a task-specific KG feasible for smaller and less bioinformatics-focused groups. These databases usually do not conform to any standard in their knowledge representation, hindering their integration. In contrast, with BioCypher, we migrate OmniPath, CKG, and other popular databases onto an interoperable KG framework.

  2. Explicit standard formats or modelling languages include the Biolink model [@doi:10.1111/cts.13302], BEL [@doi:10.1016/j.drudis.2013.12.011], GO-CAM [@doi:10.1038/s41588-019-0500-1], SBML [@doi:10.15252/msb.20199110], BioPAX [@doi:10.1038/nbt.1666], and PSI-MI [@doi:10.1038/nbt926]. There are many more, each a solution to a very specific problem, as reviewed elsewhere [@doi:10.1016/j.drudis.2013.12.011;@doi:10.1093/bioinformatics/bti718]; some are part of the COMBINE standard ecosystem [@doi:10.3389/fbioe.2015.00019]. Their main shortcoming is the rigidity that follows from their data model definitions: to represent data in one of these languages, the user needs to fully adopt it. If the task exceeds the scope of the language, the user needs to either look for alternatives, or introduce new features into the language, which can be a lengthy process. In addition, some features may be incompatible, and thus, one centrally maintained language definition is fundamentally limited. With BioCypher, each of the above languages can be adopted as the basis for a particular knowledge graph; in fact, we use the Biolink model as a basic ontology. Inside our framework, these languages can be freely and transparently exchanged, modified, extended, and hybridised, as we show in several of our case studies (e.g., “Tumour board” extends Biolink with Sequence Ontology and Disease Ontology).

  3. KG frameworks provide a means to build KGs, similar to the idea of BioCypher [@doi:10.1101/2020.04.30.071407;@doi:10.1101/631812;@doi:10.1101/2020.08.17.254839;@doi:10.1186/s12859-022-04932-3]. However, most tie themselves tightly to a particular standard format or modelling language ecosystem, thereby inheriting many of the limitations described above. The Knowledge Graph Hub provides a data loader pipeline, KGX allows conversion of KGs between different technical formats, and RTX-KG2 builds a fixed semantically standardised KG; all three adhere to the Biolink model [@doi:10.1101/2020.08.17.254839;@doi:10.1186/s12859-022-04932-3]. Bio2BEL is an extensive framework to transform primary databases into BEL [@doi:10.1101/631812]. PheKnowLator is the only tool that is conceptually similar to BioCypher in that it allows the creation of knowledge graphs under different data models [@doi:10.1101/2020.04.30.071407]. However, it appears to be aimed at knowledge representation experts, requiring considerable bioinformatics and ontology expertise. While being fully customisable, it does not feature flexible recombination of modular components.

The strategy of subgraph extraction to yield smaller, user-specific KGs has been implemented previously, for instance by CROssBAR (v1), ROBOKOP, and the BioThings Explorer [@doi:10.1093/nar/gkab543;@doi:10.1093/bioinformatics/btz604;@doi:10.1186/s12859-018-2041-5]. However, these rely on single (and thus enormous) harmonised KGs for extracting the subgraphs as opposed to BioCypher’s modular approach [@doi:10.1111/cts.12592]. While the “top-down” approach of first building a massive KG and then extracting subgraphs from it is a valid means to arrive at a particular knowledge representation, the effort involved is detrimental to efficiency and democratisation of the process. A secondary consequence of this large primary effort is that alternative representations of the initial KG will probably not be attempted, hindering flexible knowledge representation. In contrast, the “bottom-up” approach we follow in BioCypher emphasises modular recombination and flexible representation with small effort overheads.

Ontology mapping has been leveraged for data integration by consortia such as the Monarch Initiative (which is the parent organisation of the MONDO Disease Ontology and the Biolink model, among others) as well as single projects, such as KaBOB [@doi:10.1534/genetics.116.188870;@doi:10.1186/s12859-015-0559-3]. While conceptually related to BioCypher in the use of ontology and biomedical data, these are massive efforts that are not amenable to replication by the average research group. We aim to close this gap by providing an agile and modular framework that facilitates the reuse of the valuable resources generated by those projects.

There exist alternatives to workflows that involve KGs. While the premise of our manuscript is that KGs are an important part of sustainable and trustworthy machine learning in the biomedical sciences, “zero domain knowledge” approaches such as UniHPF [@doi:10.48550/arXiv.2211.08082] can do without prior knowledge in their inference process. Whether methods that forego knowledge representation entirely can be as good or better than methods that use knowledge representation is still a matter of discussion [@doi:10.1038/s41551-022-00942-x;@doi:10.1101/2022.05.01.489928;@doi:10.1101/2022.12.07.22283238;@doi:10.48550/arxiv.2210.09338;@doi:10.1016/j.artint.2021.103627;@doi:10.48550/arXiv.2205.15952;@doi:10.1093/bioinformatics/btac001]. One aspect that is apparent from modern developments in large language models is that prior knowledge-free models appear to be very data hungry; while billion parameter models are very impressive in their text and image processing capabilities, we do not nearly have enough data in molecular biomedicine to train a GPT-like model, even if we had the funds to train it. In addition, even in prior knowledge-free deep models, a semantically enriched knowledge graph can still play a role and be useful as an in-process component [@doi:10.1609/aaai.v36i10.21286]. To address these and other performance-related questions, we want to facilitate the creation of benchmarks and standard datasets through the modular nature of our framework.