Output modularisation #4

ecarrenolozano · 2024-10-10T12:39:25Z

ecarrenolozano
Oct 10, 2024

Output modularisation

Re #361 and discussions therein, we should clarify which architecture changes we need to make to streamline the output API to represent all BioCypher functions and yet have the simplest, most intuitive functionality for the user.

@kpto made the observation that isolation is needed (the core should not need to know about the output); @slobentanzer proposes to focus core representation on the basic tuple representation (three-element for nodes, five-element for edges), and to have independent output modules that can be chosen by some configuration / API. The exact nature of these configuration options and API choices are the subject of discussion here. For that, we work with diagrams for visualising the architecture.

Architecture Diagrams

We include and expand our catalog of UML diagrams. You can view the diagrams (just click below). However, if you want to contribute, join our Zulip Channel and request access to edit the files.

BioCypher Architecture Diagrams

slobentanzer · 2024-10-10T12:48:14Z

slobentanzer
Oct 10, 2024
Maintainer

I suggest to move to .drawio.svg files, stored on the biocypher repository, as soon as possible. For instance, https://github.com/biocypher/biocypher/blob/main/docs/write_mode.drawio.svg. We can create a subfolder for the diagrams in the docs folder. This will allow more seamless collaborative work on the diagrams, including making changes via PRs.

@kpto FYI, this is where we now lead the current output architecture discussion.

0 replies

kpto · 2024-10-11T10:16:26Z

kpto
Oct 11, 2024

@slobentanzer To view and edit a diagram stored in GitHub, the app draw.io needs to be installed in the organisation. I have requested an installation to biocypher, please check.

Also I don't think a mega thread to discuss everything is manageable, this thread may serve as an index of other architectural discussions but a discussion for a specific part should has it's own thread created.

Of course if you know a few tricks of GitHub that can address my concern, please share them with me :)

11 replies

slobentanzer Oct 11, 2024
Maintainer

Please give more constructive feedback (what do you need, which permissions do I need to modify for you)

kpto Oct 11, 2024

@slobentanzer You are the owner and only you can see what granular controls you have. If you ask me I can only guess permissions that allows me to create label and edit PR which is not constructive as well because I don't know how exactly are actions mapped to granular permissions in GitHub. I can do some research and then tell you though.

slobentanzer Oct 11, 2024
Maintainer

Yeah that would be great. I'd prefer if Edwin and yourself were maintainers of these processes as well. I don't know, for instance, where the permissions need to be updated (Discussions are associated to a repository, so is it a repo-level setting or organisation-level); or, which exact roles/permissions would be best for a maintainer role. If you find out something, please let me know. :)

kpto Oct 11, 2024

While I will still do some research, but regarding this matter, I don't think I should have any permission to do what I proposed. I don't think we should have permissions to edit any issues/PRs owned by others or otherwise the workflow cannot be enforced. The additional label should be created by you, the PR should be edited by @ecarrenolozano . Do you really mean you want us to be maintainers of this repo?

slobentanzer Oct 11, 2024
Maintainer

Thinking about roles is important no doubt, but you are conflating some things here. And yes, my medium-term goal is that you will be core contributors and maintainers/moderators to the extent that you can guide other contributors without my intervention. My points:

This here is not a repo, this is the Discussion feature of the organisation. Naturally, we should not modify others' issues or PRs.
You suggested, among other things, to add a label to the Discussion. Even if Edwin has to change the title and body of the discussion topic, adding labels to things is something that maintainers and triage role members usually do. It is not necessary that only the originator of an issue or discussion adds a label; in fact, often only the maintainers can label things.
You suggested to make the single discussion more specific, but add general instructions for discussions at a higher level. I don't see why you should not be able to do that, conceptually. That is what I would like you to research: which role should do these kinds of things? Imagine someone external opens a discussion: which permissions would you need to moderate this discussion without my help?

ecarrenolozano · 2024-10-21T10:08:28Z

ecarrenolozano
Oct 21, 2024
Author

slobentanzer Oct 21, 2024
Maintainer

Thanks @ecarrenolozano. Fully agree that transparent definition of what to add / change is great.

To clarify, you just pasted the sequence diagram that @ryxx0811 posted before, right? This does not reflect your points above. Should we draw one that does?

We can think in a way to complement the already made template (Issue: Add New Component).

Can you clarify that point? The New Component issue is for adapters and pipelines, so use cases of BioCypher. Do you want to add a template for adding/changing BioCypher features?

slobentanzer Oct 21, 2024
Maintainer

I think one essential decision is the internal format we use for the KG. I think it would be useful to research what would be the most flexible and technically efficient internal representation of the graph that fulfils your above requirements. We have the BioCypherNode and BioCypherEdge classes already; at the moment they are @dataclass(frozen=True) type dictionaries.

Should the internal representation be individual collections of these classes for all node and edge types?
How should we store them to allow in-memory, on-demand streaming, and "offline mode"?

ecarrenolozano Oct 21, 2024
Author

Thanks @ecarrenolozano. Fully agree that transparent definition of what to add / change is great.

To clarify, you just pasted the sequence diagram that @ryxx0811 posted before, right? This does not reflect your points above. Should we draw one that does?

Yes, this is just a copy paste to show the modeling diagrams (if they are added) in the format proposed. I need to update the reply with the updated diagrams.

We can think in a way to complement the already made template (Issue: Add New Component).

Can you clarify that point? The New Component issue is for adapters and pipelines, so use cases of BioCypher. Do you want to add a template for adding/changing BioCypher features?

Exactly, for adding/changing BioCypher features I could create a new template that involves the new format as a basis for proposing those changes. If you agree, I could do that.

I think one essential decision is the internal format we use for the KG. I think it would be useful to research what would be the most flexible and technically efficient internal representation of the graph that fulfils your above requirements. We have the BioCypherNode and BioCypherEdge classes already; at the moment they are @dataclass(frozen=True) type dictionaries.
1. Should the internal representation be individual collections of these classes for all node and edge types?

2. How should we store them to allow in-memory, on-demand streaming, and "offline mode"?

I will look information about this.

slobentanzer Oct 21, 2024
Maintainer

@ecarrenolozano yes, agree to all points

ecarrenolozano · 2024-10-22T06:19:36Z

ecarrenolozano
Oct 22, 2024
Author

I attach a diagram, as a complement for this discussion.

1 reply

slobentanzer Oct 22, 2024
Maintainer

@ecarrenolozano this may not be self-explanatory. Can you add the conclusions from our discussion as well?

kpto · 2024-10-22T10:55:47Z

kpto
Oct 22, 2024

A bit late to comment but what is the internal representation for? Why isn't it just another output like others? An output is agnostic to destination but merely something complying to the interface, whether it in the end uses database or memory does not matter. You can have multiple outputs at the same time, a database connection output, a database import data output, a networkx output and a pandas output. Isn't it more configurable to users?

14 replies

kpto Oct 22, 2024

@slobentanzer It seems to me that you think a model, apart from data structure, must also defines relevant business logic. This is not true and indeed also a bad practice.

Rather than this:

class Entity:
    self.propA = None

    def businessA()
        pass

    def businessB()
        pass

entity = Entity()
entity.businessA()

Do this:

from business import businessA, businessB

class Entity:
    self.propA = None

entity = Entity()
processedEntity = businessA.process(entity)
processedEntity = businessB.proecss(processedEntity)

Not only it makes your program more modularised by separating responsibilities, your logic is now testable, like this:

entity = MockEntity()
processedEntity = businessA.process(entity)
processedEntity.assert_logic_is_correct()

You may argue that I can still test the original design like this:

entity = Entity()
entity.businessA()
assert(entity.someProp == truth)

But if businessB requires an entity that is processed by businessA first, you can't unit test your methods because they must run together

entity = Entity()
entity.businessA()
entity.businessB()
assert(entity.businessBRelatedProp == truth) // failed, but because of businessA or businessB?

But with business logic externalised, you can have a mock entity which acts differently, like this

class MockEntityJustForBusinessB:
    self.quirkyPropForTestingBusinessB = None

processedEntity = businessB.process(entity)
processedEntity.assert_logic_is_correct()

kpto Oct 22, 2024

@slobentanzer After I posted my comment above, I further researched a bit and it seems that your and my approaches are referred as rich or anemic domain model respectively and I can see that a rich domain is often preferred in Python, especially research related packages. Perhaps I am too influenced by my previous work experience and unconsciously selling it too hard. Though, the pattern I learnt really separate responsibilities very well and if I drop them, I don't know how to organise codes well. Taking this from BioCypherEdge as an example:

    def __post_init__(self):
        """
        Check for reserved keywords.
        """

        if ":TYPE" in self.properties.keys():
            logger.debug(
                "Keyword ':TYPE' is reserved for Neo4j. "
                "Removing from properties.",
                # "Renaming to 'type'."
            )
            # self.properties["type"] = self.properties[":TYPE"]
            del self.properties[":TYPE"]
        elif "id" in self.properties.keys():
            logger.debug(
                "Keyword 'id' is reserved for Neo4j. "
                "Removing from properties.",
                # "Renaming to 'type'."
            )
            # self.properties["type"] = self.properties[":TYPE"]
            del self.properties["id"]
        elif "_ID" in self.properties.keys():
            logger.debug(
                "Keyword '_ID' is reserved for Postgres. "
                "Removing from properties.",
                # "Renaming to 'type'."
            )
            # self.properties["type"] = self.properties[":TYPE"]
            del self.properties["_ID"]

Obviously if I don't use Postgres, the Postgres keywords check does not apply on me. If I follow your pattern to fix it, it would become

        elif self.dbms == "postgres" and "_ID" in self.properties.keys():
            logger.debug(
                "Keyword '_ID' is reserved for Postgres. "
                "Removing from properties.",
                # "Renaming to 'type'."
            )
            # self.properties["type"] = self.properties[":TYPE"]
            del self.properties["_ID"]

It gives a few issues:

The object would need to have the knowledge of the output injected which does not seem right to me.
The checking self.dbms == "postgres" is repeatedly run which waste resource.

But if the logic is externalised instead, I can do

from biocypher import config

def generate_validator(config):
    validators = []
    if config["output"] == "protgres":
        validators.append(ProtgresKeywordValidator())
    if config["output"] == "neo4j":
        validators.append(Neo4jKeywordValidator())
    return CompositeValidator(validators)

validator = generate_validator(config)
validator.validate(nodes)

Doing so I solved all issues above but obviously this kind of pattern is a stranger to you.

I am actually also not sure the research packages I checked (pandas, scikit-learn, corneto) are a good reference. Most of research packages does not have a global state like BioCypher configuration. Can you think of any similar project that I can learn from?

slobentanzer Oct 23, 2024
Maintainer

@kpto this is fine, I am not attached to any patterns. I would be happy with that refactor. I think it is somewhat beside the point though: this is an internal process no user ever sees (and runs once, so we don't gain a lot by optimising this part, even if some conditional runs twice). We can refactor this as much as we want to make the code more maintainable and efficient. I said above:

we need to design the API for the users

I think this part of the argument is currently much more important than internal optimisations. I am happy to refactor the code to align with good practice, but the API discussion is somewhat independent of that. Please focus on the topic of the discussion and PR, which is the output modularisation and related changes in the user-facing API for now (modes such as in memory, online, offline; what should be the internal representation of nodes and edges).

slobentanzer Oct 23, 2024
Maintainer

Re your comment further above: I am aware that dependency injection is a thing, and I am also aware that some of the codebase is not what software engineers would have written. I am happy to make changes to any of this, but we need to focus on the prioritised issues; we will not fix the entire codebase at once. I want to particularly avoid premature optimisation. Let's get back to discussing the modes and internal representation, as those are what matters for this discussion and PR.

slobentanzer Oct 23, 2024
Maintainer

Check also #81 and #76, I have been considering this for a while. No definitive answer yet.

ecarrenolozano · 2024-10-28T09:59:50Z

ecarrenolozano
Oct 28, 2024
Author

Hi guys, Happy to interact again:

Understanding in-memory, online and offline mode terms.

Look at this diagram (for the purpose of this discussion I made this temporal diagram)
in-memory: it means that the knowledge graph is created and it lives in the RAM memory. In case of finishing a session or restaring the computer, the graph is lost.
online mode: it refers to the interaction with the graph, the possibility of adding nodes, edges and convert to another format, for instance: Pandas Data frame. Those operations occur mainly in memory and they are live operations for this definition.
offline mode: it refers when the graph is store into the disk, in case of shutdown the computer, the graph is already stored.

What is the best data structure for the internal representation? We count with different alternatives, but basically they are variations of Edge lists, Adjacency Lists, and Adjacency Matrices. Take a look at the short presentation I prepared for you, it summarize the different alternatives (pros/cons).

At this point, we can implement a suitable data structure for our needs such as the Adjacency Maps (using Python dictionaries, they are quite efficient). The problem could appear when we work with large graphs. In my little research, I noticed there are different libraries that build the knowledge graph using optimized routines in C/C++. Fortunately, all of the following list count with Python bindings to interact with those libraries. We can study the possibility of using one the following libraries for support all our graph construction:
- igraph
- graph-tool
- NetworKit

We need to define and do experiments in order to see what is the best approach. Using the data structure in native Python, or relying on external libraries that helps us with this task. What do you think?

7 replies

kpto Oct 28, 2024

@slobentanzer In your design how do I turn off the in-memory representation if I don't need it and I want to avoid high memory usage?

slobentanzer Oct 28, 2024
Maintainer

@kpto you would use one of the other modes; online if you have a DB running somewhere, offline if you want to write to disk. Only in memory mode will keep the KG in memory. That is the main reason the modes exist, IMO. To distinguish between streaming and non-streaming behaviour.

One of the first simple things we could do in terms of configuration is switch from the current way of definition

offline: True

to more explicit

mode: < in memory | online | offline/disk >

slobentanzer Oct 28, 2024
Maintainer

@kpto it is also related to the API: if we had complete separation of only two modes, we could agree that we use add methods for in memory, and write methods for disk. But, since we also have in-between CRUD-type operations we could use add or write for (e.g., adding a node to a running DB somewhere), the distinction is not clear enough to only rely on the implicit separation of modes by the Python API; at least IMO.

slobentanzer Oct 28, 2024
Maintainer

Here are some older considerations (from OmniPath) about using tuples and dictionaries: saezlab/pypath#205 (comment)

A lot of the same reasoning went into the decision for using tuples as the stream into BioCypher from the adapters. Since this is the current solution, it is a reasonable baseline for the comparison we'll do.

slobentanzer Oct 28, 2024
Maintainer

I have created a new comment with my specs for this internal data structure (from a previous response): https://github.com/orgs/biocypher/discussions/377?sort=new#discussioncomment-11077185

slobentanzer · 2024-10-28T16:36:28Z

slobentanzer
Oct 28, 2024
Maintainer

The internal data structure for representing the KG

The data structure should be the minimal feature-complete implementation of any knowledge graph we aim to model; the internal technical structure we use to represent nodes and edges. It should be agnostic to input and output formats (as far as feasible) and technically efficient, to allow a low memory footprint, fast IO, streaming, etc. We currently have two things that share this task in some regards, I think it would be cleaner to agree on one, but there are some complications.

Tuples: the current interface format between adapters and core are collections of tuples (3- and 5-element) for nodes and edges.
Frozen dataclasses in _create.py: these are more formally defined typed data classes for nodes and edges (BioCypherNode and BioCypherEdge)

How they relate: the tuples are more like a minimal "convention" that is not enforced in code apart from some error messages. They are technically very efficient, but less controlled. They are currently the input stream into BioCypher and are "before normalisation", i.e., they are "translated" by the _translate module into the actual KG components, based on the schema configuration. After translation, they are BioCypherNode and BioCypherEdge instances (still collections, still streamable).

The current workflow is a tradeoff between simplicity of implementation (the tuples) and rigorous checks for alignment with the KG schema definition, ontologies, etc (the data classes). Particularly the existence of the _translate module, that takes care of aligning any input from any adapter with the ontologies used in the background, complicates the programming (but ideally makes the task of building a KG easier for the user).

In perspective:

I am not at all attached to the current representation. It should only be able to fulfil all these purposes; how it is implemented does not matter, ultimately.
- it should be technically capable (fast, efficient, streamable)
- it should be as simple as possible (but not simpler)
- it should be easy to use and reuse in the case of adapter design
- it should not result in redundant objects in any part of the build process
- it should be robust (type-checked at some point)
- it should be agnostic to the output adapter formats
- it should be able to reflect all potential properties of the KG (properties on nodes and edges, hyper-edges, whatever else is useful)
- it should be amenable to meta-programming (as one of the longer-term goals is to have the adapter code "write itself" using only the configuration and an input format definition)

8 replies

kpto Oct 31, 2024

The relationship between adapters and internal representation is that the adapter output format could be the internal representation (data structure). We currently pass collections of tuples from the adapter to the BioCypher core; in the simplest case of internal data structure, we could choose to store these collections in memory and build output objects from them on-demand. In this case, the adapter output and the internal representation would be the same thing.

If it will not be collections of tuples, then they will be related in the practical sense that we need to transform the output from the adapters (which currently are the collections of tuples, but this could also change) into the internal representation upon ingestion.

I am not sure what you mean by that

If the internal representation is really internal then it should stay internal, meaning that it is accessible via an interface and not constructible by other layers. A data transfer object (DTO) is a dump object that is used for data exchange between subsystems (Web API <-> App) or layers in a software (Data layer <-> Business layer). Right now the tuple is used like that but usually a more explicitly defined class is used to reduce human error.

From what I understand, the internal representation in the core layer will mostly be used for format conversion only such as to_networkx or to_pandas, without knowing whether it may have other uses, I don't see a reason to share it with other layers. I also don't see a reason why an adapter should output a graph object or even if it does, why it is transmitting the graph via an internal representation defined in the core layer. A practical question appears such as how to prevent the adapter layer from editing the graph after the input process if the graph was constructed by it and it holds the reference?

A more practical approach would be that the data transmittance was done via simple interface. The use of simple object to transfer data from adapter to core to good I think, it's just that the tuple should be replaced by a more explicitly defined class. Even if a graph is needed in an adapter for whatever reason, in the end the graph should still be transmitted via the DTO.

The main objective is to reduce the redundancies in definition of the KG components that are currently needed both in the schema configuration and in the adapter code itself. Ideally, the KG schema would only be defined once, and all manual typing we currently do to link data streams to concepts in the KG would instead be automatic. Logically, it makes sense to do the general definition in the YAML, as this is the dedicated location for definitions. Consequently, it would be most streamlined if the adapter code could use the YAML instead of needing to be manually adjusted to changes of the definition. I.e., the adapter code could write itself following some rules, which means meta-programming. This includes the data structure of the adapter output stream into BioCypher, which is related to the internal representation as described above. Say we change a class in the schema definition to include an additional property: this change would somehow have to be reflected in the adapter code, such that the output of the adapter includes that property as well.

It seems to me that you want to extend the scope of the KG schema from adapter to harmonisation to source data to harmonisation and an general user just needs to focus on that schema without knowing any programming.
If it is correct then I suggest you should rather consider this:

Have a generic adapter that covers a wide range of data format with dynamic property mapping configurable via the KG schema.
Merge the generic adapter with the core to form one unified layer

For an advanced user who wants to customise the adapter:

The unified layer has an optional argument for an advanced user to supply their customised adaption logic

I don't think we need meta programming when the above could be implemented without it unless you want performance optimisation which is too early in this stage, otherwise it just complicates the project.

A side question, do you also plan to have one unified schema combining the BioCypher config and KG schema which ultimately defines how a KG is produced and BioCypher is merely a tool to manufacture it?

slobentanzer Oct 31, 2024
Maintainer

A more practical approach would be that the data transmittance was done via simple interface.

I had a similar idea once, but didn't have time to follow up. You would just create two classes for nodes and edges that have parameters < ID, label, properties > (and likewise for edges)?

I don't understand what you suggest with the generic adapter; an adapter is just one modular part of a pipeline (which is the main unit of organisation in BioCypher). Since @ecarrenolozano asked a question that also confuses adapter and pipeline (https://github.com/orgs/biocypher/discussions/387#discussion-7400403), I think this needs to be better explained in the documentation.

The individual adapters need to be agnostic of the downstream events. Their only purpose is to "know" the resource and harmonise the data stream to be conformal with what the core expects. The schema technically is not part of the adapter, although they need to be synchronised in order to yield a sensible KG (pipeline output). This synchronisation is what ultimately should be automatic (currently is manual). If you consider the graphical abstract on the home page of the docs, you can see this distinction (adapters, ontologies, and configuration are separate parts).

Let's not get too deep into the metaprogramming discussion until we have clarified the basics.

kpto Oct 31, 2024

I had a similar idea once, but didn't have time to follow up. You would just create two classes for nodes and edges that have parameters < ID, label, properties > (and likewise for edges)?

Yes

I don't understand what you suggest with the generic adapter; an adapter is just one modular part of a pipeline (which is the main unit of organisation in BioCypher). Since @ecarrenolozano asked a question that also confuses adapter and pipeline (https://github.com/orgs/biocypher/discussions/387#discussion-7400403), I think this needs to be better explained in the documentation.

An uber adapter that can read many formats and understands an unified KG schema so the mapping from source data (let's say columnar data but I want to be more generic here) to node/edge DTO is dynamically configured by the unified KG schema. It does everything you meantion:

"know" the resource and harmonise the data stream to be conformal with what the core expects.
This synchronisation is what ultimately should be automatic

The individual adapters need to be agnostic of the downstream events.

But you propose an internal representation that could be used by the adapters which contradicts this request.

slobentanzer Oct 31, 2024
Maintainer

Re point 1: this is straightforward, but would have to be benchmarked in relation to the existing approach. Tuples are a very efficient data structure, and a class adds some significant overhead. We need to make sure we don't slow down the process even more, as for large DBs this already takes a while.

Re point 2: this sounds interesting, and I'd be happy to discuss down the line. Seems like a subject that is some time in the future, anyways.

But you propose an internal representation that could be used by the adapters which contradicts this request.

No, I do not propose that. I am saying that (in the simplest case) the data structure used by the adapters and the internal format could be the same thing, i.e., collections of tuples. This does not suggest entanglement. Deciding to store collections of tuples as the internal representation of the core does not imply that the output stream of the adapters needs to be in any way aligned, nor does it need to be bidirectional. We could have dedicated classes as an interface in the adapters and still represent the KG in the core as collections of tuples.

Adapters: provide an output stream that is ingested by the core. Can be collections of tuples, or something else.

Core: has an internal representation of a KG built from adapter output streams. Can be collections of tuples, or something else.

It is always justified to start with the simplest implementation, and increase complexity only if warranted. Primitive types are simple and efficient in Python, and moving to more complex data structures should be thoroughly tested for need and justified by increased safety, user-friendliness, or other considerations.

slobentanzer Oct 31, 2024
Maintainer

See also this (empty) milestone from a while ago: https://github.com/biocypher/biocypher/milestone/4

ryxx0811 · 2024-11-15T18:02:22Z

ryxx0811
Nov 15, 2024

import networkx as nx
import time
from pympler import asizeof
from biocypher._logger import logger
import sys
import csv

import data from csv files

#generate edges and nodes
with open('experiment/dataset_30_nodes_proteins.csv', mode='r') as file:
    reader = csv.reader(file)
    nodes = list(tuple(row) for row in reader)[1:]
with open('experiment/dataset_30_edges_interactions.csv', mode='r') as file:
    reader = csv.reader(file)
    edges = list(tuple(row) for row in reader)[1:]

print(len(nodes))
print(len(edges))
print(nodes)
print(edges)

30
47
[('K8H6Q4', 'uniprot_protein', "{'sequence': 'IIRADEMNSLSLGHDMSSWEAFASQQCPISQ', 'description': 'Lorem ipsum gbrzf', 'taxon': '8366', 'mass': 9564}"), ('M4O5Z6', 'uniprot_isoform', "{'sequence': 'EYFCCIWWKAEVWDHGRWRMIAWWFKMMWGSI', 'description': 'Lorem ipsum fufca', 'taxon': '422'}"), ('283708', 'entrez_protein', "{'sequence': 'RKILVLKYLIQKWYNYLCFPITQCMDMYAIIHIEIWKTR', 'description': 'Lorem ipsum syciq', 'taxon': '9606'}"), ('C0M0E2', 'uniprot_protein', "{'sequence': 'CEAFTRELWNNWTYKVWVFSQSNFGVDS', 'description': 'Lorem ipsum anylg', 'taxon': '6356'}"), ('R0W4F4', 'uniprot_isoform', "{'sequence': 'LGFRAESGVCEQVTWWMMHAHKVMMESPHNHGKEEHTKHQPIS', 'description': 'Lorem ipsum gkixn', 'taxon': '6049'}"), ('18338', 'entrez_protein', "{'sequence': 'LPGVVMCREERIMLYNRAPRDPPNWNMRTHSVPRCY', 'description': 'Lorem ipsum lngue', 'taxon': '9606'}"), ('Z1E2A6', 'uniprot_protein', "{'sequence': 'NKKMYKGAQFEIILQSYEQPSNFSKLWAGLAVSNGLRY', 'description': 'Lorem ipsum omjjm', 'taxon': '8287', 'mass': 9178}"), ('S8I1B6', 'uniprot_isoform', "{'sequence': 'CFCGSQMAAEPHGLTQDLCEPKCCMMQMHLPMVVNPIPQWLF', 'description': 'Lorem ipsum alpua', 'taxon': '9215'}"), ('118506', 'entrez_protein', "{'sequence': 'QQWKMEKMRKNMNQQFGDSMPYVSIIP', 'description': 'Lorem ipsum wqkpa', 'taxon': '9606'}"), ('R9Q0Z5', 'uniprot_protein', "{'sequence': 'VVLLWEFEAHQPWWAGRTLENLGMVHDICGMRNGCFAITLPRTNPQH', 'description': 'Lorem ipsum mcmgr', 'taxon': '8929', 'mass': 5}"), ('X9T0C1', 'uniprot_isoform', "{'sequence': 'YYFQMAIEETWFCMIWLPFNWWMKCYDLVRQPACFFQVQLFKVY', 'description': 'Lorem ipsum aggae', 'taxon': '6239'}"), ('551342', 'entrez_protein', "{'sequence': 'TYCEIWGPGIINWNDECIQQRTGPMWTQDKIYDISKIAMKWSCLCCG', 'description': 'Lorem ipsum nxvrc', 'taxon': '9606'}"), ('C6I8L6', 'uniprot_protein', "{'sequence': 'PESFRSWDFSSWRCYEEDQRTWQMGFIVQVTPKCARPTLMSMQGFAL', 'description': 'Lorem ipsum yynyt', 'taxon': '3431'}"), ('O2Q6R6', 'uniprot_isoform', "{'sequence': 'TYWHPMLHKVQLEHRWVTQLNGECSIECNSMNALVGKTVLKLSQTA', 'description': 'Lorem ipsum sqcfw', 'taxon': '7014'}"), ('602260', 'entrez_protein', "{'sequence': 'FSCLFPNEYGICDQDGKNPLPPDCMYVPKYAQYAFEMFDQCI', 'description': 'Lorem ipsum ufdiy', 'taxon': '9606'}"), ('E0I3W8', 'uniprot_protein', "{'sequence': 'WQHDEKCHIEPFAELCDFGVWHC', 'description': 'Lorem ipsum hkaom', 'taxon': '9894'}"), ('C3D2H1', 'uniprot_isoform', "{'sequence': 'SNFQVKNWMRRAMYSWACSKFCCWVTPYKDDEQIAVEQ', 'description': 'Lorem ipsum noqxo', 'taxon': '27'}"), ('351968', 'entrez_protein', "{'sequence': 'PPTWVYNEEGNSLVHIFVSSHMLKHREVWFDWTWSHNHHQRWCY', 'description': 'Lorem ipsum iyjjl', 'taxon': '9606'}"), ('S0Z5V3', 'uniprot_protein', "{'sequence': 'PEYIFAHFEMINGSCVDPFADLYKHPHMMLPQVAQLDKYCASRQ', 'description': 'Lorem ipsum csetj', 'taxon': '307', 'mass': 5488}"), ('Z4M7Y6', 'uniprot_isoform', "{'sequence': 'ALSFRPYIWFALTRYWEKLPTCHYFQLA', 'description': 'Lorem ipsum ymkfm', 'taxon': '3346', 'mass': 159}"), ('264819', 'entrez_protein', "{'sequence': 'VRLYLKRHYRDGPSVINDPAPANWTAVGSLVLRNE', 'description': 'Lorem ipsum ypubi', 'taxon': '9606'}"), ('J8U4W9', 'uniprot_protein', "{'sequence': 'VPVQDKWDNYKYVGAWPEYAWEYYKLTWSKAMGND', 'description': 'Lorem ipsum bmygu', 'taxon': '3028'}"), ('V6K5M3', 'uniprot_isoform', "{'sequence': 'CLCCRHWYPRFFMVNGEFNNLDYHGDY', 'description': 'Lorem ipsum jgblf', 'taxon': '1277', 'mass': 3347}"), ('704945', 'entrez_protein', "{'sequence': 'RYFVELSQYETFEKTWMMFDMWEFSQF', 'description': 'Lorem ipsum fjvfh', 'taxon': '9606'}"), ('R1R4K3', 'uniprot_protein', "{'sequence': 'LHDQPSTGPEVMISNRPDGWDR', 'description': 'Lorem ipsum wgndi', 'taxon': '570', 'mass': 251}"), ('M6F1F8', 'uniprot_isoform', "{'sequence': 'DNFKTTNNYAWEMYYLGSLHKHAGYQVFP', 'description': 'Lorem ipsum uddvj', 'taxon': '761', 'mass': 1024}"), ('393815', 'entrez_protein', "{'sequence': 'NENHGGNFPHQFAVQSNTILKVDYGFTRCQPMLMPGG', 'description': 'Lorem ipsum ibvwg', 'taxon': '9606'}"), ('S4B5R2', 'uniprot_protein', "{'sequence': 'PVENEIAENRHCYSKQKIRLYWQNPYFNNWHWQFVWLR', 'description': 'Lorem ipsum ggqzo', 'taxon': '8454'}"), ('A3L9U5', 'uniprot_isoform', "{'sequence': 'QTTYGQVYLSIIAVPEYLDAQFSYASIGECSVNSWTTKGCPFWLM', 'description': 'Lorem ipsum xvhtl', 'taxon': '225'}"), ('315539', 'entrez_protein', "{'sequence': 'CHSLTVKTHNCLRVSVSSFNIFVDGEMCISIKEDKYWACHE', 'description': 'Lorem ipsum bdcti', 'taxon': '9606'}")]
[('', 'K8H6Q4', '283708', 'interacts_with', "{'method': 'Lorem ipsum bfhzl'}"), ('intact534397', 'K8H6Q4', 'O2Q6R6', 'interacts_with', "{'method': 'Lorem ipsum emzwu'}"), ('', 'K8H6Q4', 'Z4M7Y6', 'interacts_with', "{'source': 'signor', 'method': 'Lorem ipsum jkqge'}"), ('', 'K8H6Q4', '704945', 'interacts_with', "{'source': 'signor', 'method': 'Lorem ipsum iivyd'}"), ('intact717839', 'M4O5Z6', '602260', 'interacts_with', '{}'), ('', 'C0M0E2', '18338', 'interacts_with', "{'source': 'signor'}"), ('intact330395', 'C0M0E2', 'R9Q0Z5', 'interacts_with', "{'method': 'Lorem ipsum mdevp'}"), ('intact630247', 'R0W4F4', 'X9T0C1', 'interacts_with', "{'source': 'intact'}"), ('', '18338', '283708', 'interacts_with', "{'method': 'Lorem ipsum vsbme'}"), ('intact485599', '18338', 'C6I8L6', 'interacts_with', "{'source': 'intact', 'method': 'Lorem ipsum zijjn'}"), ('', '18338', 'A3L9U5', 'interacts_with', "{'source': 'signor'}"), ('intact280091', 'Z1E2A6', 'M4O5Z6', 'interacts_with', '{}'), ('', 'Z1E2A6', 'S8I1B6', 'interacts_with', "{'method': 'Lorem ipsum kfwhm'}"), ('intact966772', 'S8I1B6', 'R1R4K3', 'interacts_with', "{'method': 'Lorem ipsum daneh'}"), ('', 'S8I1B6', 'S4B5R2', 'interacts_with', "{'source': 'signor'}"), ('intact494994', 'S8I1B6', '315539', 'interacts_with', '{}'), ('intact910760', 'R9Q0Z5', 'M4O5Z6', 'interacts_with', '{}'), ('intact442842', 'R9Q0Z5', 'Z4M7Y6', 'interacts_with', "{'source': 'intact'}"), ('intact139611', 'X9T0C1', '315539', 'interacts_with', "{'method': 'Lorem ipsum sqxxv'}"), ('intact62376', '551342', 'C3D2H1', 'interacts_with', "{'source': 'signor'}"), ('intact92933', 'C6I8L6', 'R1R4K3', 'interacts_with', '{}'), ('intact593862', 'O2Q6R6', '18338', 'interacts_with', "{'source': 'intact'}"), ('intact359477', '602260', 'M6F1F8', 'interacts_with', "{'source': 'intact', 'method': 'Lorem ipsum ebflb'}"), ('', 'E0I3W8', 'V6K5M3', 'interacts_with', "{'method': 'Lorem ipsum aqvwd'}"), ('', 'C3D2H1', 'O2Q6R6', 'interacts_with', "{'source': 'intact'}"), ('', '351968', '18338', 'interacts_with', "{'method': 'Lorem ipsum tofuo'}"), ('', '351968', '118506', 'interacts_with', "{'source': 'intact'}"), ('intact845029', '351968', 'O2Q6R6', 'interacts_with', "{'source': 'signor', 'method': 'Lorem ipsum bszvg'}"), ('', '351968', 'R1R4K3', 'interacts_with', "{'source': 'intact', 'method': 'Lorem ipsum whbrr'}"), ('intact667986', 'S0Z5V3', '351968', 'interacts_with', "{'source': 'intact'}"), ('', 'Z4M7Y6', 'M4O5Z6', 'interacts_with', '{}'), ('', 'Z4M7Y6', 'S0Z5V3', 'interacts_with', "{'method': 'Lorem ipsum ynmel'}"), ('', 'Z4M7Y6', 'R1R4K3', 'interacts_with', "{'source': 'signor', 'method': 'Lorem ipsum mxwey'}"), ('', '264819', 'R9Q0Z5', 'interacts_with', '{}'), ('intact733073', '264819', '704945', 'interacts_with', "{'method': 'Lorem ipsum lrchc'}"), ('intact228087', 'J8U4W9', 'R1R4K3', 'interacts_with', "{'source': 'intact', 'method': 'Lorem ipsum vcmay'}"), ('', 'V6K5M3', '351968', 'interacts_with', "{'source': 'signor', 'method': 'Lorem ipsum ayaeg'}"), ('', '704945', 'R1R4K3', 'interacts_with', "{'source': 'intact'}"), ('intact117923', 'R1R4K3', 'Z1E2A6', 'interacts_with', "{'source': 'intact', 'method': 'Lorem ipsum pdblj'}"), ('', 'R1R4K3', 'S0Z5V3', 'interacts_with', "{'source': 'intact'}"), ('', 'M6F1F8', 'C0M0E2', 'interacts_with', "{'method': 'Lorem ipsum zfnod'}"), ('intact908082', 'M6F1F8', 'Z4M7Y6', 'interacts_with', "{'source': 'signor'}"), ('intact845342', '393815', 'V6K5M3', 'interacts_with', '{}'), ('', 'A3L9U5', '283708', 'interacts_with', "{'source': 'intact'}"), ('intact170930', 'A3L9U5', 'V6K5M3', 'interacts_with', "{'source': 'intact', 'method': 'Lorem ipsum ztxns'}"), ('', 'A3L9U5', '315539', 'interacts_with', "{'method': 'Lorem ipsum mhrxf'}"), ('', '315539', 'V6K5M3', 'interacts_with', "{'source': 'intact', 'method': 'Lorem ipsum ontfr'}")]

translate nodes and edges

#translate to BiocypherNode and BiocypherEdge
import ast
from biocypher._create import BioCypherNode,BioCypherEdge
tnodes = [
    BioCypherNode(node_id=node[0], 
                  node_label=node[1], 
                  properties=ast.literal_eval(node[2]))
    for node in nodes
]
tedges = [
    BioCypherEdge(source_id=edge[1], 
                  target_id=edge[2], 
                  relationship_label=edge[3], 
                  relationship_id=edge[0],
                  properties=ast.literal_eval(edge[4]))
    for edge in edges
]

print(len(tnodes))
print(len(tedges))
print(tnodes)
print(tedges)

30
47
[BioCypherNode(node_id='K8H6Q4', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'IIRADEMNSLSLGHDMSSWEAFASQQCPISQ', 'description': 'Lorem ipsum gbrzf', 'taxon': '8366', 'mass': 9564, 'id': 'K8H6Q4', 'preferred_id': 'id'}), BioCypherNode(node_id='M4O5Z6', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'EYFCCIWWKAEVWDHGRWRMIAWWFKMMWGSI', 'description': 'Lorem ipsum fufca', 'taxon': '422', 'id': 'M4O5Z6', 'preferred_id': 'id'}), BioCypherNode(node_id='283708', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'RKILVLKYLIQKWYNYLCFPITQCMDMYAIIHIEIWKTR', 'description': 'Lorem ipsum syciq', 'taxon': '9606', 'id': '283708', 'preferred_id': 'id'}), BioCypherNode(node_id='C0M0E2', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'CEAFTRELWNNWTYKVWVFSQSNFGVDS', 'description': 'Lorem ipsum anylg', 'taxon': '6356', 'id': 'C0M0E2', 'preferred_id': 'id'}), BioCypherNode(node_id='R0W4F4', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'LGFRAESGVCEQVTWWMMHAHKVMMESPHNHGKEEHTKHQPIS', 'description': 'Lorem ipsum gkixn', 'taxon': '6049', 'id': 'R0W4F4', 'preferred_id': 'id'}), BioCypherNode(node_id='18338', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'LPGVVMCREERIMLYNRAPRDPPNWNMRTHSVPRCY', 'description': 'Lorem ipsum lngue', 'taxon': '9606', 'id': '18338', 'preferred_id': 'id'}), BioCypherNode(node_id='Z1E2A6', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'NKKMYKGAQFEIILQSYEQPSNFSKLWAGLAVSNGLRY', 'description': 'Lorem ipsum omjjm', 'taxon': '8287', 'mass': 9178, 'id': 'Z1E2A6', 'preferred_id': 'id'}), BioCypherNode(node_id='S8I1B6', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'CFCGSQMAAEPHGLTQDLCEPKCCMMQMHLPMVVNPIPQWLF', 'description': 'Lorem ipsum alpua', 'taxon': '9215', 'id': 'S8I1B6', 'preferred_id': 'id'}), BioCypherNode(node_id='118506', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'QQWKMEKMRKNMNQQFGDSMPYVSIIP', 'description': 'Lorem ipsum wqkpa', 'taxon': '9606', 'id': '118506', 'preferred_id': 'id'}), BioCypherNode(node_id='R9Q0Z5', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'VVLLWEFEAHQPWWAGRTLENLGMVHDICGMRNGCFAITLPRTNPQH', 'description': 'Lorem ipsum mcmgr', 'taxon': '8929', 'mass': 5, 'id': 'R9Q0Z5', 'preferred_id': 'id'}), BioCypherNode(node_id='X9T0C1', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'YYFQMAIEETWFCMIWLPFNWWMKCYDLVRQPACFFQVQLFKVY', 'description': 'Lorem ipsum aggae', 'taxon': '6239', 'id': 'X9T0C1', 'preferred_id': 'id'}), BioCypherNode(node_id='551342', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'TYCEIWGPGIINWNDECIQQRTGPMWTQDKIYDISKIAMKWSCLCCG', 'description': 'Lorem ipsum nxvrc', 'taxon': '9606', 'id': '551342', 'preferred_id': 'id'}), BioCypherNode(node_id='C6I8L6', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'PESFRSWDFSSWRCYEEDQRTWQMGFIVQVTPKCARPTLMSMQGFAL', 'description': 'Lorem ipsum yynyt', 'taxon': '3431', 'id': 'C6I8L6', 'preferred_id': 'id'}), BioCypherNode(node_id='O2Q6R6', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'TYWHPMLHKVQLEHRWVTQLNGECSIECNSMNALVGKTVLKLSQTA', 'description': 'Lorem ipsum sqcfw', 'taxon': '7014', 'id': 'O2Q6R6', 'preferred_id': 'id'}), BioCypherNode(node_id='602260', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'FSCLFPNEYGICDQDGKNPLPPDCMYVPKYAQYAFEMFDQCI', 'description': 'Lorem ipsum ufdiy', 'taxon': '9606', 'id': '602260', 'preferred_id': 'id'}), BioCypherNode(node_id='E0I3W8', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'WQHDEKCHIEPFAELCDFGVWHC', 'description': 'Lorem ipsum hkaom', 'taxon': '9894', 'id': 'E0I3W8', 'preferred_id': 'id'}), BioCypherNode(node_id='C3D2H1', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'SNFQVKNWMRRAMYSWACSKFCCWVTPYKDDEQIAVEQ', 'description': 'Lorem ipsum noqxo', 'taxon': '27', 'id': 'C3D2H1', 'preferred_id': 'id'}), BioCypherNode(node_id='351968', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'PPTWVYNEEGNSLVHIFVSSHMLKHREVWFDWTWSHNHHQRWCY', 'description': 'Lorem ipsum iyjjl', 'taxon': '9606', 'id': '351968', 'preferred_id': 'id'}), BioCypherNode(node_id='S0Z5V3', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'PEYIFAHFEMINGSCVDPFADLYKHPHMMLPQVAQLDKYCASRQ', 'description': 'Lorem ipsum csetj', 'taxon': '307', 'mass': 5488, 'id': 'S0Z5V3', 'preferred_id': 'id'}), BioCypherNode(node_id='Z4M7Y6', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'ALSFRPYIWFALTRYWEKLPTCHYFQLA', 'description': 'Lorem ipsum ymkfm', 'taxon': '3346', 'mass': 159, 'id': 'Z4M7Y6', 'preferred_id': 'id'}), BioCypherNode(node_id='264819', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'VRLYLKRHYRDGPSVINDPAPANWTAVGSLVLRNE', 'description': 'Lorem ipsum ypubi', 'taxon': '9606', 'id': '264819', 'preferred_id': 'id'}), BioCypherNode(node_id='J8U4W9', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'VPVQDKWDNYKYVGAWPEYAWEYYKLTWSKAMGND', 'description': 'Lorem ipsum bmygu', 'taxon': '3028', 'id': 'J8U4W9', 'preferred_id': 'id'}), BioCypherNode(node_id='V6K5M3', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'CLCCRHWYPRFFMVNGEFNNLDYHGDY', 'description': 'Lorem ipsum jgblf', 'taxon': '1277', 'mass': 3347, 'id': 'V6K5M3', 'preferred_id': 'id'}), BioCypherNode(node_id='704945', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'RYFVELSQYETFEKTWMMFDMWEFSQF', 'description': 'Lorem ipsum fjvfh', 'taxon': '9606', 'id': '704945', 'preferred_id': 'id'}), BioCypherNode(node_id='R1R4K3', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'LHDQPSTGPEVMISNRPDGWDR', 'description': 'Lorem ipsum wgndi', 'taxon': '570', 'mass': 251, 'id': 'R1R4K3', 'preferred_id': 'id'}), BioCypherNode(node_id='M6F1F8', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'DNFKTTNNYAWEMYYLGSLHKHAGYQVFP', 'description': 'Lorem ipsum uddvj', 'taxon': '761', 'mass': 1024, 'id': 'M6F1F8', 'preferred_id': 'id'}), BioCypherNode(node_id='393815', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'NENHGGNFPHQFAVQSNTILKVDYGFTRCQPMLMPGG', 'description': 'Lorem ipsum ibvwg', 'taxon': '9606', 'id': '393815', 'preferred_id': 'id'}), BioCypherNode(node_id='S4B5R2', node_label='uniprot_protein', preferred_id='id', properties={'sequence': 'PVENEIAENRHCYSKQKIRLYWQNPYFNNWHWQFVWLR', 'description': 'Lorem ipsum ggqzo', 'taxon': '8454', 'id': 'S4B5R2', 'preferred_id': 'id'}), BioCypherNode(node_id='A3L9U5', node_label='uniprot_isoform', preferred_id='id', properties={'sequence': 'QTTYGQVYLSIIAVPEYLDAQFSYASIGECSVNSWTTKGCPFWLM', 'description': 'Lorem ipsum xvhtl', 'taxon': '225', 'id': 'A3L9U5', 'preferred_id': 'id'}), BioCypherNode(node_id='315539', node_label='entrez_protein', preferred_id='id', properties={'sequence': 'CHSLTVKTHNCLRVSVSSFNIFVDGEMCISIKEDKYWACHE', 'description': 'Lorem ipsum bdcti', 'taxon': '9606', 'id': '315539', 'preferred_id': 'id'})]
[BioCypherEdge(source_id='K8H6Q4', target_id='283708', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum bfhzl'}), BioCypherEdge(source_id='K8H6Q4', target_id='O2Q6R6', relationship_label='interacts_with', relationship_id='intact534397', properties={'method': 'Lorem ipsum emzwu'}), BioCypherEdge(source_id='K8H6Q4', target_id='Z4M7Y6', relationship_label='interacts_with', relationship_id='', properties={'source': 'signor', 'method': 'Lorem ipsum jkqge'}), BioCypherEdge(source_id='K8H6Q4', target_id='704945', relationship_label='interacts_with', relationship_id='', properties={'source': 'signor', 'method': 'Lorem ipsum iivyd'}), BioCypherEdge(source_id='M4O5Z6', target_id='602260', relationship_label='interacts_with', relationship_id='intact717839', properties={}), BioCypherEdge(source_id='C0M0E2', target_id='18338', relationship_label='interacts_with', relationship_id='', properties={'source': 'signor'}), BioCypherEdge(source_id='C0M0E2', target_id='R9Q0Z5', relationship_label='interacts_with', relationship_id='intact330395', properties={'method': 'Lorem ipsum mdevp'}), BioCypherEdge(source_id='R0W4F4', target_id='X9T0C1', relationship_label='interacts_with', relationship_id='intact630247', properties={'source': 'intact'}), BioCypherEdge(source_id='18338', target_id='283708', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum vsbme'}), BioCypherEdge(source_id='18338', target_id='C6I8L6', relationship_label='interacts_with', relationship_id='intact485599', properties={'source': 'intact', 'method': 'Lorem ipsum zijjn'}), BioCypherEdge(source_id='18338', target_id='A3L9U5', relationship_label='interacts_with', relationship_id='', properties={'source': 'signor'}), BioCypherEdge(source_id='Z1E2A6', target_id='M4O5Z6', relationship_label='interacts_with', relationship_id='intact280091', properties={}), BioCypherEdge(source_id='Z1E2A6', target_id='S8I1B6', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum kfwhm'}), BioCypherEdge(source_id='S8I1B6', target_id='R1R4K3', relationship_label='interacts_with', relationship_id='intact966772', properties={'method': 'Lorem ipsum daneh'}), BioCypherEdge(source_id='S8I1B6', target_id='S4B5R2', relationship_label='interacts_with', relationship_id='', properties={'source': 'signor'}), BioCypherEdge(source_id='S8I1B6', target_id='315539', relationship_label='interacts_with', relationship_id='intact494994', properties={}), BioCypherEdge(source_id='R9Q0Z5', target_id='M4O5Z6', relationship_label='interacts_with', relationship_id='intact910760', properties={}), BioCypherEdge(source_id='R9Q0Z5', target_id='Z4M7Y6', relationship_label='interacts_with', relationship_id='intact442842', properties={'source': 'intact'}), BioCypherEdge(source_id='X9T0C1', target_id='315539', relationship_label='interacts_with', relationship_id='intact139611', properties={'method': 'Lorem ipsum sqxxv'}), BioCypherEdge(source_id='551342', target_id='C3D2H1', relationship_label='interacts_with', relationship_id='intact62376', properties={'source': 'signor'}), BioCypherEdge(source_id='C6I8L6', target_id='R1R4K3', relationship_label='interacts_with', relationship_id='intact92933', properties={}), BioCypherEdge(source_id='O2Q6R6', target_id='18338', relationship_label='interacts_with', relationship_id='intact593862', properties={'source': 'intact'}), BioCypherEdge(source_id='602260', target_id='M6F1F8', relationship_label='interacts_with', relationship_id='intact359477', properties={'source': 'intact', 'method': 'Lorem ipsum ebflb'}), BioCypherEdge(source_id='E0I3W8', target_id='V6K5M3', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum aqvwd'}), BioCypherEdge(source_id='C3D2H1', target_id='O2Q6R6', relationship_label='interacts_with', relationship_id='', properties={'source': 'intact'}), BioCypherEdge(source_id='351968', target_id='18338', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum tofuo'}), BioCypherEdge(source_id='351968', target_id='118506', relationship_label='interacts_with', relationship_id='', properties={'source': 'intact'}), BioCypherEdge(source_id='351968', target_id='O2Q6R6', relationship_label='interacts_with', relationship_id='intact845029', properties={'source': 'signor', 'method': 'Lorem ipsum bszvg'}), BioCypherEdge(source_id='351968', target_id='R1R4K3', relationship_label='interacts_with', relationship_id='', properties={'source': 'intact', 'method': 'Lorem ipsum whbrr'}), BioCypherEdge(source_id='S0Z5V3', target_id='351968', relationship_label='interacts_with', relationship_id='intact667986', properties={'source': 'intact'}), BioCypherEdge(source_id='Z4M7Y6', target_id='M4O5Z6', relationship_label='interacts_with', relationship_id='', properties={}), BioCypherEdge(source_id='Z4M7Y6', target_id='S0Z5V3', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum ynmel'}), BioCypherEdge(source_id='Z4M7Y6', target_id='R1R4K3', relationship_label='interacts_with', relationship_id='', properties={'source': 'signor', 'method': 'Lorem ipsum mxwey'}), BioCypherEdge(source_id='264819', target_id='R9Q0Z5', relationship_label='interacts_with', relationship_id='', properties={}), BioCypherEdge(source_id='264819', target_id='704945', relationship_label='interacts_with', relationship_id='intact733073', properties={'method': 'Lorem ipsum lrchc'}), BioCypherEdge(source_id='J8U4W9', target_id='R1R4K3', relationship_label='interacts_with', relationship_id='intact228087', properties={'source': 'intact', 'method': 'Lorem ipsum vcmay'}), BioCypherEdge(source_id='V6K5M3', target_id='351968', relationship_label='interacts_with', relationship_id='', properties={'source': 'signor', 'method': 'Lorem ipsum ayaeg'}), BioCypherEdge(source_id='704945', target_id='R1R4K3', relationship_label='interacts_with', relationship_id='', properties={'source': 'intact'}), BioCypherEdge(source_id='R1R4K3', target_id='Z1E2A6', relationship_label='interacts_with', relationship_id='intact117923', properties={'source': 'intact', 'method': 'Lorem ipsum pdblj'}), BioCypherEdge(source_id='R1R4K3', target_id='S0Z5V3', relationship_label='interacts_with', relationship_id='', properties={'source': 'intact'}), BioCypherEdge(source_id='M6F1F8', target_id='C0M0E2', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum zfnod'}), BioCypherEdge(source_id='M6F1F8', target_id='Z4M7Y6', relationship_label='interacts_with', relationship_id='intact908082', properties={'source': 'signor'}), BioCypherEdge(source_id='393815', target_id='V6K5M3', relationship_label='interacts_with', relationship_id='intact845342', properties={}), BioCypherEdge(source_id='A3L9U5', target_id='283708', relationship_label='interacts_with', relationship_id='', properties={'source': 'intact'}), BioCypherEdge(source_id='A3L9U5', target_id='V6K5M3', relationship_label='interacts_with', relationship_id='intact170930', properties={'source': 'intact', 'method': 'Lorem ipsum ztxns'}), BioCypherEdge(source_id='A3L9U5', target_id='315539', relationship_label='interacts_with', relationship_id='', properties={'method': 'Lorem ipsum mhrxf'}), BioCypherEdge(source_id='315539', target_id='V6K5M3', relationship_label='interacts_with', relationship_id='', properties={'source': 'intact', 'method': 'Lorem ipsum ontfr'})]

Dictionary as KG (Adjaceny map)

#using dictionary as internal representation
class BiocypherKG:
    def __init__(self):
        self._KG={}
    def add_nodes(self,nodes):
        for node in nodes:
            node_id=node.get_id()
            if node_id not in self._KG:
                self._KG[node_id] = {}
                self._KG[node_id]['edges'] = {}
                self._KG[node_id]['attributes'] = \
                    {'prefered_id': node.get_preferred_id(),
                     'node_label': node.get_label(),
                     'properties': node.get_properties()}

    def add_edges(self,edges):
        for edge in edges:
            source_id = edge.get_source_id()
            target_id = edge.get_target_id()
            if source_id in self._KG:
                if target_id not in self._KG[source_id]['edges']:
                    self._KG[source_id]['edges'][target_id] = \
                    {'relationship_label': edge.get_label(),
                     'relationship_id':edge.get_id(),
                     'properties':edge.get_properties()}
            else:
                raise TypeError(f'Knowledge Graph has no {source_id} key. \
                                Call add_nodes() to add the {source_id} node and its properties.')
                
                
    def get_KG(self):
        return self._KG

    def to_networkx(self):
        G = nx.DiGraph()
        for k1, v1 in self._KG.items():
            for k2, v2 in v1.items():
                if k2 == 'edges':
                    for k3, v3 in v2.items():
                        G.add_edge(k1, k3, **v3)
                else:
                    G.add_node(k1, **v1['attributes'])
        return G

bkd=BiocypherKG()
import time

t_1=time.time()
bkd.add_nodes(tnodes)
bkd.add_edges(tedges)
bkd.to_networkx()
t_2=time.time()
KG=bkd.get_KG()
size=asizeof.asizeof(KG)
print(f'Time(dict):{(t_2-t_1)*1000} ms.')
print(f'Size of KG:{size/1024} kb.')

Time(dict):0.6196498870849609 ms.
Size of KG:78.28125 kb.

Tuple(BiocypherNode, BiocypherEdge, BiocypherRelAsEdge) as KG

#using collections of tuples(BiocypherNode, BiocypherEdge, BiocypherRelAsEdge) as internal representation
class BiocypherKG:
    def __init__(self):
        self.G=nx.DiGraph()

    def add_nodes(self,nodes):
        for node in nodes:
            if node not in self.G:
                self.G.add_node(
                    node.get_id(),
                    label=node.get_label(),
                    properties=node.get_properties()
                )

    def add_edges(self,edges):
        for edge in edges:
            if edge not in self.G:
                self.G.add_edge(
                    edge.get_source_id(),
                    edge.get_target_id(),
                    relationship_label=edge.get_label(),
                    relationship_id=edge.get_id(),
                    properties=edge.get_properties()
                )

bkt=BiocypherKG()

t_1=time.time()
bkt.add_nodes(tnodes)
bkt.add_edges(tedges)
t_2=time.time()
tsize=asizeof.asizeof(tnodes)+asizeof.asizeof(tedges)
size=asizeof.asizeof(nodes)+asizeof.asizeof(edges)
print(f'Time(collections oof tuples):{(t_2-t_1)*1000} ms.')
print(f'Size of KG (translated):{tsize/1024} kb.')
print(f'Size of KG :{size/1024} kb.')

Time(collections oof tuples):0.3809928894042969 ms.
Size of KG (translated):61.4140625 kb.
Size of KG :27.7421875 kb.

Results

	Dictionary	Collection of tuples	Collection of BN,BE,BRE
Time	0.6196 ms	0.38099 ms	0.38099 ms
Size	78.25125 kb	27.742 kb(not translated)	61.414 kb(translated)

@ecarrenolozano @slobentanzer The result of the experiment. One issue with using a dictionary as the internal representation is that the add_nodes function must be called whenever a edge with new nodes is added to the dictionary. This is because the dictionary uses node_id as the key in the adjacency map.

0 replies

Output modularisation #4

Output modularisation

Architecture Diagrams

Replies: 0 comments · 53 replies

slobentanzer Oct 10, 2024 Maintainer

slobentanzer Oct 11, 2024 Maintainer

slobentanzer Oct 11, 2024 Maintainer

slobentanzer Oct 11, 2024 Maintainer

ecarrenolozano Oct 21, 2024 Author

NetworkX Output Component

Software Requirements Specification

1. Introduction

2. Functional Requirements (FR)

3. Non-Functional Requirements (NFR)

4. External dependencies (ED)

5. Acceptance Criteria (AC)

Appendix. Modeling Diagrams

slobentanzer Oct 21, 2024 Maintainer

slobentanzer Oct 21, 2024 Maintainer

ecarrenolozano Oct 21, 2024 Author

slobentanzer Oct 21, 2024 Maintainer

ecarrenolozano Oct 22, 2024 Author

slobentanzer Oct 22, 2024 Maintainer

slobentanzer Oct 23, 2024 Maintainer

slobentanzer Oct 23, 2024 Maintainer

slobentanzer Oct 23, 2024 Maintainer

ecarrenolozano Oct 28, 2024 Author

slobentanzer Oct 28, 2024 Maintainer

slobentanzer Oct 28, 2024 Maintainer

slobentanzer Oct 28, 2024 Maintainer

slobentanzer Oct 28, 2024 Maintainer

slobentanzer Oct 28, 2024 Maintainer

The internal data structure for representing the KG

slobentanzer Oct 31, 2024 Maintainer

slobentanzer Oct 31, 2024 Maintainer

slobentanzer Oct 31, 2024 Maintainer

import data from csv files

translate nodes and edges

Dictionary as KG (Adjaceny map)

Tuple(BiocypherNode, BiocypherEdge, BiocypherRelAsEdge) as KG

Results

Replies: 0 comments 53 replies

slobentanzer
Oct 10, 2024
Maintainer

slobentanzer Oct 11, 2024
Maintainer

slobentanzer Oct 11, 2024
Maintainer

slobentanzer Oct 11, 2024
Maintainer

ecarrenolozano
Oct 21, 2024
Author

slobentanzer Oct 21, 2024
Maintainer

slobentanzer Oct 21, 2024
Maintainer

ecarrenolozano Oct 21, 2024
Author

slobentanzer Oct 21, 2024
Maintainer

ecarrenolozano
Oct 22, 2024
Author

slobentanzer Oct 22, 2024
Maintainer

slobentanzer Oct 23, 2024
Maintainer

slobentanzer Oct 23, 2024
Maintainer

slobentanzer Oct 23, 2024
Maintainer

ecarrenolozano
Oct 28, 2024
Author

slobentanzer Oct 28, 2024
Maintainer

slobentanzer Oct 28, 2024
Maintainer

slobentanzer Oct 28, 2024
Maintainer

slobentanzer Oct 28, 2024
Maintainer

slobentanzer
Oct 28, 2024
Maintainer

slobentanzer Oct 31, 2024
Maintainer

slobentanzer Oct 31, 2024
Maintainer

slobentanzer Oct 31, 2024
Maintainer