Output modularisation #4
Replies: 0 comments 53 replies
-
I suggest to move to .drawio.svg files, stored on the biocypher repository, as soon as possible. For instance, https://github.com/biocypher/biocypher/blob/main/docs/write_mode.drawio.svg. We can create a subfolder for the diagrams in the docs folder. This will allow more seamless collaborative work on the diagrams, including making changes via PRs. @kpto FYI, this is where we now lead the current output architecture discussion. |
Beta Was this translation helpful? Give feedback.
-
@slobentanzer To view and edit a diagram stored in GitHub, the app draw.io needs to be installed in the organisation. I have requested an installation to biocypher, please check. Also I don't think a mega thread to discuss everything is manageable, this thread may serve as an index of other architectural discussions but a discussion for a specific part should has it's own thread created. Of course if you know a few tricks of GitHub that can address my concern, please share them with me :) |
Beta Was this translation helpful? Give feedback.
-
In parallel to the work that @ryxx0811 has made until now, I propose this format to ground the requirements in such a way that the deliverables are a bit more specific. For future features this should be an input to essentially define what exactly should be developed. We can think in a way to complement the already made template (Issue: Add New Component). +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ NetworkX Output ComponentDescription: Add functionality to retrieve a NetworkX graph directly from BioCypher. Issues/Pull Requests related: #361, #358 Software Requirements Specification1. IntroductionPurpose: Implement functionality for users to create, edit, and save a NetworkX knowledge graph with the BioCypher framework. Support in-memory graph modifications (adding/removing nodes and edges) and enable persitency (disk or database). 2. Functional Requirements (FR)
3. Non-Functional Requirements (NFR)
4. External dependencies (ED)
5. Acceptance Criteria (AC)
Appendix. Modeling Diagrams
sequenceDiagram
participant Biocypher Core
participant Networkx KG
participant Pandas DF
participant Networkx Writer
participant DBMS
Biocypher Core ->> Networkx KG: add_nodes/add_edges
Networkx KG ->> Pandas DF: add_nodes/add_edges
Pandas DF ->> Pandas DF: add_table
Pandas DF ->> Networkx KG: dataframe[node]/dataframe[edge]
Networkx KG ->> Biocypher Core: dataframe[node]/dataframe[edge]
Biocypher Core ->> Networkx KG: get_KG
Networkx KG ->> Networkx KG: get_KG
Networkx KG ->> Biocypher Core: Knowledge Graph
Biocypher Core ->> DBMS: write_to_dbms(optional)
DBMS ->> DBMS: write
Biocypher Core ->> Networkx Writer: write_to_file(optional)
Networkx Writer ->> Networkx Writer: write
|
Beta Was this translation helpful? Give feedback.
-
A bit late to comment but what is the internal representation for? Why isn't it just another output like others? An output is agnostic to destination but merely something complying to the interface, whether it in the end uses database or memory does not matter. You can have multiple outputs at the same time, a database connection output, a database import data output, a networkx output and a pandas output. Isn't it more configurable to users? |
Beta Was this translation helpful? Give feedback.
-
Hi guys, Happy to interact again:
At this point, we can implement a suitable data structure for our needs such as the Adjacency Maps (using Python dictionaries, they are quite efficient). The problem could appear when we work with large graphs. In my little research, I noticed there are different libraries that build the knowledge graph using optimized routines in C/C++. Fortunately, all of the following list count with Python bindings to interact with those libraries. We can study the possibility of using one the following libraries for support all our graph construction: We need to define and do experiments in order to see what is the best approach. Using the data structure in native Python, or relying on external libraries that helps us with this task. What do you think? |
Beta Was this translation helpful? Give feedback.
-
The internal data structure for representing the KGThe data structure should be the minimal feature-complete implementation of any knowledge graph we aim to model; the internal technical structure we use to represent nodes and edges. It should be agnostic to input and output formats (as far as feasible) and technically efficient, to allow a low memory footprint, fast IO, streaming, etc. We currently have two things that share this task in some regards, I think it would be cleaner to agree on one, but there are some complications.
How they relate: the tuples are more like a minimal "convention" that is not enforced in code apart from some error messages. They are technically very efficient, but less controlled. They are currently the input stream into BioCypher and are "before normalisation", i.e., they are "translated" by the _translate module into the actual KG components, based on the schema configuration. After translation, they are The current workflow is a tradeoff between simplicity of implementation (the tuples) and rigorous checks for alignment with the KG schema definition, ontologies, etc (the data classes). Particularly the existence of the _translate module, that takes care of aligning any input from any adapter with the ontologies used in the background, complicates the programming (but ideally makes the task of building a KG easier for the user). In perspective:
|
Beta Was this translation helpful? Give feedback.
-
import networkx as nx
import time
from pympler import asizeof
from biocypher._logger import logger
import sys
import csv import data from csv files#generate edges and nodes
with open('experiment/dataset_30_nodes_proteins.csv', mode='r') as file:
reader = csv.reader(file)
nodes = list(tuple(row) for row in reader)[1:]
with open('experiment/dataset_30_edges_interactions.csv', mode='r') as file:
reader = csv.reader(file)
edges = list(tuple(row) for row in reader)[1:] print(len(nodes))
print(len(edges))
print(nodes)
print(edges)
translate nodes and edges#translate to BiocypherNode and BiocypherEdge
import ast
from biocypher._create import BioCypherNode,BioCypherEdge
tnodes = [
BioCypherNode(node_id=node[0],
node_label=node[1],
properties=ast.literal_eval(node[2]))
for node in nodes
]
tedges = [
BioCypherEdge(source_id=edge[1],
target_id=edge[2],
relationship_label=edge[3],
relationship_id=edge[0],
properties=ast.literal_eval(edge[4]))
for edge in edges
]
print(len(tnodes))
print(len(tedges))
print(tnodes)
print(tedges)
Dictionary as KG (Adjaceny map)#using dictionary as internal representation
class BiocypherKG:
def __init__(self):
self._KG={}
def add_nodes(self,nodes):
for node in nodes:
node_id=node.get_id()
if node_id not in self._KG:
self._KG[node_id] = {}
self._KG[node_id]['edges'] = {}
self._KG[node_id]['attributes'] = \
{'prefered_id': node.get_preferred_id(),
'node_label': node.get_label(),
'properties': node.get_properties()}
def add_edges(self,edges):
for edge in edges:
source_id = edge.get_source_id()
target_id = edge.get_target_id()
if source_id in self._KG:
if target_id not in self._KG[source_id]['edges']:
self._KG[source_id]['edges'][target_id] = \
{'relationship_label': edge.get_label(),
'relationship_id':edge.get_id(),
'properties':edge.get_properties()}
else:
raise TypeError(f'Knowledge Graph has no {source_id} key. \
Call add_nodes() to add the {source_id} node and its properties.')
def get_KG(self):
return self._KG
def to_networkx(self):
G = nx.DiGraph()
for k1, v1 in self._KG.items():
for k2, v2 in v1.items():
if k2 == 'edges':
for k3, v3 in v2.items():
G.add_edge(k1, k3, **v3)
else:
G.add_node(k1, **v1['attributes'])
return G bkd=BiocypherKG()
import time t_1=time.time()
bkd.add_nodes(tnodes)
bkd.add_edges(tedges)
bkd.to_networkx()
t_2=time.time()
KG=bkd.get_KG()
size=asizeof.asizeof(KG)
print(f'Time(dict):{(t_2-t_1)*1000} ms.')
print(f'Size of KG:{size/1024} kb.')
Tuple(BiocypherNode, BiocypherEdge, BiocypherRelAsEdge) as KG#using collections of tuples(BiocypherNode, BiocypherEdge, BiocypherRelAsEdge) as internal representation
class BiocypherKG:
def __init__(self):
self.G=nx.DiGraph()
def add_nodes(self,nodes):
for node in nodes:
if node not in self.G:
self.G.add_node(
node.get_id(),
label=node.get_label(),
properties=node.get_properties()
)
def add_edges(self,edges):
for edge in edges:
if edge not in self.G:
self.G.add_edge(
edge.get_source_id(),
edge.get_target_id(),
relationship_label=edge.get_label(),
relationship_id=edge.get_id(),
properties=edge.get_properties()
) bkt=BiocypherKG() t_1=time.time()
bkt.add_nodes(tnodes)
bkt.add_edges(tedges)
t_2=time.time()
tsize=asizeof.asizeof(tnodes)+asizeof.asizeof(tedges)
size=asizeof.asizeof(nodes)+asizeof.asizeof(edges)
print(f'Time(collections oof tuples):{(t_2-t_1)*1000} ms.')
print(f'Size of KG (translated):{tsize/1024} kb.')
print(f'Size of KG :{size/1024} kb.')
Results
@ecarrenolozano @slobentanzer The result of the experiment. One issue with using a dictionary as the internal representation is that the add_nodes function must be called whenever a edge with new nodes is added to the dictionary. This is because the dictionary uses node_id as the key in the adjacency map. |
Beta Was this translation helpful? Give feedback.
-
Output modularisation
Re #361 and discussions therein, we should clarify which architecture changes we need to make to streamline the output API to represent all BioCypher functions and yet have the simplest, most intuitive functionality for the user.
@kpto made the observation that isolation is needed (the core should not need to know about the output); @slobentanzer proposes to focus core representation on the basic tuple representation (three-element for nodes, five-element for edges), and to have independent output modules that can be chosen by some configuration / API. The exact nature of these configuration options and API choices are the subject of discussion here. For that, we work with diagrams for visualising the architecture.
Architecture Diagrams
We include and expand our catalog of UML diagrams. You can view the diagrams (just click below). However, if you want to contribute, join our Zulip Channel and request access to edit the files.
Beta Was this translation helpful? Give feedback.
All reactions