Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify what sort of biomedical data ARAs will be required to obtain … #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dkoslicki
Copy link
Member

@dkoslicki dkoslicki commented Apr 24, 2020

…from KPs or other translator resources.

In particular, while obtaining Message nodes and edges from Translator resources makes complete sense imho, ARAs frequently require additional information to perform reasoning. Examples include:

  1. General use case: utilizing KS (and perhaps KP) data dumps to speed performance.
    • Specific use case: NCBIeUtils is a slow API endpoint. When an ARA receives/creates a Message KG with many thousands of nodes and edges, the ARA may wish to annotate all pairs of nodes or all edges with co-occurrence frequency in PubMed literature. A locally cached version of PubMed/baseline stored in a local database causes these (tens of) thousands of queries to be executed in a performant fashion. Even though technically a Translator KP could be used to provide a faster API endpoint than NCBIeUtils for this purpose, the performance penalty would still be experienced. i.e. keep these latency numbers in mind.
  2. General use case: requiring locally aggregated graphs or other non-Translator resources for machine learning/reasoning purposes.
    • Specific use case: An ARA may need access to a specifically formatted graph (not associated with any particular Message) in order to utilize graph convolutional neural network methods (or node embedding methods for downstream traditional ML techniques). The ARA would store and use this information for link prediction, answer scoring, etc. purposes. It seems unreasonable to require a KP to provide this resource for a particular ARA.

@cbizon
Copy link
Collaborator

cbizon commented Apr 26, 2020

@dkoslicki - these are great points, and I'd like to ask a few questions to help clarify my understanding.

So for use case 1, is it fair to say that this is mostly a concern about caching? If so, then I think it's mostly orthogonal to whether ARAs can get non-KP information. So, if caching of KPs were allowed, then I think we would want to make a KP for that information, such that any of the ARAs could use it. I guess one problem could be if the data being pulled is not amenable to being served via a ReasonerAPI, but I'm not sure if that's the problem here...

For use case 2, I agree that we need to be able to instantiate those aggregate graphs for all kinds of big-graph analysis. The question in my mind is whether they can be created from KPs. My hope is that they could be - the KGX tools developed by @cmungall and @deepakunni3 among others should allow this to happen pretty easily? Or do you see a need to bring in non-KP information into such a graph?

@dkoslicki
Copy link
Member Author

@cbizon

For use case 1, yes this is “mostly” a concern about caching. Hitting an API is significantly slower than hitting a database stored in memory. While KP’s could provide endpoints that allow bulk download of the data:

  1. You point of ReasonerAPI perhaps not being amenable to the task is well taken (and apparently proposed on the agenda for the current “relay” breakout groups on Thursday).
  2. Some data sources are not provided by any KP (eg. We (Team Expander Agent) have used the Veteran’s Association National Drug File as a cached sources of information previously). Seems onerous to ask a KP team to roll out such a data source when we’re using it for a specific task (eg. use case 2).

For use case 2: I think the aggregated graphs (from tools such as KGX) would be a great starting point, but additional (non-KP) information might be required. For example, some GNN’s require graphs to have a specific format (easily done by post-processing a, say, KGX aggregated graph), but also require additional information decorating node/edge properties (specifically, numerical values related to non-KP derived (or only partially KP derived) training data, etc). It would be nice to clarify if it’s “ok” for ARA’s to keep/control these modified graphs (originally derived from some Translator KP-aggregated graph(s)) as local sources of information to informal machine learning models.

@cbizon
Copy link
Collaborator

cbizon commented May 5, 2020

@dkoslicki - did the Relay discussion help with the clarification here? I think that the outcome of that discussion was that:

  • caching is fine, as long as the things being cached were from KPs
  • If an ARA needs non KP-served data it should try to get a KP team to stand them up, but at least in the short term, standing up their own KP is ok.

@dkoslicki
Copy link
Member Author

@dkoslicki - did the Relay discussion help with the clarification here? I think that the outcome of that discussion was that:

  • caching is fine, as long as the things being cached were from KPs
  • If an ARA needs non KP-served data it should try to get a KP team to stand them up, but at least in the short term, standing up their own KP is ok.

Yup! I think that accurately summarizes the discussion. Though on point 2, "in the short term" may end up being longer than we think (as KP teams have their own milestones to prioritize). Might be wise to make the architecture doc just not say anything/much on this point (as I think is currently the case) until a KP/KS registry is settled on, a request/ticketing system is set up, etc.

@jzollars
Copy link
Contributor

@NCATSTranslator/architecture-core: Test: Review Pull Request.

@jzollars jzollars requested a review from a team June 10, 2020 19:44
Copy link
Collaborator

@Rosinaweber Rosinaweber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About No. 7:
[[ARAs obtain Message nodes and edges only via KPs (or other ARAs), not from locally-cached aggregated graphs or non-Translator data sources.]]
Sounds odd that ARAs can obtain Message nodes and edges from other ARAs because to provide those, would it be necessary that ARAs locally cache graphs? Apologies if I'm missing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants