Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resnik computations getting stuck #296

Open
caufieldjh opened this issue Sep 15, 2022 · 3 comments
Open

Resnik computations getting stuck #296

caufieldjh opened this issue Sep 15, 2022 · 3 comments

Comments

@caufieldjh
Copy link
Member

In the continued adventures of Resnik -
In semsim, I've found that trying to run Resnik computation on KGPhenio seems to get stuck.
The DAG is 49291 nodes, including Upheno nodes, as we can't get paths between phenotype ontology nodes without them.

With code as follows:

prefixes = ["HP","MP"]
cutoff = 2.5
resnik_model = DAGResnik()
resnik_model.fit(dag, node_counts=counts)
rs_df = resnik_model. \
    get_similarities_from_bipartite_graph_from_edge_node_prefixes(
        source_node_prefixes=prefixes,
        destination_node_prefixes=prefixes,
        minimum_similarity=cutoff,
        return_similarities_dataframe=True,
    ).astype("category", copy=True)

will consume as much memory as is available without actually completing.
I tried this out on a cloud instance with 128 GB memory today and the process got killed due to running out of memory. It took ~4 hrs of continuous use of 100% of 16 vCPUs, with a peak of around 55 GB, then increased to >120 GB within about 10 more minutes.

Is the Resnik calculation getting stuck in the DAG somewhere?

I've previously been able to get some output from the function, but only with a previous version, so I wasn't able to specify a minimum_similarity in that case.

Embiggen is 0.11.38, ensmallen is 0.8.24.

@hrshdhgd @justaddcoffee

@LucaCappelletti94
Copy link
Collaborator

Roughly how many edges are you expecting to receive?

@pnrobinson
Copy link
Member

One optimization I have made in our Java code reflects the fact that if we start if an ontology that has subontologies that do not intermingle, you do not need to explicitly calculate the IC of terms where you know their MICA the root (e.g., liver and ear). This results in a large saving. Luca, can we do a code review and figure out if this might make sense here?

@justaddcoffee
Copy link
Member

Roughly how many edges are you expecting to receive?

For this experiment (HP versus MP phenotypes), I think there are roughly 49k nodes and 93k edges so not particularly large.

So, a memory peak of >120 GB when computing the all X all Resnik similarity and only storing things above a fairly high cutoff (>2.5 IC I think) is kind of surprising to me...

This results in a large saving. Luca, can we do a code review and figure out if this might make sense here?

Ping me too please! I'd like to sit in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants