Resnik computations getting stuck #296

caufieldjh · 2022-09-15T01:30:48Z

In the continued adventures of Resnik -
In semsim, I've found that trying to run Resnik computation on KGPhenio seems to get stuck.
The DAG is 49291 nodes, including Upheno nodes, as we can't get paths between phenotype ontology nodes without them.

With code as follows:

prefixes = ["HP","MP"]
cutoff = 2.5
resnik_model = DAGResnik()
resnik_model.fit(dag, node_counts=counts)
rs_df = resnik_model. \
    get_similarities_from_bipartite_graph_from_edge_node_prefixes(
        source_node_prefixes=prefixes,
        destination_node_prefixes=prefixes,
        minimum_similarity=cutoff,
        return_similarities_dataframe=True,
    ).astype("category", copy=True)

will consume as much memory as is available without actually completing.
I tried this out on a cloud instance with 128 GB memory today and the process got killed due to running out of memory. It took ~4 hrs of continuous use of 100% of 16 vCPUs, with a peak of around 55 GB, then increased to >120 GB within about 10 more minutes.

Is the Resnik calculation getting stuck in the DAG somewhere?

I've previously been able to get some output from the function, but only with a previous version, so I wasn't able to specify a minimum_similarity in that case.

Embiggen is 0.11.38, ensmallen is 0.8.24.

@hrshdhgd @justaddcoffee

The text was updated successfully, but these errors were encountered:

LucaCappelletti94 · 2022-09-15T06:46:20Z

Roughly how many edges are you expecting to receive?

pnrobinson · 2022-09-15T11:24:30Z

One optimization I have made in our Java code reflects the fact that if we start if an ontology that has subontologies that do not intermingle, you do not need to explicitly calculate the IC of terms where you know their MICA the root (e.g., liver and ear). This results in a large saving. Luca, can we do a code review and figure out if this might make sense here?

justaddcoffee · 2022-09-15T11:49:24Z

Roughly how many edges are you expecting to receive?

For this experiment (HP versus MP phenotypes), I think there are roughly 49k nodes and 93k edges so not particularly large.

So, a memory peak of >120 GB when computing the all X all Resnik similarity and only storing things above a fairly high cutoff (>2.5 IC I think) is kind of surprising to me...

This results in a large saving. Luca, can we do a code review and figure out if this might make sense here?

Ping me too please! I'd like to sit in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resnik computations getting stuck #296

Resnik computations getting stuck #296

caufieldjh commented Sep 15, 2022

LucaCappelletti94 commented Sep 15, 2022

pnrobinson commented Sep 15, 2022

justaddcoffee commented Sep 15, 2022

Resnik computations getting stuck #296

Resnik computations getting stuck #296

Comments

caufieldjh commented Sep 15, 2022

LucaCappelletti94 commented Sep 15, 2022

pnrobinson commented Sep 15, 2022

justaddcoffee commented Sep 15, 2022