Adds support for `jaccard_coefficient` #62

rlratzel · 2025-01-09T17:30:12Z

Adds support for jaccard_coefficient to nx-cugraph.

This includes a test, but relies largely on the existing test coverage provided by NetworkX. The test included here could (should) be submitted to NetworkX though in a separate PR, since it is not covering anything unique to nx-cugraph.

A benchmark is also included, with results showing 2-4X speedup. I've seen much, much larger speedup on a different graph (large movie review bipartite graph, showing 966s for NX, 2s for nx-cugraph = ~500X), so I need to investigate further. This investigation need not prevent this PR from being merged now though.

nv-rliu

lgtm!

eriknw · 2025-01-21T14:27:25Z

nx_cugraph/algorithms/link_prediction.py

+    if ebunch is None:
+        # FIXME: is there a more efficient way to do this (on GPU or
+        # otherwise)?
+        ebunch = list(nx.non_edges(G))
+        if not ebunch:
+            return iter([])


Here's an alternative:

Suggested change

if ebunch is None:

# FIXME: is there a more efficient way to do this (on GPU or

# otherwise)?

ebunch = list(nx.non_edges(G))

if not ebunch:

return iter([])

G = _to_undirected_graph(G)

if ebunch is None:

A = cp.tri(G._N, G._N, dtype=bool)

A[G.src_indices, G.dst_indices] = True

src_indices, dst_indices = cp.nonzero(~A)

if src_indices.size == 0:

return iter([])

src_indices = src_indices.astype(index_dtype)

dst_indices = dst_indices.astype(index_dtype)

there are other variations and alternatives; for example, you could use a mask:

mask = G.src_indices < G.dst_indices A[G.src_indices[mask], G.dst_indices[mask]] = True

Observe that these go straight to indices, so there is no need to call e.g. G._list_to_nodearray(src_indices) to do remapping from keys to indices.

Also, if possible, I would advise against using the graph object (such as calling nx.non_edges(G)) before converting the graph via e.g. G = _to_undirected_graph(G). I think it's best to make fewer assumptions about the input.

ok thanks, I'm going to benchmark and apply one of these suggestions (if they're faster, which I'm assuming they are).

I'm a bit confused by this statement though:

I would advise against using the graph object (such as calling nx.non_edges(G)) before converting the graph via e.g. G = _to_undirected_graph(G). I think it's best to make fewer assumptions about the input.

If I need to call a NX function like nx.non_edges(), why would I first convert the input to a CudaGraph using _to_undirected_graph()? Shouldn't we be assuming the input is a compatible NX Graph type, or are we allowing CudaGraph objects to be passed as well? I suspect I'm missing something. (side note: this is ideal info for a docstring, but I believe the decorator bubbles them up to the corresponding nx function docstring, which I wouldn't want here.)

nx_cugraph/algorithms/link_prediction.py

…lid nodes, updates comments and FIXMEs.

eriknw

LGTM! Thanks for the updates. I left one minor suggestion.

Did you ever dig into understanding the benchmark performance? What's the procedure for updating the benchmark docs?

Adding jaccard makes me want to also add adamic_adar_index and preferential_attachment (should be easy to implement, and they show up in examples and learning material, but they're not commonly used by networkx dependents--but neither is jaccard!).

eriknw · 2025-01-27T23:38:21Z

nx_cugraph/classes/graph.py

+        # FIXME: the SGGraph constructor arg "symmetrize" will perform all
+        # symmetrization steps required by libcugraph. The edge_array check
+        # should be kept, but all other code in this `if` block should be
+        # removed if possible.


symmetrize= was added in 24.10 here, rapidsai/cugraph#4649, so I think it makes a lot of sense to investigate using it and removing some code. Note that symmetrize here can be "union" and "intersection", but I think PLC only does "union", so we'd still need virtually all the code here. Perhaps we could use _get_int_dtype to determine what dtype we should cast to to make this more efficient. I'm also curious how the performance of this code compares to symmetrizing in PLC.

eriknw · 2025-01-27T23:50:27Z

nx_cugraph/algorithms/link_prediction.py

+        # checked. If not done, plc.jaccard_coefficients() will accept node IDs
+        # not in the graph and return a coefficient of 0 for them, which is not
+        # compatible with NX.
+        if (not hasattr(G, "key_to_id") or G.key_to_id is None) and (


No need for hasattr here; G is converted above, and we use this convention (converting to CudaGraph) heavily throughout the code.

Suggested change

if (not hasattr(G, "key_to_id") or G.key_to_id is None) and (

if G.key_to_id is None and (

I ended up refactoring the ebunch node check in order to pass some tests added to ensure we behave like NX. The change consolidates the additional valid node checks in _list_to_nodearray, but let me know if you see any issues.

…2-jaccard

rlratzel · 2025-01-28T08:29:02Z

Did you ever dig into understanding the benchmark performance? What's the procedure for updating the benchmark docs?

IIRC, nx-cugraph speedup over NX increases as average degree and the size of ebunch increases, and graph size doesn't come into play since Jaccard is just comparing arbitrary pairs of nodes and their neighbors. We could add an additional dataset (or generate one) to emphasize this in our benchmarks.

What's the procedure for updating the benchmark docs?

@nv-rliu are there docs (README, something else) on updating the benchmark docs?

nv-rliu · 2025-01-28T09:56:44Z

@nv-rliu are there docs (README, something else) on updating the benchmark docs?

There aren't any new docs on our side for updating the benchmark numbers in the table. The process would be to update bench_algos.py to contain the benchmark we want to include, and then repeat the same process with Perflab and specify that we want results for the new algo (i.e. Jaccard) and then once we get those we can include them in the table. Once this is merged, I can get right on that.

eriknw · 2025-01-28T12:01:07Z

nx_cugraph/classes/graph.py

+            valids = [isinstance(n, int) and n >= 0 and n < self._N for n in nodes]
+            if not all(valids):
+                raise ValueError(nodes[valids.index(False)])


Heh, right, values like 4.5 are also invalid.

There are many types of integers floating around so isinstance(n, int) may be inadequate, and isinstance(n, numbers.Integral) is slow, so here's an alternative

N = self._N for node in nodes: try: node = index(node) # Ensure integral value except TypeError: raise KeyError(node) from None if node < 0 or node >= N: raise KeyError(node)

where we import index from operators.

This is a little more strict than NetworkX

In [2]: G = nx.complete_graph(10) In [3]: 1 in G Out[3]: True In [4]: 1.0 in G Out[4]: True

so another alternative could be

N = self._N for node in nodes: try: n = int(node) except TypeError: raise KeyError(node) from None if n != node or n < 0 or n >= N: raise KeyError(node)

Also, what do you think of making this verification opt-in by adding a keyword argument to the method? I'm conflicted, b/c I like performance. We have sometimes played a little fast and loose in the name of performance, and we may not catch every invalid node that NetworkX would catch. This has been somewhat intentional: functions in networkx don't have consistent behavior, so I would want to have e.g. randomized tests like Ross talked about to stress both networkx and networkx backends for exceptional behavior. OTOH, it's probably safest to do this check here.

eriknw · 2025-01-28T16:10:33Z

Maybe a silly question, but can plc.all_pairs_jaccard_coefficients be useful here? How is it different from plc.jaccard_coefficients?

rlratzel added 7 commits January 8, 2025 18:29

Initial commit, still running tests.

9513ed1

Removed docstring to avoid having it as an admonition in generated docs.

69a63ba

NX tests passing.

8958183

Updates comment.

9c7d048

Updates comment.

9ceb0af

Adds initial benchmark for Jaccard.

6eadc55

Updates and adds comments.

6e4b2bc

rlratzel requested a review from eriknw January 9, 2025 17:30

rlratzel self-assigned this Jan 9, 2025

rlratzel requested a review from a team as a code owner January 9, 2025 17:30

github-actions bot added the benchmarks label Jan 9, 2025

rlratzel added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 9, 2025

nv-rliu approved these changes Jan 9, 2025

View reviewed changes

nv-rliu approved these changes Jan 13, 2025

View reviewed changes

eriknw requested changes Jan 21, 2025

View reviewed changes

rlratzel added 2 commits January 23, 2025 10:46

Updates code for computing pairs when ebunch is None and check for va…

f17e162

…lid nodes, updates comments and FIXMEs.

Removes passing arg with default value.

893b6bc

eriknw approved these changes Jan 28, 2025

View reviewed changes

rlratzel added 2 commits January 28, 2025 01:21

Merge remote-tracking branch 'upstream/branch-25.02' into branch-25.0…

34fe191

…2-jaccard

Updates ebunch node check, adds test for valid ebunch.

8040fd0

eriknw reviewed Jan 28, 2025

View reviewed changes

eriknw mentioned this pull request Jan 28, 2025

Add jaccard_coefficient algorithm to nx-cugraph rapidsai/cugraph-docs#87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for `jaccard_coefficient` #62

Adds support for `jaccard_coefficient` #62

rlratzel commented Jan 9, 2025

nv-rliu left a comment

eriknw Jan 21, 2025

rlratzel Jan 23, 2025

eriknw left a comment •

edited

Loading

eriknw Jan 27, 2025

eriknw Jan 27, 2025

rlratzel Jan 28, 2025

rlratzel commented Jan 28, 2025

nv-rliu commented Jan 28, 2025

eriknw Jan 28, 2025

eriknw commented Jan 28, 2025

	if (not hasattr(G, "key_to_id") or G.key_to_id is None) and (
	if G.key_to_id is None and (

Adds support for jaccard_coefficient #62

Are you sure you want to change the base?

Adds support for jaccard_coefficient #62

Conversation

rlratzel commented Jan 9, 2025

nv-rliu left a comment

Choose a reason for hiding this comment

eriknw Jan 21, 2025

Choose a reason for hiding this comment

rlratzel Jan 23, 2025

Choose a reason for hiding this comment

eriknw left a comment • edited Loading

Choose a reason for hiding this comment

eriknw Jan 27, 2025

Choose a reason for hiding this comment

eriknw Jan 27, 2025

Choose a reason for hiding this comment

rlratzel Jan 28, 2025

Choose a reason for hiding this comment

rlratzel commented Jan 28, 2025

nv-rliu commented Jan 28, 2025

eriknw Jan 28, 2025

Choose a reason for hiding this comment

eriknw commented Jan 28, 2025

Adds support for `jaccard_coefficient` #62

Adds support for `jaccard_coefficient` #62

eriknw left a comment •

edited

Loading