Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for jaccard_coefficient #62

Open
wants to merge 11 commits into
base: branch-25.02
Choose a base branch
from

Conversation

rlratzel
Copy link
Contributor

@rlratzel rlratzel commented Jan 9, 2025

Adds support for jaccard_coefficient to nx-cugraph.

This includes a test, but relies largely on the existing test coverage provided by NetworkX. The test included here could (should) be submitted to NetworkX though in a separate PR, since it is not covering anything unique to nx-cugraph.

A benchmark is also included, with results showing 2-4X speedup. I've seen much, much larger speedup on a different graph (large movie review bipartite graph, showing 966s for NX, 2s for nx-cugraph = ~500X), so I need to investigate further. This investigation need not prevent this PR from being merged now though.

image

@rlratzel rlratzel requested a review from eriknw January 9, 2025 17:30
@rlratzel rlratzel self-assigned this Jan 9, 2025
@rlratzel rlratzel requested a review from a team as a code owner January 9, 2025 17:30
@rlratzel rlratzel added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 9, 2025
Copy link
Contributor

@nv-rliu nv-rliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Comment on lines 29 to 34
if ebunch is None:
# FIXME: is there a more efficient way to do this (on GPU or
# otherwise)?
ebunch = list(nx.non_edges(G))
if not ebunch:
return iter([])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an alternative:

Suggested change
if ebunch is None:
# FIXME: is there a more efficient way to do this (on GPU or
# otherwise)?
ebunch = list(nx.non_edges(G))
if not ebunch:
return iter([])
G = _to_undirected_graph(G)
if ebunch is None:
A = cp.tri(G._N, G._N, dtype=bool)
A[G.src_indices, G.dst_indices] = True
src_indices, dst_indices = cp.nonzero(~A)
if src_indices.size == 0:
return iter([])
src_indices = src_indices.astype(index_dtype)
dst_indices = dst_indices.astype(index_dtype)

there are other variations and alternatives; for example, you could use a mask:

        mask = G.src_indices < G.dst_indices
        A[G.src_indices[mask], G.dst_indices[mask]] = True

Observe that these go straight to indices, so there is no need to call e.g. G._list_to_nodearray(src_indices) to do remapping from keys to indices.

Also, if possible, I would advise against using the graph object (such as calling nx.non_edges(G)) before converting the graph via e.g. G = _to_undirected_graph(G). I think it's best to make fewer assumptions about the input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks, I'm going to benchmark and apply one of these suggestions (if they're faster, which I'm assuming they are).

I'm a bit confused by this statement though:

I would advise against using the graph object (such as calling nx.non_edges(G)) before converting the graph via e.g. G = _to_undirected_graph(G). I think it's best to make fewer assumptions about the input.

If I need to call a NX function like nx.non_edges(), why would I first convert the input to a CudaGraph using _to_undirected_graph()? Shouldn't we be assuming the input is a compatible NX Graph type, or are we allowing CudaGraph objects to be passed as well? I suspect I'm missing something. (side note: this is ideal info for a docstring, but I believe the decorator bubbles them up to the corresponding nx function docstring, which I wouldn't want here.)

nx_cugraph/algorithms/link_prediction.py Outdated Show resolved Hide resolved
nx_cugraph/algorithms/link_prediction.py Outdated Show resolved Hide resolved
nx_cugraph/algorithms/link_prediction.py Show resolved Hide resolved
nx_cugraph/algorithms/link_prediction.py Outdated Show resolved Hide resolved
nx_cugraph/algorithms/link_prediction.py Outdated Show resolved Hide resolved
nx_cugraph/algorithms/link_prediction.py Outdated Show resolved Hide resolved
nx_cugraph/algorithms/link_prediction.py Outdated Show resolved Hide resolved
nx_cugraph/algorithms/link_prediction.py Outdated Show resolved Hide resolved
Copy link
Contributor

@eriknw eriknw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the updates. I left one minor suggestion.

Did you ever dig into understanding the benchmark performance? What's the procedure for updating the benchmark docs?

Adding jaccard makes me want to also add adamic_adar_index and preferential_attachment (should be easy to implement, and they show up in examples and learning material, but they're not commonly used by networkx dependents--but neither is jaccard!).

Comment on lines +1013 to +1016
# FIXME: the SGGraph constructor arg "symmetrize" will perform all
# symmetrization steps required by libcugraph. The edge_array check
# should be kept, but all other code in this `if` block should be
# removed if possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

symmetrize= was added in 24.10 here, rapidsai/cugraph#4649, so I think it makes a lot of sense to investigate using it and removing some code. Note that symmetrize here can be "union" and "intersection", but I think PLC only does "union", so we'd still need virtually all the code here. Perhaps we could use _get_int_dtype to determine what dtype we should cast to to make this more efficient. I'm also curious how the performance of this code compares to symmetrizing in PLC.

# checked. If not done, plc.jaccard_coefficients() will accept node IDs
# not in the graph and return a coefficient of 0 for them, which is not
# compatible with NX.
if (not hasattr(G, "key_to_id") or G.key_to_id is None) and (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for hasattr here; G is converted above, and we use this convention (converting to CudaGraph) heavily throughout the code.

Suggested change
if (not hasattr(G, "key_to_id") or G.key_to_id is None) and (
if G.key_to_id is None and (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up refactoring the ebunch node check in order to pass some tests added to ensure we behave like NX. The change consolidates the additional valid node checks in _list_to_nodearray, but let me know if you see any issues.

@rlratzel
Copy link
Contributor Author

Did you ever dig into understanding the benchmark performance? What's the procedure for updating the benchmark docs?

IIRC, nx-cugraph speedup over NX increases as average degree and the size of ebunch increases, and graph size doesn't come into play since Jaccard is just comparing arbitrary pairs of nodes and their neighbors. We could add an additional dataset (or generate one) to emphasize this in our benchmarks.

What's the procedure for updating the benchmark docs?

@nv-rliu are there docs (README, something else) on updating the benchmark docs?

@nv-rliu
Copy link
Contributor

nv-rliu commented Jan 28, 2025

@nv-rliu are there docs (README, something else) on updating the benchmark docs?

There aren't any new docs on our side for updating the benchmark numbers in the table. The process would be to update bench_algos.py to contain the benchmark we want to include, and then repeat the same process with Perflab and specify that we want results for the new algo (i.e. Jaccard) and then once we get those we can include them in the table. Once this is merged, I can get right on that.

Comment on lines +1156 to +1158
valids = [isinstance(n, int) and n >= 0 and n < self._N for n in nodes]
if not all(valids):
raise ValueError(nodes[valids.index(False)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, right, values like 4.5 are also invalid.

There are many types of integers floating around so isinstance(n, int) may be inadequate, and isinstance(n, numbers.Integral) is slow, so here's an alternative

N = self._N
for node in nodes:
    try:
        node = index(node)  # Ensure integral value
    except TypeError:
        raise KeyError(node) from None
    if node < 0 or node >= N:
        raise KeyError(node)

where we import index from operators.

This is a little more strict than NetworkX

In [2]: G = nx.complete_graph(10)

In [3]: 1 in G
Out[3]: True

In [4]: 1.0 in G
Out[4]: True

so another alternative could be

N = self._N
for node in nodes:
    try:
        n = int(node)
    except TypeError:
        raise KeyError(node) from None
    if n != node or n < 0 or n >= N:
        raise KeyError(node)

Also, what do you think of making this verification opt-in by adding a keyword argument to the method? I'm conflicted, b/c I like performance. We have sometimes played a little fast and loose in the name of performance, and we may not catch every invalid node that NetworkX would catch. This has been somewhat intentional: functions in networkx don't have consistent behavior, so I would want to have e.g. randomized tests like Ross talked about to stress both networkx and networkx backends for exceptional behavior. OTOH, it's probably safest to do this check here.

@eriknw
Copy link
Contributor

eriknw commented Jan 28, 2025

Maybe a silly question, but can plc.all_pairs_jaccard_coefficients be useful here? How is it different from plc.jaccard_coefficients?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants