-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for jaccard_coefficient
#62
base: branch-25.02
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
if ebunch is None: | ||
# FIXME: is there a more efficient way to do this (on GPU or | ||
# otherwise)? | ||
ebunch = list(nx.non_edges(G)) | ||
if not ebunch: | ||
return iter([]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an alternative:
if ebunch is None: | |
# FIXME: is there a more efficient way to do this (on GPU or | |
# otherwise)? | |
ebunch = list(nx.non_edges(G)) | |
if not ebunch: | |
return iter([]) | |
G = _to_undirected_graph(G) | |
if ebunch is None: | |
A = cp.tri(G._N, G._N, dtype=bool) | |
A[G.src_indices, G.dst_indices] = True | |
src_indices, dst_indices = cp.nonzero(~A) | |
if src_indices.size == 0: | |
return iter([]) | |
src_indices = src_indices.astype(index_dtype) | |
dst_indices = dst_indices.astype(index_dtype) |
there are other variations and alternatives; for example, you could use a mask:
mask = G.src_indices < G.dst_indices
A[G.src_indices[mask], G.dst_indices[mask]] = True
Observe that these go straight to indices, so there is no need to call e.g. G._list_to_nodearray(src_indices)
to do remapping from keys to indices.
Also, if possible, I would advise against using the graph object (such as calling nx.non_edges(G)
) before converting the graph via e.g. G = _to_undirected_graph(G)
. I think it's best to make fewer assumptions about the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks, I'm going to benchmark and apply one of these suggestions (if they're faster, which I'm assuming they are).
I'm a bit confused by this statement though:
I would advise against using the graph object (such as calling nx.non_edges(G)) before converting the graph via e.g. G = _to_undirected_graph(G). I think it's best to make fewer assumptions about the input.
If I need to call a NX function like nx.non_edges()
, why would I first convert the input to a CudaGraph
using _to_undirected_graph()
? Shouldn't we be assuming the input is a compatible NX Graph
type, or are we allowing CudaGraph
objects to be passed as well? I suspect I'm missing something. (side note: this is ideal info for a docstring, but I believe the decorator bubbles them up to the corresponding nx function docstring, which I wouldn't want here.)
…lid nodes, updates comments and FIXMEs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the updates. I left one minor suggestion.
Did you ever dig into understanding the benchmark performance? What's the procedure for updating the benchmark docs?
Adding jaccard makes me want to also add adamic_adar_index
and preferential_attachment
(should be easy to implement, and they show up in examples and learning material, but they're not commonly used by networkx dependents--but neither is jaccard!).
# FIXME: the SGGraph constructor arg "symmetrize" will perform all | ||
# symmetrization steps required by libcugraph. The edge_array check | ||
# should be kept, but all other code in this `if` block should be | ||
# removed if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
symmetrize=
was added in 24.10 here, rapidsai/cugraph#4649, so I think it makes a lot of sense to investigate using it and removing some code. Note that symmetrize
here can be "union"
and "intersection"
, but I think PLC only does "union"
, so we'd still need virtually all the code here. Perhaps we could use _get_int_dtype
to determine what dtype we should cast to to make this more efficient. I'm also curious how the performance of this code compares to symmetrizing in PLC.
# checked. If not done, plc.jaccard_coefficients() will accept node IDs | ||
# not in the graph and return a coefficient of 0 for them, which is not | ||
# compatible with NX. | ||
if (not hasattr(G, "key_to_id") or G.key_to_id is None) and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for hasattr
here; G is converted above, and we use this convention (converting to CudaGraph) heavily throughout the code.
if (not hasattr(G, "key_to_id") or G.key_to_id is None) and ( | |
if G.key_to_id is None and ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up refactoring the ebunch
node check in order to pass some tests added to ensure we behave like NX. The change consolidates the additional valid node checks in _list_to_nodearray
, but let me know if you see any issues.
IIRC, nx-cugraph speedup over NX increases as average degree and the size of
@nv-rliu are there docs (README, something else) on updating the benchmark docs? |
There aren't any new docs on our side for updating the benchmark numbers in the table. The process would be to update |
valids = [isinstance(n, int) and n >= 0 and n < self._N for n in nodes] | ||
if not all(valids): | ||
raise ValueError(nodes[valids.index(False)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, right, values like 4.5 are also invalid.
There are many types of integers floating around so isinstance(n, int)
may be inadequate, and isinstance(n, numbers.Integral)
is slow, so here's an alternative
N = self._N
for node in nodes:
try:
node = index(node) # Ensure integral value
except TypeError:
raise KeyError(node) from None
if node < 0 or node >= N:
raise KeyError(node)
where we import index
from operators
.
This is a little more strict than NetworkX
In [2]: G = nx.complete_graph(10)
In [3]: 1 in G
Out[3]: True
In [4]: 1.0 in G
Out[4]: True
so another alternative could be
N = self._N
for node in nodes:
try:
n = int(node)
except TypeError:
raise KeyError(node) from None
if n != node or n < 0 or n >= N:
raise KeyError(node)
Also, what do you think of making this verification opt-in by adding a keyword argument to the method? I'm conflicted, b/c I like performance. We have sometimes played a little fast and loose in the name of performance, and we may not catch every invalid node that NetworkX would catch. This has been somewhat intentional: functions in networkx don't have consistent behavior, so I would want to have e.g. randomized tests like Ross talked about to stress both networkx and networkx backends for exceptional behavior. OTOH, it's probably safest to do this check here.
Maybe a silly question, but can |
Adds support for
jaccard_coefficient
to nx-cugraph.This includes a test, but relies largely on the existing test coverage provided by NetworkX. The test included here could (should) be submitted to NetworkX though in a separate PR, since it is not covering anything unique to nx-cugraph.
A benchmark is also included, with results showing 2-4X speedup. I've seen much, much larger speedup on a different graph (large movie review bipartite graph, showing 966s for NX, 2s for nx-cugraph = ~500X), so I need to investigate further. This investigation need not prevent this PR from being merged now though.