Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev/dev skrub #626

Open
wants to merge 58 commits into
base: master
Choose a base branch
from
Open

Dev/dev skrub #626

wants to merge 58 commits into from

Conversation

lmeyerov
Copy link
Contributor

@lmeyerov lmeyerov commented Dec 31, 2024

Updating from dirty_cat to skrub:

Breaking

  • Change from dirty-cat to skrub:
    • new SuperVectorizer -> TableVectorizer interface
  • pip install graphistry[umap-learn] and pip install graphistry[ai] are now python 3.9+ (was 3.8+)
  • Plottable's _node_dbscan / _edge_dbscan are now _dbscan_nodes / _dbscan_edges

Feat

  • featurize/umap transform(): Ensure helper transforms get same feature_cols_in
  • numpy 2 support
  • more of umap, dbscan, featurize fields are tracked in Plottable

Infra

  • [umap-learn] install: Replace dirty-cat with skrub, unpin scikit-learn
  • [umap-learn] + [ai] unpin deps - scikit, scipy, torch (now 2), etc
  • Add optional rapids setup.py target

Refactor

  • Move more type models to models/compute/{feature,umap,cluster}
  • Turn more print => logger

Tests

  • Stop ignoring warnings in featurize and umap
  • python version tests use corresponding python version for mypy
  • umap tests: py 3.8, 3.9 => 3.9..3.12
  • ai tests: py 3.8, 3.9 => 3.9..3.12
  • plugin tests check for module imports

Fixes

  • GPU AI pathways work more, and stay longer on-gpu
  • Remove lint/type ignores and fix root causes

WIP

  • GPU CI


self.g2 = g2
fenc = g2._node_encoder
self.X, self.Y = fenc.X, fenc.y
self.EMB = g2._node_embedding
self.emb, self.x, self.y = g2.transform_umap(
ndf_reddit, ndf_reddit, kind="nodes", return_graph=False
ndf_reddit, ndf_reddit[['label', 'type']], kind="nodes", return_graph=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be needed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'll back out, and flip to warning

)
self.g3 = g2.transform_umap(
ndf_reddit, ndf_reddit, kind="nodes", return_graph=True
ndf_reddit, ndf_reddit[['label', 'type']], kind="nodes", return_graph=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that scrub doesn't like the extra columns? I am un clear on where this happens later in pipeline.

setup.py Outdated Show resolved Hide resolved
@lmeyerov
Copy link
Contributor Author

lmeyerov commented Jan 5, 2025

Our setup.py had, for [test], a dirty-cat related pin on scipy/sklearn that was causing a lot of the confusion here, removing makes the below go away


This may be just requiring py3.10+, not py3.9+, for umap


umaplearn -> sklearn -> scipy seems to fail on missing coo_matrix.A on call simplicial_set = normalize(simplicial_set, norm="max")

It seems scipy started deprecating/moving A around then, and unclear how sklearn works around that

self = <test_umap_utils.TestUMAPMethods testMethod=test_edge_umap>

    @pytest.mark.skipif(not has_umap, reason="requires umap feature dependencies")
    def test_edge_umap(self):
        g = graphistry.edges(triangleEdges, "src", "dst")
        use_cols = [edge_ints, edge_floats, edge_numeric]
        targets = [edge_target]
>       self._test_umap(
            g,
            use_cols=use_cols,
            targets=targets,
            name="Edge UMAP with `(target, use_col)=`",
            kind="edges",
            df=triangleEdges,
        )

graphistry/tests/test_umap_utils.py:4[60](https://github.com/graphistry/pygraphistry/actions/runs/12616887167/job/35158605081?pr=626#step:8:61): 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
graphistry/tests/test_umap_utils.py:404: in _test_umap
    g2 = g.umap(
graphistry/umap_utils.py:729: in umap
    res = res._process_umap(
graphistry/umap_utils.py:465: in _process_umap
    emb = res._umap_fit_transform(X_, y_, umap_fit_kwargs, umap_transform_kwargs)
graphistry/umap_utils.py:342: in _umap_fit_transform
    self.umap_fit(X, y, umap_fit_kwargs)
graphistry/umap_utils.py:319: in umap_fit
    self._umap.fit(X, y, **umap_fit_kwargs)
pygraphistry/lib/python3.11/site-packages/umap/umap_.py:2711: in fit
    self.graph_ = discrete_metric_simplicial_set_intersection(
pygraphistry/lib/python3.11/site-packages/umap/umap_.py:855: in discrete_metric_simplicial_set_intersection
    return reset_local_connectivity(simplicial_set)
pygraphistry/lib/python3.11/site-packages/umap/umap_.py:767: in reset_local_connectivity
    simplicial_set = normalize(simplicial_set, norm="max")
pygraphistry/lib/python3.11/site-packages/sklearn/utils/_param_validation.py:214: in wrapper
    return func(*args, **kwargs)
pygraphistry/lib/python3.11/site-packages/sklearn/preprocessing/_data.py:18[63](https://github.com/graphistry/pygraphistry/actions/runs/12616887167/job/35158605081?pr=626#step:8:64): in normalize
    mins, maxes = min_max_axis(X, 1)
pygraphistry/lib/python3.11/site-packages/sklearn/utils/sparsefuncs.py:512: in min_max_axis
    return _sparse_min_max(X, axis=axis)
pygraphistry/lib/python3.11/site-packages/sklearn/utils/sparsefuncs.py:472: in _sparse_min_max
    _sparse_min_or_max(X, axis, np.minimum),
pygraphistry/lib/python3.11/site-packages/sklearn/utils/sparsefuncs.py:4[65](https://github.com/graphistry/pygraphistry/actions/runs/12616887167/job/35158605081?pr=626#step:8:66): in _sparse_min_or_max
    return _min_or_max_axis(X, axis, min_or_max)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X = <Compressed Sparse Row sparse matrix of dtype 'float32'
	with 24 stored elements and shape (12, 12)>
axis = 1, min_or_max = <ufunc 'minimum'>

    def _min_or_max_axis(X, axis, min_or_max):
        N = X.shape[axis]
        if N == 0:
            raise ValueError("zero-size array to reduction operation")
        M = X.shape[1 - axis]
        mat = X.tocsc() if axis == 0 else X.tocsr()
        mat.sum_duplicates()
        major_index, value = _minor_reduce(mat, min_or_max)
        not_full = np.diff(mat.indptr)[major_index] < N
        value[not_full] = min_or_max(value[not_full], 0)
        mask = value != 0
        major_index = np.compress(mask, major_index)
        value = np.compress(mask, value)
    
        if axis == 0:
            res = sp.coo_matrix(
                (value, (np.zeros(len(value)), major_index)), dtype=X.dtype, shape=(1, M)
            )
        else:
            res = sp.coo_matrix(
                (value, (major_index, np.zeros(len(value)))), dtype=X.dtype, shape=(M, 1)
            )
>       return res.A.ravel()
E       AttributeError: 'coo_matrix' object has no attribute 'A'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants