Dev/dev skrub #626

lmeyerov · 2024-12-31T10:44:41Z

Updating from dirty_cat to skrub:

Breaking

Change from dirty-cat to skrub:
- new SuperVectorizer -> TableVectorizer interface
pip install graphistry[umap-learn] and pip install graphistry[ai] are now python 3.9+ (was 3.8+)
Plottable's _node_dbscan / _edge_dbscan are now _dbscan_nodes / _dbscan_edges

Feat

featurize/umap transform(): Ensure helper transforms get same feature_cols_in
numpy 2 support
more of umap, dbscan, featurize fields are tracked in Plottable

Infra

[umap-learn] install: Replace dirty-cat with skrub, unpin scikit-learn
[umap-learn] + [ai] unpin deps - scikit, scipy, torch (now 2), etc
Add optional rapids setup.py target

Refactor

Move more type models to models/compute/{feature,umap,cluster}
Turn more print => logger

Tests

Stop ignoring warnings in featurize and umap
python version tests use corresponding python version for mypy
umap tests: py 3.8, 3.9 => 3.9..3.12
ai tests: py 3.8, 3.9 => 3.9..3.12
plugin tests check for module imports

Fixes

GPU AI pathways work more, and stay longer on-gpu
Remove lint/type ignores and fix root causes

WIP

GPU CI

silkspace · 2024-12-31T19:59:20Z

graphistry/tests/test_umap_utils.py


        self.g2 = g2
        fenc = g2._node_encoder
        self.X, self.Y = fenc.X, fenc.y
        self.EMB = g2._node_embedding
        self.emb, self.x, self.y = g2.transform_umap(
-            ndf_reddit, ndf_reddit, kind="nodes", return_graph=False
+            ndf_reddit, ndf_reddit[['label', 'type']], kind="nodes", return_graph=False


this should not be needed...

yeah, i'll back out, and flip to warning

silkspace · 2024-12-31T20:00:05Z

graphistry/tests/test_umap_utils.py

        )
        self.g3 = g2.transform_umap(
-            ndf_reddit, ndf_reddit, kind="nodes", return_graph=True
+            ndf_reddit, ndf_reddit[['label', 'type']], kind="nodes", return_graph=True


It seems that scrub doesn't like the extra columns? I am un clear on where this happens later in pipeline.

setup.py

lmeyerov · 2025-01-05T06:30:21Z

Our setup.py had, for [test], a dirty-cat related pin on scipy/sklearn that was causing a lot of the confusion here, removing makes the below go away

This may be just requiring py3.10+, not py3.9+, for umap

umaplearn -> sklearn -> scipy seems to fail on missing coo_matrix.A on call simplicial_set = normalize(simplicial_set, norm="max")

It seems scipy started deprecating/moving A around then, and unclear how sklearn works around that

Failing test run (below): https://github.com/graphistry/pygraphistry/actions/runs/12616887167/job/35158605081?pr=626
Works when run:
- scipy<1.14.0 => scipy-1.13.1 (may 2024) https://github.com/scipy/scipy/releases/tag/v1.14.0
- => scikit-learn-1.3.2 (oct 2023)

self = <test_umap_utils.TestUMAPMethods testMethod=test_edge_umap>

    @pytest.mark.skipif(not has_umap, reason="requires umap feature dependencies")
    def test_edge_umap(self):
        g = graphistry.edges(triangleEdges, "src", "dst")
        use_cols = [edge_ints, edge_floats, edge_numeric]
        targets = [edge_target]
>       self._test_umap(
            g,
            use_cols=use_cols,
            targets=targets,
            name="Edge UMAP with `(target, use_col)=`",
            kind="edges",
            df=triangleEdges,
        )

graphistry/tests/test_umap_utils.py:4[60](https://github.com/graphistry/pygraphistry/actions/runs/12616887167/job/35158605081?pr=626#step:8:61): 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
graphistry/tests/test_umap_utils.py:404: in _test_umap
    g2 = g.umap(
graphistry/umap_utils.py:729: in umap
    res = res._process_umap(
graphistry/umap_utils.py:465: in _process_umap
    emb = res._umap_fit_transform(X_, y_, umap_fit_kwargs, umap_transform_kwargs)
graphistry/umap_utils.py:342: in _umap_fit_transform
    self.umap_fit(X, y, umap_fit_kwargs)
graphistry/umap_utils.py:319: in umap_fit
    self._umap.fit(X, y, **umap_fit_kwargs)
pygraphistry/lib/python3.11/site-packages/umap/umap_.py:2711: in fit
    self.graph_ = discrete_metric_simplicial_set_intersection(
pygraphistry/lib/python3.11/site-packages/umap/umap_.py:855: in discrete_metric_simplicial_set_intersection
    return reset_local_connectivity(simplicial_set)
pygraphistry/lib/python3.11/site-packages/umap/umap_.py:767: in reset_local_connectivity
    simplicial_set = normalize(simplicial_set, norm="max")
pygraphistry/lib/python3.11/site-packages/sklearn/utils/_param_validation.py:214: in wrapper
    return func(*args, **kwargs)
pygraphistry/lib/python3.11/site-packages/sklearn/preprocessing/_data.py:18[63](https://github.com/graphistry/pygraphistry/actions/runs/12616887167/job/35158605081?pr=626#step:8:64): in normalize
    mins, maxes = min_max_axis(X, 1)
pygraphistry/lib/python3.11/site-packages/sklearn/utils/sparsefuncs.py:512: in min_max_axis
    return _sparse_min_max(X, axis=axis)
pygraphistry/lib/python3.11/site-packages/sklearn/utils/sparsefuncs.py:472: in _sparse_min_max
    _sparse_min_or_max(X, axis, np.minimum),
pygraphistry/lib/python3.11/site-packages/sklearn/utils/sparsefuncs.py:4[65](https://github.com/graphistry/pygraphistry/actions/runs/12616887167/job/35158605081?pr=626#step:8:66): in _sparse_min_or_max
    return _min_or_max_axis(X, axis, min_or_max)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

X = <Compressed Sparse Row sparse matrix of dtype 'float32'
	with 24 stored elements and shape (12, 12)>
axis = 1, min_or_max = <ufunc 'minimum'>

    def _min_or_max_axis(X, axis, min_or_max):
        N = X.shape[axis]
        if N == 0:
            raise ValueError("zero-size array to reduction operation")
        M = X.shape[1 - axis]
        mat = X.tocsc() if axis == 0 else X.tocsr()
        mat.sum_duplicates()
        major_index, value = _minor_reduce(mat, min_or_max)
        not_full = np.diff(mat.indptr)[major_index] < N
        value[not_full] = min_or_max(value[not_full], 0)
        mask = value != 0
        major_index = np.compress(mask, major_index)
        value = np.compress(mask, value)
    
        if axis == 0:
            res = sp.coo_matrix(
                (value, (np.zeros(len(value)), major_index)), dtype=X.dtype, shape=(1, M)
            )
        else:
            res = sp.coo_matrix(
                (value, (major_index, np.zeros(len(value)))), dtype=X.dtype, shape=(M, 1)
            )
>       return res.A.ravel()
E       AttributeError: 'coo_matrix' object has no attribute 'A'

lmeyerov added 7 commits December 31, 2024 02:37

feat(skrub): upgrade from dirty_cat

c0db6bb

fix(umap): transform drop y from X

068491b

infra(ci): gha does not support py3.14

08a913c

infra(ci): remove py3.13 as sklearn gha not ready

e1ebc1c

infra(gha py): remove 3.12 bc torch < 2

2ea2824

infra(ci): typecheck use appropriate py version

8e44774

infra(py): unpin torch for py3.12

7fdef80

silkspace reviewed Dec 31, 2024

View reviewed changes

lmeyerov added 9 commits December 31, 2024 20:09

garden(featurize): types

c3431e4

refactor(FastEncoder): non-null y

f9767a8

wip(transform): match batch X y to train X y

527c61f

fix(feat): backout incorrect feat standaridzations

fa09511

infra(typecheck): handle py x.y.z version formats

e17440e

infra(umap testers): full pytest overriding

b34d525

refactor(feat test): split

237bd35

refactor(transform param names): trained -> fit

d6cd846

fix(featurize): ydf

60177c5

lmeyerov mentioned this pull request Jan 1, 2025

[BUG] umap dirty_cat on colab #607

Open

lmeyerov added 4 commits January 4, 2025 20:00

feat(transform): skrub preconditioning

78badb2

feat(tests): do not ignore warnings

ca1cc22

fix(feat utils): python 3.9 typing

95c6a9b

fix(feat utils): python 3.9 typing

a12a4df

aucahuasi reviewed Jan 5, 2025

View reviewed changes

setup.py Outdated Show resolved Hide resolved

lmeyerov added 2 commits January 4, 2025 21:25

fix(umap): scipy 1.15 breakage?

c35e79f

fix(umap): scipy 1.14 breakage?

4913f37

lmeyerov added 3 commits January 4, 2025 22:43

infra(scikit): increase minimum version for umap learn

9aa5857

fix(umap): require higher scipy

286d25e

infra(umap): py3.10+

b243f0f

lmeyerov added 30 commits January 5, 2025 00:39

fix(feat): edge cases

90c56fd

infra(ci): try reenabling py 3.9 for umap

75c56bd

infra(ci): try reenabling py3.8 for umap, ai

e59d444

infra(deps): umaps use of skrub means python 3.9+

bb73f30

fix(test): cuml umap engine arg pos

47083c7

infra(gpu ci): wip

06db67d

infra(setup.py): rapids

d696be6

infra(ci): swap dc with skrub

6889698

refactor(print): to logger

7d5a604

fix(test): plugin tests conditional on deps

2fd87c5

fix(dgl): pass gpu tests

2b340cd

refactor(ModelDict): move to models

d194783

refactor(graph kind): external model

ef91ba3

refactor(umap models): factor out

bf2d8a0

refactor(interfaces): umap field names, typed interfaces

0ca9af4

garden(deadcode): remove

0d7a4c5

fix(feat): gpu support

aa75872

fix(umap): gpu

04223ca

refactor(dbscan): new interfaces

aaa878b

fix(dbscan): gpu mode

49433be

refactor(umap tests): decouple

46179ab

fix(gpu mode): more ai paths

cc76655

infra(dgl): bin tester

86f224c

infra(ci): dgl

25f259a

fix(gpu runner): do not suppress feat tests

c0b543c

infra(ai tester): py3.10

5e37bcf

infra(gpu ci): update - wip

1564a05

security(gpu ci): require admin

f04c731

Merge branch 'master' into dev/dev-skrub

a4ac601

Merge branch 'master' into dev/dev-skrub

9a3a886

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev/dev skrub #626

Dev/dev skrub #626

lmeyerov commented Dec 31, 2024 •

edited

Loading

silkspace Dec 31, 2024

lmeyerov Dec 31, 2024

silkspace Dec 31, 2024

lmeyerov commented Jan 5, 2025 •

edited

Loading

Dev/dev skrub #626

Are you sure you want to change the base?

Dev/dev skrub #626

Conversation

lmeyerov commented Dec 31, 2024 • edited Loading

Breaking

Feat

Infra

Refactor

Tests

Fixes

WIP

silkspace Dec 31, 2024

Choose a reason for hiding this comment

lmeyerov Dec 31, 2024

Choose a reason for hiding this comment

silkspace Dec 31, 2024

Choose a reason for hiding this comment

lmeyerov commented Jan 5, 2025 • edited Loading

lmeyerov commented Dec 31, 2024 •

edited

Loading

lmeyerov commented Jan 5, 2025 •

edited

Loading