(fix): indexing performance in `backed` mode for `boolean` case #1233

ilan-gold · 2023-11-15T12:14:41Z

Here's my shot at it. As I explain in the comment (also in the issue), the boolean case is converted by scipy to numeric integers which is why the solution has to go before mtx access. The other option would be to "solve" the integer case but as you point out, there are some nasty edge cases there with sorting, repeated indices etc. So in the interest of getting something out, I have this.

I set up a little benchmarking thing. Here's the setup:

import numpy as np
import zarr
from anndata.experimental import read_elem, write_elem, sparse_dataset
from scipy import sparse
import h5py

# Create data
rng = np.random.default_rng()

X = sparse.random(100_000, 10_000, density=0.01, random_state=rng, format="csr")
group = zarr.open("demo.zarr", mode="w")
write_elem(group, "X", X)
X_zarr = sparse_dataset(group["X"])

h5_group = h5py.File("demo.h5", mode="w")
write_elem(h5_group, "X", X, dataset_kwargs={"compression": "lzf"})
X_h5 = sparse_dataset(h5_group["X"])

# randomized indices
inds = np.random.choice(X_zarr.shape[0], 100, replace=False)
inds.sort()

mask = np.zeros(X_zarr.shape[0], dtype=bool)
for i in range(0, len(inds) - 1, 2):
    mask[inds[i]:inds[i+1]] = True

# non-random indices, with alternating one false and n true
def make_alternating_mask(n):
    mask_alternating = np.ones(X_zarr.shape[0], dtype=bool)
    for i in range(0, X_zarr.shape[0], n):
        mask_alternating[i] = False     
    return mask_alternating

Once you have this, I would set the following in order to use make_alternating_mask with different size num_contiguous contiguous slices (as opposed to random indices in mask):

num_contiguous = 10

You can then profile different use cases with X_zarr and X_h5, using np.where on the mask to trigger "old behavior."

For example,

 %prun -s cumtime X_zarr[make_alternating_mask(num_contiguous)]

uses the "new" behavior because num_contiguous=10>7, the cutoff point where the optimization is faster and

 %prun -s cumtime X_zarr[np.where(make_alternating_mask(num_contiguous))]

yields the old behavior. For me, the first one is orders of magnitude faster (due to, it appears, many less zarr accesses). A quick "timing" check can also be performed for different values of num_contiguous

def timings_zarr(num_contiguous):
    print('****Proposed solution****')
    print('-------------------------')
    print('Zarr, random indices')
    print('-------------------------')
    %time X_zarr[mask]
    print('-------------------------')
    print('Zarr, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_zarr[make_alternating_mask(num_contiguous)]
    print('\n****Old behavior****') # achieved using integer indices
    print('-------------------------')
    print('Zarr, random indices')
    print('-------------------------')
    %time X_zarr[np.where(mask)[0]]
    print('-------------------------')
    print('Zarr, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_zarr[np.where(make_alternating_mask(num_contiguous))[0]]

def timings_h5(num_contiguous):
    print('\n****Proposed solution****')
    print('-------------------------')
    print('h5, random indices')
    print('-------------------------')
    %time X_h5[mask]
    print('-------------------------')
    print('h5, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_h5[make_alternating_mask(num_contiguous)]
    print('\n****Old behavior****') # achieved using integer indices
    print('-------------------------')
    print('h5, random indices')
    print('-------------------------')
    %time X_h5[np.where(mask)[0]]
    print('-------------------------')
    print('h5, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_h5[np.where(make_alternating_mask(num_contiguous))[0]]

Addresses the boolean type case in Better indexing performance for backed sparse arrays via slices #1224 and fixes numpy regression in sparse indexing #1254
Tests added
Release note added (or unnecessary)

ilan-gold · 2023-11-15T12:21:29Z

Potential other needed items:

Benchmarking included here?
Tests? What are we testing here? Can we do np.where vs. slices?

anndata/_core/sparse_dataset.py

codecov · 2023-11-15T12:30:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (522d7ea) 85.14% compared to head (84e0864) 83.08%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1233      +/-   ##
==========================================
- Coverage   85.14%   83.08%   -2.06%     
==========================================
  Files          34       34              
  Lines        5432     5458      +26     
==========================================
- Hits         4625     4535      -90     
- Misses        807      923     +116

Flag	Coverage Δ
gpu-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
anndata/_core/sparse_dataset.py	`93.83% <100.00%> (+1.35%)`	⬆️

... and 7 files with indirect coverage changes

ivirshup

Benchmarking included here?

Some simple benchmark cases here would be great: https://github.com/scverse/anndata/blob/main/benchmarks/benchmarks/sparse_dataset.py

Tests? What are we testing here? Can we do np.where vs. slices?

If we're not already testing for equivalency of these two cases we should definitely do that

anndata/_core/sparse_dataset.py

ilan-gold · 2023-12-18T08:00:01Z

@ivirshup I just noticed that https://github.com/scverse/anndata/blob/main/anndata/_core/index.py#L104 has np.where as well. So we might want to remove that as well, but I think it is a separate issue. I don't know the history of it.

ivirshup · 2024-01-10T11:11:35Z

I've moved the indexing methods out of the class since neither actually used the self parameter, and purely operated on the arguments. This is more like the other indexing code that live in the get_... functions.

ilan-gold · 2024-01-10T14:41:28Z

Thanks! Will keep this in mind for next time.

ivirshup · 2024-01-10T15:02:09Z

@ivirshup I just noticed that https://github.com/scverse/anndata/blob/main/anndata/_core/index.py#L104 has np.where as well. So we might want to remove that as well, but I think it is a separate issue. I don't know the history of it.

I suspect this was because it was easier to do this than have to deal with boolean indices everywhere. There may also be some issues with how the various _subset methods react when a boolean array is passed through, but think our indexing tests should make this clear if this is changed.

Could you open an issue/ pr for this?

ivirshup · 2024-01-10T15:02:23Z

Then feel free to merge this

…e for `boolean` case

…acked` mode for `boolean` case) (#1294) Co-authored-by: Ilan Gold <[email protected]>

ilan-gold added 6 commits November 10, 2023 13:52

(feat): first pass heuristic

ebe05d2

(feat): the function is faster within the call?

a68bc06

(feat): add break-even point

4b95cd8

(chore): light refactor

4257469

(fix): not everything can be converted to slices

878a9df

(chore): add comment.

9ae39da

Merge branch 'main' into ig/backed_sparse_indexing_performance

f3f9e74

ilan-gold commented Nov 15, 2023

View reviewed changes

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

ilan-gold commented Nov 15, 2023

View reviewed changes

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

ilan-gold requested a review from ivirshup November 15, 2023 12:30

ivirshup added performance 🐌 topic: backed backend: zarr labels Dec 7, 2023

ivirshup added this to the 0.10.4 milestone Dec 7, 2023

ivirshup added the skip-gpu-ci label Dec 7, 2023

ivirshup reviewed Dec 7, 2023

View reviewed changes

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

ilan-gold added 5 commits December 8, 2023 12:11

Merge branch 'main' into ig/backed_sparse_indexing_performance

da562e5

(feat): first pass sparse from elems

e30b34e

(chore): rename

070ea1e

(chore): try refactor

66212e3

(chore): add test

080dfac

ilan-gold commented Dec 8, 2023

View reviewed changes

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

ilan-gold added 3 commits December 11, 2023 09:26

(feat): add benchmark

fea30a7

(fix): higher numpy version bug

013de2a

(refactor): efficient -> compressed, index method

617a53d

ilan-gold requested a review from ivirshup December 11, 2023 15:17

ilan-gold commented Dec 11, 2023

View reviewed changes

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved

Merge branch 'main' into ig/backed_sparse_indexing_performance

f1fc495

(refactor): use two conditions again

2aacc3f

Merge branch 'main' into ig/backed_sparse_indexing_performance

a1cab60

flying-sheep modified the milestones: 0.10.4, 0.10.5 Jan 4, 2024

Minor simplification

8d9a3b3

ivirshup added 2 commits January 10, 2024 12:13

Merge branch 'main' into ig/backed_sparse_indexing_performance

27f9f2c

Release note

84e0864

ivirshup approved these changes Jan 10, 2024

View reviewed changes

ilan-gold mentioned this pull request Jan 10, 2024

Check indexing arg types #1293

Closed

ilan-gold merged commit ab43f8d into scverse:main Jan 10, 2024
13 checks passed

ilan-gold deleted the ig/backed_sparse_indexing_performance branch January 10, 2024 15:29

meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Jan 10, 2024

Backport PR scverse#1233: (fix): indexing performance in backed mod…

51d1878

…e for `boolean` case

meeseeksmachine mentioned this pull request Jan 10, 2024

Backport PR #1233 on branch 0.10.x ((fix): indexing performance in backed mode for boolean case) #1294

Merged

flying-sheep pushed a commit that referenced this pull request Jan 11, 2024

Backport PR #1233 on branch 0.10.x ((fix): indexing performance in `b…

c63461b

…acked` mode for `boolean` case) (#1294) Co-authored-by: Ilan Gold <[email protected]>

ilan-gold assigned ilan-gold and unassigned ilan-gold Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): indexing performance in `backed` mode for `boolean` case #1233

(fix): indexing performance in `backed` mode for `boolean` case #1233

ilan-gold commented Nov 15, 2023 •

edited

Loading

ilan-gold commented Nov 15, 2023 •

edited

Loading

codecov bot commented Nov 15, 2023 •

edited

Loading

ivirshup left a comment

ilan-gold commented Dec 18, 2023

ivirshup commented Jan 10, 2024

ilan-gold commented Jan 10, 2024

ivirshup commented Jan 10, 2024

ivirshup commented Jan 10, 2024

(fix): indexing performance in backed mode for boolean case #1233

(fix): indexing performance in backed mode for boolean case #1233

Conversation

ilan-gold commented Nov 15, 2023 • edited Loading

ilan-gold commented Nov 15, 2023 • edited Loading

codecov bot commented Nov 15, 2023 • edited Loading

Codecov Report

ivirshup left a comment

Choose a reason for hiding this comment

ilan-gold commented Dec 18, 2023

ivirshup commented Jan 10, 2024

ilan-gold commented Jan 10, 2024

ivirshup commented Jan 10, 2024

ivirshup commented Jan 10, 2024

(fix): indexing performance in `backed` mode for `boolean` case #1233

(fix): indexing performance in `backed` mode for `boolean` case #1233

ilan-gold commented Nov 15, 2023 •

edited

Loading

ilan-gold commented Nov 15, 2023 •

edited

Loading

codecov bot commented Nov 15, 2023 •

edited

Loading