Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): indexing performance in backed mode for boolean case #1233

Merged

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Nov 15, 2023

Here's my shot at it. As I explain in the comment (also in the issue), the boolean case is converted by scipy to numeric integers which is why the solution has to go before mtx access. The other option would be to "solve" the integer case but as you point out, there are some nasty edge cases there with sorting, repeated indices etc. So in the interest of getting something out, I have this.

I set up a little benchmarking thing. Here's the setup:

import numpy as np
import zarr
from anndata.experimental import read_elem, write_elem, sparse_dataset
from scipy import sparse
import h5py

# Create data
rng = np.random.default_rng()

X = sparse.random(100_000, 10_000, density=0.01, random_state=rng, format="csr")
group = zarr.open("demo.zarr", mode="w")
write_elem(group, "X", X)
X_zarr = sparse_dataset(group["X"])

h5_group = h5py.File("demo.h5", mode="w")
write_elem(h5_group, "X", X, dataset_kwargs={"compression": "lzf"})
X_h5 = sparse_dataset(h5_group["X"])

# randomized indices
inds = np.random.choice(X_zarr.shape[0], 100, replace=False)
inds.sort()

mask = np.zeros(X_zarr.shape[0], dtype=bool)
for i in range(0, len(inds) - 1, 2):
    mask[inds[i]:inds[i+1]] = True

# non-random indices, with alternating one false and n true
def make_alternating_mask(n):
    mask_alternating = np.ones(X_zarr.shape[0], dtype=bool)
    for i in range(0, X_zarr.shape[0], n):
        mask_alternating[i] = False     
    return mask_alternating

Once you have this, I would set the following in order to use make_alternating_mask with different size num_contiguous contiguous slices (as opposed to random indices in mask):

num_contiguous = 10

You can then profile different use cases with X_zarr and X_h5, using np.where on the mask to trigger "old behavior."

For example,

 %prun -s cumtime X_zarr[make_alternating_mask(num_contiguous)]

uses the "new" behavior because num_contiguous=10>7, the cutoff point where the optimization is faster and

 %prun -s cumtime X_zarr[np.where(make_alternating_mask(num_contiguous))]

yields the old behavior. For me, the first one is orders of magnitude faster (due to, it appears, many less zarr accesses). A quick "timing" check can also be performed for different values of num_contiguous

def timings_zarr(num_contiguous):
    print('****Proposed solution****')
    print('-------------------------')
    print('Zarr, random indices')
    print('-------------------------')
    %time X_zarr[mask]
    print('-------------------------')
    print('Zarr, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_zarr[make_alternating_mask(num_contiguous)]
    print('\n****Old behavior****') # achieved using integer indices
    print('-------------------------')
    print('Zarr, random indices')
    print('-------------------------')
    %time X_zarr[np.where(mask)[0]]
    print('-------------------------')
    print('Zarr, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_zarr[np.where(make_alternating_mask(num_contiguous))[0]]

def timings_h5(num_contiguous):
    print('\n****Proposed solution****')
    print('-------------------------')
    print('h5, random indices')
    print('-------------------------')
    %time X_h5[mask]
    print('-------------------------')
    print('h5, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_h5[make_alternating_mask(num_contiguous)]
    print('\n****Old behavior****') # achieved using integer indices
    print('-------------------------')
    print('h5, random indices')
    print('-------------------------')
    %time X_h5[np.where(mask)[0]]
    print('-------------------------')
    print('h5, alternating mask with size ', num_contiguous, ' slices')
    print('-------------------------')
    %time X_h5[np.where(make_alternating_mask(num_contiguous))[0]]

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Nov 15, 2023

Potential other needed items:

  • Benchmarking included here?
  • Tests? What are we testing here? Can we do np.where vs. slices?

@ilan-gold ilan-gold requested a review from ivirshup November 15, 2023 12:30
Copy link

codecov bot commented Nov 15, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (522d7ea) 85.14% compared to head (84e0864) 83.08%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1233      +/-   ##
==========================================
- Coverage   85.14%   83.08%   -2.06%     
==========================================
  Files          34       34              
  Lines        5432     5458      +26     
==========================================
- Hits         4625     4535      -90     
- Misses        807      923     +116     
Flag Coverage Δ
gpu-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
anndata/_core/sparse_dataset.py 93.83% <100.00%> (+1.35%) ⬆️

... and 7 files with indirect coverage changes

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarking included here?

Some simple benchmark cases here would be great: https://github.com/scverse/anndata/blob/main/benchmarks/benchmarks/sparse_dataset.py

Tests? What are we testing here? Can we do np.where vs. slices?

If we're not already testing for equivalency of these two cases we should definitely do that

anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved
anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved
anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved
anndata/_core/sparse_dataset.py Outdated Show resolved Hide resolved
@ilan-gold ilan-gold requested a review from ivirshup December 11, 2023 15:17
@ilan-gold
Copy link
Contributor Author

@ivirshup I just noticed that https://github.com/scverse/anndata/blob/main/anndata/_core/index.py#L104 has np.where as well. So we might want to remove that as well, but I think it is a separate issue. I don't know the history of it.

@flying-sheep flying-sheep modified the milestones: 0.10.4, 0.10.5 Jan 4, 2024
@ivirshup
Copy link
Member

I've moved the indexing methods out of the class since neither actually used the self parameter, and purely operated on the arguments. This is more like the other indexing code that live in the get_... functions.

@ilan-gold
Copy link
Contributor Author

Thanks! Will keep this in mind for next time.

@ivirshup
Copy link
Member

@ivirshup I just noticed that https://github.com/scverse/anndata/blob/main/anndata/_core/index.py#L104 has np.where as well. So we might want to remove that as well, but I think it is a separate issue. I don't know the history of it.

I suspect this was because it was easier to do this than have to deal with boolean indices everywhere. There may also be some issues with how the various _subset methods react when a boolean array is passed through, but think our indexing tests should make this clear if this is changed.

Could you open an issue/ pr for this?

@ivirshup
Copy link
Member

Then feel free to merge this

@ilan-gold ilan-gold merged commit ab43f8d into scverse:main Jan 10, 2024
13 checks passed
@ilan-gold ilan-gold deleted the ig/backed_sparse_indexing_performance branch January 10, 2024 15:29
meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Jan 10, 2024
flying-sheep pushed a commit that referenced this pull request Jan 11, 2024
…acked` mode for `boolean` case) (#1294)

Co-authored-by: Ilan Gold <[email protected]>
@ilan-gold ilan-gold assigned ilan-gold and unassigned ilan-gold Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

numpy regression in sparse indexing
3 participants