-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix): indexing performance in backed
mode for boolean
case
#1233
(fix): indexing performance in backed
mode for boolean
case
#1233
Conversation
Potential other needed items:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1233 +/- ##
==========================================
- Coverage 85.14% 83.08% -2.06%
==========================================
Files 34 34
Lines 5432 5458 +26
==========================================
- Hits 4625 4535 -90
- Misses 807 923 +116
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarking included here?
Some simple benchmark cases here would be great: https://github.com/scverse/anndata/blob/main/benchmarks/benchmarks/sparse_dataset.py
Tests? What are we testing here? Can we do np.where vs. slices?
If we're not already testing for equivalency of these two cases we should definitely do that
@ivirshup I just noticed that https://github.com/scverse/anndata/blob/main/anndata/_core/index.py#L104 has |
I've moved the indexing methods out of the class since neither actually used the |
Thanks! Will keep this in mind for next time. |
I suspect this was because it was easier to do this than have to deal with boolean indices everywhere. There may also be some issues with how the various Could you open an issue/ pr for this? |
Then feel free to merge this |
…e for `boolean` case
…acked` mode for `boolean` case) (#1294) Co-authored-by: Ilan Gold <[email protected]>
Here's my shot at it. As I explain in the comment (also in the issue), the
boolean
case is converted byscipy
to numeric integers which is why the solution has to go beforemtx
access. The other option would be to "solve" theinteger
case but as you point out, there are some nasty edge cases there with sorting, repeated indices etc. So in the interest of getting something out, I have this.I set up a little benchmarking thing. Here's the setup:
Once you have this, I would set the following in order to use
make_alternating_mask
with different sizenum_contiguous
contiguous slices (as opposed to random indices inmask
):You can then profile different use cases with
X_zarr
andX_h5
, usingnp.where
on the mask to trigger "old behavior."For example,
uses the "new" behavior because
num_contiguous=10>7
, the cutoff point where the optimization is faster andyields the old behavior. For me, the first one is orders of magnitude faster (due to, it appears, many less
zarr
accesses). A quick "timing" check can also be performed for different values ofnum_contiguous
boolean
type case in Better indexing performance for backed sparse arrays via slices #1224 and fixes numpy regression in sparse indexing #1254