Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: filter out null values when sampling for index training #3404

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jan 21, 2025

We were not filtering out null values when sampling. Because we often call array.values() on Arrow arrays, which ignores the null bitmap, we are often silently treating the nulls as zeros (or possibly undefined values). Only thing that caught these nulls is an assertion. However, residualization occurring with L2 and Cosine often meant that these values were transformed and null information was lost before the assertion, which is why it got past previous unit tests.

This PR adds more assertions validating there aren't nulls, and makes sure the sampling code handles null vectors.

Closes #3402
Closes #3400

@github-actions github-actions bot added the bug Something isn't working label Jan 21, 2025
@wjones127 wjones127 marked this pull request as ready for review January 22, 2025 00:25
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 82.35294% with 9 lines in your changes missing coverage. Please review.

Project coverage is 78.74%. Comparing base (7f60aa0) to head (db277d8).

Files with missing lines Patch % Lines
rust/lance/src/index/vector/utils.rs 81.25% 3 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3404      +/-   ##
==========================================
+ Coverage   78.72%   78.74%   +0.02%     
==========================================
  Files         250      250              
  Lines       90879    90929      +50     
  Branches    90879    90929      +50     
==========================================
+ Hits        71540    71603      +63     
+ Misses      16397    16377      -20     
- Partials     2942     2949       +7     
Flag Coverage Δ
unittests 78.74% <82.35%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -2215,6 +2222,77 @@ mod tests {
.await;
}

#[rstest]
#[tokio::test]
async fn test_create_index_nulls(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm thinking should we add some tests for verifying recall? then we can know whether flat search handles nulls well.

it might be good to modify this test https://github.com/lancedb/lance/blob/main/rust/lance/src/index/vector/ivf/v2.rs to contain half rows with nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, is there a way to count rows that are present in the index? I assume if it’s null then we don’t write it to the index file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the test so it asserts we can use search to get all the non-null vectors back. But I am not getting the results I expect. I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems no such way to count that now, it could be easy for v3 index by counting the num rows of storage file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.

@BubbleCal Could you help me make sense of the output of this test? https://github.com/lancedb/lance/actions/runs/12918160780/job/36026117407?pr=3404

I was expecting search to only return non-null rows, but it seems like we are getting some null vectors in the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants