-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: filter out null values when sampling for index training #3404
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3404 +/- ##
==========================================
+ Coverage 78.72% 78.74% +0.02%
==========================================
Files 250 250
Lines 90879 90929 +50
Branches 90879 90929 +50
==========================================
+ Hits 71540 71603 +63
+ Misses 16397 16377 -20
- Partials 2942 2949 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -2215,6 +2222,77 @@ mod tests { | |||
.await; | |||
} | |||
|
|||
#[rstest] | |||
#[tokio::test] | |||
async fn test_create_index_nulls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm thinking should we add some tests for verifying recall? then we can know whether flat search handles nulls well.
it might be good to modify this test https://github.com/lancedb/lance/blob/main/rust/lance/src/index/vector/ivf/v2.rs to contain half rows with nulls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking, is there a way to count rows that are present in the index? I assume if it’s null then we don’t write it to the index file, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the test so it asserts we can use search to get all the non-null vectors back. But I am not getting the results I expect. I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems no such way to count that now, it could be easy for v3 index by counting the num rows of storage file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.
@BubbleCal Could you help me make sense of the output of this test? https://github.com/lancedb/lance/actions/runs/12918160780/job/36026117407?pr=3404
I was expecting search to only return non-null rows, but it seems like we are getting some null vectors in the results.
We were not filtering out null values when sampling. Because we often call
array.values()
on Arrow arrays, which ignores the null bitmap, we are often silently treating the nulls as zeros (or possibly undefined values). Only thing that caught these nulls is an assertion. However, residualization occurring with L2 and Cosine often meant that these values were transformed and null information was lost before the assertion, which is why it got past previous unit tests.This PR adds more assertions validating there aren't nulls, and makes sure the sampling code handles null vectors.
Closes #3402
Closes #3400