Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't eagerly materialize fields that the user hasn't asked for #3442

Merged
merged 9 commits into from
Feb 11, 2025

Conversation

westonpace
Copy link
Contributor

We added logic a while back to eagerly materialize fields if they are narrow and there is a filter. However, we forgot to ensure that those fields are actually part of the final projection. The result is that we end up loading many columns the user doesn't want and then throwing them away.

This fix changes the set of fields we load to only be those that are asked for.

@github-actions github-actions bot added bug Something isn't working python labels Feb 11, 2025
Comment on lines +1034 to +1039
if !self.projection_plan.physical_schema.fields.is_empty() {
return Err(Error::invalid_input(
"count_rows should not be called on a plan selecting columns".to_string(),
location!(),
));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little torn on this error. Ideally, we would just silently blank out the projection plan and then create the count plan. However, to do that we either have to clone the scan, which is a pretty big thing to be cloning, or modify the scanner, which would maybe not be what users would expect from count_rows.

For now, I want to get something out soon, so I'm just raising an error, with the assumption that Scanner::count_rows is a mostly internal method anyways (users should use Dataset::count_rows).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that is pretty internal. Plus easy to work around.

@github-actions github-actions bot added the java label Feb 11, 2025
// Start with the desired schema
.union_schema(desired_schema)
// Subtract columns that are expensive
.subtract_predicate(|f| !self.is_early_field(f))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do "early" and "eager" mean the same thing in the vocabulary?

@westonpace westonpace merged commit c70d1d2 into lancedb:main Feb 11, 2025
26 of 27 checks passed
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 90.72165% with 9 lines in your changes missing coverage. Please review.

Project coverage is 78.91%. Comparing base (8a61b69) to head (b3df0b2).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/scanner.rs 90.32% 2 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3442      +/-   ##
==========================================
- Coverage   78.93%   78.91%   -0.03%     
==========================================
  Files         251      251              
  Lines       92267    92390     +123     
  Branches    92267    92390     +123     
==========================================
+ Hits        72833    72910      +77     
- Misses      16463    16504      +41     
- Partials     2971     2976       +5     
Flag Coverage Δ
unittests 78.91% <90.72%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working java python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants