-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: incremental Snapshot update #651
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #651 +/- ##
==========================================
+ Coverage 83.66% 83.69% +0.03%
==========================================
Files 75 75
Lines 16949 17087 +138
Branches 16949 17087 +138
==========================================
+ Hits 14180 14301 +121
- Misses 2085 2088 +3
- Partials 684 698 +14 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Robert Pack <[email protected]>
@@ -166,21 +166,59 @@ impl LogSegment { | |||
.filter_ok(|x| x.is_commit()) | |||
.try_collect()?; | |||
|
|||
// Return a FileNotFound error if there are no new commits. | |||
// Callers can then decide how to handle this case. | |||
require!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting it up makes it clearer that we want nonempty 👌
// Callers can then decide how to handle this case. | ||
require!( | ||
!ascending_commit_files.is_empty(), | ||
Error::file_not_found("No new commits.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the error should be more like "No commits in range", or "Failed to find commits in the range". "New commits" makes sense for an update operation, but CDF could operate over old commits.
/// The other LogSegment must be contiguous with this one, i.e. the end | ||
/// version of this LogSegment must be one less than the start version of | ||
/// the other LogSegment and contain only commit files. | ||
pub(crate) fn extend(&mut self, other: &LogSegment) -> DeltaResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this passes by reference. I'm not sure when we'd need to keep around such a log segment besides extending an existing one. Should we pass ownership and avoid the clone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go a step further -- most of this should probably be inlined at the call site, which eliminates many of the correctness checks that caller already had to perform (or can prove are redundant for other reasons)
(see also my other comment about making an entirely new snapshot instead of mutating an existing one)
|
||
let (metadata, protocol) = log_segment.read_metadata_opt(engine)?; | ||
if let Some(p) = &protocol { | ||
p.ensure_read_supported()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the column mapping mode also be checked in the case only Protocol is updated? the column_mapping_mode
function takes protocol as a parameter, so I imagine whatever the output column_mapping_mode is should be re-validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact this is another usecase for my TableConfiguration PR where we do all the validations in one place without having to think about these cases.
assert_eq!(snapshot.protocol().min_reader_version(), 1); | ||
assert_eq!(snapshot.table_properties.enable_type_widening, None); | ||
|
||
snapshot.update(1, &engine).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a clean n simple api to update! 👌
@@ -137,6 +137,9 @@ pub struct TableProperties { | |||
/// whether to enable row tracking during writes. | |||
pub enable_row_tracking: Option<bool>, | |||
|
|||
/// whether to enable type widening |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see enableTypeWidening
in the protocol. How come it's not part of the protocol doesn't have it when other docs do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's still an RFC at this point: https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md. We can totally implement support for it, but details of the feature could technically change before the RFC is merged into the actual spec. I believe the feature name also needs to be typeWidening-preview
or similar, until that happens.
@@ -137,6 +137,9 @@ pub struct TableProperties { | |||
/// whether to enable row tracking during writes. | |||
pub enable_row_tracking: Option<bool>, | |||
|
|||
/// whether to enable type widening | |||
pub enable_type_widening: Option<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re
Note
removed the breaking change label, since calling code should not require updates.
The semver check fails here with the message:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.38.0/src/lints/constructible_struct_adds_field.ronFailed in:
Summary semver requires new major version: 1 major and 0 minor checks failed
field TableProperties.enable_type_widening in /home/runner/work/delta-kernel-rs/delta-kernel-rs/kernel/src/table_properties.rs:141
That seems like a legit (tho annoying) callout of a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep +1 came here to comment the same thing (this is a breaking change) - also this seems a tangential change, do we want to include it here?
pub fn update( | ||
&mut self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me that we want a mut update
? An engine could conceivably want to keep the original snapshot because e.g. other queries are still using it. And if there is no sharing, then we should probably just consume self.
Either way, it seems better to return an entirely new Snapshot? Especially since that makes this code exception safe (an error partway through doesn't leave any visible changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also -- if we do take &self
then we may want to consider returning a Cow
, since a common case is that the passed-in snapshot is still the latest. If consuming self, it's easy enough to return self as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea perhaps we need some discussion on what API we want here? I had a PR that gives a new Snapshot
with a Snapshot::new_from
API: #549
/// The other LogSegment must be contiguous with this one, i.e. the end | ||
/// version of this LogSegment must be one less than the start version of | ||
/// the other LogSegment and contain only commit files. | ||
pub(crate) fn extend(&mut self, other: &LogSegment) -> DeltaResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd go a step further -- most of this should probably be inlined at the call site, which eliminates many of the correctness checks that caller already had to perform (or can prove are redundant for other reasons)
(see also my other comment about making an entirely new snapshot instead of mutating an existing one)
Note
removed the breaking change label, since calling code should not require updates.
What changes are proposed in this pull request?
This PR proposes a new api on
Snapshot
to handle updating a snapshot to a more recent version.This allows engines like duckdb or delta-rs to hold on to a
Snapshot
and avoid potentially expensive operations on the object store by only the new log entries.This PR affects the following public APIs
update
method toSnapshot
.enable_type_widening
onTableProperties
(this was useful to use the type widening table for testing)Snapshot::try_new
How was this change tested?
LogSegment
andSnapshot::update