Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: incremental Snapshot update #651

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Jan 18, 2025

Note

removed the breaking change label, since calling code should not require updates.

What changes are proposed in this pull request?

This PR proposes a new api on Snapshot to handle updating a snapshot to a more recent version.

    /// Update the `Snapshot` to the target version.
    ///
    /// # Parameters
    /// - `target_version`: desired version of the `Snapshot` after update, defaults to latest.
    /// - `engine`: Implementation of [`Engine`] apis.
    ///
    /// # Returns
    /// - boolean flag indicating if the `Snapshot` was updated.
    pub fn update(
        &mut self,
        target_version: impl Into<Option<Version>>,
        engine: &dyn Engine,
    ) -> DeltaResult<bool>

This allows engines like duckdb or delta-rs to hold on to a Snapshot and avoid potentially expensive operations on the object store by only the new log entries.

This PR affects the following public APIs

  • adds a new update method to Snapshot.
  • exposes enable_type_widening on TableProperties (this was useful to use the type widening table for testing)
  • extends permissible inputs to Snapshot::try_new

How was this change tested?

  • added new tests for both LogSegment and Snapshot::update

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 18, 2025
Signed-off-by: Robert Pack <[email protected]>
Copy link

codecov bot commented Jan 18, 2025

Codecov Report

Attention: Patch coverage is 90.34483% with 14 lines in your changes missing coverage. Please review.

Project coverage is 83.69%. Comparing base (8494126) to head (9dcc067).

Files with missing lines Patch % Lines
kernel/src/snapshot.rs 82.25% 2 Missing and 9 partials ⚠️
kernel/src/log_segment.rs 96.96% 1 Missing ⚠️
kernel/src/log_segment/tests.rs 97.87% 0 Missing and 1 partial ⚠️
kernel/src/table_properties/deserialize.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #651      +/-   ##
==========================================
+ Coverage   83.66%   83.69%   +0.03%     
==========================================
  Files          75       75              
  Lines       16949    17087     +138     
  Branches    16949    17087     +138     
==========================================
+ Hits        14180    14301     +121     
- Misses       2085     2088       +3     
- Partials      684      698      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roeap roeap removed the breaking-change Change that will require a version bump label Jan 18, 2025
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 18, 2025
@roeap roeap removed the breaking-change Change that will require a version bump label Jan 18, 2025
@@ -166,21 +166,59 @@ impl LogSegment {
.filter_ok(|x| x.is_commit())
.try_collect()?;

// Return a FileNotFound error if there are no new commits.
// Callers can then decide how to handle this case.
require!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting it up makes it clearer that we want nonempty 👌

// Callers can then decide how to handle this case.
require!(
!ascending_commit_files.is_empty(),
Error::file_not_found("No new commits.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error should be more like "No commits in range", or "Failed to find commits in the range". "New commits" makes sense for an update operation, but CDF could operate over old commits.

/// The other LogSegment must be contiguous with this one, i.e. the end
/// version of this LogSegment must be one less than the start version of
/// the other LogSegment and contain only commit files.
pub(crate) fn extend(&mut self, other: &LogSegment) -> DeltaResult<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this passes by reference. I'm not sure when we'd need to keep around such a log segment besides extending an existing one. Should we pass ownership and avoid the clone?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go a step further -- most of this should probably be inlined at the call site, which eliminates many of the correctness checks that caller already had to perform (or can prove are redundant for other reasons)

(see also my other comment about making an entirely new snapshot instead of mutating an existing one)


let (metadata, protocol) = log_segment.read_metadata_opt(engine)?;
if let Some(p) = &protocol {
p.ensure_read_supported()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the column mapping mode also be checked in the case only Protocol is updated? the column_mapping_mode function takes protocol as a parameter, so I imagine whatever the output column_mapping_mode is should be re-validated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this is another usecase for my TableConfiguration PR where we do all the validations in one place without having to think about these cases.

assert_eq!(snapshot.protocol().min_reader_version(), 1);
assert_eq!(snapshot.table_properties.enable_type_widening, None);

snapshot.update(1, &engine).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a clean n simple api to update! 👌

@@ -137,6 +137,9 @@ pub struct TableProperties {
/// whether to enable row tracking during writes.
pub enable_row_tracking: Option<bool>,

/// whether to enable type widening
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see enableTypeWidening in the protocol. How come it's not part of the protocol doesn't have it when other docs do?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's still an RFC at this point: https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md. We can totally implement support for it, but details of the feature could technically change before the RFC is merged into the actual spec. I believe the feature name also needs to be typeWidening-preview or similar, until that happens.

@@ -137,6 +137,9 @@ pub struct TableProperties {
/// whether to enable row tracking during writes.
pub enable_row_tracking: Option<bool>,

/// whether to enable type widening
pub enable_type_widening: Option<bool>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re

Note
removed the breaking change label, since calling code should not require updates.

The semver check fails here with the message:

A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.38.0/src/lints/constructible_struct_adds_field.ron

Failed in:
Summary semver requires new major version: 1 major and 0 minor checks failed
field TableProperties.enable_type_widening in /home/runner/work/delta-kernel-rs/delta-kernel-rs/kernel/src/table_properties.rs:141

That seems like a legit (tho annoying) callout of a breaking change?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep +1 came here to comment the same thing (this is a breaking change) - also this seems a tangential change, do we want to include it here?

Comment on lines +164 to +165
pub fn update(
&mut self,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that we want a mut update? An engine could conceivably want to keep the original snapshot because e.g. other queries are still using it. And if there is no sharing, then we should probably just consume self.

Either way, it seems better to return an entirely new Snapshot? Especially since that makes this code exception safe (an error partway through doesn't leave any visible changes).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- if we do take &self then we may want to consider returning a Cow, since a common case is that the passed-in snapshot is still the latest. If consuming self, it's easy enough to return self as needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea perhaps we need some discussion on what API we want here? I had a PR that gives a new Snapshot with a Snapshot::new_from API: #549

/// The other LogSegment must be contiguous with this one, i.e. the end
/// version of this LogSegment must be one less than the start version of
/// the other LogSegment and contain only commit files.
pub(crate) fn extend(&mut self, other: &LogSegment) -> DeltaResult<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go a step further -- most of this should probably be inlined at the call site, which eliminates many of the correctness checks that caller already had to perform (or can prove are redundant for other reasons)

(see also my other comment about making an entirely new snapshot instead of mutating an existing one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants