Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Adds protocol checks to the public getChanges API on TableImpl #3651

Merged

Conversation

allisonport-db
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

To avoid reading invalid tables, Kernel should check that any read protocol actions are supported by Kernel. This PR makes the current API private, and adds a public API around it that does this check when the Protocol is included in the list of actions to be read from the file.

Also removes the "byVersion" part of the API name since we are adding separate timestamp APIs in #3650.

How was this patch tested?

Adds unit tests.

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return getRawChanges(engine, startVersion, endVersion, actionSet)
.map(
batch -> {
int protocolIdx =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get the protocol col index once outside of the .map on line 207.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't access the schema w/o the batch

@@ -48,19 +45,19 @@ public class TableFeatures {
////////////////////

public static void validateReadSupportedTable(
Protocol protocol, Metadata metadata, String tablePath) {
Protocol protocol, Optional<Metadata> metadata, String tablePath) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just add a method that checks readerFeatures doesn't contain that Kernel doesn't know. Similar to SUPPORTED_WRITER_FEATURES, add SUPPORTED_READER_FEATURES?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with that this method can also be refactored to use the set and then do specific metadata checks based on the readerFEatuer (e.g. column mapping)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored out the SUPPORTED_READER_FEATURES. I think it still seems fine to me to include the metadata as an optional parameter; if it's present we do the metadata checks.

Separate fx would still need to check versions + the features.

Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized one thing: what happens if the connector just requests ADD, REMOVEs. but there is a protocol update between the versions requested? The updated protocol could cause the ADD info returned wrong. The new protocol feature (unknown to Kernel) may add some extra fields to ADD which Kernel won't be reading.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after implmenting my minor feedback

return getRawChanges(engine, startVersion, endVersion, copySet)
.map(
batch -> {
int protocolIdx = batch.getSchema().indexOf("protocol"); // must exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a common pattern throughout kernel code?

i.e. have a batch, get the index of a particular column, and then for each non-null value apply a function?

would it be cleaner to abstract this to a method on top of the ColumnarBatch interface that lets us apply a function for all non-null values for a particular column name?

if this occurrence is rare, then LGTM, and no need for premature optimization

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's that common but we probably do it elsewhere once or twice. But I don't think an abstraction would necessarily belong on the ColumnarBatch interface though, maybe if we end up supporting UDF-like expressions this could be done there

@allisonport-db allisonport-db merged commit 27cdcb9 into delta-io:master Sep 11, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants