-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Adds protocol checks to the public getChanges API on TableImpl #3651
[Kernel] Adds protocol checks to the public getChanges API on TableImpl #3651
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
return getRawChanges(engine, startVersion, endVersion, actionSet) | ||
.map( | ||
batch -> { | ||
int protocolIdx = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get the protocol col index once outside of the .map
on line 207.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't access the schema w/o the batch
@@ -48,19 +45,19 @@ public class TableFeatures { | |||
//////////////////// | |||
|
|||
public static void validateReadSupportedTable( | |||
Protocol protocol, Metadata metadata, String tablePath) { | |||
Protocol protocol, Optional<Metadata> metadata, String tablePath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just add a method that checks readerFeatures doesn't contain that Kernel doesn't know. Similar to SUPPORTED_WRITER_FEATURES
, add SUPPORTED_READER_FEATURES
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with that this method can also be refactored to use the set and then do specific metadata checks based on the readerFEatuer (e.g. column mapping)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factored out the SUPPORTED_READER_FEATURES
. I think it still seems fine to me to include the metadata
as an optional parameter; if it's present we do the metadata checks.
Separate fx would still need to check versions + the features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized one thing: what happens if the connector just requests ADD, REMOVEs. but there is a protocol update between the versions requested? The updated protocol could cause the ADD info returned wrong. The new protocol feature (unknown to Kernel) may add some extra fields to ADD which Kernel won't be reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after implmenting my minor feedback
kernel/kernel-api/src/main/java/io/delta/kernel/internal/TableImpl.java
Outdated
Show resolved
Hide resolved
kernel/kernel-api/src/main/java/io/delta/kernel/internal/TableFeatures.java
Outdated
Show resolved
Hide resolved
return getRawChanges(engine, startVersion, endVersion, copySet) | ||
.map( | ||
batch -> { | ||
int protocolIdx = batch.getSchema().indexOf("protocol"); // must exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a common pattern throughout kernel code?
i.e. have a batch, get the index of a particular column, and then for each non-null value apply a function?
would it be cleaner to abstract this to a method on top of the ColumnarBatch interface that lets us apply a function for all non-null values for a particular column name?
if this occurrence is rare, then LGTM, and no need for premature optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's that common but we probably do it elsewhere once or twice. But I don't think an abstraction would necessarily belong on the ColumnarBatch interface though, maybe if we end up supporting UDF-like expressions this could be done there
Which Delta project/connector is this regarding?
Description
To avoid reading invalid tables, Kernel should check that any read protocol actions are supported by Kernel. This PR makes the current API private, and adds a public API around it that does this check when the
Protocol
is included in the list of actions to be read from the file.Also removes the "byVersion" part of the API name since we are adding separate timestamp APIs in #3650.
How was this patch tested?
Adds unit tests.