-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!: Introduce TableConfiguration
to jointly manage metadata, protocol, and table properties
#644
base: main
Are you sure you want to change the base?
Conversation
3f74148
to
a6d74d6
Compare
kernel/src/table_configuration.rs
Outdated
/// See the documentation of [`TableChanges`] for more details. | ||
/// | ||
/// [`TableChanges`]: crate::table_changes::TableChanges | ||
pub fn can_read_cdf(&self) -> DeltaResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachschuermann @nicklan wdyt of this sort of API? The general pattern would be can_(read | write)_{table_feature}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we'd only create such functions if/when we need them, and if they make sense. For example if there is a feature that is only relevant to readers, there would be no can_write_{tablefeature}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason it doesn't return DeltaResult<bool>
? Seems like there are cases (most of them?) where we don't want to necessarily throw, we just want to learn (but if something goes wrong while trying to learn, an error response still warranted).
Can always add a corresponding assert_can_XXX
method if the error version is actually needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using the pattern of Ok(())
=> true, Err(_)
=> false, with an explanation why. We use a similar pattern here. Do you think we should rethink this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends entirely on whether "most" callers of a given method would treat false as an error. Protocol checks would of course treat it like an error, but other places in the code merely want to make control flow decisions based on the answer. I guess Result for control flow isn't as bad as exceptions for control flow, but in general it seems preferable to use Result for cases where something actually went wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I expect most of the callers to use this as a bool or have a custom error message. I'll switch to bool 👍
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #644 +/- ##
==========================================
+ Coverage 84.05% 84.21% +0.16%
==========================================
Files 75 76 +1
Lines 17251 17427 +176
Branches 17251 17427 +176
==========================================
+ Hits 14500 14677 +177
+ Misses 2052 2050 -2
- Partials 699 700 +1 ☔ View full report in Codecov by Sentry. |
@@ -412,7 +412,7 @@ impl Scan { | |||
partition_columns: self.snapshot.metadata().partition_columns.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that everything related to protocol or metadata should reside in table configuration now, which would include table schema, partition columns, etc. Idea being, the P&M are "raw" information (with significant cross-validation requirements), which the table configuration aims to encapsulate cleanly. Almost nobody should be accessing the P&M directly after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One useful thought experiment (inspired by Delta spark's SnapshotDescriptor trait would be: how many parts of the code that currently take a Snapshot
could/should take a TableConfiguration
instead?
(distinction being, TableConfiguration
accesses do not require I/O nor significant compute, while true snapshot operations like scans do). Methods that work with a Snapshot
but which do not require an Engine
are likely candidates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm meaning that we actually want to expose only table_configuration
from Snapshot
instead of table_configuraition()
+ protocol()
+ metadata()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ expose in pub(crate)
ofc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly. Pass a TC if you need to know things about the snapshot; pass a snapshot only if you actually need to scan it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do this in the other issue I made. Instead of refactoring to make them pub(crate)
, we'll change all consumers to use TableConfiguration instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going by the thought experiment, it also seems like this should hold a version for the table configuration 🤔. A schema for a table only make sense given a fixed version.
Edit: same for the table_root. Now it's starting to feel like I'm blowing up the responsibilities of TableConfiguration. I'll sit on these thoughts for a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll probably be important to find the dividing line between this and ResolvedTable once catalogs come into play 😵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and yes
pub fn schema(&self) -> &Schema { | ||
&self.schema | ||
self.table_configuration.schema() | ||
} | ||
|
||
/// Table [`Metadata`] at this `Snapshot`s version. | ||
pub fn metadata(&self) -> &Metadata { | ||
&self.metadata | ||
self.table_configuration.metadata() | ||
} | ||
|
||
/// Table [`Protocol`] at this `Snapshot`s version. | ||
pub fn protocol(&self) -> &Protocol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How invasive would it be to remove these three methods from Snapshot?
(not in this PR)
Conceptually, P&M are Delta details that shouldn't be pub in the first place (and they're "raw" information that is easily misused). Schema is a border case, since we use it a lot and it's not a Delta-specific concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried it, and dev-visibility
works fine. pub(crate)
makes an integration test fail to compile since it reads P & M.
You've convinced me that P&M shouldn't be exposed to "regular users" of kernel. I think it fits in well with the spirit of developer-visibility to have them though. If dev-vis exposes all the actions, folks with that feature flag probably want to see these implementation details.
Given all that, I'd advocate for dev-vis for these methods :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you mentioned not to do it in this PR, I made an issue to keep track of it here: #648
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that dev-visibility is reasonable. If the other PR would be tiny, we could consider folding it into this PR -- but I suspect it will be big enough (and not directly on topic here) to be worth splitting out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed a small change actually. I think I'd instead fold the issue above with consolidating all the protocol validation in TableConfiguration. This PR for introducing TC, that second PR for cementing it as the one-stop shop for validated P&M.
EDIT: To clarify, making methods dev-visible is small. Removing them outright is more work.
} | ||
|
||
/// Table [`Protocol`] at this `Snapshot`s version. | ||
pub fn protocol(&self) -> &Protocol { | ||
&self.protocol | ||
self.table_configuration.protocol() | ||
} | ||
|
||
/// Get the [`TableProperties`] for this [`Snapshot`]. | ||
pub fn table_properties(&self) -> &TableProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should (the whole raw set of) table properties be pub? Many table properties are mandated by the Delta spec as part of specific features, and shouldn't be manipulated directly. Many others are mandated by delta-spark convention, but we don't actually support them all here. Should we hide the raw properties and instead surface the ones users should actually see/set via individual methods on the table configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another part where I imagine these details fit within the spirit of developer-visibility imo. A usecase I can think of is as follows:
Kernel doesn't support retention duration/vacuum. Iirc we don't plan to support it. Suppose delta-rs wanted to see these properties, they'd want access to the table properties. Dev-visibility is a good approach to letting them access these details.
kernel/src/table_configuration.rs
Outdated
pub fn column_mapping_mode(&self) -> &ColumnMappingMode { | ||
&self.column_mapping_mode | ||
} | ||
pub fn schema(&self) -> &Schema { | ||
self.schema.as_ref() | ||
} | ||
pub fn protocol(&self) -> &Protocol { | ||
&self.protocol | ||
} | ||
pub fn metadata(&self) -> &Metadata { | ||
&self.metadata | ||
} | ||
pub fn table_properties(&self) -> &TableProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these pub(crate) (with developer-visibility) or even private? Ideally, few if any callers should need access to these raw objects because the various accessor methods cover their needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make them private, we'd have to remove the protocol/metadata methods from the snapshot. You'd mentioned that should be in another PR, but I could go ahead and remove them here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't bloat this PR, we could consider doing it here? Otherwise, it belongs in the other PR as you say.
kernel/src/snapshot.rs
Outdated
@@ -25,11 +24,7 @@ const LAST_CHECKPOINT_FILE_NAME: &str = "_last_checkpoint"; | |||
pub struct Snapshot { | |||
pub(crate) table_root: Url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: given that we have a fn table_root()
and a fn _log_segment()
, it seems we don't need to have these be pub(crate)
? Also: why did we go with the _log_segment
naming convention. Seems pythonic, and I haven't seen any similar naming in our code 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that's weird. I wonder if somebody was unaware that field names and method names don't collide in rust?
TableConfiguration
to jointly manage metadata, protocol, and column mappingTableConfiguration
to jointly manage metadata, protocol, and table properities
TableConfiguration
to jointly manage metadata, protocol, and table properitiesTableConfiguration
to jointly manage metadata, protocol, and table properties
if snapshot.table_configuration().is_cdf_read_supported() { | ||
Ok(()) | ||
} else { | ||
Err(Error::change_data_feed_unsupported(snapshot.version())) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: Intuitively, there should be a better way to express this in rust... but I can't find anything :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed!! There is nice support for options, but I couldn't find anything. I'll give it some more thought 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: The only options seem to be:
- adding another crate
- turning into option then result. More steps than is worth :/
Looks like we're stuck with this.
kernel/src/table_configuration.rs
Outdated
Error::unsupported("Change data feed not supported when column mapping is enabled") | ||
); | ||
Ok(()) | ||
protocol_supported && cdf_enabled && column_mapping_disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume CDF + CM is a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.protocol() | ||
.has_writer_feature(&WriterFeatures::DeletionVectors) | ||
&& self.protocol.min_writer_version() == 7; | ||
read_supported && write_supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it normal to store RW features in both reader and writer list? I thought each feature appeared in only one list, and writers are required to check both? The spec says:
to write a table, writers must implement and respect all features listed in
writerFeatures
. Because writers have to read the table (or only the Delta log) before write, they must implement and respect all reader features as well.
At least, I would have interpreted the above to mean, "writer expects to find DV (a reader feature) in the reader list" rather than "writer checks both reader and writer lists and expects to find it in both"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked, and we do indeed need to check both lists. Taken from our table-with-dv-small
:
{"protocol":{"minReaderVersion":3,"minWriterVersion":7,"readerFeatures": ["deletionVectors"],"writerFeatures":["deletionVectors"]}}
Dv is in both lists, so we should check both lists.
@@ -412,7 +412,7 @@ impl Scan { | |||
partition_columns: self.snapshot.metadata().partition_columns.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly. Pass a TC if you need to know things about the snapshot; pass a snapshot only if you actually need to scan it
kernel/src/table_configuration.rs
Outdated
pub(crate) fn column_mapping_mode(&self) -> &ColumnMappingMode { | ||
&self.column_mapping_mode | ||
} | ||
/// The [`Schema`] of this table configuration's [`Metadata`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are similar to the ones in Snapshot
. Ex:
/// Version of this `Snapshot` in the table.
pub fn version(&self) -> Version {
self.log_segment.end_version
}
I'm not convinced these are useful doc comments. Anyone have thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One-liners like this are always tricky... in theory the name is self-describing. On the other hand, a single short sentence to confirm that for the reader isn't hard to write and may do some good?
kernel/src/table_configuration.rs
Outdated
version: Version, | ||
) -> DeltaResult<Self> { | ||
// important! before a read/write to the table we must check it is supported | ||
protocol.ensure_read_supported()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: We should be checking that writes are supported as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to check that the set of write features is a subset of the read features. I think I'll leave that to the future PR that ports all of the protocol checks to TableConfiguration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well - I think this needs to be configurable (don't need to check if write is supported if we only read the table)? this feels like more of a code smell that we might not be doing it in the right spot.
2ffdeea
to
829cf43
Compare
TableConfiguration
to jointly manage metadata, protocol, and table propertiesTableConfiguration
to jointly manage metadata, protocol, and table properties
TableConfiguration
to jointly manage metadata, protocol, and table propertiesTableConfiguration
to jointly manage metadata, protocol, and table properties
/// | ||
/// [`TableChanges`]: crate::table_changes::TableChanges | ||
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))] | ||
pub(crate) fn is_cdf_read_supported(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't do unit tests here because TableChanges
already tests this path. I think it's worth creating an issue to port some of that to unit tests here.
Note that the testing in TableChanges is more of an integration test that reads a delta log and fails for various reasons. I don't think it's worth keeping integration tests.
If all this sounds good, I'll go ahead and make the issue :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I quite followed... are you proposing that integration tests are unhelpful in general and should always be simplified to unit tests? Or that the specific existing TC tests don't need to be written as integration tests? Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm saying that the integration tests can be simplified to unit tests. Unit tests make it clear that we're hitting all the edge cases that we want to hit.
lmk if you think my judgement is off though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read that as: we have some TableChanges
(integration) tests that basically need to be migrated to unit tests here, as they aren't specific to TableChanges
and should just exercise this in isolation (in UT)
If that's the case - sounds good :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added this to the tracking issue for #650
if snapshot.table_configuration().is_cdf_read_supported() { | ||
Ok(()) | ||
} else { | ||
Err(Error::change_data_feed_unsupported(snapshot.version())) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: The only options seem to be:
- adding another crate
- turning into option then result. More steps than is worth :/
Looks like we're stuck with this.
kernel/src/table_configuration.rs
Outdated
version: Version, | ||
) -> DeltaResult<Self> { | ||
// important! before a read/write to the table we must check it is supported | ||
protocol.ensure_read_supported()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to check that the set of write features is a subset of the read features. I think I'll leave that to the future PR that ports all of the protocol checks to TableConfiguration.
use super::TableConfiguration; | ||
|
||
#[test] | ||
fn dv_supported_not_enabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful these tests are :/ This is a lot of lines to test pretty simple logic. With more *_is_supported
, *_is_enabled
, this gets unwieldy.
I could try to factor this out, but covering all cases would involve specifying reader/writer versions, and configurations. At that point, we're just creating P & M structs through function parameters.
Would appreciate some advice on this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we want to verify that the information from the underlying P&M gets surfaced correctly by TC? But that gets annoying, because as you say we end up writing very specific tests that exactly fit the prod code they aim to exercise (such tests tend to be short-sighted). Combine that with a very large combination space for table features and it gets ugly fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
darn okay. I think I'll leave more thorough testing for when we do more Metadata/Protocol validation. If we test that we corrected validated P&M + Schema + column mapping, the *_is_supported
, *_is_enabled
should be small enough that they're not worth testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we have a way around the issue of "many combinations" that we want to test.. I would probably tackle this by introducing some framework (function/macro/etc.) for passing some large matrix of combinations and expected outputs? can help out more here if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think this is something I'll have to think about more and address in a followup PR.
Edit: Added this to the PR for improving TableConfigurations: #650
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
check_table_config(&start_snapshot)?; | ||
check_table_config(&end_snapshot)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future looking: If we move to a ResolvedTable concept, do we actually need two snapshots? Or could we use just a start snapshot and an end resolved table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree I think we just need bits from ResolvedTable
at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ResolvedTable
can hold the entire TableConfiguration
, then we actually only need ResolvedTable
for both start and end. CDF doesn't need the start_snapshot's log segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing comments
} | ||
|
||
/// Get the [`TableProperties`] for this [`Snapshot`]. | ||
pub fn table_properties(&self) -> &TableProperties { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we effectively remove table_properties API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just moving from Snapshot
to TableConfiguration
, no?
https://github.com/delta-io/delta-kernel-rs/pull/644/files#diff-56f657b4ffdaff037266acb6a20e8723f0985a49b487f289cc2a5ce41bb8ca19R82
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zach makes a good point tho, it's best to stay consistent. The plan is to eventually move all of these fields into the table config, but I think I'll revert this and match it up with the protocol, metadata, and other fields.
check_table_config(&start_snapshot)?; | ||
check_table_config(&end_snapshot)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree I think we just need bits from ResolvedTable
at the end
kernel/src/table_configuration.rs
Outdated
@@ -0,0 +1,262 @@ | |||
//! Provides a high level api to check feature support/enablement for a table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit can we expound on this? i.e. include that it encapsulates Protocol/Metadata/TableProperties and since they are all intertwined it performs cross-validation in order to expose a nice API for bits like column mapping, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this. PTAL!
protocol: Protocol, | ||
schema: SchemaRef, | ||
table_properties: TableProperties, | ||
column_mapping_mode: ColumnMappingMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this? can't we just compute in a fn column_mapping_mode()
?
kernel/src/table_configuration.rs
Outdated
/// another, and that the kernel supports reading from this table. | ||
pub(crate) fn try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a better way for us to communicate this? embedding the "can I read this table" into TableConfiguration::try_new()
feels unintuitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's easier to think about if we s/read/access/ -- if kernel encounters a reader feature it doesn't understand, we can't do anything with that table because interpretation of key fields could have changed, log replay algorithm could have updated, etc. Think e.g. how v2 checkpoints change which files we should be looking for in the _delta_log
dir, or how in-commit timestamps can change which version of the table a given timestamp resolves to.
And then we have writer[-only] features, which only matter if we want to change the table in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it feels unintuitive. There's three questions here:
1) Is this table configuration valid
2) Can we read from the table (ensure_read_supported
)
3) Can we write to the table. (ensure_write_supported
)
It feels like we ought to split these concepts.
The following are true. I provide my reasoning in [1]
- Claim 1: For both reads and writes, we should check
ensure_read_supported
. - Claim 2: For reads, it is important that we don't check write support.
So while it would be fine to check ensure_read_supported
in try_new
(claim 1), we definitely can't check ensure_write_supported
(claim 2).
I think we should have a separate functions TableConfiguration::is_read_supported
and TableConfiguration::is_write_supported
. try_new
will only be responsible for asserting that the table config is valid. These separately address the 3 questions above.
I'll go do that. Lmk if I'm missing something @zachschuermann @scovich
[1]:
Claim 1: Both reads and writes check ensure_read_supported
Reads of course check ensure_read_supported
In the case when we want to write, we need to perform a read anyway. The protocol states:
to write a table, writers must implement and respect all features listed in writerFeatures. Because writers have to read the table (or only the Delta log) before write, they must implement and respect all reader features as well.
This is why it's not incorrect to check 1&2 in try_new
Claim 2: Checking ensure_write_supported
on the read path is incorrect
Suppose we don't support appendOnly
, and we want to read a table. ensure_write_supported
would fail on this. But this is a perfectly fine table for us to read. This is why it's incorrect to check ensure_write_supported
in try_new
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest reason to keep the read-supported check in try_new is that it's a narrow waist. Everyone who wants to access the table in any way needs to pass that check first, and putting it there catches them all.
For writes, a similar narrow waist would be the transaction builder -- tho that doesn't cover things like metadata cleanup and vacuum, which don't perform any commit.
We really don't want to forget those checks... Bad Things can happen.
kernel/src/table_configuration.rs
Outdated
version: Version, | ||
) -> DeltaResult<Self> { | ||
// important! before a read/write to the table we must check it is supported | ||
protocol.ensure_read_supported()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well - I think this needs to be configurable (don't need to check if write is supported if we only read the table)? this feels like more of a code smell that we might not be doing it in the right spot.
let column_mapping_mode = column_mapping_mode(&protocol, &table_properties); | ||
|
||
// validate column mapping mode -- all schema fields should be correctly (un)annotated | ||
validate_schema_column_mapping(&schema, column_mapping_mode)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I suppose regarding my comment on storing column mapping mode above.. I guess since we already compute it for validation we may as well remember it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep!
/// | ||
/// [`TableChanges`]: crate::table_changes::TableChanges | ||
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))] | ||
pub(crate) fn is_cdf_read_supported(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read that as: we have some TableChanges
(integration) tests that basically need to be migrated to unit tests here, as they aren't specific to TableChanges
and should just exercise this in isolation (in UT)
If that's the case - sounds good :)
use super::TableConfiguration; | ||
|
||
#[test] | ||
fn dv_supported_not_enabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we have a way around the issue of "many combinations" that we want to test.. I would probably tackle this by introducing some framework (function/macro/etc.) for passing some large matrix of combinations and expected outputs? can help out more here if needed
3d0044b
to
64b44ec
Compare
What changes are proposed in this pull request?
This PR introduces the
TableConfiguration
struct which is used to perform feature enablement checks on both the table's protocol and metadata.Problem statement
To check that a feature is enabled, you often must check that a certain reader/writer feature is enabled and that a table property is set to true. For example, a writer must check both the
delta.enableDeletionVectors
table property, and check that thedeletionVectors
writer feature is present in the table's Protocol. Probing two disparate structs to do a single check is error-prone and may lead to these metadata/protocol checks to become out of sync. Moreover checks are performed in the CDF path, snapshot scan path, and in the read path. Thus there are many ways in which protocol and metadata checks can diverge with one another.Put simply, the problems are:
Solution
TableConfiguration
consolidates all protocol and metadata checks to one place. It also ensures that the logic for checking feature enablement is kept consistent throughout the codebase. This addresses problem 1-4 outlined above.Closes: #571
How was this change tested?
existing tests pass.
TODO: add more tests