-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat!(ffi): support field nullability in schema visitor #656
feat!(ffi): support field nullability in schema visitor #656
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this -- it's a day one gap I'm surprised we got away with ignoring for so long.
ffi/src/schema.rs
Outdated
@@ -143,21 +147,21 @@ pub unsafe extern "C" fn visit_schema( | |||
fn visit_struct_fields(visitor: &EngineSchemaVisitor, s: &StructType) -> usize { | |||
let child_list_id = (visitor.make_field_list)(visitor.data, s.fields.len()); | |||
for field in s.fields() { | |||
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id); | |||
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id, field.is_nullable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The visit_schema_item
helper is private, so we should consider where to inject the new arg for maximum readability going forward -- probably not at end like this?
Honestly, the current arg order is already a bit weird, so maybe we should reorder the whole thing:
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id, field.is_nullable()); | |
visit_schema_item(field.name(), field.data_type(), field.is_nullable(), visitor, child_list_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I agree if it isn't too much trouble to just reorder the args like proposed here - it matches the typical parameter ordering that i'm aware of in Arrow, etc.
ffi/src/schema.rs
Outdated
// TODO: Should we no longer support value_contains_null argument in visitor_map | ||
// if nullability of the value will anyway be known through simple type visitor? | ||
// Same for contains_null in visitor_array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good question. This is anyway a breaking change (because all the type visitors need to be updated with the new nullable
arg. If we don't remove this now-redundant arg, it's just another thing that can potentially get out of sync as the code base evolves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... we still need a boolean nullability arg, but this PR changes its interpretation. Before this PR, it indicates whether the map contains nullable values. After this PR, it needs to indicate whether the map itself is nullable (e.g. because the map is a nullable struct field). Passing two booleans would just invite confusion IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we change the semantics of that boolean to "whether the map itself is nullable" we still retain the "whether the map contains nullable values" by leveraging the primitive's own nullability right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as the TODO itself observes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we change the semantics of that boolean to "whether the map itself is nullable" we still retain the "whether the map contains nullable values" by leveraging the primitive's own nullability right?
yes, that's what I meant, e.g. user can find out whether map value / array value is nullable not from visitor_map
/visitor_array
argument contains_null
, but from its own visitor (whatever it is - visitor_string, etc). Therefore having map/array value nullability indication in two places is strange and ambiguous, so I'd remove it from visitor_map
/visitor_array
(and leave there only argument which indicates whether map/array itself is nullable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep sorry just me clarifying :)
ffi/src/schema.rs
Outdated
is_nullable: bool, | ||
contains_null: bool, // if this array can contain null values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_nullable: bool, | |
contains_null: bool, // if this array can contain null values | |
is_nullable: bool, |
(see commentary about the TODO flagged below)
ffi/src/schema.rs
Outdated
is_nullable: bool, | ||
value_contains_null: bool, // if this map can contain null values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_nullable: bool, | |
value_contains_null: bool, // if this map can contain null values | |
is_nullable: bool, |
(see commentary about the TODO flagged below)
ffi/src/schema.rs
Outdated
precision: u8, | ||
scale: u8, | ||
), | ||
|
||
/// Visit a `string` belonging to the list identified by `sibling_list_id`. | ||
pub visit_string: | ||
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice), | ||
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice, is_nullable: bool), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure fmt check will fail on this and similar long lines below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kssenii you can just run cargo fmt
to get everything autoformatted nicely!
Not sure why semver didn't flag the callback signature change as breaking?? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #656 +/- ##
==========================================
- Coverage 84.05% 83.92% -0.13%
==========================================
Files 75 75
Lines 17251 17277 +26
Branches 17251 17277 +26
==========================================
Hits 14500 14500
- Misses 2052 2078 +26
Partials 699 699 ☔ View full report in Codecov by Sentry. |
ffi/examples/read-table/schema.h
Outdated
@@ -2,6 +2,7 @@ | |||
#include "read_table.h" | |||
#include "kernel_utils.h" | |||
#include <stdint.h> | |||
#include <time.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the time header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I got this error otherwise
/home/kssenii/workspace/ClickHouse2/contrib/delta-kernel-rs/ffi/examples/read-table/read_table.c:168:24: error: implicit declaration of function ‘gmtime’ [-Wimplicit-function-declaration]
168 | struct tm *tm_info = gmtime(&tv.tv_sec);
| ^~~~~~
/home/kssenii/workspace/ClickHouse2/contrib/delta-kernel-rs/ffi/examples/read-table/read_table.c:9:1: note: ‘gmtime’ is defined in header ‘<time.h>’; this is probably fixable by adding ‘#include <time.h>’
8 | #include "schema.h"
+++ |+#include <time.h>
9 | #include "kernel_utils.h"
(just noticed I should have added it in read_table.c instead of schema.h, but anyway that fixed the problem).
I will remove this header, would leave it for myself locally only.
ffi/src/schema.rs
Outdated
@@ -143,21 +147,21 @@ pub unsafe extern "C" fn visit_schema( | |||
fn visit_struct_fields(visitor: &EngineSchemaVisitor, s: &StructType) -> usize { | |||
let child_list_id = (visitor.make_field_list)(visitor.data, s.fields.len()); | |||
for field in s.fields() { | |||
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id); | |||
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id, field.is_nullable()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I agree if it isn't too much trouble to just reorder the args like proposed here - it matches the typical parameter ordering that i'm aware of in Arrow, etc.
ffi/src/schema.rs
Outdated
precision: u8, | ||
scale: u8, | ||
), | ||
|
||
/// Visit a `string` belonging to the list identified by `sibling_list_id`. | ||
pub visit_string: | ||
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice), | ||
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice, is_nullable: bool), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kssenii you can just run cargo fmt
to get everything autoformatted nicely!
ffi/src/schema.rs
Outdated
// TODO: Should we no longer support value_contains_null argument in visitor_map | ||
// if nullability of the value will anyway be known through simple type visitor? | ||
// Same for contains_null in visitor_array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we change the semantics of that boolean to "whether the map itself is nullable" we still retain the "whether the map contains nullable values" by leveraging the primitive's own nullability right?
ffi/src/schema.rs
Outdated
call!( | ||
visit_array, | ||
visit_array_item(visitor, at, at.contains_null) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: more compact to factor out the nested function call:
call!( | |
visit_array, | |
visit_array_item(visitor, at, at.contains_null) | |
) | |
let child_list_id = visit_array_item(visitor, at, at.contains_null); | |
call!(visit_array, child_list_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to apply fmt one more time after f2fa5c5, so now it folded back to one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
nit: I think we need to tweak the PR title's prefix a tad? Maybe |
yep thanks for flagging, editted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just had a quick comment on removing those prints!
ffi/examples/read-table/schema.h
Outdated
char* name_ptr = malloc(sizeof(char) * (name.len + 24)); | ||
snprintf(name_ptr, name.len + 1, "%s", name.ptr); | ||
snprintf(name_ptr + name.len, 24, " (contains null: %s)", contains_null ? "true" : "false"); | ||
snprintf(name_ptr + name.len, 24, " (is nullable: %s)", is_nullable ? "true" : "false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only have 22 byte max now? (it's either 21 or 22 bytes depending on if it's true/false)
@kssenii can you merge in latest main? There should be a button in the github UI, but I'm not the PR owner so I can't do it for you. |
What changes are proposed in this pull request?
Added support for field nullability in ffi api. It was not supported before, but it still important to know whether fields are nullable or not, because some applications might work more efficiently with non-nullable types.
This PR affects the following public APIs
New argument
is_nullable
is added toEngineSchemaVisitor
visitor functions.How was this change tested?
Tested on ClickHouse's delta lake related tests.