Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!(ffi): support field nullability in schema visitor #656

Merged

Conversation

kssenii
Copy link
Contributor

@kssenii kssenii commented Jan 22, 2025

  • new feature PR: "feat: support for field nullability in ffi api"

What changes are proposed in this pull request?

Added support for field nullability in ffi api. It was not supported before, but it still important to know whether fields are nullable or not, because some applications might work more efficiently with non-nullable types.

This PR affects the following public APIs

New argument is_nullable is added to EngineSchemaVisitor visitor functions.

How was this change tested?

Tested on ClickHouse's delta lake related tests.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this -- it's a day one gap I'm surprised we got away with ignoring for so long.

ffi/src/schema.rs Show resolved Hide resolved
@@ -143,21 +147,21 @@ pub unsafe extern "C" fn visit_schema(
fn visit_struct_fields(visitor: &EngineSchemaVisitor, s: &StructType) -> usize {
let child_list_id = (visitor.make_field_list)(visitor.data, s.fields.len());
for field in s.fields() {
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id);
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id, field.is_nullable());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The visit_schema_item helper is private, so we should consider where to inject the new arg for maximum readability going forward -- probably not at end like this?

Honestly, the current arg order is already a bit weird, so maybe we should reorder the whole thing:

Suggested change
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id, field.is_nullable());
visit_schema_item(field.name(), field.data_type(), field.is_nullable(), visitor, child_list_id);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I agree if it isn't too much trouble to just reorder the args like proposed here - it matches the typical parameter ordering that i'm aware of in Arrow, etc.

Comment on lines 192 to 194
// TODO: Should we no longer support value_contains_null argument in visitor_map
// if nullability of the value will anyway be known through simple type visitor?
// Same for contains_null in visitor_array.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good question. This is anyway a breaking change (because all the type visitors need to be updated with the new nullable arg. If we don't remove this now-redundant arg, it's just another thing that can potentially get out of sync as the code base evolves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... we still need a boolean nullability arg, but this PR changes its interpretation. Before this PR, it indicates whether the map contains nullable values. After this PR, it needs to indicate whether the map itself is nullable (e.g. because the map is a nullable struct field). Passing two booleans would just invite confusion IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we change the semantics of that boolean to "whether the map itself is nullable" we still retain the "whether the map contains nullable values" by leveraging the primitive's own nullability right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as the TODO itself observes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we change the semantics of that boolean to "whether the map itself is nullable" we still retain the "whether the map contains nullable values" by leveraging the primitive's own nullability right?

yes, that's what I meant, e.g. user can find out whether map value / array value is nullable not from visitor_map/visitor_array argument contains_null, but from its own visitor (whatever it is - visitor_string, etc). Therefore having map/array value nullability indication in two places is strange and ambiguous, so I'd remove it from visitor_map/visitor_array (and leave there only argument which indicates whether map/array itself is nullable).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep sorry just me clarifying :)

Comment on lines 56 to 57
is_nullable: bool,
contains_null: bool, // if this array can contain null values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_nullable: bool,
contains_null: bool, // if this array can contain null values
is_nullable: bool,

(see commentary about the TODO flagged below)

Comment on lines 68 to 69
is_nullable: bool,
value_contains_null: bool, // if this map can contain null values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_nullable: bool,
value_contains_null: bool, // if this map can contain null values
is_nullable: bool,

(see commentary about the TODO flagged below)

precision: u8,
scale: u8,
),

/// Visit a `string` belonging to the list identified by `sibling_list_id`.
pub visit_string:
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice),
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice, is_nullable: bool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure fmt check will fail on this and similar long lines below?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kssenii you can just run cargo fmt to get everything autoformatted nicely!

@scovich scovich added the breaking-change Change that will require a version bump label Jan 22, 2025
@scovich
Copy link
Collaborator

scovich commented Jan 22, 2025

Not sure why semver didn't flag the callback signature change as breaking??

Copy link

codecov bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 0% with 36 lines in your changes missing coverage. Please review.

Project coverage is 83.92%. Comparing base (45eedf2) to head (44c58cf).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/schema.rs 0.00% 36 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #656      +/-   ##
==========================================
- Coverage   84.05%   83.92%   -0.13%     
==========================================
  Files          75       75              
  Lines       17251    17277      +26     
  Branches    17251    17277      +26     
==========================================
  Hits        14500    14500              
- Misses       2052     2078      +26     
  Partials      699      699              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -2,6 +2,7 @@
#include "read_table.h"
#include "kernel_utils.h"
#include <stdint.h>
#include <time.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the time header?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I got this error otherwise

/home/kssenii/workspace/ClickHouse2/contrib/delta-kernel-rs/ffi/examples/read-table/read_table.c:168:24: error: implicit declaration of function ‘gmtime’ [-Wimplicit-function-declaration]
  168 |   struct tm *tm_info = gmtime(&tv.tv_sec);
      |                        ^~~~~~
/home/kssenii/workspace/ClickHouse2/contrib/delta-kernel-rs/ffi/examples/read-table/read_table.c:9:1: note: ‘gmtime’ is defined in header ‘<time.h>’; this is probably fixable by adding ‘#include <time.h>’
    8 | #include "schema.h"
  +++ |+#include <time.h>
    9 | #include "kernel_utils.h"

(just noticed I should have added it in read_table.c instead of schema.h, but anyway that fixed the problem).
I will remove this header, would leave it for myself locally only.

@@ -143,21 +147,21 @@ pub unsafe extern "C" fn visit_schema(
fn visit_struct_fields(visitor: &EngineSchemaVisitor, s: &StructType) -> usize {
let child_list_id = (visitor.make_field_list)(visitor.data, s.fields.len());
for field in s.fields() {
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id);
visit_schema_item(field.data_type(), field.name(), visitor, child_list_id, field.is_nullable());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I agree if it isn't too much trouble to just reorder the args like proposed here - it matches the typical parameter ordering that i'm aware of in Arrow, etc.

precision: u8,
scale: u8,
),

/// Visit a `string` belonging to the list identified by `sibling_list_id`.
pub visit_string:
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice),
extern "C" fn(data: *mut c_void, sibling_list_id: usize, name: KernelStringSlice, is_nullable: bool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kssenii you can just run cargo fmt to get everything autoformatted nicely!

Comment on lines 192 to 194
// TODO: Should we no longer support value_contains_null argument in visitor_map
// if nullability of the value will anyway be known through simple type visitor?
// Same for contains_null in visitor_array.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we change the semantics of that boolean to "whether the map itself is nullable" we still retain the "whether the map contains nullable values" by leveraging the primitive's own nullability right?

Comment on lines 268 to 271
call!(
visit_array,
visit_array_item(visitor, at, at.contains_null)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: more compact to factor out the nested function call:

Suggested change
call!(
visit_array,
visit_array_item(visitor, at, at.contains_null)
)
let child_list_id = visit_array_item(visitor, at, at.contains_null);
call!(visit_array, child_list_id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to apply fmt one more time after f2fa5c5, so now it folded back to one line

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@scovich
Copy link
Collaborator

scovich commented Jan 22, 2025

nit: I think we need to tweak the PR title's prefix a tad? Maybe feat[ffi api]!: or something?
@zachschuermann ?

@zachschuermann
Copy link
Collaborator

nit: I think we need to tweak the PR title's prefix a tad? Maybe feat[ffi api]!: or something? @zachschuermann ?

yep thanks for flagging, editted

@zachschuermann zachschuermann changed the title ffi api: support field nullability feat![ffi]: support field nullability Jan 22, 2025
@zachschuermann zachschuermann changed the title feat![ffi]: support field nullability feat![ffi]: support field nullability in schema visitor Jan 22, 2025
@zachschuermann zachschuermann changed the title feat![ffi]: support field nullability in schema visitor feat!(ffi): support field nullability in schema visitor Jan 22, 2025
@zachschuermann zachschuermann self-requested a review January 22, 2025 20:04
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just had a quick comment on removing those prints!

ffi/examples/read-table/schema.h Show resolved Hide resolved
Comment on lines 129 to 131
char* name_ptr = malloc(sizeof(char) * (name.len + 24));
snprintf(name_ptr, name.len + 1, "%s", name.ptr);
snprintf(name_ptr + name.len, 24, " (contains null: %s)", contains_null ? "true" : "false");
snprintf(name_ptr + name.len, 24, " (is nullable: %s)", is_nullable ? "true" : "false");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only have 22 byte max now? (it's either 21 or 22 bytes depending on if it's true/false)

@scovich
Copy link
Collaborator

scovich commented Jan 23, 2025

@kssenii can you merge in latest main? There should be a button in the github UI, but I'm not the PR owner so I can't do it for you.

@zachschuermann zachschuermann merged commit d999b5c into delta-io:main Jan 23, 2025
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants