Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Implement Variant writers #12323

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Feb 18, 2025

This PR implements Variant writers for Parquet based on a Parquet schema passed into the writer builder. It works basically the same as #12139.

@rdblue rdblue force-pushed the variant-parquet-writers branch 3 times, most recently from 360f531 to 086a16c Compare February 20, 2025 23:38
@@ -85,6 +85,16 @@ public static ShreddedObject object(VariantMetadata metadata) {
return new ShreddedObject(metadata);
}

public static ShreddedObject object(VariantObject object) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used to create a shredded object from an existing object when writing. It uses the object's metadata.

This avoids exposing VariantObject.metadata because metadata is carried by Variant instead of values.

@@ -62,7 +71,7 @@ protected ParquetValueWriter<?> timestampWriter(ColumnDescriptor desc, boolean i
}
}

private class WriteBuilder extends ParquetTypeVisitor<ParquetValueWriter<?>> {
private class WriteBuilder extends TypeWithSchemaVisitor<ParquetValueWriter<?>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to detect a variant type, this needs to use the original Iceberg schema. When Parquet exposes the VARIANT logical type annotation, we can update this to no longer require the schema.

@@ -192,6 +194,11 @@ public WriteBuilder schema(Schema newSchema) {
return this;
}

public WriteBuilder variantShreddingFunc(BiFunction<Integer, String, Type> func) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a function passed to schema conversion. Variants (field ID and name) are passed to this function to determine the shredded type (typed_value). I'm using a callback function to avoid exposing a way to set the Parquet schema directly here.

private String name = "table";
private WriteSupport<?> writeSupport = null;
private Function<MessageType, ParquetValueWriter<?>> createWriterFunc = null;
private BiFunction<Schema, MessageType, ParquetValueWriter<?>> createWriterFunc = null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is supports passing the schema to the write builder function.

@@ -71,6 +71,10 @@ public static UnboxedWriter<Short> shorts(ColumnDescriptor desc) {
return new ShortWriter(desc);
}

public static <T> ParquetValueWriter<T> unboxed(ColumnDescriptor desc) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to expose this to get float and double working. Those writers currently require a metrics builder that expects a non-null field ID.

private final VariantMetadata metadata;
private final VariantValue value;

VariantData(VariantMetadata metadata, VariantValue value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally in #12304, but I moved it here after reverting changes to variant classes in that PR. This uses it to create objects that are passed to writers and created by readers.

@rdblue rdblue force-pushed the variant-parquet-writers branch from 67c2bab to f4296c3 Compare February 21, 2025 23:18
@github-actions github-actions bot added the API label Feb 21, 2025
@rdblue
Copy link
Contributor Author

rdblue commented Feb 21, 2025

Rebased after moving variants to API in #12374.

@rdblue rdblue force-pushed the variant-parquet-writers branch from f4296c3 to f3e3ccc Compare February 21, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant