-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for timestamp downcasting when loading data to iceberg tables #1045
Comments
I also have patch for this ready, however it seems like I have no permissions to push a new branch and to create PR diff --git a/pyiceberg/io/pyarrow.py b/pyiceberg/io/pyarrow.py
index b99c3b1..153b8a5 100644
--- a/pyiceberg/io/pyarrow.py
+++ b/pyiceberg/io/pyarrow.py
@@ -2303,6 +2303,8 @@ def _check_pyarrow_schema_compatible(
def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]:
+ from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE
+
for file_path in file_paths:
input_file = io.new_input(file_path)
with input_file.open() as input_stream:
@@ -2313,7 +2315,12 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_
f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids"
)
schema = table_metadata.schema()
- _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema())
+ downcast_ns_timestamp_to_us = Config().get_bool(DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE) or False
+ _check_pyarrow_schema_compatible(
+ schema,
+ parquet_metadata.schema.to_arrow_schema(),
+ downcast_ns_timestamp_to_us
+ )
statistics = data_file_statistics_from_parquet_metadata(
parquet_metadata=parquet_metadata,
|
Hi @fusion2222 thank you for raising this issue. We'd love to see your contribution on the project! Could you try forking the project and then creating a new branch there to create the PR? |
Here's the setup I currently use
It would be good to add this to the Contributing docs :) |
@kevinjqliu @sungwy can I help here? I faced the same issue while trying to write a csv into iceberg using pyiceberg/catalog/sql.py |
I think we can add this to the documentation |
The documentation is at https://py.iceberg.apache.org/configuration/#nanoseconds-support Do you think there's a better place to signal this to the users? |
I've faced the same issue when loading data using Table.add_files method. It fails and shows this error:
As @fusion2222 already mentioned, setting |
@rotem-ad i dont see an active PR for this issue. Would you like to open one? Happy to review |
@kevinjqliu I followed your instructions on creating a PR https://github.com/apache/iceberg-python/pull/1569 All credits to @fusion2222 for this patch. I would love to see this issue fixed ASAP. Thanks |
hey folks, looks like there are currently 2 open PR on this issue
Let's standardize on one of them and add a test case (see my comment here) There's currently a test to show that
|
At this point I looked on code 4 hours about how to adjust unit test mentioned by @kevinjqliu. But code is complicated and it is hard to navigate for a newcomer. Even if I correct exception raising assertion, the old test is still using assertion for warning message printed out to stdout / stderr. This message looks like this:
This is triggered by |
Hi @fusion2222 and @lloyd-EA - thank you both for jumping on the issue and contributing the PRs! I agree with @kevinjqliu here that it would be good to consolidate our efforts on a single PR before duplicating our efforts further. I apologize that the purpose of the error message in the
[1] (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) |
Thanks @sungwy! I'm working on your suggestion here. Just having a little trouble getting PySpark working on my local env, I'll see if I can fix it. @fusion2222 I'm happy to merge your branch to mine and continue with my PR or you can merge mine to yours, continue with your PR and decline mine. I really don't mind :) |
I will close my PR. The codebase seems not to be newcomer friendly and it seems like @lloyd-EA already has some experience with pyiceberg library. |
Sorry you had that experience @fusion2222 ! There's of course a lot of Iceberg specific context here on this repository, and I'm hoping we can continue to work to build a library that is easy for new contributors to join. I'm cross-posting our finding in @lloyd-EA 's PR, where we discovered that Spark-Iceberg does in fact have a problem reading This may also mean that we may only be able to add |
Apache Iceberg version
0.7.0 (latest release)
Please describe the bug 🐞
As of release 0.7.0, pyiceberg tables support new data-loading method
pyiceberg.table.Table.add_files()
. However, this method currently does not respect well documented setting downcast-ns-timestamp-to-us-on-write.Setting
downcast-ns-timestamp-to-us-on-write
is always defaulted toFalse
ifTable.add_files()
is used. No matter if config file explicitly specifiesdowncast-ns-timestamp-to-us-on-write: "True"
.Env variable
PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE
is not respected as well.The text was updated successfully, but these errors were encountered: