Support for timestamp downcasting when loading data to iceberg tables #1045

fusion2222 · 2024-08-12T11:29:27Z

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

As of release 0.7.0, pyiceberg tables support new data-loading method pyiceberg.table.Table.add_files(). However, this method currently does not respect well documented setting downcast-ns-timestamp-to-us-on-write.

Setting downcast-ns-timestamp-to-us-on-write is always defaulted to False if Table.add_files() is used. No matter if config file explicitly specifies downcast-ns-timestamp-to-us-on-write: "True".

Env variable PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE is not respected as well.

The text was updated successfully, but these errors were encountered:

fusion2222 · 2024-08-12T11:32:08Z

I also have patch for this ready, however it seems like I have no permissions to push a new branch and to create PR

diff --git a/pyiceberg/io/pyarrow.py b/pyiceberg/io/pyarrow.py
index b99c3b1..153b8a5 100644
--- a/pyiceberg/io/pyarrow.py
+++ b/pyiceberg/io/pyarrow.py
@@ -2303,6 +2303,8 @@ def _check_pyarrow_schema_compatible(
 
 
 def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]:
+    from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE
+
     for file_path in file_paths:
         input_file = io.new_input(file_path)
         with input_file.open() as input_stream:
@@ -2313,7 +2315,12 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_
                 f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids"
             )
         schema = table_metadata.schema()
-        _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema())
+        downcast_ns_timestamp_to_us = Config().get_bool(DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE) or False
+        _check_pyarrow_schema_compatible(
+            schema,
+            parquet_metadata.schema.to_arrow_schema(),
+            downcast_ns_timestamp_to_us
+        )
 
         statistics = data_file_statistics_from_parquet_metadata(
             parquet_metadata=parquet_metadata,

sungwy · 2024-08-12T13:59:00Z

Hi @fusion2222 thank you for raising this issue. We'd love to see your contribution on the project!

Could you try forking the project and then creating a new branch there to create the PR?

kevinjqliu · 2024-08-12T23:41:38Z

Here's the setup I currently use

Git clone apache/iceberg-python repo (this repo)
Fork the repo (kevinjqliu/iceberg-python)
Add the forked repo as a remote named kevinjqliu (git remote add kevinjqliu [email protected]:kevinjqliu/iceberg-python.git).
Create a new branch (git checkout -b kevinjqliu/issue-1045)
Add changes and commit
Push the branch to the forked repo (git push kevinjqliu)
Come back to this repo's Github and open a PR

It would be good to add this to the Contributing docs :)

tusharchou · 2024-10-29T19:40:09Z

@kevinjqliu @sungwy can I help here? I faced the same issue while trying to write a csv into iceberg using pyiceberg/catalog/sql.py

tusharchou · 2024-10-29T19:56:09Z

import os
os.environ['PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE'] = 'true'

I think we can add this to the documentation

kevinjqliu · 2024-10-31T16:40:54Z

The documentation is at https://py.iceberg.apache.org/configuration/#nanoseconds-support

Do you think there's a better place to signal this to the users?

rotem-ad · 2025-01-01T09:03:54Z

I've faced the same issue when loading data using Table.add_files method. It fails and shows this error:

TypeError: Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.

As @fusion2222 already mentioned, setting PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE to 'true' does not help. It's completely ignored since its value is simply not passed to _check_pyarrow_schema_compatible function when using add_files.
IMO this is a pretty serious bug, but the fix is super easy (as shown in @fusion2222 patch).
Can someone please merge the fix for this? If not I guess I'll try to create a PR for the issue.

kevinjqliu · 2025-01-02T21:20:18Z

@rotem-ad i dont see an active PR for this issue. Would you like to open one? Happy to review

lloyd-EA · 2025-01-24T14:40:10Z

@kevinjqliu I followed your instructions on creating a PR

https://github.com/apache/iceberg-python/pull/1569

All credits to @fusion2222 for this patch.

I would love to see this issue fixed ASAP. Thanks

fusion2222 · 2025-01-24T19:32:40Z

Link mentioned by @lloyd-EA seems not to work. I created another this PR based on my patch from my previous comment as you suggested @sungwy .

#1572

Only thing holding me down was writing unit tests. For a newcomer they seem complicated.

kevinjqliu · 2025-01-26T18:12:27Z

hey folks, looks like there are currently 2 open PR on this issue

Let's standardize on one of them and add a test case (see my comment here)

There's currently a test to show that add_files fails with ns timestamp, so we need to modify this test too

iceberg-python/tests/integration/test_add_files.py

Line 588 in 3b53edc

    
           def test_add_files_with_timestamp_tz_ns_fails(session_catalog: Catalog, format_version: int, mocker: MockerFixture) -> None:

fusion2222 · 2025-01-26T21:36:21Z

At this point I looked on code 4 hours about how to adjust unit test mentioned by @kevinjqliu. But code is complicated and it is hard to navigate for a newcomer.

Even if I correct exception raising assertion, the old test is still using assertion for warning message printed out to stdout / stderr. This message looks like this:

Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.`.

This is triggered by pyiceberg.io.pyarrow._ConvertToIceberg.primitive, which is triggered deep inside Table.add_files. Sadly, code does not have sensible comments explaining why this warning is shown. Is this some sort of limit in some 3rd party library? Or something else? I don't really know if it is safe just to remove or what course of action is the best regarding assertion for this warning.

sungwy · 2025-01-28T03:54:13Z

Hi @fusion2222 and @lloyd-EA - thank you both for jumping on the issue and contributing the PRs! I agree with @kevinjqliu here that it would be good to consolidate our efforts on a single PR before duplicating our efforts further.

I apologize that the purpose of the error message in the _ConvertToIceberg visitor isn't making much sense without additional context. Here's my attempt at explaining its purpose. The original PR to introduce ns downcasting only supports downcasting for write operations, because we are able to set the parquet type to timestamp logical type[1] with the desired downcasted us precision when we rewrite the data files and hence seemed more straight forward to implement.

add_files on the other hand does not rewrite the data file, so if we support adding data files with ns precision, we aren't actually rewriting the data files with us precision, but instead we are keeping the data files with ns precision. Hence we will need to focus on verifying that we are able to read the parquet files that are added with ns precision when the IcebergSchema is of us precision.

Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write. This assertion is thrown within the _ConvertToIceberg visitor that visits a pyarrow schema and converts it to Iceberg schema. It should be silenced when _downcast_ns_timestamp_to_us attribute is set to True on the visitor.

[1] (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)

lloyd-EA · 2025-01-28T09:30:33Z

Thanks @sungwy! I'm working on your suggestion here. Just having a little trouble getting PySpark working on my local env, I'll see if I can fix it.

@fusion2222 I'm happy to merge your branch to mine and continue with my PR or you can merge mine to yours, continue with your PR and decline mine. I really don't mind :)

fusion2222 · 2025-01-29T12:24:04Z

I will close my PR. The codebase seems not to be newcomer friendly and it seems like @lloyd-EA already has some experience with pyiceberg library.

sungwy · 2025-01-29T13:02:21Z

Sorry you had that experience @fusion2222 ! There's of course a lot of Iceberg specific context here on this repository, and I'm hoping we can continue to work to build a library that is easy for new contributors to join.

I'm cross-posting our finding in @lloyd-EA 's PR, where we discovered that Spark-Iceberg does in fact have a problem reading ns timestamp data files when the Iceberg table field type is stored as us timestamp in the schema. We will need to investigate the issue on Spark-Iceberg side to understand if this is a bug, or an expected behavior of Spark-Iceberg.

This may also mean that we may only be able to add ns timestamp files to a V3 Spec table

fusion2222 mentioned this issue Jan 24, 2025

Enable pyiceberg.table.Table.add_files ns downcasting #1572

Closed

sungwy mentioned this issue Jan 28, 2025

Issue 1045 - Allow downcasting of parquet timestamp[ns] to timestamp[us] for Table.add_files() #1569

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for timestamp downcasting when loading data to iceberg tables #1045

Support for timestamp downcasting when loading data to iceberg tables #1045

fusion2222 commented Aug 12, 2024

fusion2222 commented Aug 12, 2024

sungwy commented Aug 12, 2024

kevinjqliu commented Aug 12, 2024

tusharchou commented Oct 29, 2024

tusharchou commented Oct 29, 2024 •

edited

Loading

kevinjqliu commented Oct 31, 2024

rotem-ad commented Jan 1, 2025 •

edited

Loading

kevinjqliu commented Jan 2, 2025

lloyd-EA commented Jan 24, 2025

fusion2222 commented Jan 24, 2025

kevinjqliu commented Jan 26, 2025

fusion2222 commented Jan 26, 2025 •

edited

Loading

sungwy commented Jan 28, 2025

lloyd-EA commented Jan 28, 2025

fusion2222 commented Jan 29, 2025

sungwy commented Jan 29, 2025

Support for timestamp downcasting when loading data to iceberg tables #1045

Support for timestamp downcasting when loading data to iceberg tables #1045

Comments

fusion2222 commented Aug 12, 2024

Apache Iceberg version

Please describe the bug 🐞

fusion2222 commented Aug 12, 2024

sungwy commented Aug 12, 2024

kevinjqliu commented Aug 12, 2024

tusharchou commented Oct 29, 2024

tusharchou commented Oct 29, 2024 • edited Loading

kevinjqliu commented Oct 31, 2024

rotem-ad commented Jan 1, 2025 • edited Loading

kevinjqliu commented Jan 2, 2025

lloyd-EA commented Jan 24, 2025

fusion2222 commented Jan 24, 2025

kevinjqliu commented Jan 26, 2025

fusion2222 commented Jan 26, 2025 • edited Loading

sungwy commented Jan 28, 2025

lloyd-EA commented Jan 28, 2025

fusion2222 commented Jan 29, 2025

sungwy commented Jan 29, 2025

tusharchou commented Oct 29, 2024 •

edited

Loading

rotem-ad commented Jan 1, 2025 •

edited

Loading

fusion2222 commented Jan 26, 2025 •

edited

Loading