Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for timestamp downcasting when loading data to iceberg tables #1045

Open
fusion2222 opened this issue Aug 12, 2024 · 16 comments
Open

Comments

@fusion2222
Copy link

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

As of release 0.7.0, pyiceberg tables support new data-loading method pyiceberg.table.Table.add_files(). However, this method currently does not respect well documented setting downcast-ns-timestamp-to-us-on-write.

Setting downcast-ns-timestamp-to-us-on-write is always defaulted to False if Table.add_files() is used. No matter if config file explicitly specifies downcast-ns-timestamp-to-us-on-write: "True".

Env variable PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE is not respected as well.

@fusion2222
Copy link
Author

I also have patch for this ready, however it seems like I have no permissions to push a new branch and to create PR

diff --git a/pyiceberg/io/pyarrow.py b/pyiceberg/io/pyarrow.py
index b99c3b1..153b8a5 100644
--- a/pyiceberg/io/pyarrow.py
+++ b/pyiceberg/io/pyarrow.py
@@ -2303,6 +2303,8 @@ def _check_pyarrow_schema_compatible(
 
 
 def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_paths: Iterator[str]) -> Iterator[DataFile]:
+    from pyiceberg.table import DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE
+
     for file_path in file_paths:
         input_file = io.new_input(file_path)
         with input_file.open() as input_stream:
@@ -2313,7 +2315,12 @@ def parquet_files_to_data_files(io: FileIO, table_metadata: TableMetadata, file_
                 f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids"
             )
         schema = table_metadata.schema()
-        _check_pyarrow_schema_compatible(schema, parquet_metadata.schema.to_arrow_schema())
+        downcast_ns_timestamp_to_us = Config().get_bool(DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE) or False
+        _check_pyarrow_schema_compatible(
+            schema,
+            parquet_metadata.schema.to_arrow_schema(),
+            downcast_ns_timestamp_to_us
+        )
 
         statistics = data_file_statistics_from_parquet_metadata(
             parquet_metadata=parquet_metadata,

@sungwy
Copy link
Collaborator

sungwy commented Aug 12, 2024

Hi @fusion2222 thank you for raising this issue. We'd love to see your contribution on the project!

Could you try forking the project and then creating a new branch there to create the PR?

@kevinjqliu
Copy link
Contributor

Here's the setup I currently use

  • Git clone apache/iceberg-python repo (this repo)
  • Fork the repo (kevinjqliu/iceberg-python)
  • Add the forked repo as a remote named kevinjqliu (git remote add kevinjqliu [email protected]:kevinjqliu/iceberg-python.git).
  • Create a new branch (git checkout -b kevinjqliu/issue-1045)
  • Add changes and commit
  • Push the branch to the forked repo (git push kevinjqliu)
  • Come back to this repo's Github and open a PR

It would be good to add this to the Contributing docs :)

@tusharchou
Copy link
Contributor

@kevinjqliu @sungwy can I help here? I faced the same issue while trying to write a csv into iceberg using pyiceberg/catalog/sql.py

@tusharchou
Copy link
Contributor

tusharchou commented Oct 29, 2024

import os
os.environ['PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE'] = 'true'

I think we can add this to the documentation

@kevinjqliu
Copy link
Contributor

The documentation is at https://py.iceberg.apache.org/configuration/#nanoseconds-support

Do you think there's a better place to signal this to the users?

@rotem-ad
Copy link

rotem-ad commented Jan 1, 2025

I've faced the same issue when loading data using Table.add_files method. It fails and shows this error:

TypeError: Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.

As @fusion2222 already mentioned, setting PYICEBERG_DOWNCAST_NS_TIMESTAMP_TO_US_ON_WRITE to 'true' does not help. It's completely ignored since its value is simply not passed to _check_pyarrow_schema_compatible function when using add_files.
IMO this is a pretty serious bug, but the fix is super easy (as shown in @fusion2222 patch).
Can someone please merge the fix for this? If not I guess I'll try to create a PR for the issue.

@kevinjqliu
Copy link
Contributor

@rotem-ad i dont see an active PR for this issue. Would you like to open one? Happy to review

@lloyd-EA
Copy link

@kevinjqliu I followed your instructions on creating a PR

https://github.com/apache/iceberg-python/pull/1569

All credits to @fusion2222 for this patch.

I would love to see this issue fixed ASAP. Thanks

@fusion2222
Copy link
Author

Link mentioned by @lloyd-EA seems not to work. I created another this PR based on my patch from my previous comment as you suggested @sungwy .

#1572

Only thing holding me down was writing unit tests. For a newcomer they seem complicated.

@kevinjqliu
Copy link
Contributor

hey folks, looks like there are currently 2 open PR on this issue

Let's standardize on one of them and add a test case (see my comment here)

There's currently a test to show that add_files fails with ns timestamp, so we need to modify this test too

def test_add_files_with_timestamp_tz_ns_fails(session_catalog: Catalog, format_version: int, mocker: MockerFixture) -> None:

@fusion2222
Copy link
Author

fusion2222 commented Jan 26, 2025

At this point I looked on code 4 hours about how to adjust unit test mentioned by @kevinjqliu. But code is complicated and it is hard to navigate for a newcomer.

Even if I correct exception raising assertion, the old test is still using assertion for warning message printed out to stdout / stderr. This message looks like this:

Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write.`. 

This is triggered by pyiceberg.io.pyarrow._ConvertToIceberg.primitive, which is triggered deep inside Table.add_files. Sadly, code does not have sensible comments explaining why this warning is shown. Is this some sort of limit in some 3rd party library? Or something else? I don't really know if it is safe just to remove or what course of action is the best regarding assertion for this warning.

@sungwy
Copy link
Collaborator

sungwy commented Jan 28, 2025

Hi @fusion2222 and @lloyd-EA - thank you both for jumping on the issue and contributing the PRs! I agree with @kevinjqliu here that it would be good to consolidate our efforts on a single PR before duplicating our efforts further.

I apologize that the purpose of the error message in the _ConvertToIceberg visitor isn't making much sense without additional context. Here's my attempt at explaining its purpose. The original PR to introduce ns downcasting only supports downcasting for write operations, because we are able to set the parquet type to timestamp logical type[1] with the desired downcasted us precision when we rewrite the data files and hence seemed more straight forward to implement.

add_files on the other hand does not rewrite the data file, so if we support adding data files with ns precision, we aren't actually rewriting the data files with us precision, but instead we are keeping the data files with ns precision. Hence we will need to focus on verifying that we are able to read the parquet files that are added with ns precision when the IcebergSchema is of us precision.

Iceberg does not yet support 'ns' timestamp precision. Use 'downcast-ns-timestamp-to-us-on-write' configuration property to automatically downcast 'ns' to 'us' on write. This assertion is thrown within the _ConvertToIceberg visitor that visits a pyarrow schema and converts it to Iceberg schema. It should be silenced when _downcast_ns_timestamp_to_us attribute is set to True on the visitor.

[1] (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)

@lloyd-EA
Copy link

Thanks @sungwy! I'm working on your suggestion here. Just having a little trouble getting PySpark working on my local env, I'll see if I can fix it.

@fusion2222 I'm happy to merge your branch to mine and continue with my PR or you can merge mine to yours, continue with your PR and decline mine. I really don't mind :)

@fusion2222
Copy link
Author

I will close my PR. The codebase seems not to be newcomer friendly and it seems like @lloyd-EA already has some experience with pyiceberg library.

@sungwy
Copy link
Collaborator

sungwy commented Jan 29, 2025

Sorry you had that experience @fusion2222 ! There's of course a lot of Iceberg specific context here on this repository, and I'm hoping we can continue to work to build a library that is easy for new contributors to join.

I'm cross-posting our finding in @lloyd-EA 's PR, where we discovered that Spark-Iceberg does in fact have a problem reading ns timestamp data files when the Iceberg table field type is stored as us timestamp in the schema. We will need to investigate the issue on Spark-Iceberg side to understand if this is a bug, or an expected behavior of Spark-Iceberg.

This may also mean that we may only be able to add ns timestamp files to a V3 Spec table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants