Remove parsing table name in `row_filter` #1689

geruh · 2025-02-20T01:50:46Z

This PR deprecates one of the three items that were planned for the 0.9.0 release.

All items marked for removal:

Table name reference in scan expression

iceberg-python/pyiceberg/expressions/parser.py

Line 95 in efc8b5a

removed_in="0.9.0",
REST catalog client AUTH_URL (REST: Remove deprecated AUTH_URL #1691)

iceberg-python/pyiceberg/catalog/rest.py

Line 324 in efc8b5a

removed_in="0.9.0",
botocore session (AWS: Undeprecate botocore_session #1692)

iceberg-python/pyiceberg/catalog/__init__.py

Line 790 in efc8b5a

removed_in="0.9.0",

Currently there are three items marked for release. However, based on the ongoing discussion, it appears that the other two items. have not yet been replaced with a proper solution. As a result, this PR only addresses the deprecation of Table name reference in scan expression while we await further resolution on the others.

cc: @Fokko @kevinjqliu @HonahX

Fokko · 2025-02-20T10:12:31Z

pyiceberg/expressions/parser.py

-        )
-    # TODO: Once this is removed, we will no longer take just the last index of parsed column result
-    # And introduce support for parsing filter expressions with nested fields.
+        raise ValueError(f"Cannot parse expressions with table names or nested fields, got: {".".join(result.column)}")
    return Reference(result.column[-1])


Based on the comment, I think this is what we want:

Suggested change

return Reference(result.column[-1])

return Reference('.'.join(result.column))

We should not raise a ValueError, and allow for nested fields:

def test_nested_fields() -> None: assert LessThan("location.x", DecimalLiteral(Decimal(52.00))) == parser.parse("location.x < 52.00")

Looping in the expert @sungwy

Hello!

There's two missing features in PyIceberg right now with expression parsing:

Support for parsing field names with "." delimiters

Support for parsing nested fields

I think @Fokko 's suggested change will make it work for both those cases, and I think there will be value in testing the Reference binding for both cases in tests/expressions/test_expressions.py.

I can think of three test cases:

def test_nested_bind() -> None: schema = Schema(NestedField(2, "name", StructType(NestedField(3, "first", StringType()))), schema_id=1) bound = BoundIsNull(BoundReference(schema.find_field(3), schema.accessor_for_field(3))) assert IsNull(Reference("name.first")).bind(schema) == bound def test_bind_dot_name() -> None: schema = Schema(NestedField(2, "name.first", StringType()), schema_id=1) bound = BoundIsNull(BoundReference(schema.find_field(2), schema.accessor_for_field(2))) assert IsNull(Reference("name.first")).bind(schema) == bound def test_bind_ambiguous_name() -> None: with pytest.raises(ValueError) as exc_info: schema = Schema(NestedField(2, "name", StructType(NestedField(3, "first", StringType()))), NestedField(4, "name.first", StringType()), schema_id=1) assert "Invalid schema, multiple fields for name name.first: 3 and 4" in str(exc_info) ## bound = BoundIsNull(BoundReference(schema.find_field(3), schema.accessor_for_field(3))) ## assert IsNull(Reference("name.first")).bind(schema) == bound

Where the last one should throw an error that's propagated from the Schema having ambiguous names.

I had originally thought that this work would be a lot more complicated because we would want to distinguish the nested fields from simple field names with "." delimiters. But I think we can just rely on PyIceberg not supporting ambiguous field names for now as we introduce nested parsing support

Cool I did see @sungwy had a PR open for this #965, and there were some concerns with this approach. Particularity this comment #965 (comment) unless I'm missing something. Let me catch up here and see if I can add this.

Alright, I made the changes to include the joining logic for the column references. As long as there's parity between a bound schema and the expression, I don’t see any issues.

One thing I noticed is that if we have a nested field with a dot, like foo.bar, and a field nested inside it as baz, it's treated as foo.bar.baz, and the expression respects that.

Additionally, it's worth noting that the parser currently doesn’t allow a quoted identifier to represent a single field containing a dot, such as "foo.bar".baz. Due to the expectations set here: https://github.com/apache/iceberg-python/blob/main/pyiceberg/expressions/parser.py#L82

thanks for making the change!

Additionally, it's worth noting that the parser currently doesn’t allow a quoted identifier to represent a single field containing a dot, such as "foo.bar".baz.

I think its worth adding a test for this and show that its currently not supported

The biggest one I want to unlock here are nested fields.

But I think we can just rely on PyIceberg not supporting ambiguous field names for now as we introduce nested parsing support

I think the probability of having a collation between a nested field, and a field with a dot is low. If this becomes a problem, we could always extend (in a separate PR) the Reference class to allow passing in a tuple to make it explicit.

Before doing this, we would also need to establish a way to express this in the SQL-like syntax:

SELECT location.x, location.`field.with.dots` FROM table

As an example from Databricks: https://docs.databricks.com/aws/en/sql/language-manual/functions/dotsign

Fokko · 2025-02-20T10:12:50Z

Good one @geruh, thanks for picking this up!

kevinjqliu

LGTM! Thanks for adding the tests

kevinjqliu · 2025-02-21T15:56:42Z

Thanks @geruh for the contribution and @sungwy @Fokko for the review :)
Let's release 0.9.0!

geruh added 2 commits February 19, 2025 17:35

Deprecate for 0.9.0 release

3573cfb

Deprecate for 0.9.0 release

2fdd981

Fokko reviewed Feb 20, 2025

View reviewed changes

Fokko added this to the PyIceberg 0.9.0 release milestone Feb 20, 2025

geruh added 3 commits February 20, 2025 13:28

Joining column ref's on nested columns

a7f6bb0

Add another test with nested and dot name

7bb4e7d

Add test for parser with quoted dot column

ab3fc86

kevinjqliu approved these changes Feb 21, 2025

View reviewed changes

kevinjqliu requested review from sungwy and Fokko February 21, 2025 02:02

Fokko approved these changes Feb 21, 2025

View reviewed changes

sungwy approved these changes Feb 21, 2025

View reviewed changes

kevinjqliu changed the title ~~Deprecate for 0.9.0 release~~ Remove parsing table name in row_filter Feb 21, 2025

kevinjqliu merged commit 948486e into apache:main Feb 21, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove parsing table name in `row_filter` #1689

Remove parsing table name in `row_filter` #1689

geruh commented Feb 20, 2025 •

edited by kevinjqliu

Loading

Fokko Feb 20, 2025 •

edited

Loading

Fokko Feb 20, 2025

sungwy Feb 20, 2025 •

edited

Loading

geruh Feb 20, 2025 •

edited

Loading

geruh Feb 20, 2025 •

edited

Loading

kevinjqliu Feb 20, 2025

Fokko Feb 21, 2025

Fokko commented Feb 20, 2025

kevinjqliu left a comment

kevinjqliu commented Feb 21, 2025

	return Reference(result.column[-1])
	return Reference('.'.join(result.column))

Remove parsing table name in row_filter #1689

Remove parsing table name in row_filter #1689

Conversation

geruh commented Feb 20, 2025 • edited by kevinjqliu Loading

Fokko Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

Fokko Feb 20, 2025

Choose a reason for hiding this comment

sungwy Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

geruh Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

geruh Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

kevinjqliu Feb 20, 2025

Choose a reason for hiding this comment

Fokko Feb 21, 2025

Choose a reason for hiding this comment

Fokko commented Feb 20, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu commented Feb 21, 2025

Remove parsing table name in `row_filter` #1689

Remove parsing table name in `row_filter` #1689

geruh commented Feb 20, 2025 •

edited by kevinjqliu

Loading

Fokko Feb 20, 2025 •

edited

Loading

sungwy Feb 20, 2025 •

edited

Loading

geruh Feb 20, 2025 •

edited

Loading

geruh Feb 20, 2025 •

edited

Loading