Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove parsing table name in row_filter #1689

Merged
merged 5 commits into from
Feb 21, 2025
Merged

Conversation

geruh
Copy link
Contributor

@geruh geruh commented Feb 20, 2025

This PR deprecates one of the three items that were planned for the 0.9.0 release.

All items marked for removal:

Currently there are three items marked for release. However, based on the ongoing discussion, it appears that the other two items. have not yet been replaced with a proper solution. As a result, this PR only addresses the deprecation of Table name reference in scan expression while we await further resolution on the others.

cc: @Fokko @kevinjqliu @HonahX

)
# TODO: Once this is removed, we will no longer take just the last index of parsed column result
# And introduce support for parsing filter expressions with nested fields.
raise ValueError(f"Cannot parse expressions with table names or nested fields, got: {".".join(result.column)}")
return Reference(result.column[-1])
Copy link
Contributor

@Fokko Fokko Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment, I think this is what we want:

Suggested change
return Reference(result.column[-1])
return Reference('.'.join(result.column))

We should not raise a ValueError, and allow for nested fields:

def test_nested_fields() -> None:
    assert LessThan("location.x", DecimalLiteral(Decimal(52.00))) == parser.parse("location.x < 52.00")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looping in the expert @sungwy

Copy link
Collaborator

@sungwy sungwy Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

There's two missing features in PyIceberg right now with expression parsing:

  • Support for parsing field names with "." delimiters
  • Support for parsing nested fields

I think @Fokko 's suggested change will make it work for both those cases, and I think there will be value in testing the Reference binding for both cases in tests/expressions/test_expressions.py.

I can think of three test cases:

def test_nested_bind() -> None:
    schema = Schema(NestedField(2, "name", StructType(NestedField(3, "first", StringType()))), schema_id=1)
    bound = BoundIsNull(BoundReference(schema.find_field(3), schema.accessor_for_field(3)))
    assert IsNull(Reference("name.first")).bind(schema) == bound


def test_bind_dot_name() -> None:
    schema = Schema(NestedField(2, "name.first", StringType()), schema_id=1)
    bound = BoundIsNull(BoundReference(schema.find_field(2), schema.accessor_for_field(2)))
    assert IsNull(Reference("name.first")).bind(schema) == bound


def test_bind_ambiguous_name() -> None:
    with pytest.raises(ValueError) as exc_info:
        schema = Schema(NestedField(2, "name", StructType(NestedField(3, "first", StringType()))), NestedField(4, "name.first", StringType()), schema_id=1)
    assert "Invalid schema, multiple fields for name name.first: 3 and 4" in str(exc_info)
    ## bound = BoundIsNull(BoundReference(schema.find_field(3), schema.accessor_for_field(3)))
    ## assert IsNull(Reference("name.first")).bind(schema) == bound

Where the last one should throw an error that's propagated from the Schema having ambiguous names.

I had originally thought that this work would be a lot more complicated because we would want to distinguish the nested fields from simple field names with "." delimiters. But I think we can just rely on PyIceberg not supporting ambiguous field names for now as we introduce nested parsing support

Copy link
Contributor Author

@geruh geruh Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I did see @sungwy had a PR open for this #965, and there were some concerns with this approach. Particularity this comment #965 (comment) unless I'm missing something. Let me catch up here and see if I can add this.

Copy link
Contributor Author

@geruh geruh Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I made the changes to include the joining logic for the column references. As long as there's parity between a bound schema and the expression, I don’t see any issues.

One thing I noticed is that if we have a nested field with a dot, like foo.bar, and a field nested inside it as baz, it's treated as foo.bar.baz, and the expression respects that.

Additionally, it's worth noting that the parser currently doesn’t allow a quoted identifier to represent a single field containing a dot, such as "foo.bar".baz. Due to the expectations set here: https://github.com/apache/iceberg-python/blob/main/pyiceberg/expressions/parser.py#L82

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making the change!

Additionally, it's worth noting that the parser currently doesn’t allow a quoted identifier to represent a single field containing a dot, such as "foo.bar".baz.

I think its worth adding a test for this and show that its currently not supported

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest one I want to unlock here are nested fields.

But I think we can just rely on PyIceberg not supporting ambiguous field names for now as we introduce nested parsing support

I think the probability of having a collation between a nested field, and a field with a dot is low. If this becomes a problem, we could always extend (in a separate PR) the Reference class to allow passing in a tuple to make it explicit.

Before doing this, we would also need to establish a way to express this in the SQL-like syntax:

SELECT 
    location.x, 
    location.`field.with.dots`
FROM table

As an example from Databricks: https://docs.databricks.com/aws/en/sql/language-manual/functions/dotsign

@Fokko
Copy link
Contributor

Fokko commented Feb 20, 2025

Good one @geruh, thanks for picking this up!

@Fokko Fokko added this to the PyIceberg 0.9.0 release milestone Feb 20, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding the tests

@kevinjqliu kevinjqliu requested review from sungwy and Fokko February 21, 2025 02:02
@kevinjqliu kevinjqliu changed the title Deprecate for 0.9.0 release Remove parsing table name in row_filter Feb 21, 2025
@kevinjqliu kevinjqliu merged commit 948486e into apache:main Feb 21, 2025
7 checks passed
@kevinjqliu
Copy link
Contributor

Thanks @geruh for the contribution and @sungwy @Fokko for the review :)
Let's release 0.9.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants