-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove parsing table name in row_filter
#1689
Conversation
pyiceberg/expressions/parser.py
Outdated
) | ||
# TODO: Once this is removed, we will no longer take just the last index of parsed column result | ||
# And introduce support for parsing filter expressions with nested fields. | ||
raise ValueError(f"Cannot parse expressions with table names or nested fields, got: {".".join(result.column)}") | ||
return Reference(result.column[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comment, I think this is what we want:
return Reference(result.column[-1]) | |
return Reference('.'.join(result.column)) |
We should not raise a ValueError
, and allow for nested fields:
def test_nested_fields() -> None:
assert LessThan("location.x", DecimalLiteral(Decimal(52.00))) == parser.parse("location.x < 52.00")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looping in the expert @sungwy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello!
There's two missing features in PyIceberg right now with expression parsing:
- Support for parsing field names with "." delimiters
- Support for parsing nested fields
I think @Fokko 's suggested change will make it work for both those cases, and I think there will be value in testing the Reference binding for both cases in tests/expressions/test_expressions.py
.
I can think of three test cases:
def test_nested_bind() -> None:
schema = Schema(NestedField(2, "name", StructType(NestedField(3, "first", StringType()))), schema_id=1)
bound = BoundIsNull(BoundReference(schema.find_field(3), schema.accessor_for_field(3)))
assert IsNull(Reference("name.first")).bind(schema) == bound
def test_bind_dot_name() -> None:
schema = Schema(NestedField(2, "name.first", StringType()), schema_id=1)
bound = BoundIsNull(BoundReference(schema.find_field(2), schema.accessor_for_field(2)))
assert IsNull(Reference("name.first")).bind(schema) == bound
def test_bind_ambiguous_name() -> None:
with pytest.raises(ValueError) as exc_info:
schema = Schema(NestedField(2, "name", StructType(NestedField(3, "first", StringType()))), NestedField(4, "name.first", StringType()), schema_id=1)
assert "Invalid schema, multiple fields for name name.first: 3 and 4" in str(exc_info)
## bound = BoundIsNull(BoundReference(schema.find_field(3), schema.accessor_for_field(3)))
## assert IsNull(Reference("name.first")).bind(schema) == bound
Where the last one should throw an error that's propagated from the Schema having ambiguous names.
I had originally thought that this work would be a lot more complicated because we would want to distinguish the nested fields from simple field names with "." delimiters. But I think we can just rely on PyIceberg not supporting ambiguous field names for now as we introduce nested parsing support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool I did see @sungwy had a PR open for this #965, and there were some concerns with this approach. Particularity this comment #965 (comment) unless I'm missing something. Let me catch up here and see if I can add this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I made the changes to include the joining logic for the column references. As long as there's parity between a bound schema and the expression, I don’t see any issues.
One thing I noticed is that if we have a nested field with a dot, like foo.bar,
and a field nested inside it as baz
, it's treated as foo.bar.baz
, and the expression respects that.
Additionally, it's worth noting that the parser currently doesn’t allow a quoted identifier to represent a single field containing a dot, such as "foo.bar".baz
. Due to the expectations set here: https://github.com/apache/iceberg-python/blob/main/pyiceberg/expressions/parser.py#L82
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for making the change!
Additionally, it's worth noting that the parser currently doesn’t allow a quoted identifier to represent a single field containing a dot, such as "foo.bar".baz.
I think its worth adding a test for this and show that its currently not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest one I want to unlock here are nested fields.
But I think we can just rely on PyIceberg not supporting ambiguous field names for now as we introduce nested parsing support
I think the probability of having a collation between a nested field, and a field with a dot is low. If this becomes a problem, we could always extend (in a separate PR) the Reference
class to allow passing in a tuple to make it explicit.
Before doing this, we would also need to establish a way to express this in the SQL-like syntax:
SELECT
location.x,
location.`field.with.dots`
FROM table
As an example from Databricks: https://docs.databricks.com/aws/en/sql/language-manual/functions/dotsign
Good one @geruh, thanks for picking this up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding the tests
row_filter
This PR deprecates one of the three items that were planned for the 0.9.0 release.
All items marked for removal:
Table name reference in scan expression
iceberg-python/pyiceberg/expressions/parser.py
Line 95 in efc8b5a
REST catalog client AUTH_URL (REST: Remove deprecated
AUTH_URL
#1691)iceberg-python/pyiceberg/catalog/rest.py
Line 324 in efc8b5a
botocore session (AWS: Undeprecate
botocore_session
#1692)iceberg-python/pyiceberg/catalog/__init__.py
Line 790 in efc8b5a
Currently there are three items marked for release. However, based on the ongoing discussion, it appears that the other two items. have not yet been replaced with a proper solution. As a result, this PR only addresses the deprecation of
Table name reference in scan expression
while we await further resolution on the others.cc: @Fokko @kevinjqliu @HonahX