-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix join type coercion when joining 2 relations with the same name via DataFrame
API
#14387
Conversation
let ctx = SessionContext::new(); | ||
|
||
// Test that join will coerce column types when necessary | ||
// even when the relations don't have unique names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: when the columns don't have unique names, is it safe to assume that the unqualified columns in left_cols
/right_cols
refer to the left/right relations, respectively? Since technically a valid join predicate is for example, t1.id = t1.id
.
Seems like it would make sense to me most of the time! But there is some level of implicitness here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the columns don't have unique names, is it safe to assume that the unqualified columns in left_cols/right_cols refer to the left/right relations, respectively?
Yes, i think so: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.join
Though perhaps we could make that explicit in the docs 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a good idea! 🙌🏾
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance you are willing to make a PR? I would be happy to review it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Add a small doc update here #14393
@@ -190,7 +190,15 @@ impl<'a> TypeCoercionRewriter<'a> { | |||
.map(|(lhs, rhs)| { | |||
// coerce the arguments as though they were a single binary equality | |||
// expression | |||
let (lhs, rhs) = self.coerce_binary_op(lhs, Operator::Eq, rhs)?; | |||
let left_schema = join.left.schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Join output schema is the left schema concatenated with the right schema
Normally, since the lhs
and rhs
have fully qualified names (t1.a
, t2.a
) resolving their types in terms of the output schema is not a problem
However, when the relations on both sides have the same relation name but different types (which can happen with DataFrames) it is important to resolve the types in terms of the left/right schema
let ctx = SessionContext::new(); | ||
|
||
// Test that join will coerce column types when necessary | ||
// even when the relations don't have unique names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the columns don't have unique names, is it safe to assume that the unqualified columns in left_cols/right_cols refer to the left/right relations, respectively?
Yes, i think so: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.join
Though perhaps we could make that explicit in the docs 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
what are "anonymous types"? |
Update: i see it is the title -- I will fix |
DataFrame
API
@@ -1121,6 +1121,39 @@ async fn join() -> Result<()> { | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn join_coercion_unnnamed() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "unnnamed" be corrected to "unnamed"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you. That is a typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async fn join_coercion_unnnamed() -> Result<()> { | |
async fn join_coercion_unnamed() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM👍
Thanks for the review @jonahgao |
Which issue does this PR close?
Invalid comparison operation: Utf8 == Utf8View
error during LEFT ANTI JOIN #13510Rationale for this change
Fix a bug
What changes are included in this PR?
Resolve join predicates using the correct schema
Are these changes tested?
Yes, a new unit test is added
Are there any user-facing changes?
Bug fix