Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix join type coercion when joining 2 relations with the same name via DataFrame API #14387

Merged
merged 1 commit into from
Feb 1, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 31, 2025

Which issue does this PR close?

Rationale for this change

Fix a bug

What changes are included in this PR?

Resolve join predicates using the correct schema

Are these changes tested?

Yes, a new unit test is added

Are there any user-facing changes?

Bug fix

@alamb alamb changed the title Fix join type coercion Fix join type coercion with anonymous types Jan 31, 2025
@alamb alamb marked this pull request as ready for review January 31, 2025 16:50
let ctx = SessionContext::new();

// Test that join will coerce column types when necessary
// even when the relations don't have unique names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: when the columns don't have unique names, is it safe to assume that the unqualified columns in left_cols/right_cols refer to the left/right relations, respectively? Since technically a valid join predicate is for example, t1.id = t1.id.

Seems like it would make sense to me most of the time! But there is some level of implicitness here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the columns don't have unique names, is it safe to assume that the unqualified columns in left_cols/right_cols refer to the left/right relations, respectively?

Yes, i think so: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.join

Though perhaps we could make that explicit in the docs 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a good idea! 🙌🏾

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you are willing to make a PR? I would be happy to review it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Add a small doc update here #14393

@@ -190,7 +190,15 @@ impl<'a> TypeCoercionRewriter<'a> {
.map(|(lhs, rhs)| {
// coerce the arguments as though they were a single binary equality
// expression
let (lhs, rhs) = self.coerce_binary_op(lhs, Operator::Eq, rhs)?;
let left_schema = join.left.schema();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Join output schema is the left schema concatenated with the right schema

Normally, since the lhs and rhs have fully qualified names (t1.a, t2.a) resolving their types in terms of the output schema is not a problem

However, when the relations on both sides have the same relation name but different types (which can happen with DataFrames) it is important to resolve the types in terms of the left/right schema

let ctx = SessionContext::new();

// Test that join will coerce column types when necessary
// even when the relations don't have unique names
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the columns don't have unique names, is it safe to assume that the unqualified columns in left_cols/right_cols refer to the left/right relations, respectively?

Yes, i think so: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.join

Though perhaps we could make that explicit in the docs 🤔

Copy link
Contributor

@rkrishn7 rkrishn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@findepi
Copy link
Member

findepi commented Jan 31, 2025

what are "anonymous types"?

@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2025

what are "anonymous types"?

I am not sure I understand what you are referring to

Update: i see it is the title -- I will fix

@alamb alamb changed the title Fix join type coercion with anonymous types Fix join type coercion when joining 2 relations with the same name via DataFrame API Jan 31, 2025
@@ -1121,6 +1121,39 @@ async fn join() -> Result<()> {
Ok(())
}

#[tokio::test]
async fn join_coercion_unnnamed() -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "unnnamed" be corrected to "unnamed"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you. That is a typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async fn join_coercion_unnnamed() -> Result<()> {
async fn join_coercion_unnamed() -> Result<()> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍

@alamb alamb merged commit e9c35a6 into apache:main Feb 1, 2025
26 of 27 checks passed
@alamb
Copy link
Contributor Author

alamb commented Feb 1, 2025

Thanks for the review @jonahgao

@alamb alamb deleted the alamb/join_type_coercion branch February 1, 2025 17:14
cj-zhukov pushed a commit to cj-zhukov/datafusion that referenced this pull request Feb 3, 2025
alamb added a commit to alamb/datafusion that referenced this pull request Feb 3, 2025
alamb added a commit that referenced this pull request Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: Invalid comparison operation: Utf8 == Utf8View error during LEFT ANTI JOIN
4 participants