Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50694][SQL] Support withColumns / withColumnsRenamed in subqueries #49386

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jan 6, 2025

What changes were proposed in this pull request?

Supports withColumns / withColumnsRenamed in subqueries.

Why are the changes needed?

When the query is used as a subquery by adding col.outer(), withColumns or withColumnsRenamed doesn't work because they need analyzed plans.

Does this PR introduce any user-facing change?

Yes, those APIs are available in subqueries.

How was this patch tested?

Added the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member

cc @cloud-fan too

newChild: LogicalPlan): UnresolvedWithColumns = copy(child = newChild)
}

case class UnresolvedWithColumnsRenamed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need new logical plans. See how we implement a similar feature for SQL pipe: 68be1da

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reuse Project with UnresolvedStarExceptOrReplace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, df.withColumn can append or replace columns, we should probably extend UnresolvedStarExceptOrReplace for this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@ueshin ueshin Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be some semantic differences even with withColumnsRenamed:

  • If the specified column names are missing in the df, df.withColumnsRenamed ignores whereas UnresolvedStarExceptOrReplace throws an exception.
  • df.withColumnsRenamed respects the argument order, e.g.,
    test("SPARK-46260: withColumnsRenamed should respect the Map ordering") {
      val df = spark.range(10).toDF()
      assert(df.withColumnsRenamed(ListMap("id" -> "a", "a" -> "b")).columns === Array("b"))
      assert(df.withColumnsRenamed(ListMap("a" -> "b", "id" -> "a")).columns === Array("a"))
    }
    
    whereas UnresolvedStarExceptOrReplace throws an exception.

I guess we should keep the new plans to be safer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. One question is shall we add new logical plans or new star-like expressions? star-like expressions are more flexible as they work for function input as well (e.g. struct(*)). And we might be able to unify them with the SQL pipe one with extra flags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the star-like expressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants