[SPARK-50762][SQL] Add Analyzer rule for resolving SQL scalar UDFs #49414

allisonwang-db · 2025-01-08T08:59:21Z

What changes were proposed in this pull request?

This PR adds a new Analyzer rule ResolveSQLFunctions to resolve scalar SQL UDFs by replacing a SQLFunctionExpression with an actual function body. It currently supports the following operators: Project, Filter, Join and Aggregate.

For example:

CREATE FUNCTION area(width DOUBLE, height DOUBLE) RETURNS DOUBLE
RETURN width * height;

and this query

SELECT area(a, b) FROM t;

will be resolved as

Project [area(width, height) AS area]
  +- Project [a, b, CAST(a AS DOUBLE) AS width, CAST(b AS DOUBLE) AS height]
    +- Relation [a, b]

Why are the changes needed?

To support SQL UDFs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New SQL query tests. More tests will be added once table function resolution is supported.

Was this patch authored or co-authored using generative AI tooling?

No

cloud-fan · 2025-01-08T10:52:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        if !f.resolved || AggregateExpression.containsAggregate(cond) ||
+          ResolveGroupingAnalytics.hasGroupingFunction(cond) ||
+          cond.containsPattern(TEMP_RESOLVED_COLUMN) =>
+        // If the filter's condition contains aggregate expressions or grouping functions or temp


Suggested change

// If the filter's condition contains aggregate expressions or grouping functions or temp

// If the filter's condition contains aggregate expressions or grouping expressions or temp

cloud-fan · 2025-01-08T10:56:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        val topProject = if (topProjectList.nonEmpty) Project(topProjectList, newAgg) else newAgg
+        topProject


Suggested change

val topProject = if (topProjectList.nonEmpty) Project(topProjectList, newAgg) else newAgg

topProject

if (topProjectList.nonEmpty) Project(topProjectList, newAgg) else newAgg

cloud-fan · 2025-01-08T11:05:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+     * Example (aggregate):
+     *   Before: foo(c1) + foo(max(c2)) + max(foo(c2))
+     *   After: foo(c1) + foo(max_c2) + max_foo_c2
+     *   Extracted expressions: [c1, max(c2) AS max_c2, max(foo(c2)) AS max_foo_c2]


This reminds me of Aggregate normalization we did in RewriteWithExpression, which moves the result projection from Aggregate and puts it in a new Project node above Aggregate.

It's no harm to do this normalization but for safety we only do it when we have to, like the With expression and SQL UDF.

cloud-fan · 2025-01-08T11:10:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        h.copy(child = a.copy(child = rewrite(a.child)))
+
+      case a: Aggregate if a.resolved && hasSQLFunctionExpression(a.expressions) =>
+        val child = rewrite(a.child)


shall we rewrite SQL function top-down? Then the newly created Project under Aggregate can be rewritten in one pass.

Or we create a util function that only rewrites a single node, then we call it at the end of Aggregate rewriting to rewrite the newly created Project.

Yea that's something that can be explored. I plan to add more tests in the upcoming PRs to make sure correctness first, and after that we can make further improvements.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SQLFunctionExpression.scala

cloud-fan · 2025-01-08T12:49:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+            // Outer references also need to be wrapped because the function input may already
+            // contain outer references.
+            val outer = expr.transform {
+              case a: Attribute => OuterReference(a)


Why do we use OuterReference if we always rewrite SQL UDF to scalar expression?

The first step to resolve a SQL UDF is to verify the function body (expression or subquery) can be resolved correctly using the captured SQL config. We wrap the function inputs with outer references so that we can run simple analyzer on top:

Project [CAST(width * height AS DOUBLE) AS area] +- Project [CAST(outer(a) AS DOUBLE) AS width, CAST(outer(b AS DOUBLE) AS height] +- OneRowRelation

Once analyzed, the next step is to inline the SQL UDF body into the original query plan tree (rewriteSQLFunctions)

cloud-fan · 2025-01-09T11:11:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SQLFunctionExpression.scala

+ * A wrapper node for a SQL scalar function expression.
+ */
+case class SQLScalarFunction(function: SQLFunction, inputs: Seq[Expression], child: Expression)
+  extends UnaryExpression with UnaryLike[Expression] with Unevaluable {


Suggested change

extends UnaryExpression with UnaryLike[Expression] with Unevaluable {

extends UnaryExpression with Unevaluable {

cloud-fan · 2025-01-09T11:17:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+   *
+   * Analyzed plan:
+   *
+   *   Project [foo(x) AS foo]


how do we eliminate the Aggregate and Filter?

cloud-fan · 2025-01-09T11:55:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        // aggregate functions. They need to be extracted into a project list above the
+        // current aggregate.
+        val aggExprs = ArrayBuffer.empty[NamedExpression]
+        val topProjectList = aggregateExpressions.map(extractAndRewrite(_, aggExprs))


We can follow RewriteWithExpression to get the top project in a simpler way

val PhysicalAggregation(groupingExprs, aggExprs, resultExprs, _, _) = a val newGroupingExprs = groupingExprs.map(rewriteSQLFunctions(_, bottomProjectList)) val newAggExprs = aggExprs.map(rewriteSQLFunctions(_, bottomProjectList)) ...

Another issue is the group expression may appear in aggregateExpressions as well, and we want to avoid duplicating the SQL function expression. This can be solved by PullOutGroupingExpressions. We can create a util function in PullOutGroupingExpressions which rewrites a single Aggregate, then leverage it here. To put everything together:

val rewritten = PullOutGroupingExpressions.rewriteAgg(a) val PhysicalAggregation(groupingExprs, aggExprs, resultExprs, _, _) = rewritten val newAggExprs = aggExprs.map(rewriteSQLFunctions(_, bottomProjectList)) // no need to rewrite grouping expr as it won't contain SQL UDF now. Project(resultExprs, rewritten.copy( aggExprs = newAggExprs, child = Project(bottomProjectList, rewritten.child)) )

Note: this is a more aggressive rewrite, which rewrites all aggregate/grouping expressions with the same idea of rewriting SQL UDF. The code is simpler but the plan change is large. I'm also ok to keep the current implementation as it is.

Sounds good. We can explore this once we have more test coverage.

cloud-fan · 2025-01-10T10:58:51Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SQLFunctionSuite.scala

+/**
+ * Test suite for SQL user-defined functions (UDFs).
+ */
+class SQLFunctionSuite extends QueryTest with SharedSparkSession {


is it a duplication of golden file tests? It's also end-to-end

It's intended to test plan structures (for more complicated queries) and for other DDL commands in the future (such as DESCRIBE)

cloud-fan · 2025-01-14T06:58:31Z

thanks, merging to master!

github-actions bot added the SQL label Jan 8, 2025

allisonwang-db requested a review from cloud-fan January 8, 2025 09:03

cloud-fan reviewed Jan 8, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

cloud-fan reviewed Jan 8, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Show resolved Hide resolved

cloud-fan reviewed Jan 8, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 8, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/SQLFunctionExpression.scala Outdated Show resolved Hide resolved

allisonwang-db force-pushed the spark-50762-resolve-scalar-udf branch from b865df5 to 1b26abe Compare January 8, 2025 12:37

cloud-fan reviewed Jan 8, 2025

View reviewed changes

cloud-fan reviewed Jan 9, 2025

View reviewed changes

cloud-fan reviewed Jan 10, 2025

View reviewed changes

allisonwang-db added 3 commits January 13, 2025 13:51

resolve

19c468a

address comments

b35f721

address comments

46fb145

allisonwang-db force-pushed the spark-50762-resolve-scalar-udf branch from 73b650f to 46fb145 Compare January 13, 2025 21:51

cloud-fan closed this in bba6839 Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50762][SQL] Add Analyzer rule for resolving SQL scalar UDFs #49414

[SPARK-50762][SQL] Add Analyzer rule for resolving SQL scalar UDFs #49414

allisonwang-db commented Jan 8, 2025 •

edited

Loading

cloud-fan Jan 8, 2025

cloud-fan Jan 8, 2025

cloud-fan Jan 8, 2025 •

edited

Loading

cloud-fan Jan 8, 2025

cloud-fan Jan 8, 2025

allisonwang-db Jan 9, 2025

cloud-fan Jan 8, 2025

allisonwang-db Jan 9, 2025

cloud-fan Jan 9, 2025

cloud-fan Jan 9, 2025

cloud-fan Jan 9, 2025 •

edited

Loading

cloud-fan Jan 9, 2025

allisonwang-db Jan 13, 2025

cloud-fan Jan 10, 2025

allisonwang-db Jan 13, 2025

cloud-fan commented Jan 14, 2025

	// If the filter's condition contains aggregate expressions or grouping functions or temp
	// If the filter's condition contains aggregate expressions or grouping expressions or temp

		val topProject = if (topProjectList.nonEmpty) Project(topProjectList, newAgg) else newAgg
		topProject

	val topProject = if (topProjectList.nonEmpty) Project(topProjectList, newAgg) else newAgg
	topProject
	if (topProjectList.nonEmpty) Project(topProjectList, newAgg) else newAgg

	extends UnaryExpression with UnaryLike[Expression] with Unevaluable {
	extends UnaryExpression with Unevaluable {

[SPARK-50762][SQL] Add Analyzer rule for resolving SQL scalar UDFs #49414

[SPARK-50762][SQL] Add Analyzer rule for resolving SQL scalar UDFs #49414

Conversation

allisonwang-db commented Jan 8, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 14, 2025

allisonwang-db commented Jan 8, 2025 •

edited

Loading

cloud-fan Jan 8, 2025 •

edited

Loading

cloud-fan Jan 9, 2025 •

edited

Loading