-
Notifications
You must be signed in to change notification settings - Fork 114
[Gold Standard] Enable stats with spark 2.4 with a sample query #429
base: master
Are you sure you want to change the base?
Conversation
== Physical Plan == | ||
CollectLimit 100 | ||
+- *(5) Project [cs_bill_customer_sk#1, ss_customer_sk#2] | ||
+- *(5) SortMergeJoin [cs_sold_date_sk#3], [ss_sold_date_sk#4], Inner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: SortMergeJoin is used here because stats are enabled.
@@ -554,17 +553,35 @@ trait TPCDSBase extends SparkFunSuite with SparkInvolvedSuite { | |||
""".stripMargin) | |||
} | |||
|
|||
private val originalCBCEnabled = conf.cboEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all changes below this line onward are directly picked from spark codebase. Refer https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala#L609
src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSTableStats.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @apoorvedave1!
)) | ||
) | ||
// scalastyle:on line.size.limit | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new line
src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSTableStats.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
What is the context for this pull request?
As discussed previously, this PR enables stats on spark's query plans
What changes were proposed in this pull request?
Enable stats in native spark's gold standard test cases. The stats are picked directly from apache spark 3.0+ codebase from this file https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSTableStats.scala
This ensures that for joining large tables, Broadcast join is replaced by SortMergeJoin.
Does this PR introduce any user-facing change?
No
How was this patch tested?
unit test