-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50767][SQL] Remove codegen of from_json
#49411
base: master
Are you sure you want to change the base?
Conversation
Could you update the test results for |
Well, okay. JDK17: https://github.com/panbingkun/spark/actions/runs/12666974432 |
+1 for including |
Does it mean any time we add codegen support for some functions, there is a risk of perf regression? Are we sure |
|
@panbingkun After giving it some more thought, we could try enabling If it weren't a |
I found the following content in the log:
|
object FromJsonBenchmark extends SqlBasedBenchmark {
import spark.implicits._
def withFilter(rowsNum: Int, numIters: Int): Unit = {
val benchmark = new Benchmark("from_json in Filter", rowsNum, output = output)
withTempPath { path =>
prepareDataInfo(benchmark)
val numCols = 500
val schema = writeWideRow(path.getAbsolutePath, rowsNum, numCols)
val jsonValue = from_json($"value", schema)
val predicate = jsonValue.getField(s"col0") >= lit(100000) ||
jsonValue.getField(s"col50") >= lit(100000) ||
jsonValue.getField(s"col123") >= lit(100000)
val caseName = s"from_object, codegen: no"
benchmark.addCase(caseName, numIters) { _ =>
val df = spark.read
.text(path.getAbsolutePath)
.where(predicate)
df.write.mode("overwrite").format("noop").save()
}
benchmark.run()
}
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val numIters = 3
runBenchmark("Benchmark for performance of from_json codegen") {
withFilter(1_000_000, numIters)
}
}
}
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes 61029 62195 1781 0.0 61028.8 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes 61391 66201 7157 0.0 61391.2 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes 60653 61195 481 0.0 60652.7 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no 61289 62508 1155 0.0 61288.6 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no 61219 61663 386 0.0 61218.9 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no 61056 61362 287 0.0 61055.9 1.0X |
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes 2325 2341 26 0.0 23249.6 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes 2237 2266 36 0.0 22373.5 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: yes 2317 2403 74 0.0 23172.3 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no 2264 2286 20 0.0 22639.3 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no 2475 3010 554 0.0 24752.2 1.0X
OpenJDK 64-Bit Server VM 17.0.10+7-LTS on Mac OS X 15.2
Apple M2
from_json in Filter: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
from_object, codegen: no 2315 2780 480 0.0 23150.6 1.0X |
From the above scenario, it seems that there is no a performance regression, and I am investigating other reasons. |
before: code-gen-before.txt It can be seen that after |
@cloud-fan @dongjoon-hyun @panbingkun How do we proceed with this issue? I think the risk of generating huge As more functions come to support code generation, the probability of generating huge methods will increase, It should apply to more than just filters, right?. Perhaps we need to find a more universal approach to split the generated methods in order to avoid this risk? |
Give me some time to look at the root cause. |
In the val df = spark.read
.text(path.getAbsolutePath)
.where(predicate)
df.write.mode("overwrite").format("noop").save()
Ultimately, optimize the 500 calls to
|
val predicateCode = generatePredicateCode( |
there is no subexpressionElimination
optimization here, 500 calls will ultimately be applied to JsonToStructs
.
If we can implement subexpressionElimination optimization in the method spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala Lines 68 to 80 in 0123a5e
cc @cloud-fan |
So the answer to this question is: In the |
@panbingkun great investigation! +1 to implement subexpression elimination for |
So this feature doesn't need to be revert, right? |
any progress? @panbingkun |
It is being implemented and will take some time. |
What changes were proposed in this pull request?
The pr aims to remove codegen of
from_json
.Why are the changes needed?
Based on the discussion and testing with
SubExprEliminationBenchmark
#48466 (comment),after implementing codegen for
from_json
, there is a performance regression in the withFilter scenario withsubExprElimination
= true,codegen
= trueLet's remove it first and will submit it after we solve the above issue.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass GA & Manually test.
Was this patch authored or co-authored using generative AI tooling?
No.