[SPARK-48816][SQL] Shorthand for interval converters in UnivocityParser

### What changes were proposed in this pull request? Directly call `IntervalUtils.castStringToDTInterval/castStringToYMInterval` instead of creating Cast expressions to evaluate. - Benchmarks indicated a 10% time-saving. - Bad record recording might not work if the cast handles the exceptions early ### Why are the changes needed? - pref improvement - Bugfix for bad record recording ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing existing tests and benchmark tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47227 from yaooqinn/SPARK-48816. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>
jingz-db · Jul 22, 2024 · f287c78 · f287c78
1 parent 154e350
commit f287c78
Show file tree

Hide file tree

Showing 4 changed files with 134 additions and 85 deletions.
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
@@ -26,7 +26,7 @@ import com.univocity.parsers.csv.CsvParser
 import org.apache.spark.SparkUpgradeException
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.catalyst.{InternalRow, NoopFilters, OrderedFilters}
-import org.apache.spark.sql.catalyst.expressions.{Cast, EmptyRow, ExprUtils, GenericInternalRow, Literal}
+import org.apache.spark.sql.catalyst.expressions.{ExprUtils, GenericInternalRow}
 import org.apache.spark.sql.catalyst.util._
 import org.apache.spark.sql.catalyst.util.LegacyDateFormats.FAST_DATE_FORMAT
 import org.apache.spark.sql.errors.{ExecutionErrors, QueryExecutionErrors}
@@ -260,12 +260,14 @@ class UnivocityParser(
 
     case ym: YearMonthIntervalType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
-        Cast(Literal(datum), ym).eval(EmptyRow)
+        IntervalUtils.castStringToYMInterval(
+          UTF8String.fromString(datum), ym.startField, ym.endField)
       }
 
     case dt: DayTimeIntervalType => (d: String) =>
       nullSafeDatum(d, name, nullable, options) { datum =>
-        Cast(Literal(datum), dt).eval(EmptyRow)
+        IntervalUtils.castStringToDTInterval(
+          UTF8String.fromString(datum), dt.startField, dt.endField)
       }
 
     case udt: UserDefinedType[_] =>

diff --git a/sql/core/benchmarks/CSVBenchmark-jdk21-results.txt b/sql/core/benchmarks/CSVBenchmark-jdk21-results.txt
@@ -2,69 +2,76 @@
 Benchmark to measure CSV read/write performance
 ================================================================================================
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
 AMD EPYC 7763 64-Core Processor
 Parsing quoted values:                    Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-One quoted string                                 23353          23432          75          0.0      467067.4       1.0X
+One quoted string                                 23962          24182         316          0.0      479231.3       1.0X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
 AMD EPYC 7763 64-Core Processor
 Wide rows with 1000 columns:              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 1000 columns                               56825          57244         679          0.0       56825.1       1.0X
-Select 100 columns                                20482          20568          86          0.0       20481.7       2.8X
-Select one column                                 16968          17000          36          0.1       16967.7       3.3X
-count()                                            3366           3378          11          0.3        3366.4      16.9X
-Select 100 columns, one bad input field           28347          28379          30          0.0       28346.6       2.0X
-Select 100 columns, corrupt record field          32401          32450          42          0.0       32401.2       1.8X
+Select 1000 columns                               56724          57115         570          0.0       56724.1       1.0X
+Select 100 columns                                20740          20855         115          0.0       20739.7       2.7X
+Select one column                                 17304          17377         114          0.1       17304.3       3.3X
+count()                                            3719           3740          21          0.3        3719.0      15.3X
+Select 100 columns, one bad input field           24943          24999          69          0.0       24943.2       2.3X
+Select 100 columns, corrupt record field          28306          28341          31          0.0       28306.2       2.0X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
 AMD EPYC 7763 64-Core Processor
 Count a dataset with 10 columns:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Select 10 columns + count()                       11174          11195          18          0.9        1117.4       1.0X
-Select 1 column + count()                          7666           7694          24          1.3         766.6       1.5X
-count()                                            2042           2048           5          4.9         204.2       5.5X
+Select 10 columns + count()                       10977          10982           5          0.9        1097.7       1.0X
+Select 1 column + count()                          7406           7554         131          1.4         740.6       1.5X
+count()                                            1550           1558           9          6.5         155.0       7.1X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
 AMD EPYC 7763 64-Core Processor
 Write dates and timestamps:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-Create a dataset of timestamps                      854            882          27         11.7          85.4       1.0X
-to_csv(timestamp)                                  6166           6174          13          1.6         616.6       0.1X
-write timestamps to files                          6480           6575         158          1.5         648.0       0.1X
-Create a dataset of dates                           948            949           1         10.6          94.8       0.9X
-to_csv(date)                                       4471           4474           3          2.2         447.1       0.2X
-write dates to files                               4599           4616          15          2.2         459.9       0.2X
+Create a dataset of timestamps                      845            847           3         11.8          84.5       1.0X
+to_csv(timestamp)                                  5546           5597          57          1.8         554.6       0.2X
+write timestamps to files                          5760           5768           8          1.7         576.0       0.1X
+Create a dataset of dates                          1053           1064           9          9.5         105.3       0.8X
+to_csv(date)                                       4115           4122           9          2.4         411.5       0.2X
+write dates to files                               4102           4108           5          2.4         410.2       0.2X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
 AMD EPYC 7763 64-Core Processor
 Read dates and timestamps:                                             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -----------------------------------------------------------------------------------------------------------------------------------------------------
-read timestamp text from files                                                  1200           1213          12          8.3         120.0       1.0X
-read timestamps from files                                                     11576          11601          22          0.9        1157.6       0.1X
-infer timestamps from files                                                    23234          23253          16          0.4        2323.4       0.1X
-read date text from files                                                       1115           1162          44          9.0         111.5       1.1X
-read date from files                                                           10978          11006          43          0.9        1097.8       0.1X
-infer date from files                                                          22588          22604          13          0.4        2258.8       0.1X
-timestamp strings                                                               1224           1236          21          8.2         122.4       1.0X
-parse timestamps from Dataset[String]                                          13566          13595          41          0.7        1356.6       0.1X
-infer timestamps from Dataset[String]                                          25057          25094          36          0.4        2505.7       0.0X
-date strings                                                                    1618           1626           7          6.2         161.8       0.7X
-parse dates from Dataset[String]                                               12784          12816          34          0.8        1278.4       0.1X
-from_csv(timestamp)                                                            12008          12088          69          0.8        1200.8       0.1X
-from_csv(date)                                                                 11930          11938          12          0.8        1193.0       0.1X
-infer error timestamps from Dataset[String] with default format                14366          14394          35          0.7        1436.6       0.1X
-infer error timestamps from Dataset[String] with user-provided format          14380          14412          52          0.7        1438.0       0.1X
-infer error timestamps from Dataset[String] with legacy format                 14439          14453          21          0.7        1443.9       0.1X
+read timestamp text from files                                                  1107           1119          16          9.0         110.7       1.0X
+read timestamps from files                                                      9511           9553          49          1.1         951.1       0.1X
+infer timestamps from files                                                    19084          19114          27          0.5        1908.4       0.1X
+read date text from files                                                       1036           1046          14          9.7         103.6       1.1X
+read date from files                                                            8299           8309          15          1.2         829.9       0.1X
+infer date from files                                                          17290          17294           4          0.6        1729.0       0.1X
+timestamp strings                                                               1188           1197           7          8.4         118.8       0.9X
+parse timestamps from Dataset[String]                                          11442          11458          14          0.9        1144.2       0.1X
+infer timestamps from Dataset[String]                                          21076          21116          39          0.5        2107.6       0.1X
+date strings                                                                    1651           1659          10          6.1         165.1       0.7X
+parse dates from Dataset[String]                                               10181          10186           5          1.0        1018.1       0.1X
+from_csv(timestamp)                                                            10023          10062          34          1.0        1002.3       0.1X
+from_csv(date)                                                                  9335           9351          15          1.1         933.5       0.1X
+infer error timestamps from Dataset[String] with default format                11187          11205          16          0.9        1118.7       0.1X
+infer error timestamps from Dataset[String] with user-provided format          11201          11216          13          0.9        1120.1       0.1X
+infer error timestamps from Dataset[String] with legacy format                 11210          11227          17          0.9        1121.0       0.1X
 
-OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
 AMD EPYC 7763 64-Core Processor
 Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-w/o filters                                        4302           4383         137          0.0       43020.6       1.0X
-pushdown disabled                                  4206           4220          13          0.0       42058.8       1.0X
-w/ filters                                          776            784          10          0.1        7756.3       5.5X
+w/o filters                                        4365           4377          13          0.0       43653.8       1.0X
+pushdown disabled                                  4348           4370          22          0.0       43477.7       1.0X
+w/ filters                                          695            713          29          0.1        6950.2       6.3X
+
+OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
+AMD EPYC 7763 64-Core Processor
+Interval:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
+------------------------------------------------------------------------------------------------------------------------
+Read as Intervals                                  7089           7096           7          0.4        2362.1       1.0X
+Read Raw Strings                                   2071           2075           6          1.4         690.1       3.4X