All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Spark 3.0 Migration
- Migrate to Spark version 3.0.1, Hadoop 3.2.1 and Scala 2.12
- Spark 3 uses the Proleptic Gregorian calendar.
In case there are problems when data sources have dates before 1582 or other problematics formats, as a quick fix we can set the
following spark parameters in the pipelines:
An example of an exception related to parsing dates and timestamps looks like this:
"spark.sql.legacy.timeParserPolicy": "LEGACY", "spark.sql.legacy.parquet.datetimeRebaseModeInWrite": "LEGACY", "spark.sql.legacy.parquet.datetimeRebaseModeInRead": "LEGACY"
Note 1: there's also two other exceptions that we observed related to reading or writing Parquets with old date/time formats. They look very similar to the Spark upgrade exception above, but highlight the need to change the respective spark.sql.legacy.parquet.datetimeRebaseModeInXXXXX property. Note 2: the solution provided above should cover all the exceptions enumerated here for a given data source.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '00/00/0000' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
- Fix reconciliation execution time by removing unneeded caching stage.
- Enable multi-line option for append loads
- fix duplicate issues generated by the latest changes applied to CompetitorDataPreprocessor
- Make init condensation optional, but true by default.
- Modify append load to support more complex partitioning strategies without file_regex
- Added support for configuring write load mode and num output files in append load
- Support for specifying the quote and escape characters. More info on how to specify those here: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html
- Support for multiple partition attributes (non date-derived) and single non date-derived partition attributes.