We are excited to announce the release of Delta Lake 3.3.0! This release includes several exciting new features.
Highlights
- [Delta Spark] Support for Identity Column to assign unique values for each record inserted into a table.
- [Delta Spark] Support VACUUM LITE to deliver faster VACUUM for periodically run VACUUM commands.
- [Delta Spark] Support for Row Tracking Backfill to alter an existing table to enable Row Tracking. Row Tracking allows engines such as Spark to track row-level lineage in Delta Lake tables.
- [Delta Spark] Support for enhanced table state validation with version checksums and improved Snapshot initialization performance based on this checksum.
- [Delta UniForm] Support for enabling UniForm Iceberg on existing tables without rewriting the data files using ALTER TABLE.
- [Delta Kernel] Support for reading Delta tables that have Type Widening enabled.
Details by each component.
Delta Spark
Delta Spark 3.3.0 is built on Apache Spark™ 3.5.3. Similarly to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.3.0/index.html
- API documentation: https://docs.delta.io/3.3.0/delta-apidoc.html#delta-spark
- Maven artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/3.3.0/
The key features of this release are:
- Support for Identity Column: Delta Lake identity columns are a type of generated column that automatically assigns unique values to each record inserted into a table. Users do not need to explicitly provide values for these columns during data insertion. They offer a straightforward and efficient mechanism to generate unique keys for table rows, combining ease of use with high performance. See the documentation for more information.
- Support VACUUM LITE to deliver faster VACUUM for periodically run VACUUM commands. When running VACUUM in LITE mode, instead of finding all files in the table directory, VACUUM LITE uses the Delta transaction log to identify and remove files no longer referenced by any table versions within the retention duration.
- Support for Row Tracking Backfill:Row Tracking feature can now be used on existing Delta Lake tables to track row-level lineage in Delta Spark, previously it was only possible for new tables. Users can now use ALTER TABLE table_name SET TBLPROPERTIES (delta.enableRowTracking = true) syntax to alter an existing table to enable Row Tracking. When enabled, users can identify rows across multiple versions of the table and can access this tracking information using the two metadata fields
_metadata.row_id
and_metadata.row_commit_version
. Refer to the documentation on Row Tracking for more information and examples. - Delta Lake now generates version checksums for each table commit, providing stronger consistency guarantees and improved debugging capabilities. It tracks detailed table metrics including file counts, table size, data distribution histograms, etc. This enables automatic detection of potential state inconsistencies and helps maintain table integrity in distributed environments. The state validation is performed on every checkpoint. The Checksum is also used to bypass the initial Spark query that retrieves the Protocol and Metadata actions, resulting in a decreased snapshot initialization latency.
- Liquid clustering updates:
- Support OPTIMIZE FULL to fully recluster a Liquid table. This command optimizes all records in a table that uses liquid clustering, including data that might have previously been clustered.
- Support enabling liquid clustering on an existing unpartitioned Delta table using ALTER TABLE <table_name> CLUSTER BY (<clustering_columns>). Previously, liquid clustering could only be enabled upon table creation.
- Support creating clustered table from an external location
- The In-Commit Timestamp table feature is no longer in preview When enabled, this feature persists monotonically increasing timestamps within Delta commits, ensuring they are not affected by file operations. With this, time travel queries yield consistent results, even if the table directory is relocated. This feature was available as a preview feature in Delta 3.2 and is now generally available in Delta 3.3. See the documentation for more information.
Other notable changes include:
- Protocol upgrade/downgrade improvements
- Support dropping table features for columnMapping, vacuumProtocolCheck, and checkConstraints.
- Improve table protocol transitions to simplify the CUJ when altering the table protocol.
- Support protocol version downgrades when the existing table features exist in the lower protocol version.
- Update protocol upgrades behavior such that when enabling a legacy feature via a table property (e.g. setting
delta.enableChangeDataFeed=true
) the protocol is upgraded to (1,7) and only the legacy feature is enabled. Previously the minimum protocol version would be selected and all preceding legacy features enabled. - Support enabling a table feature on a table using the Python DeltaTable API with
deltaTable.addFeatureSupport(...)
.
- Type-widening improvements
- Support automatic type widening in Delta Sink when type widening is enabled on the table and schema evolution is enabled on the sink.
- Support type widening on nested fields when other nested fields in the same struct are referenced by check constraints or generated column expressions.
- Fix type-widening operation validation for map, array or struct columns used in generated column expressions or check constraints.
- Fix to directly read the file schema from the parquet footers when identifying the files to be rewritten when dropping the type widening table feature.
- Fix using type widening on a table containing a char/varchar column.
- Liquid clustering improvements
- Fix liquid clustering to automatically fall back to Z-order clustering when clustering on a single column. Previously, any attempts to optimize the table would fail.
- Support RESTORE on clustered tables. Previously, RESTORE operations would not restore clustering metadata.
- Support SHOW TBLPROPERTIES for clustered tables.
- Support for partition-like data skipping filters (preview): When enabled by setting
spark.databricks.delta.skipping.partitionLikeFilters.enabled
, applies arbitrary data skipping filters referencing Liquid clustering columns to files with the same min and max values on clustering columns. This may decrease the files scanned for selective queries on large Liquid tables.
- Performance improvements
- Improve the performance of finding the last complete checkpoint with more efficient file listing.
- Pushdown query filters when reading CDF so the filters can be used for partition pruning and row group skipping.
- Fix streaming CDF queries to not read log entries beyond endOffset for reduced processing time.
- Optimize REORG TABLE to remove dropped columns from the parquet files.
- Optimize MERGE resolution by batching the resolution of target column names and the resolution of assignment expressions to improve analysis performance on wide tables.
- Miscellaneous enhancements
- Support the CLONE operation in the Scala DeltaTable API.
- Support running optimize operations in batches by configuring the config
spark.databricks.delta.optimize.batchSize
. - Support specifying a struct column that contains nested columns of unsupported stats types in
delta.dataSkippingStatsColumns
. Previously this would throw an exception. - Support schema evolution for map or array root types.
- Support schema evolution for struct types nested within a map.
- Support non-deterministic expressions in the updated/inserted clauses of MERGE.
- Support reading the
_commit_timestamp
column using In Commit Timestamps in CDF reads when ICT is enabled on a table. - Write timestamp partition values in UTC. This prevents ambiguous behavior when reading timestamp partition columns that do not store timezone information.
- Miscellaneous bug fixes
- Support CREATE TABLE LIKE with user provided properties. Previously any properties that were provided in the SQL command were ignored and only the properties from the source table were used.
- Fix a bug where providing a query filter that compares two
Literal
expressions would cause an infinite loop when constructing data skipping filters. - Fix In-Commit Timestamps to use
clock.currentTimeMillis()
instead ofSystem.nanoTime()
for large commits since some systems return a very small number whenSystem.nanoTime()
is called. - Support concurrent CREATE TABLE IF NOT EXISTS commands. Previously the second command would fail with “TABLE_ALREADY_EXISTS” error.
- Fix to correctly read the
_commit_timestamp
column in Change Data Feed when reading from timezones other than UTC. - Improve the error messages when a by-path Delta table cannot be accessed by forwarding encountered error messages instead of always throwing a “missing delta table” exception.
- Fix inconsistencies with special characters in the min/max file statistics.
- Fix the Scala/Python DeltaTable merge API to execute using the Spark DataFrame API instead of completely bypassing Spark’s analysis. This allows the merge operation to be visible in the QueryExecutionListener.
- Fix an issue with CDF queries failing when using Spark Connect.
- Fix a bug in the conflict resolution for OPTIMIZE when a concurrent transaction adds deletion vectors.
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.3.0/delta-uniform.html
- Maven artifacts: delta-iceberg_2.12, delta-iceberg_2.13, delta-hudi_2.12, delta-hudi-2.13
You can now enable UniForm Iceberg on existing Delta tables without rewriting data files. You can then seamlessly read the table downstream in Iceberg clients such as Spark and Snowflake. See Enable by altering an existing table.
Other notable changes include:
- Support Timestamp-type partition columns for UniForm Iceberg.
- Support automatically running expireSnapshot on the UniForm Iceberg table to cleanup old manifests whenever OPTIMIZE is run on the Delta table.
- Support retrying for the Delta UniForm Iceberg conversion.
- Support list and map data types for UniForm Hudi.
- Miscellaneous bug fixes
Delta Kernel
- API documentation: https://docs.delta.io/3.3.0/api/java/kernel/index.html
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details.
This release of Delta Kernel Java contains the following changes:
- Delta Kernel Java and Rust now support reading Delta tables that have Type Widening enabled. The default ParquetHandlers provided by both Delta kernel implementations include support for reading tables that had any of the type changes covered by the feature applied.
- Support cleaning up expired log files as part of checkpointing.
- Support data skipping on timestamp and timestamp_ntz type columns.
- Support writing to tables with the inCommitTimestamp table feature enabled.
- Other notable read-side changes
- Support reading tables with schemas with long type field metadata.
- Support pushing down predicates IS NULL and IS NOT NULL into the default parquet reader to prune the row groups read.
- Fix a bug with reading tables with spaces in the table path.
- Add a new utility function to determine if a given partition exists (i.e. actually contains data) in a snapshot.
- Support retrieving the timestamp of the latest commit in a snapshot using
Snapshot.getTimestamp
. - Support getting the partition columns of the table for a snapshot.
- Support legacy Parquet schemas for map type and array type in the
DefaultParquetHandler
. - Support reading timestamp_ntz columns that are stored in parquet as INT96.
- Other notable write-side changes
- Write timestamps using the INT64 physical format in Parquet in the
DefaultParquetHandler
. Previously they were written as INT96 which is an outdated and deprecated format for timestamps. - Fix a bug when writing decimal type data as binary in the
DefaultParquetHandler
. - Fix to correctly escape special characters in the partition values when constructing the partition paths during writes.
- Write timestamps using the INT64 physical format in Parquet in the
- Expression support changes and fixes
- Lazily evaluate comparator expressions in the
DefaultExpressionHandler
. Previously expressions would be eagerly evaluated for every row in the underlying vectors. - Support SQL expression LIKE for comparing strings.
- Support SQL expression IS NOT DISTINCT FROM (aka null safe equals).
- Fix string comparison in the
DefaultExpressionHandler
to use UTF8. - Fix binary comparisons to use unsigned comparison in the
DefaultExpressionHandler.
- Lazily evaluate comparator expressions in the
- Developer-facing improvements
- Enable a code formatter for Java.
- Enact a set of exception principles to encourage good exception practices.
Other projects
Delta Sharing Spark
- Upgrade delta-sharing-client to 1.2.2 from 1.1.1.
- Support typeWidening, variantType, and timestampNtz table features for Delta Format Sharing.
Delta Storage
- Fix an issue where the S3DynamoDBLogStore (used for safe, concurrent multi-cluster writes to S3) would make extraneous GET calls to DynamoDB during Delta VACUUM operations, impacting performance.
Delta Flink
- Support partition columns of Date type in the Delta Sink.
Credits
Abhishek Radhakrishnan, Adam Binford, Alden Lau, Aleksei Shishkin, Alexey Shishkin, Allison Portis, Ami Oka, Amogh Jahagirdar, Andreas Chatzistergiou, Andrew Xue, Anish, Annie Wang, Avril Aysha, Bart Samwel, Burak Yavuz, Carmen Kwan, Charlene Lyu, ChengJi-db, Chirag Singh, Christos Stavrakakis, Cuong Nguyen, Dhruv Arya, Eduard Tudenhoefner, Felipe Pessoto, Fokko Driesprong, Fred Storage Liu, Hao Jiang, Hyukjin Kwon, Jacek Laskowski, Jackie Zhang, Jade Wang, James DeLoye, Jiaheng Tang, Jintao Shen, Johan Lasperas, Juliusz Sompolski, Jun, Jungtaek Lim, Kaiqi Jin, Kam Cheung Ting, Krishnan Paranji Ravi, Lars Kroll, Leon Windheuser, Lin Zhou, Liwen Sun, Lukas Rupprecht, Marko Ilić, Matt Braymer-Hayes, Maxim Gekk, Michael Zhang, Ming DAI, Mingkang Li, Nils Andre, Ole Sasse, Paddy Xu, Prakhar Jain, Qianru Lao, Qiyuan Dong, Rahul Shivu Mahadev, Rajesh Parangi, Rakesh Veeramacheneni, Richard Chen, Richard-code-gig, Robert Dillitz, Robin Moffatt, Ryan Johnson, Sabir Akhadov, Scott Sandre, Sergiu Pocol, Shawn Chang, Shixiong Zhu, Sumeet Varma, Tai Le Manh, Taiga Matsumoto, Tathagata Das, Thang Long Vu, Tom van Bussel, Tulio Cavalcanti, Venki Korukanti, Vishwas Modhera, Wenchen Fan, Yan Zhao, YotillaAntoni, Yumingxuan Guo, Yuya Ebihara, Zhipeng Mao, Zihao Xu, zzl-7