[Feature Request] Dynamic spark.databricks.delta.snapshotPartitions based on size of snapshot #3351

santosh-d3vpl3x · 2024-07-09T19:15:57Z

Feature request

Which Delta project/connector is this regarding?

Overview

Currently, the spark.databricks.delta.snapshotPartitions value is static. The idea is to make this value dependent on the size of snapshot such that we can have well sized partitions.

Motivation

Delta computes the snapshot to understand which parquet files to read and caches the snapshot to make planning and execution performant. The cached number of partitions depend upon spark.databricks.delta.snapshotPartitions. For bigger tables, the default value of 50 might be saner but for smaller tables, this usually results in partition sizes of few bytes. This does not play well with dynamic allocation. It is not recommended to kill an executor that has cached partition on it, by default spark sets decommission time to infinity for such executors. Many a times, this leads to idle executor staying alive just because it has few bytes of delta cache. Converse is also true, for a bigger snapshot, the value might be too small making the job fail. This value should be abstracted away from users.

Further details

A naïve approach: can we leverage AQE here? Perhaps, introduce a configuration that directly deal with the size of snapshot and remove the num partitions.

A simpler win could be to allow users to also configure storage level for their caches.

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.

The text was updated successfully, but these errors were encountered:

Kimahriman · 2024-07-11T00:34:42Z

You can set the storage level for the snapshot cache: #1000, but this would also be a good feature. I've thought about doing this for a while but never got to it. You have all the file info for the files needed to load the snapshot I believe, so you can do some stats based on the total size of those files perhaps.

santosh-d3vpl3x · 2024-07-11T08:59:44Z

TIL! I couldn't find any reference to that config anywhere, thanks!

I wish delta config also used version to indicate since which version the config existed. Which version of delta has this change? Looks like it is 1.2.0.

Kimahriman · 2024-07-11T10:58:19Z

Yeah a lot of the configs you just kinda need to know about or dig through the source code to find and see what they do. I specifically added that for the dynamic allocation issue you mentioned hah. I'm not sure why they don't have a version on the config, the only real way to tell is to look at what tag that commit made it into which it looks like you did, 1.2.0.

Kimahriman · 2025-01-08T00:20:04Z

Finally got around to trying to knock this out. Ignoring the cache issue, it would still be nice to dynamically size the snapshot partitions for small vs large tables

santosh-d3vpl3x added the enhancement New feature or request label Jul 9, 2024

Kimahriman linked a pull request Jan 8, 2025 that will close this issue

[Spark] Add config for sizing snapshot partitions by file index size #4023

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Dynamic spark.databricks.delta.snapshotPartitions based on size of snapshot #3351

[Feature Request] Dynamic spark.databricks.delta.snapshotPartitions based on size of snapshot #3351

santosh-d3vpl3x commented Jul 9, 2024 •

edited

Loading

Kimahriman commented Jul 11, 2024

santosh-d3vpl3x commented Jul 11, 2024 •

edited

Loading

Kimahriman commented Jul 11, 2024

Kimahriman commented Jan 8, 2025

[Feature Request] Dynamic spark.databricks.delta.snapshotPartitions based on size of snapshot #3351

[Feature Request] Dynamic spark.databricks.delta.snapshotPartitions based on size of snapshot #3351

Comments

santosh-d3vpl3x commented Jul 9, 2024 • edited Loading

Feature request

Which Delta project/connector is this regarding?

Overview

Motivation

Further details

Willingness to contribute

Kimahriman commented Jul 11, 2024

santosh-d3vpl3x commented Jul 11, 2024 • edited Loading

Kimahriman commented Jul 11, 2024

Kimahriman commented Jan 8, 2025

santosh-d3vpl3x commented Jul 9, 2024 •

edited

Loading

santosh-d3vpl3x commented Jul 11, 2024 •

edited

Loading