Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper function to get recently updated partitions #130

Open
1 of 3 tasks
huynguyent opened this issue Feb 1, 2024 · 4 comments
Open
1 of 3 tasks

Helper function to get recently updated partitions #130

huynguyent opened this issue Feb 1, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@huynguyent
Copy link

Is your feature request related to a problem? Please describe.

For streaming systems (or batch systems that run in high frequency) that write data into delta tables, it's a common problem to have lots of small files. In many cases, it's not practical to auto compact because of various reasons, for example

  • Auto compaction is not available in Delta lake before 3.1.0
  • Auto compaction might not be well supported outside Spark

One way to solve this is to have a separate process that perform optimization regularly on these delta tables. However it's not a good idea to optimize the entire table whenever without any constraint. A few example reasons:

  • While in theory optimize is a no-op if the partitions weren't updated, it still takes some overhead per partition to determine it's a no-op. This could add up quite significantly when you have lots of partitions.
  • If the optimize operation included z-order, subsequent z-order operations won't be no-op even if the partitions weren't updated

Describe the solution you'd like
A helper function to find out which partitions have been updated between some time period, for example

def get_updated_partitions(delta_table: DeltaTable, start_time: datetime.datetime, end_time: datetime.datetime, exclude_optimize_operations: bool) -> list[dict[str, str]]

The exclude_optimize_operations flag is necessary because optimization operations themselves are also update operations. If the operations are not excluded, they might cause a feedback loop since they will keep showing up in the output.

All the information needed for this features should be available in the transaction log.

Describe alternatives you've considered
Optimizing the entire table and accept the overhead

Not sure what's a good alternative once z-order is used however

Additional context

N/A

Willingness to contribute

Would you be willing to contribute an implementation of this feature?

  • Yes. I can contribute this feature independently.
  • Yes. I would be willing to contribute this feature with guidance from the mack community.
  • No. I cannot contribute this feature at this time.
@huynguyent huynguyent added the enhancement New feature or request label Feb 1, 2024
@MrPowers
Copy link
Owner

MrPowers commented Feb 1, 2024

Sounds good to me. Only worry is that we won't be able to access the necessary information from the transaction log without using private methods.

The Delta Rust transaction log APIs expose more info. Anyways, I am cool with this, just want to make sure we rely on public interfaces!

@huynguyent
Copy link
Author

Sadly the transaction log information seems to only be exposed in the Scala version, not Python one :(

https://books.japila.pl/delta-lake-internals/DeltaLog/

If we wanna do this in pyspark, we would have to reach into the JVM get this information. Not sure if that would count as public interfaces though. The delta-spark library regularly reach into the JVM from the pyspark side, for example

https://github.com/delta-io/delta/blob/master/python/delta/tables.py#L52C14-L52C18

@MrPowers
Copy link
Owner

MrPowers commented Feb 1, 2024

@huynguyent - you can add this function to levi if you'd like: https://github.com/MrPowers/levi

There is a get_add_actions method in Delta Rust that probably exposes the necessary details: https://delta-io.github.io/delta-rs/api/delta_table/#deltalake.DeltaTable.get_add_actions

@huynguyent
Copy link
Author

Yup, levi might be a more straightforward place for this feature. I'll raise an isse and look into implementing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants