Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherrypick the data rows [deleted or old values] from a past snapshot #12271

Open
1 of 3 tasks
Shekharrajak opened this issue Feb 14, 2025 · 8 comments
Open
1 of 3 tasks
Labels
improvement PR that improves existing functionality

Comments

@Shekharrajak
Copy link

Feature Request / Improvement

Hello team,

Is there any way to pick the specific partition or data rows from the old snapshots to main snapshot ?

Example:

When we delete a partition x from the main snapshot branch there will be a new commit & snapshot will be created. And new addition of the partitions will increament the snapshots but if we want to get the old partition x back, what APIs we have ? I could not find the way to mark deleted data files back in manifest and available in new snapshot.

If we do not have such APIs, let's discuss the design for future version.

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
@Shekharrajak Shekharrajak added the improvement PR that improves existing functionality label Feb 14, 2025
@RussellSpitzer
Copy link
Member

You do need to make a new snapshot, I would use the Table Append api to just re-add the files that were removed.

@Shekharrajak
Copy link
Author

Thanks @RussellSpitzer for sharing! re-adding files will update the manifest and data will be query-able ? Can you please share your solution or APIs ?

@manuzhang
Copy link
Collaborator

I think Russell is referring to https://iceberg.apache.org/docs/nightly/api/#update-operations

@Shekharrajak
Copy link
Author

Thanks, but I am still not clear - how can we identify the datafiles from the old snapshot for specific rows/partition and add them into latest snapshot ?

@RussellSpitzer
Copy link
Member

You have a lot of options, you can read the files while time traveling, you can check metadata tables, you can read manifests directly.

Say I want to revert files removed in snapshot A. I'd scan entries metadata table for all datafiles that were removed in that snapshot. Collect all the Datafile info. Then do something like (psuedo-code) table.newAppend().appendFile(file1).appendFile(file2)....

@Shekharrajak
Copy link
Author

Shekharrajak commented Feb 21, 2025

Thanks @RussellSpitzer , let me try. Meantime if you already have any example please share.

Also I would like to add datafiles for only specific partitions timestamp not all deleted datafiles needs to be reverted.

@Shekharrajak
Copy link
Author

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Feb 21, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants