Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement write.metadata.delete-after-commit.enabled to clean up old metadata files #1607

Merged
merged 18 commits into from
Feb 12, 2025

Conversation

kaushiksrini
Copy link
Contributor

Implements property write.metadata.delete-after-commit.enabled from https://iceberg.apache.org/docs/1.5.1/maintenance/#remove-old-metadata-files.

Closes #1199

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @kaushiksrini 🙌 This looks great.

I have one minor style comment. Could you also add this to the docs? Thanks!

Comment on lines 1642 to 1646
transaction = table.transaction()
update = transaction.update_schema()
update.add_column(path=f"new_column_{i}", field_type=IntegerType())
update.commit()
transaction.commit_transaction()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the context managers? I think that's easier to read

Suggested change
transaction = table.transaction()
update = transaction.update_schema()
update.add_column(path=f"new_column_{i}", field_type=IntegerType())
update.commit()
transaction.commit_transaction()
with table.transaction() as transaction:
with transaction.update_schema() as update:
update.add_column(path=f"new_column_{i}", field_type=IntegerType())

@kaushiksrini
Copy link
Contributor Author

@Fokko thanks! used context managers and added to the documentation

@@ -63,6 +63,7 @@ Iceberg tables support table properties to configure table behavior.
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
| `write.metadata.delete-after-commit.enabled` | Boolean | False | Whether to automatically delete old *tracked* metadata files after each table commit. It will retain a number of the most recent metadata files, which can be set using property `write.metadata.previous-versions-max`. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 867 to 868
if current_table is not None:
self._delete_old_metadata(io, current_table.metadata, updated_metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the right place for this operation?
looking at the java implementation, changes are committed to the table before old metadata files are deleted
https://github.com/apache/iceberg/blob/d935460bbed5da0ac205a67912b70fa682ee84df/core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java#L125-L126

At this point, changes are only applied to the table metadata but is not yet committed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_delete_old_metadata should be after table is committed

@kaushiksrini
Copy link
Contributor Author

@kevinjqliu thanks for the review! I moved the logic after the table is committed

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! Thanks for the PR. I think _do_commit is the right place to add this.

pyiceberg/catalog/__init__.py Outdated Show resolved Hide resolved

# https://github.com/apache/iceberg/blob/f6faa58/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L527
# delete old metadata if METADATA_DELETE_AFTER_COMMIT_ENABLED is set to true
self.catalog._delete_old_metadata(self.io, self.metadata, response.metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to https://github.com/apache/iceberg/blob/f6faa58dac57e03be6e02a43937ac7c15c770225/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L539-L544

can we add a comment here explaining how METADATA_PREVIOUS_VERSIONS_MAX is taken into account?

TableProperties.METADATA_PREVIOUS_VERSIONS_MAX,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe we want to wrap this in try/catch and throw a warning as to not block the commit process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a try-catch, but the deleteFiles already has a warning for actually deleting the files. I could only see an exception arise from _delete_old_metadata's operations.

def delete_files(io: FileIO, files_to_delete: Set[str], file_type: str) -> None:
"""Delete files.
Log warnings if failing to delete any file.
Args:
io: The FileIO used to delete the object.
files_to_delete: A set of file paths to be deleted.
file_type: The type of the file.
"""
for file in files_to_delete:
try:
io.delete(file)
except OSError as exc:
logger.warning(msg=f"Failed to delete {file_type} file {file}", exc_info=exc)

tests/catalog/test_sql.py Outdated Show resolved Hide resolved
tests/catalog/test_sql.py Show resolved Hide resolved
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for following up

@Fokko Fokko changed the title Remove old metadata Clean up old metadata Feb 11, 2025
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this locally, and it works great. Thanks @kaushiksrini for working on this, and thanks @kevinjqliu for the review 👍

@kaushiksrini
Copy link
Contributor Author

kaushiksrini commented Feb 12, 2025

@Fokko thanks for the review! and @kevinjqliu thanks for making the changes + review!

@kevinjqliu kevinjqliu merged commit d6dce6d into apache:main Feb 12, 2025
8 checks passed
@kevinjqliu
Copy link
Contributor

Thanks @kaushiksrini I applied the simple test changes via github. Thanks @Fokko for the review

@kevinjqliu kevinjqliu changed the title Clean up old metadata Implement write.metadata.delete-after-commit.enabled to clean up old metadata files Feb 12, 2025
@kevinjqliu kevinjqliu changed the title Implement write.metadata.delete-after-commit.enabled to clean up old metadata files Implement write.metadata.delete-after-commit.enabled to clean up old metadata files Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove old metadata files
3 participants