Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to find out the number of datafiles deleted #12288

Open
Shekharrajak opened this issue Feb 16, 2025 · 4 comments
Open

API to find out the number of datafiles deleted #12288

Shekharrajak opened this issue Feb 16, 2025 · 4 comments
Labels
API question Further information is requested

Comments

@Shekharrajak
Copy link

Query engine

JAVA API

Question

Using JAVA API I would like to find out how many datafiles are deleted in my delete API and in current snapshot (comparing with previous snapshot).

Can anyone share how it is achievable ? Please review my code, and comment if it is a good way of doing the same, if there is no API:

            Snapshot snapshot = table.snapshot(latestSnapshotId);
            List<ManifestFile> manifestFiles = snapshot.allManifests(table.io());
            for (ManifestFile manifest : manifestFiles) {
                try (ManifestReader<DataFile> reader = ManifestFiles.read(manifest, table.io())) {
                    for (DataFile file : reader) {
                       
                        if (manifest.hasDeletedFiles()) { // This ensures we're checking deleted files
                            LOGGER.info("Deleted DataFile: {}, Partition: {}, Manifest: {}",
                                    file.path(), file.partition(), manifest.path());
                        }
                    }
                }
            }
@Shekharrajak Shekharrajak added the question Further information is requested label Feb 16, 2025
@singhpk234
Copy link
Contributor

If it's just the count you can use snapshot summary : https://iceberg.apache.org/spec/?h=spec#metrics
people also use partition summary to have partition level break down !

@Shekharrajak
Copy link
Author

I found a testcases :


Map<String, String> summary = table.currentSnapshot().summary();
    assertThat(summary.get("deleted-data-files"))
        .as("Deleted files count must match")
        .isEqualTo("4");

But I want to understand if the whole datafile is marked as deleted or some data rows of the datafile marked as deleted - do we have other data rows also which is not deleted as part of deletion ?

@manuzhang
Copy link
Collaborator

manuzhang commented Feb 17, 2025

Here's a solution (using Spark SQL as an example).

  1. Find added position delete files from the entries metadata table of a snapshot_id.
select data_file.file_path from $db.$table.entries where snapshot_id='$snapshot_id' and data_file.content=1;
  1. Collect affected data files from content position delete files, whose columns are data_file_path and position.
select * from parquet.`$position_delete_file.parquet`

@manuzhang manuzhang added the API label Feb 17, 2025
@Fokko
Copy link
Contributor

Fokko commented Feb 17, 2025

Hey @Shekharrajak, I think you're already pretty close, but I'm not sure I understand what you're looking for. The snapshot summary properties are described here, maybe that helps :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants