Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print un-pretty metadata JSON files without whitespace #12281

Open
1 of 3 tasks
istreeter opened this issue Feb 15, 2025 · 5 comments · May be fixed by #12318
Open
1 of 3 tasks

Print un-pretty metadata JSON files without whitespace #12281

istreeter opened this issue Feb 15, 2025 · 5 comments · May be fixed by #12318
Labels
improvement PR that improves existing functionality

Comments

@istreeter
Copy link

Feature Request / Improvement

Currently, metadata files are pretty-printed, with lots of new-lines and whitespace indentations. This is the relevant line of code, which uses the Jackson default pretty printer.

If we could write metadata files without redundant whitespace, then it would save some storage space, and network overhead.

This will have have most impact for tables with large metadata files. For example, I have seen a metadata files which was 53.6MB. After removing whitespace, this was reduced to 41.4MB. I have read other issues in github which mention gigabyte-scale metadata files, e.g. in #9734.

I cannot think of any downside of this suggested change. Metadata files are mainly read by machines not humans. And if a human does want to inspect a metadata file, then it is fairly easy to prettify a JSON file when needed.

I'd be happy to open a PR for this, if you think it's a good idea? It seems like an easy way to get a small but noticeable performance improvement for reads and writes.

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
@istreeter istreeter added the improvement PR that improves existing functionality label Feb 15, 2025
@singhpk234
Copy link
Contributor

This is interesting, does removing the pretty causes any issues for rendering in common ui tools or any other reader ?

I have seen a metadata files which was 53.6MB. After removing whitespace, this was reduced to 41.4M

~22% reduction in the sounds awesome ! Looks like having pretty existed from day 1 in the codebase. would be best to take this up in dev-list as well

cc @bryanck, you might be interested ^^^

@istreeter
Copy link
Author

does removing the pretty causes any issues for rendering in common ui tools or any other reader ?

It's hard to guess which UI tools get used to read Iceberg metadata files.

If someone opened it in a very primitive text editor, then yes it would look ugly. But I'm fairly sure most modern developer tools have a way to format json for viewing.

would be best to take this up in dev-list as well

Good idea -- I will join the dev list and ask the question over there.

@bryanck
Copy link
Contributor

bryanck commented Feb 17, 2025

Turning on metadata file compression is the best option to reduce size, though stripping out formatting did yield ~5% savings in one large file I tested even after compressing. It's easy enough to run it through jq if you want it formatted.

@bryanck
Copy link
Contributor

bryanck commented Feb 17, 2025

I'm also curious if exploring support for a binary format would make sense?

@istreeter
Copy link
Author

In terms of just reducing file size, then yes you're right either compression or binary format has more impact. But that sounds like a difficult breaking change. It could be difficult to roll out, and even more difficult to roll back from.

Whereas stripping whitespace seems like a much easier win. If we assume parsers are indifferent to whitespace, then this is completely non-breaking.

By the way, I just sent a mail to the dev-list about this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants