-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print un-pretty metadata JSON files without whitespace #12281
Comments
This is interesting, does removing the pretty causes any issues for rendering in common ui tools or any other reader ?
~22% reduction in the sounds awesome ! Looks like having pretty existed from day 1 in the codebase. would be best to take this up in dev-list as well cc @bryanck, you might be interested ^^^ |
It's hard to guess which UI tools get used to read Iceberg metadata files. If someone opened it in a very primitive text editor, then yes it would look ugly. But I'm fairly sure most modern developer tools have a way to format json for viewing.
Good idea -- I will join the dev list and ask the question over there. |
Turning on metadata file compression is the best option to reduce size, though stripping out formatting did yield ~5% savings in one large file I tested even after compressing. It's easy enough to run it through |
I'm also curious if exploring support for a binary format would make sense? |
In terms of just reducing file size, then yes you're right either compression or binary format has more impact. But that sounds like a difficult breaking change. It could be difficult to roll out, and even more difficult to roll back from. Whereas stripping whitespace seems like a much easier win. If we assume parsers are indifferent to whitespace, then this is completely non-breaking. By the way, I just sent a mail to the dev-list about this issue. |
Feature Request / Improvement
Currently, metadata files are pretty-printed, with lots of new-lines and whitespace indentations. This is the relevant line of code, which uses the Jackson default pretty printer.
If we could write metadata files without redundant whitespace, then it would save some storage space, and network overhead.
This will have have most impact for tables with large metadata files. For example, I have seen a metadata files which was 53.6MB. After removing whitespace, this was reduced to 41.4MB. I have read other issues in github which mention gigabyte-scale metadata files, e.g. in #9734.
I cannot think of any downside of this suggested change. Metadata files are mainly read by machines not humans. And if a human does want to inspect a metadata file, then it is fairly easy to prettify a JSON file when needed.
I'd be happy to open a PR for this, if you think it's a good idea? It seems like an easy way to get a small but noticeable performance improvement for reads and writes.
Query engine
None
Willingness to contribute
The text was updated successfully, but these errors were encountered: