-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ZstdZarrCompressor #149
base: master
Are you sure you want to change the base?
Conversation
Pull Request Test Coverage Report for Build 9654302116Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9655786902Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9656097799Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9657292182Details
💛 - Coveralls |
An alternative to the string lookup for the compressor, would be to just pass in |
struct ZstdZarrCompressor <: Zarr.Compressor | ||
compressor::CodecZstd.ZstdCompressor | ||
decompressor::CodecZstd.ZstdDecompressor | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't support multithreaded use IIUC. I think this should be like
Lines 129 to 131 in f436713
struct ZlibCompressor <: Compressor | |
clevel::Int | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially wrote it like this, but then I was thinking about all the other potential parameters, even if they do not need to be serialized. I think what we should implement is the ability to copy a compessor.
Frankly, I'm somewhat confused about why one actually needs to serialize the compression level into the array metadata. You do not need that information to decompress the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I don't understand this correctly, but what would happen in a scenario where a user opens an existing array and wants to add some new data? Of course one can set a different compression level for the new chunks, but for consistency of the dataset I think it is good to write all compression parameters to the metadata struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this implementation respects all this and other compressors in Zarr.jl currently don't work multithreaded as well so ok from my side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the thread safety issues, this also leaks memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other potential parameters what about something like:
https://github.com/nhz2/ChunkCodecs.jl/blob/799b154bd400633f0ae3bd1cf78d0cc95957f2cf/ChunkCodecLibZstd/src/encode.jl#L21-L25
struct ZstdEncodeOptions <: EncodeOptions
compressionLevel::Cint
checksum::Bool
advanced_parameters::Vector{Pair{Cint, Cint}}
end
Where the advanced parameters are set with ZSTD_CCtx_setParameter
after the compression level and checksum options are set.
bump |
I've been working on this in https://github.com/nhz2/ChunkCodecs.jl/tree/main/ChunkCodecLibZstd |
|
||
if compressor isa AbstractString | ||
if haskey(compressortypes, String(compressor)) | ||
compressor = compressortypes[compressor]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make it impossible to set custom compression levels for the compression algorithm. Do we need another keyword argument for zcreate
that gets passed to the compressor constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that the simple option of just passing a string will give you default compression options. If you want to specify the compression level, you can use the compression constructor and pass the instatiated compressor instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and sorry for missing it for such a long time. Probably we need to rebase and test this again. @mkitti in case you don't have the time right now I can try to rebase as well. Just let me know.
As for how to proceed here, I do think @nhz2 has some legitimate concerns. The way I would fix the threading issues is by making a copy of the compressor and decompressed before proceeding with any operations. The main reason for this is to keep all parameters and their validation code implemented in CodecZstd.jl rather than having to synchronize any changes here. Logic we have to encode in adapter code here will raise the maintenance burden going forward. I'm not sure if there is an easy way to make this copy using the public Zstandard API. We might have to scan the parameters and then set them manually. That said, I do think we should consider proceeding with this pull request mainly because it does provide baseline implementation of Zstandard for Zarr.jl, which currently does not exist. Subsequent pull requests from either myself, Nathan, or others can revise the underlying implementation. The most important thing to get right here is the public API this presents to the Zarr.jl user. Another outstanding issue is whether we intend this package to implement Zarr v2 and Zarr v3, which does have some important differences with regard to how codecs are implemented, especially with regard to chains of codecs. The checksum parameters for example is now a requirement for Zarr v3 codec, but I'm not sure if it is an accepted parameter for Zarr v2. |
transcode(z.compressor, a_uint8) | ||
end | ||
|
||
JSON.lower(z::ZstdZarrCompressor) = Dict("id"=>"zstd", "level" => z.compressor.level) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This level
property is not documented or tested in CodecZstd
so it might be removed in a future non-breaking release. This is easy to fix by either documenting this property in CodecZstd
, or keeping track of the compression level in the ZstdZarrCompressor
struct.
This is the main reason I would prefer to merge this PR soon as well. I would lreally like to start working on my v3 branch again, but would like to rebase on all outstanding PRs first, especially the ones involving codec interface changes to see what would still be compatible with v3 and how much code can be reused or would have to be adapted. |
Yes, it makes sense to merge this now and clean up things later, though I think in that case we should mark zstd support as experimental in the docs so we can more easily change the API in the future. |
This implements ZstdZarrCompressor which wraps around CodecZstd as a package extension.
Part of the complication of using package extensions is getting a reference to new types defined in the extension. I created a mechanism by which you could specify the compressor as a string, which would then lookup the type from a dictionary.
I'm also wondering if there might be a general way to wrap TranscodingStreams codecs into Zarr compressors.