Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ZstdZarrCompressor #149

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mkitti
Copy link
Member

@mkitti mkitti commented Jun 25, 2024

This implements ZstdZarrCompressor which wraps around CodecZstd as a package extension.

Part of the complication of using package extensions is getting a reference to new types defined in the extension. I created a mechanism by which you could specify the compressor as a string, which would then lookup the type from a dictionary.

I'm also wondering if there might be a general way to wrap TranscodingStreams codecs into Zarr compressors.

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9654302116

Details

  • 27 of 34 (79.41%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 88.316%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Compressors.jl 4 11 36.36%
Files with Coverage Reduction New Missed Lines %
src/Compressors.jl 1 83.08%
Totals Coverage Status
Change from base Build 8981180163: -0.3%
Covered Lines: 839
Relevant Lines: 950

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9655786902

Details

  • 25 of 32 (78.13%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.4%) to 88.291%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Compressors.jl 4 11 36.36%
Files with Coverage Reduction New Missed Lines %
src/Compressors.jl 1 83.08%
Totals Coverage Status
Change from base Build 8981180163: -0.4%
Covered Lines: 837
Relevant Lines: 948

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9656097799

Details

  • 26 of 32 (81.25%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.3%) to 88.397%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Compressors.jl 5 11 45.45%
Files with Coverage Reduction New Missed Lines %
src/Compressors.jl 1 84.62%
Totals Coverage Status
Change from base Build 8981180163: -0.3%
Covered Lines: 838
Relevant Lines: 948

💛 - Coveralls

@coveralls
Copy link

coveralls commented Jun 25, 2024

Pull Request Test Coverage Report for Build 9657292182

Details

  • 32 of 32 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.5%) to 89.135%

Totals Coverage Status
Change from base Build 8981180163: 0.5%
Covered Lines: 845
Relevant Lines: 948

💛 - Coveralls

@mkitti
Copy link
Member Author

mkitti commented Jun 25, 2024

An alternative to the string lookup for the compressor, would be to just pass in CodecZstd.ZstdCompressor directly, specifically an instance created by CodecZstd.ZstdFrameCompressor(). Via a conversion mechanism, we could wrap that into a Zarr.Compressor.

Comment on lines +16 to +19
struct ZstdZarrCompressor <: Zarr.Compressor
compressor::CodecZstd.ZstdCompressor
decompressor::CodecZstd.ZstdDecompressor
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't support multithreaded use IIUC. I think this should be like

Zarr.jl/src/Compressors.jl

Lines 129 to 131 in f436713

struct ZlibCompressor <: Compressor
clevel::Int
end
where the struct only contains the parameters of the codec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wrote it like this, but then I was thinking about all the other potential parameters, even if they do not need to be serialized. I think what we should implement is the ability to copy a compessor.

Frankly, I'm somewhat confused about why one actually needs to serialize the compression level into the array metadata. You do not need that information to decompress the data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't understand this correctly, but what would happen in a scenario where a user opens an existing array and wants to add some new data? Of course one can set a different compression level for the new chunks, but for consistency of the dataset I think it is good to write all compression parameters to the metadata struct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this implementation respects all this and other compressors in Zarr.jl currently don't work multithreaded as well so ok from my side

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the thread safety issues, this also leaks memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other potential parameters what about something like:
https://github.com/nhz2/ChunkCodecs.jl/blob/799b154bd400633f0ae3bd1cf78d0cc95957f2cf/ChunkCodecLibZstd/src/encode.jl#L21-L25

struct ZstdEncodeOptions <: EncodeOptions
    compressionLevel::Cint
    checksum::Bool
    advanced_parameters::Vector{Pair{Cint, Cint}}
end

Where the advanced parameters are set with ZSTD_CCtx_setParameter after the compression level and checksum options are set.

@lazarusA
Copy link

bump

@nhz2
Copy link
Member

nhz2 commented Dec 19, 2024

I've been working on this in https://github.com/nhz2/ChunkCodecs.jl/tree/main/ChunkCodecLibZstd


if compressor isa AbstractString
if haskey(compressortypes, String(compressor))
compressor = compressortypes[compressor]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make it impossible to set custom compression levels for the compression algorithm. Do we need another keyword argument for zcreate that gets passed to the compressor constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that the simple option of just passing a string will give you default compression options. If you want to specify the compression level, you can use the compression constructor and pass the instatiated compressor instance.

Copy link
Collaborator

@meggart meggart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and sorry for missing it for such a long time. Probably we need to rebase and test this again. @mkitti in case you don't have the time right now I can try to rebase as well. Just let me know.

@mkitti
Copy link
Member Author

mkitti commented Jan 4, 2025

As for how to proceed here, I do think @nhz2 has some legitimate concerns. The way I would fix the threading issues is by making a copy of the compressor and decompressed before proceeding with any operations.

The main reason for this is to keep all parameters and their validation code implemented in CodecZstd.jl rather than having to synchronize any changes here. Logic we have to encode in adapter code here will raise the maintenance burden going forward.

I'm not sure if there is an easy way to make this copy using the public Zstandard API. We might have to scan the parameters and then set them manually.

That said, I do think we should consider proceeding with this pull request mainly because it does provide baseline implementation of Zstandard for Zarr.jl, which currently does not exist.

Subsequent pull requests from either myself, Nathan, or others can revise the underlying implementation.

The most important thing to get right here is the public API this presents to the Zarr.jl user.

Another outstanding issue is whether we intend this package to implement Zarr v2 and Zarr v3, which does have some important differences with regard to how codecs are implemented, especially with regard to chains of codecs. The checksum parameters for example is now a requirement for Zarr v3 codec, but I'm not sure if it is an accepted parameter for Zarr v2.

transcode(z.compressor, a_uint8)
end

JSON.lower(z::ZstdZarrCompressor) = Dict("id"=>"zstd", "level" => z.compressor.level)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This level property is not documented or tested in CodecZstd so it might be removed in a future non-breaking release. This is easy to fix by either documenting this property in CodecZstd, or keeping track of the compression level in the ZstdZarrCompressor struct.

@meggart
Copy link
Collaborator

meggart commented Jan 6, 2025

Another outstanding issue is whether we intend this package to implement Zarr v2 and Zarr v3, which does have some important differences with regard to how codecs are implemented, especially with regard to chains of codecs. The checksum parameters for example is now a requirement for Zarr v3 codec, but I'm not sure if it is an accepted parameter for Zarr v2.

This is the main reason I would prefer to merge this PR soon as well. I would lreally like to start working on my v3 branch again, but would like to rebase on all outstanding PRs first, especially the ones involving codec interface changes to see what would still be compatible with v3 and how much code can be reused or would have to be adapted.

@nhz2
Copy link
Member

nhz2 commented Jan 6, 2025

Yes, it makes sense to merge this now and clean up things later, though I think in that case we should mark zstd support as experimental in the docs so we can more easily change the API in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants