-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsspec compatible storage #115
Conversation
This looks good so far. A few ideas on your open questions:
I think we might add some marker for urls to be parsed via fsspec. For example, we could say that "fsspec:https://path/to/url" should be parsed by fsspec, so you add a regex to the
Yes it would be nice if this was optional somehow. I am wondering if the core fsspec wrapping would need to live inside Zarr.jl at all. I think making a FSSPEC.jl package which holds the PythonCall and CondaPkg dependencies plus some wrappers for parsing fsspec strings and wrapping their objects, would be quite intriguing. Fsspec itself has applications outside Zarr as well and the wrapper might be useful for other purposes as well. Then we could make and optional dependency on that package here and define the Zarr-specific things only if FSSPEC.jl is loaded.
Yes, all we need here for decompression is a pointer to the data, so I agree this could be done directly on the python memory.
Indeed, the shuffle filter implementation looks quite straightforward https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/_shuffle.pyx let me know if you need support in any way. |
I like that this method does have an advantage of simplicity in the implementation, as all I would need to do is modify
I think this is a really good idea! Though that would of course be quite a bit bigger than the scope of what's needed for this PR. I am thinking maybe for that a good starting point would be to try making a wrapper that implements the FilePathsBase.jl API?
I actually looked through the PythonCall.jl source code and found that there appears to be a convenient way to actually just expose the byte arrays as a Julia EDIT: Looks like this wouldn't work with |
I think it would not be a big problem to add a regex with
I think this would be up to you. In general, just installing the fsspec dependencies and exposing everything needed for the Zarr would already be worth a package. I does not have to be feature-complete from the start, but I can imagine it would be easy to add more general functionality or a FilePathsBase wrapper later since it would be only wrapping.
This would not be a huge problem for me. As long as this is well-tested we can do this. I think except for the kerchunk-stuff (which would be optional as well) you would not even have to make a lot of new test but you can simply re-use the existing Storage tests and just repeat them with an "fsspec"-prefix and then I think we should see if there are any problems arising. |
Alright, I think in that case a good starting point would be for the main FSSpec.jl package to just implement a using FSSpec
using Zarr
# fs is string macro for FSPath type defined in FSSpec.jl
url = fs"protocol://path/to/dataset"
g = zopen(url, consolidated=true) Where zopen(p::FSPath; kwargs...) = zopen(FSStore(p); kwargs...) |
This sounds very reasonable and I think the string macro is a great idea. |
Just to follow up on this, did you make any progress either on this PR or a potential FSSPec.jl? I would be very interested myself in this functionality so if there is some way I can help or contribute, let me know. |
Apologies, I have been a bit swamped with some important milestones coming up later this month and at home I have recently been adjusting with trying to raise two young kittens. I do plan to resume working on this but it will probably not be until the week after next. |
What is the current status of this PR? julia> ZGroup(st)
ERROR: Python: ClientConnectorCertificateError: Cannot connect to host mur-sst.s3.us-west-2.amazonaws.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')]
Python stacktrace:
[1] _wrap_create_connection
@ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:990
[2] _create_direct_connection
@ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:1200
[3] _create_direct_connection
@ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:1231
[4] _create_connection
@ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:907
[5] connect
@ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:540
[6] _request
@ aiohttp.client ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/client.py:573
[7] __aenter__
@ aiohttp.client ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/client.py:1186
[8] _cat_file
@ fsspec.implementations.http ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/implementations/http.py:226
[9] wait_for
@ asyncio.tasks ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/asyncio/tasks.py:510
[10] _cat
@ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:446
[11] _runner
@ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:56
[12] sync
@ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:103
[13] wrapper
@ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:118
[14] __getitem__
@ fsspec.mapping ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/mapping.py:151
Stacktrace:
[1] pythrow()
@ PythonCall ~/.julia/packages/PythonCall/wXfah/src/err.jl:94
[2] errcheck
@ ~/.julia/packages/PythonCall/wXfah/src/err.jl:10 [inlined]
[3] pygetitem(x::PythonCall.Py, k::String)
@ PythonCall ~/.julia/packages/PythonCall/wXfah/src/abstract/object.jl:169
[4] getindex
@ ~/.julia/packages/PythonCall/wXfah/src/Py.jl:296 [inlined]
[5] getindex
@ ~/.julia/dev/Zarr/src/Storage/fsspec.jl:24 [inlined]
[6] getindex(s::Zarr.FSStore, p::String, i::String)
@ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:54
[7] getattrs(s::Zarr.FSStore, p::String)
@ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:67
[8] ZGroup(s::Zarr.FSStore, mode::String, path::String; fill_as_missing::Bool)
@ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:30
[9] ZGroup
@ ~/.julia/dev/Zarr/src/ZGroup.jl:17 [inlined]
[10] ZGroup(s::Zarr.FSStore)
@ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:17
[11] top-level scope
@ REPL[11]:1 and when I try it with the following kerchunk file:
I get this error: julia> ZGroup(st)
ERROR: Python: KeyError: '.zattrs'
Python stacktrace:
[1] __getitem__
@ fsspec.mapping ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/mapping.py:155
Stacktrace:
[1] pythrow()
@ PythonCall ~/.julia/packages/PythonCall/wXfah/src/err.jl:94
[2] errcheck
@ ~/.julia/packages/PythonCall/wXfah/src/err.jl:10 [inlined]
[3] pygetitem(x::PythonCall.Py, k::String)
@ PythonCall ~/.julia/packages/PythonCall/wXfah/src/abstract/object.jl:169
[4] getindex
@ ~/.julia/packages/PythonCall/wXfah/src/Py.jl:296 [inlined]
[5] getindex
@ ~/.julia/dev/Zarr/src/Storage/fsspec.jl:24 [inlined]
[6] getindex(s::Zarr.FSStore, p::String, i::String)
@ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:54
[7] getattrs(s::Zarr.FSStore, p::String)
@ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:67
[8] ZGroup(s::Zarr.FSStore, mode::String, path::String; fill_as_missing::Bool)
@ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:30
[9] ZGroup
@ ~/.julia/dev/Zarr/src/ZGroup.jl:17 [inlined]
[10] ZGroup(s::Zarr.FSStore)
@ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:17
[11] top-level scope
@ REPL[14]:1 |
Pull Request Test Coverage Report for Build 4686001185Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
For the curious, the FSSpec element of this is implemented in asinghvi17/FSSpec.jl and there is a native Julia Kerchunk reader in asinghvi17/Kerchunk.jl. |
Here is a first draft of adding support for wrapping fsspec compatible storage into Zarr.jl. So far it seems to work as I would expect:
Still need to:
FSStore
and callzopen
as above, (2) Changezopen
to parse paths either natively or through fsspec with an optional argument (eguse_fsspec=true
), or (3) Have a separatezopen_py
orzopen_fsspec
that assumes the storage path is fsspec.Blosc.decompress!
to accept python arrays. It makes very little difference for the usual MUR SST benchmark I have been using since the raw chunks are very small, but could add a few seconds of time for larger chunks.Also unfortunately I couldn't get this to work with any kerchunk references files I had on hand yet since they all require the shuffle filter. This isn't implemented in Zarr.jl, but we should probably do this anyways as it's rather straightforward to add from looking at the code in numcodecs.