Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec compatible storage #115

Closed
wants to merge 1 commit into from
Closed

fsspec compatible storage #115

wants to merge 1 commit into from

Conversation

agoodm
Copy link

@agoodm agoodm commented Apr 13, 2023

Here is a first draft of adding support for wrapping fsspec compatible storage into Zarr.jl. So far it seems to work as I would expect:

julia> url = "https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1"
"https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1"

julia> g = zopen(url, consolidated=true)
ZarrGroup at Consolidated Consolidated Zarr.HTTPStore("https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1") and path 
Variables: lat analysed_sst analysis_error mask time lon sea_ice_fraction 

julia> s = g["analysed_sst"]
ZArray{Int16} of size 36000 x 17999 x 6443

julia> v1 = s[1:3600, 1:1799, 1:50];

julia> st = Zarr.FSStore(url)
Zarr.FSStore("https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1", <py fsspec.mapping.FSMap object at 0x7f8fca721360>)

julia> g2 = zopen(st, consolidated=true)
ZarrGroup at Consolidated Zarr.FSStore("https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1", <py fsspec.mapping.FSMap object at 0x7f8fca721360>) and path 
Variables: lat analysed_sst analysis_error mask time lon sea_ice_fraction 

julia> s2 = g2["analysed_sst"]
ZArray{Int16} of size 36000 x 17999 x 6443

julia> v2 = s2[1:3600, 1:1799, 1:50];

julia> all(v2 .== v1)
true

Still need to:

  • Decide on interface. Options include: (1) Manually wrap urls with FSStore and call zopen as above, (2) Change zopen to parse paths either natively or through fsspec with an optional argument (eg use_fsspec=true), or (3) Have a separate zopen_py or zopen_fsspec that assumes the storage path is fsspec.
  • Figure out how we are handling dependencies (PythonCall.jl is set to required for now but we may prefer to make it optional. Also may need to handle python deps via CondaPkg.jl)
  • For now raw chunk data is copied from python into Julia objects, but in principle it is actually possible to directly pass the native python chunk data directly to Blosc without always copying. I'll need to try overriding the Blosc.decompress! to accept python arrays. It makes very little difference for the usual MUR SST benchmark I have been using since the raw chunks are very small, but could add a few seconds of time for larger chunks.
  • Add tests

Also unfortunately I couldn't get this to work with any kerchunk references files I had on hand yet since they all require the shuffle filter. This isn't implemented in Zarr.jl, but we should probably do this anyways as it's rather straightforward to add from looking at the code in numcodecs.

@agoodm agoodm marked this pull request as draft April 13, 2023 20:07
@meggart
Copy link
Collaborator

meggart commented Apr 14, 2023

This looks good so far. A few ideas on your open questions:

Decide on interface. Options include: (1) Manually wrap urls with FSStore and call zopen as above, (2) Change zopen to parse paths either natively or through fsspec with an optional argument (eg use_fsspec=true), or (3) Have a separate zopen_py or zopen_fsspec that assumes the storage path is fsspec.

I think we might add some marker for urls to be parsed via fsspec. For example, we could say that "fsspec:https://path/to/url" should be parsed by fsspec, so you add a regex to the regexlist and define a storefromstring function that splits the fsspec from the front of the string and parses the rest.

Figure out how we are handling dependencies (PythonCall.jl is set to required for now but we may prefer to make it optional. Also may need to handle python deps via CondaPkg.jl)

Yes it would be nice if this was optional somehow. I am wondering if the core fsspec wrapping would need to live inside Zarr.jl at all. I think making a FSSPEC.jl package which holds the PythonCall and CondaPkg dependencies plus some wrappers for parsing fsspec strings and wrapping their objects, would be quite intriguing. Fsspec itself has applications outside Zarr as well and the wrapper might be useful for other purposes as well. Then we could make and optional dependency on that package here and define the Zarr-specific things only if FSSPEC.jl is loaded.

For now raw chunk data is copied from python into Julia objects, but in principle it is actually possible to directly pass the native python chunk data directly to Blosc without always copying. I'll need to try overriding the Blosc.decompress! to accept python arrays. It makes very little difference for the usual MUR SST benchmark I have been using since the raw chunks are very small, but could add a few seconds of time for larger chunks.

Yes, all we need here for decompression is a pointer to the data, so I agree this could be done directly on the python memory.

Also unfortunately I couldn't get this to work with any kerchunk references files I had on hand yet since they all require the shuffle filter. This isn't implemented in Zarr.jl, but we should probably do this anyways as it's rather straightforward to add from looking at the code in numcodecs.

Indeed, the shuffle filter implementation looks quite straightforward https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/_shuffle.pyx let me know if you need support in any way.

@agoodm
Copy link
Author

agoodm commented Apr 17, 2023

I think we might add some marker for urls to be parsed via fsspec. For example, we could say that "fsspec:https://path/to/url" should be parsed by fsspec, so you add a regex to the regexlist and define a storefromstring function that splits the fsspec from the front of the string and parses the rest.

I like that this method does have an advantage of simplicity in the implementation, as all I would need to do is modify storefromstring to default to FSStore if the path starts with fsspec:. The one disadvantage I see is that it would mean you would have to explicitly include this even for protocols that are not natively supported (eg "fsspec:reference://" instead of just "reference://" for kerchunk). But then again, the logic to support making that seamless would be more complicated and perhaps it might make sense in this context to always be explicit about using fsspec since it does require python.

Yes it would be nice if this was optional somehow. I am wondering if the core fsspec wrapping would need to live inside Zarr.jl at all. I think making a FSSPEC.jl package which holds the PythonCall and CondaPkg dependencies plus some wrappers for parsing fsspec strings and wrapping their objects, would be quite intriguing. Fsspec itself has applications outside Zarr as well and the wrapper might be useful for other purposes as well. Then we could make and optional dependency on that package here and define the Zarr-specific things only if FSSPEC.jl is loaded.

I think this is a really good idea! Though that would of course be quite a bit bigger than the scope of what's needed for this PR. I am thinking maybe for that a good starting point would be to try making a wrapper that implements the FilePathsBase.jl API?

Yes, all we need here for decompression is a pointer to the data, so I agree this could be done directly on the python memory.

I actually looked through the PythonCall.jl source code and found that there appears to be a convenient way to actually just expose the byte arrays as a Julia Vector{UInt8} without copying (https://github.com/cjdoris/PythonCall.jl/blob/b63bed8d4652d5428a89b001cb736a0baa4339d5/src/concrete/bytes.jl#L26-L29). That said this part of the API isn't officially documented and I should probably consult with the maintainers over there if using it would be a good idea in this context.

EDIT: Looks like this wouldn't work with Blosc.decompress! since resize! doesn't work on shared data. So directly calling blosc_decompress would still be needed.

@meggart
Copy link
Collaborator

meggart commented Apr 18, 2023

I like that this method does have an advantage of simplicity in the implementation, as all I would need to do is modify storefromstring to default to FSStore if the path starts with fsspec:. The one disadvantage I see is that it would mean you would have to explicitly include this even for protocols that are not natively supported (eg "fsspec:reference://" instead of just "reference://" for kerchunk). But then again, the logic to support making that seamless would be more complicated and perhaps it might make sense in this context to always be explicit about using fsspec since it does require python.

I think it would not be a big problem to add a regex with "reference://" to the regexlist and also just digest it with kerchunk, while forcing user to be explicit in other cases. On the other hand this might also become confusing, but there is always the option to create the store explicitly and using zopen on the store if one wants to use the python fsspec backend systematically.

I think this is a really good idea! Though that would of course be quite a bit bigger than the scope of what's needed for this PR. I am thinking maybe for that a good starting point would be to try making a wrapper that implements the FilePathsBase.jl API?

I think this would be up to you. In general, just installing the fsspec dependencies and exposing everything needed for the Zarr would already be worth a package. I does not have to be feature-complete from the start, but I can imagine it would be easy to add more general functionality or a FilePathsBase wrapper later since it would be only wrapping.

So directly calling blosc_decompress would still be needed.

This would not be a huge problem for me. As long as this is well-tested we can do this. I think except for the kerchunk-stuff (which would be optional as well) you would not even have to make a lot of new test but you can simply re-use the existing Storage tests and just repeat them with an "fsspec"-prefix and then I think we should see if there are any problems arising.

@agoodm
Copy link
Author

agoodm commented Apr 20, 2023

Alright, I think in that case a good starting point would be for the main FSSpec.jl package to just implement a FSPath type that wraps a URL (which will eventually be FilePathsBase compatible) and an FSMap that mainly exposes the fsspec mapper interface. Another thing I just thought of would be using a string macro like so:

using FSSpec
using Zarr

# fs is string macro for FSPath type defined in FSSpec.jl
url = fs"protocol://path/to/dataset" 
g = zopen(url, consolidated=true)

Where

zopen(p::FSPath; kwargs...) = zopen(FSStore(p); kwargs...)

@meggart
Copy link
Collaborator

meggart commented Apr 25, 2023

This sounds very reasonable and I think the string macro is a great idea.

@meggart
Copy link
Collaborator

meggart commented Jun 6, 2023

Just to follow up on this, did you make any progress either on this PR or a potential FSSPec.jl? I would be very interested myself in this functionality so if there is some way I can help or contribute, let me know.

@agoodm
Copy link
Author

agoodm commented Jun 10, 2023

Apologies, I have been a bit swamped with some important milestones coming up later this month and at home I have recently been adjusting with trying to raise two young kittens. I do plan to resume working on this but it will probably not be until the week after next.

@felixcremer
Copy link
Contributor

What is the current status of this PR?
I tried it today because there was some interest to have the possibility to open kerchunk files from Julia at the Big Data from Space conference. Please let me know if there is anything I could do to help push this forward?
When I run the example in the first post of this PR I get the following error:

julia> ZGroup(st)
ERROR: Python: ClientConnectorCertificateError: Cannot connect to host mur-sst.s3.us-west-2.amazonaws.com:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')]
Python stacktrace:
 [1] _wrap_create_connection
   @ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:990
 [2] _create_direct_connection
   @ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:1200
 [3] _create_direct_connection
   @ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:1231
 [4] _create_connection
   @ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:907
 [5] connect
   @ aiohttp.connector ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/connector.py:540
 [6] _request
   @ aiohttp.client ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/client.py:573
 [7] __aenter__
   @ aiohttp.client ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/aiohttp/client.py:1186
 [8] _cat_file
   @ fsspec.implementations.http ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/implementations/http.py:226
 [9] wait_for
   @ asyncio.tasks ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/asyncio/tasks.py:510
 [10] _cat
   @ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:446
 [11] _runner
   @ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:56
 [12] sync
   @ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:103
 [13] wrapper
   @ fsspec.asyn ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/asyn.py:118
 [14] __getitem__
   @ fsspec.mapping ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/mapping.py:151
Stacktrace:
  [1] pythrow()
    @ PythonCall ~/.julia/packages/PythonCall/wXfah/src/err.jl:94
  [2] errcheck
    @ ~/.julia/packages/PythonCall/wXfah/src/err.jl:10 [inlined]
  [3] pygetitem(x::PythonCall.Py, k::String)
    @ PythonCall ~/.julia/packages/PythonCall/wXfah/src/abstract/object.jl:169
  [4] getindex
    @ ~/.julia/packages/PythonCall/wXfah/src/Py.jl:296 [inlined]
  [5] getindex
    @ ~/.julia/dev/Zarr/src/Storage/fsspec.jl:24 [inlined]
  [6] getindex(s::Zarr.FSStore, p::String, i::String)
    @ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:54
  [7] getattrs(s::Zarr.FSStore, p::String)
    @ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:67
  [8] ZGroup(s::Zarr.FSStore, mode::String, path::String; fill_as_missing::Bool)
    @ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:30
  [9] ZGroup
    @ ~/.julia/dev/Zarr/src/ZGroup.jl:17 [inlined]
 [10] ZGroup(s::Zarr.FSStore)
    @ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:17
 [11] top-level scope
    @ REPL[11]:1

and when I try it with the following kerchunk file:

julia> url = "https://dap.ceda.ac.uk/neodc/esacci/biomass/metadata/kerchunk/agb/maps/v4.0/ESACCI-BIOMASS-L4-AGB-CHANGE-100m-2018-2017-fv4.0-kr1.1.json?download=1"

I get this error:

julia> ZGroup(st)
ERROR: Python: KeyError: '.zattrs'
Python stacktrace:
 [1] __getitem__
   @ fsspec.mapping ~/.julia/dev/Zarr/.CondaPkg/env/lib/python3.12/site-packages/fsspec/mapping.py:155
Stacktrace:
  [1] pythrow()
    @ PythonCall ~/.julia/packages/PythonCall/wXfah/src/err.jl:94
  [2] errcheck
    @ ~/.julia/packages/PythonCall/wXfah/src/err.jl:10 [inlined]
  [3] pygetitem(x::PythonCall.Py, k::String)
    @ PythonCall ~/.julia/packages/PythonCall/wXfah/src/abstract/object.jl:169
  [4] getindex
    @ ~/.julia/packages/PythonCall/wXfah/src/Py.jl:296 [inlined]
  [5] getindex
    @ ~/.julia/dev/Zarr/src/Storage/fsspec.jl:24 [inlined]
  [6] getindex(s::Zarr.FSStore, p::String, i::String)
    @ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:54
  [7] getattrs(s::Zarr.FSStore, p::String)
    @ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:67
  [8] ZGroup(s::Zarr.FSStore, mode::String, path::String; fill_as_missing::Bool)
    @ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:30
  [9] ZGroup
    @ ~/.julia/dev/Zarr/src/ZGroup.jl:17 [inlined]
 [10] ZGroup(s::Zarr.FSStore)
    @ Zarr ~/.julia/dev/Zarr/src/ZGroup.jl:17
 [11] top-level scope
    @ REPL[14]:1

@coveralls
Copy link

coveralls commented May 20, 2024

Pull Request Test Coverage Report for Build 4686001185

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 28 (0.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-2.7%) to 83.718%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/Storage/fsspec.jl 0 28 0.0%
Totals Coverage Status
Change from base Build 4676462810: -2.7%
Covered Lines: 725
Relevant Lines: 866

💛 - Coveralls

@asinghvi17
Copy link
Member

For the curious, the FSSpec element of this is implemented in asinghvi17/FSSpec.jl and there is a native Julia Kerchunk reader in asinghvi17/Kerchunk.jl.

@agoodm agoodm closed this Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants