Add ZarrDataset extension #640

felixcremer · 2024-04-24T15:23:32Z

This is a first draft for opening zarr data with ZarrDatasets in Rasters.
I just copied the approach in the GRIBDatasets extension into a new extension.

This would still need some tests.

When I load the same data with this branch or with YAXArrays I get the following dimensions, hereby ds is the data loaded with YAXArrays and rs with Rasters. This is the ESDC tiny cube which is available here "https://s3.bgc-jena.mpg.de:9000/esdl-esdc-v3.0.2/esdc-16d-2.5deg-46x72x1440-3.0.2.zarr":

julia> dims(ds.burnt_area)
↓ lon Sampled{Float64} -178.75:2.5:178.75 ForwardOrdered Regular Points,
→ lat Sampled{Float64} -88.75:2.5:88.75 ForwardOrdered Regular Points,
↗ Ti  Sampled{DateTime} [1979-01-09T00:00:00, …, 2021-12-27T00:00:00] ForwardOrdered Irregular Points

julia> dims(rs.burnt_area)
↓ X  Mapped{Union{Missing, Float64}} [-178.75, -176.25, …, 176.25, 178.75] ForwardOrdered Regular Points,
→ Y  Mapped{Union{Missing, Float64}} [-88.75, -86.25, …, 86.25, 88.75] ForwardOrdered Regular Points,
↗ Ti Sampled{DateTime} [1979-01-09T00:00:00, …, 2021-12-27T00:00:00] ForwardOrdered Irregular Points

and for another example.:

julia> dims(ras)
↓ X  Mapped{Union{Missing, Float64}} [-15.998697916666666, -15.99609375, …, 10.662760416666666, 10.66536458333333] ForwardOrdered Explicit Intervals{Center},
→ Y  Mapped{Union{Missing, Float64}} [62.66536458333333, 62.662760416666664, …, 48.00390625, 48.001302083333336] ReverseOrdered Explicit Intervals{Center},
↗ Ti Sampled{Union{Missing, DateTime}} [DateTime("2017-06-05T12:00:00"), DateTime("2017-06-06T12:00:00"), DateTime("2017-06-07T12:00:00")] ForwardOrdered Explicit Intervals{Center}

julia> dims(yax)
↓ lon Sampled{Float64} -15.998697916666666:0.0026041666666666665:10.66536458333333 ForwardOrdered Regular Points,
→ lat Sampled{Float64} 62.66536458333333:-0.0026041666666666652:48.001302083333336 ReverseOrdered Regular Points,
↗ Ti  Sampled{DateTime} [2017-06-05T12:00:00, …, 2017-06-07T12:00:00] ForwardOrdered Irregular Points

I am not sure what are the expected dimensions for these datasets.

felixcremer · 2024-04-24T18:40:10Z

Before we could merge this, we need to register ZarrDatasets

rafaqz · 2024-04-29T16:14:35Z

The dimension name difference is because Rasters switches some common x/y/z names into explicit X/Y/Z to standardize algorithms (e.g. rasterize knows to automatically rasterize polygons into X/Y dimensions and not others).

There are some arguments for and against this, its a fairly random decision. But it definitely makes running the other alogorithms easier, and lets us check other things like broadcasts aren't mixing dimensions.

felixcremer · 2024-05-02T14:22:17Z

The name differences is something that I would be fine with, but others are a bit more hesitant to have. I got a lot of issues in YAXArrays because with the switch to DimensionalData we automatically converted Time axes to Ti.
Maybe this could be a keyword argument when people are very keen on keeping their axes names or we could make the detected dimension a subtype of XDim without changing the name?

The other thing that I saw is that YAXArrays makes points and Rasters makes Intervals. And YAXArrays is using more Ranges instead of vectors also the types of the Axes seem to be off with the Rasters version I would not expect to have Missing values in the dimensions.

I am going to explore this further, but I rebased this today on main and the overhaul keywords PR broke some things which I am going through at the moment.

felixcremer · 2024-05-02T15:15:06Z

It was not your PR but I made a dirty rebase. This is fixed now.

codecov · 2024-05-02T20:22:29Z

Codecov Report

Attention: Patch coverage is 44.44444% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 82.43%. Comparing base (a15ebb1) to head (98d837c).
Report is 13 commits behind head on main.

❗ Current head 98d837c differs from pull request most recent head ff3fad8

Please upload reports for the commit ff3fad8 to get more accurate results.

Files	Patch %	Lines
ext/RastersZarrDatasetsExt/zarrdatasets_source.jl	0.00%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #640      +/-   ##
==========================================
+ Coverage   82.32%   82.43%   +0.10%     
==========================================
  Files          60       61       +1     
  Lines        4357     4480     +123     
==========================================
+ Hits         3587     3693     +106     
- Misses        770      787      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rafaqz · 2024-05-02T21:05:33Z

Missing values in the dimensions may be a CommonDataModel/NCDatsets bug, I think its forbidden in the CF standard too. Rasters would not introduce that.

Rasters will only give you Intervals if there is a bounds variable or some other metadata indicating intervals. My guess is YAX just doesn't check. But you really want intervals if you have them, so Contains and things that depend on it like extract work. Intervals let you assume the array value is the mean for the whole pixel, but you cant really do that for points.

We could use ranges instead of vectors for regular, but you do introduce the occasional floating point error.

rafaqz · 2024-05-02T21:17:32Z

Oh right my issue for the missing values was closed:
JuliaGeo/NCDatasets.jl#185

We could just run map(identity, lookup) over the vectors before using them to remove Missing.

@Alexander-Barth I think we really more control over missing value handling in CommonDataModel in general. Is there a way to load dim variables to force ignoring the _FillValue ?. It seems people don't want the Missing type in their CF dimension variables. And allocating a vector to add the Missing type, then another one to remove it again seems unnecessary to me.

@felixcremer also note the problem comes from incorrect files that have _FillValue properties for dim variables when CF disallows them.

rafaqz · 2024-05-02T21:39:25Z

#655 fixes the missing problem on the Rasters side for now

felixcremer · 2024-05-02T21:46:01Z

@felixcremer also note the problem comes from incorrect files that have _FillValue properties for dim variables when CF disallows them.

In Zarr the FillValue does not necessarily mean that it is missing. It is the default value that you would get when a chunk is not written. This is often used to mean missing and in Zarr.jl there is also a keyword fillvalueasmissing.

Yes I think, that YAX does not really check for the bounds variables and it only discards them.

I am going to look into properly testing this tomorrow.

rafaqz · 2024-05-02T22:36:04Z

Ok in that case we really need a way to stop CommonDataModel from replacing the FillValue with missing at all.

I'm going to overhaul missing handling and CF standards application soon, I don't like everything from CommonDataModel automatically replacing missing everywhere.

Alexander-Barth · 2024-05-03T09:40:11Z

In Zarr the FillValue does not necessarily mean that it is missing. It is the default value that you would get when a chunk is not written. This is often used to mean missing and in Zarr.jl there is also a keyword fillvalueasmissing.

Is that not analogous to NetCDF with unlimited dimensions? If time is an unlimited dimension and when you write only the second time slice, the first time will be treated with full of fill values.

You can ignore the defined FillValue by setting it to nothing:

fname = tempname()
ds = NCDataset(fname,"c")
defDim(ds,"lon",100)
v = defVar(ds,"lon",Float32,("lon",),fillvalue = 9999.)
close(ds)

ds = NCDataset(fname)
eltype(ds["lon"])
# Union{Missing, Float32}


ds = NCDataset(fname)
eltype(cfvariable(ds,"lon",fillvalue=nothing))
# Float32

(this code works with ZarrDatasets after this change)

As I understand, the Zarr spec, fill_value should not be set when it is not used:

fill_value
A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

However, it seems that the Zarr array here set the fill value for the coordinate dimensions (despite not being used).

rafaqz · 2024-05-03T09:44:01Z

You can ignore the defined FillValue by setting it to nothing:

We really need this at the point of loading the file or when getting a specific variable, for example by passing a keyword. Its not really practical to ask people to modify a file on disk to change how it loads.

Alexander-Barth · 2024-05-05T05:11:09Z

In my example, you do not need to modify the data on disk. fillvalue is a keyword argument of the function cfvariable. Setting it to nothing, makes NCDatasets to ignore its value on disk.

rafaqz · 2024-05-05T07:48:52Z

A right I have to stop reading these on mobile. We can allow setting that from Rasters fairly easily.

Alexander-Barth · 2024-05-07T16:55:29Z

In this commit JuliaGeo/ZarrDatasets.jl@a84ef0a
ZarrDatasets no longer considers Zarr fill_value the same as the CF _FillValue for coordinate variables. I think that this should help with the present issue.

(This issue was helpful to me to understand the differences between Zarr fill_value as CF _FillValue:
pydata/xarray#5475 (comment) )

rafaqz · 2024-05-07T17:16:14Z

Great! If also added fixes to Rasters to avoid Missing in lookups

This is a first draft for opening zarr data with ZarrDatasets in Rasters

felixcremer · 2024-05-13T13:58:24Z

The missing values in the lookups is fixed. But now I am a bit confused by the Locus that is detected. For the GLDAS dataset http://tinyurl.com/GLDAS-NOAH025-3H the time bounds are correctly recognized and the dimensions are set as intervals.
The locus of the time axis is set to Center but according to the metadata and also to the values in the time_bnds it should be the end of the interval.
Here gldas is the dataset opened as a raster and gldasz is the same data opened as a Zarr.Zarray.

How is the Locus set when opening a dataset?

julia> tmb = gldasz.arrays["time_bnds"][:,:]
2×6 Matrix{Float64}:
 1.15717e7  1.15718e7  1.1572e7   1.15722e7  1.15724e7  1.15726e7
 1.15718e7  1.1572e7   1.15722e7  1.15724e7  1.15726e7  1.15727e7

julia> tm = gldasz.arrays["time"][:]
6-element Vector{Float64}:
 1.157184e7
 1.157202e7
 1.15722e7
 1.157238e7
 1.157256e7
 1.157274e7

julia> d .+ Minute.(tmb[1, :])
6-element Vector{DateTime}:
 2022-01-01T00:00:00
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00

julia> d .+ Minute.(tm[:])
6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00

julia> d .+ Minute.(tmb[2, :])
6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00

julia> dims(gldas, Ti)
Ti Sampled{DateTime} ForwardOrdered Explicit DimensionalData.Dimensions.Lookups.Intervals{DimensionalData.Dimensions.Lookups.Center}
wrapping: 6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00
julia> gldasz.attrs["tavg definition:"]
"past 3-hour average"

rafaqz · 2024-05-13T14:06:15Z

Ok its always assumed to be the center in CF standards, but I guess when a bounds matrix is provided we need to actually check that (and currently we dont). I have no idea what to do if its some other fractional position besides start/center/end...

If you have time to add the check based on bounds matrix values that would help, see

Probably all of this logic needs some kind of overhaul.
https://github.com/rafaqz/Rasters.jl/blob/main/src/sources/commondatamodel.jl#L333-L350

One complication is we really don't want to get Explicit bounds matrix spans if we can have Regular because Regular is so much easier to work with.

CF says:

If bounds are not provided, an application might reasonably assume the gridpoints to be at the centers of the cells, but we do not require that in this standard.

So I've forgotten about the corrolary of that - when bounds are provided the gridpoints can be anywhere

felixcremer · 2024-05-13T14:35:40Z

The problem is, with Zarr data we will for now always get explicit, because there is no standard yet that defines how to save the Regular dimensions in the zarr format. See the discussion in the geozarr standard repository zarr-developers/geozarr-spec#17.

Or does Rasters try to convert an explicit Vector to a regular representation?
What is the locus information used for when the data bounds are explicit?

rafaqz · 2024-05-13T15:23:26Z

~~Oh that's the same with netcdf, bounds matrix is optional.~~ Oops I misread now for not. That is a pain, but we do try and check is Regular is ok anyway.

Yes we try to convert to Regular. But in hindsight I think it will be wrong in some edge cases that shouldn't exist, but could, and when the locus is not Center.

Butt when there is no bounds matrix according to CF we have to assume Center because as far as I know there is no metadata to indicate that position? Maybe zarr does specify that?

(I'm not sure that thread is relevant here, the bounds matrix is not the same as transformation variables... edit: because we already have this same problem in netcdf and just check for regularity... But also good to see the CF spec is as overwhelming for everyone else as it is for me...)

Otherwise the RasterStack code tries to load all zarr subfiles separately.

This always assumes a Center locus when the bounds are not explicitely hit. We might want to check for the actual center, but this might become problematic for time bounds.

felixcremer · 2024-05-15T10:14:23Z

I rebased the branch and add a check for the Locus to see whether the index is at one side of the boundaries and to set Start or End accordingly. I am not sure, whether we should really check whether the Center is in the middle between the boundaries, because what would we do with time axis where the index is at the 15th of the month and the end might be 30 or 31 and therefore the exact middle would always be off. I am not sure, how this is handled in other tools.
Also I am not sure whether we would need the same check for the regular span or whether we could assume Center there.

rafaqz · 2024-05-15T17:27:23Z

We always have to special case date time, so I think that's ok

src/sources/commondatamodel.jl

rafaqz · 2024-05-15T20:12:09Z

src/sources/commondatamodel.jl

    end
-    nondim = setdiff(keys(ds), toremove)
+    # Maybe this should be fixed in ZarrDatasets but it works with this patch.


Yes fine for now but these CMD functions should have consistent return values

This copied some of the ncdatasets tests and did them on a Zarr array.

This is necessary, because otherwise the documentation build fails with strange pkg errors.

I am trying to make the _writevar! function only depend on CDM functions to make it backend agnostic. This needs certain functions to make the NCD specific parts inside functions which can be specialized on the source type.

felixcremer · 2024-05-16T15:30:15Z

I am currently working on making the writevar! function backend agnostic so that we can use it for writing Zarr and NCDatasets this is currently broken.
In the future I would also like to add some round trip tests where we load a netcdf, save it to zarr and open it up again and then save it back to netcdf and this should give the same netcdf file.

The documenter build is failing because of download issues with WorldClim and CI gave some strange malloc failure but I am hoping, that this was just a github action glitch.

rafaqz · 2024-06-16T22:33:51Z

I'm writing some major changes to how cf standards, missing values etc are handled accross all the backends.

The will have pretty major clashes with what you are doing here - like in writevar. Any chance we can get this merged relatively soon before I do too much more of that PR?

rafaqz · 2024-08-02T07:24:59Z

Any updates? would be good to merge this

felixcremer · 2024-08-02T07:28:33Z

I'll have a look today. I am going to remove the write stuff, and focus just on reading zarr data for now, so that we can merge it soon.

felixcremer · 2024-08-02T15:42:40Z

This is currently failing because of this test:

            @testset "non allowed values" begin
                @test_throws ArgumentError write(filename, convert.(Union{Missing,Float16}, ncarray); force=true)
            end

What is the purpose of this test?
The Errror changed from an ArgumentError to a MethodError because there is no fillvalue(Float16) from CommonDataModel.
I am not sure, whether this is something that fails now because of my changes.
Apart from that I removed the write function for ZarrDatasets and we should try to make that in the future.

rafaqz · 2024-08-03T06:20:35Z

Its just to make sure there is a sensible argument error rather than a random method error users don't understand.

We need to say "can't write $T" explicitly

felixcremer · 2024-08-03T08:19:09Z

Than this is something, that I would address in this PR.

rafaqz · 2024-08-03T09:45:27Z

Yeah, I'm not sure where it threw the ArgumentError before but looks like it did

felixcremer · 2024-08-12T08:11:33Z

I've redone the current behaviour and hopefully this is now going through all tests.
This should be good to go.

rafaqz mentioned this pull request Apr 29, 2024

Registration? JuliaGeo/ZarrDatasets.jl#3

Closed

felixcremer force-pushed the fc/zarr branch from b9905ce to 4c2aee3 Compare May 2, 2024 13:44

felixcremer added 6 commits May 13, 2024 10:34

Add ZarrDataset extension

200d63d

This is a first draft for opening zarr data with ZarrDatasets in Rasters

Remove pre 1.9 check for extensions

33c2820

Fix dirty rebase

01bfa6c

Fix Zarrsource package link

e31d40f

Add ZarrDatasets as test dep

b468667

Start implementing write for Zarrsource

738b9f7

felixcremer added 2 commits May 14, 2024 15:46

Treat Zarr folder as single file in RasterStack

342d9a2

Otherwise the RasterStack code tries to load all zarr subfiles separately.

Add check for locus

0b2bc8d

This always assumes a Center locus when the bounds are not explicitely hit. We might want to check for the actual center, but this might become problematic for time bounds.

felixcremer force-pushed the fc/zarr branch from af4a1fa to 0b2bc8d Compare May 15, 2024 10:04

rafaqz reviewed May 15, 2024

View reviewed changes

felixcremer added 6 commits May 15, 2024 23:34

Replace all by mapreduce

0679efe

Add some tests using the ESDL as testfile

47d0980

Fix typo

2d7907a

Add more tests

40fbad3

This copied some of the ncdatasets tests and did them on a Zarr array.

Update registry in Documenter action

881c51b

This is necessary, because otherwise the documentation build fails with strange pkg errors.

Move _writevar! to Rasters

4b66180

I am trying to make the _writevar! function only depend on CDM functions to make it backend agnostic. This needs certain functions to make the NCD specific parts inside functions which can be specialized on the source type.

felixcremer marked this pull request as draft May 16, 2024 15:30

Move NCDataset specific error handling into NCDatasetsExt

ff3fad8

Remove write function

bb58cfb

felixcremer marked this pull request as ready for review August 2, 2024 14:37

Merge branch 'main' into fc/zarr

6380467

felixcremer added 2 commits August 12, 2024 10:08

Dispatch _check_allowed_type on sourctrait

534bfab

Remove show macro calls

47cbafc

rafaqz merged commit 7d1fa2c into rafaqz:main Aug 13, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ZarrDataset extension #640

Add ZarrDataset extension #640

felixcremer commented Apr 24, 2024

felixcremer commented Apr 24, 2024

rafaqz commented Apr 29, 2024

felixcremer commented May 2, 2024

felixcremer commented May 2, 2024

codecov bot commented May 2, 2024 •

edited

Loading

rafaqz commented May 2, 2024 •

edited

Loading

rafaqz commented May 2, 2024 •

edited

Loading

rafaqz commented May 2, 2024

felixcremer commented May 2, 2024

rafaqz commented May 2, 2024

Alexander-Barth commented May 3, 2024 •

edited by rafaqz

Loading

rafaqz commented May 3, 2024 •

edited

Loading

Alexander-Barth commented May 5, 2024

rafaqz commented May 5, 2024

Alexander-Barth commented May 7, 2024

rafaqz commented May 7, 2024

felixcremer commented May 13, 2024

rafaqz commented May 13, 2024 •

edited

Loading

felixcremer commented May 13, 2024

rafaqz commented May 13, 2024 •

edited

Loading

felixcremer commented May 15, 2024

rafaqz commented May 15, 2024

rafaqz May 15, 2024

felixcremer commented May 16, 2024

rafaqz commented Jun 16, 2024

rafaqz commented Aug 2, 2024

felixcremer commented Aug 2, 2024

felixcremer commented Aug 2, 2024

rafaqz commented Aug 3, 2024

felixcremer commented Aug 3, 2024

rafaqz commented Aug 3, 2024

felixcremer commented Aug 12, 2024

Add ZarrDataset extension #640

Add ZarrDataset extension #640

Conversation

felixcremer commented Apr 24, 2024

felixcremer commented Apr 24, 2024

rafaqz commented Apr 29, 2024

felixcremer commented May 2, 2024

felixcremer commented May 2, 2024

codecov bot commented May 2, 2024 • edited Loading

Codecov Report

rafaqz commented May 2, 2024 • edited Loading

rafaqz commented May 2, 2024 • edited Loading

rafaqz commented May 2, 2024

felixcremer commented May 2, 2024

rafaqz commented May 2, 2024

Alexander-Barth commented May 3, 2024 • edited by rafaqz Loading

rafaqz commented May 3, 2024 • edited Loading

Alexander-Barth commented May 5, 2024

rafaqz commented May 5, 2024

Alexander-Barth commented May 7, 2024

rafaqz commented May 7, 2024

felixcremer commented May 13, 2024

rafaqz commented May 13, 2024 • edited Loading

felixcremer commented May 13, 2024

rafaqz commented May 13, 2024 • edited Loading

felixcremer commented May 15, 2024

rafaqz commented May 15, 2024

rafaqz May 15, 2024

Choose a reason for hiding this comment

felixcremer commented May 16, 2024

rafaqz commented Jun 16, 2024

rafaqz commented Aug 2, 2024

felixcremer commented Aug 2, 2024

felixcremer commented Aug 2, 2024

rafaqz commented Aug 3, 2024

felixcremer commented Aug 3, 2024

rafaqz commented Aug 3, 2024

felixcremer commented Aug 12, 2024

codecov bot commented May 2, 2024 •

edited

Loading

rafaqz commented May 2, 2024 •

edited

Loading

rafaqz commented May 2, 2024 •

edited

Loading

Alexander-Barth commented May 3, 2024 •

edited by rafaqz

Loading

rafaqz commented May 3, 2024 •

edited

Loading

rafaqz commented May 13, 2024 •

edited

Loading

rafaqz commented May 13, 2024 •

edited

Loading