Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZarrDataset extension #640

Merged
merged 19 commits into from
Aug 13, 2024
Merged

Add ZarrDataset extension #640

merged 19 commits into from
Aug 13, 2024

Conversation

felixcremer
Copy link
Contributor

This is a first draft for opening zarr data with ZarrDatasets in Rasters.
I just copied the approach in the GRIBDatasets extension into a new extension.

This would still need some tests.

When I load the same data with this branch or with YAXArrays I get the following dimensions, hereby ds is the data loaded with YAXArrays and rs with Rasters. This is the ESDC tiny cube which is available here "https://s3.bgc-jena.mpg.de:9000/esdl-esdc-v3.0.2/esdc-16d-2.5deg-46x72x1440-3.0.2.zarr":

julia> dims(ds.burnt_area)
 lon Sampled{Float64} -178.75:2.5:178.75 ForwardOrdered Regular Points,
 lat Sampled{Float64} -88.75:2.5:88.75 ForwardOrdered Regular Points,
↗ Ti  Sampled{DateTime} [1979-01-09T00:00:00, , 2021-12-27T00:00:00] ForwardOrdered Irregular Points

julia> dims(rs.burnt_area)
 X  Mapped{Union{Missing, Float64}} [-178.75, -176.25, , 176.25, 178.75] ForwardOrdered Regular Points,
 Y  Mapped{Union{Missing, Float64}} [-88.75, -86.25, , 86.25, 88.75] ForwardOrdered Regular Points,
↗ Ti Sampled{DateTime} [1979-01-09T00:00:00, , 2021-12-27T00:00:00] ForwardOrdered Irregular Points

and for another example.:

julia> dims(ras)
 X  Mapped{Union{Missing, Float64}} [-15.998697916666666, -15.99609375, , 10.662760416666666, 10.66536458333333] ForwardOrdered Explicit Intervals{Center},
 Y  Mapped{Union{Missing, Float64}} [62.66536458333333, 62.662760416666664, , 48.00390625, 48.001302083333336] ReverseOrdered Explicit Intervals{Center},
↗ Ti Sampled{Union{Missing, DateTime}} [DateTime("2017-06-05T12:00:00"), DateTime("2017-06-06T12:00:00"), DateTime("2017-06-07T12:00:00")] ForwardOrdered Explicit Intervals{Center}

julia> dims(yax)
 lon Sampled{Float64} -15.998697916666666:0.0026041666666666665:10.66536458333333 ForwardOrdered Regular Points,
 lat Sampled{Float64} 62.66536458333333:-0.0026041666666666652:48.001302083333336 ReverseOrdered Regular Points,
↗ Ti  Sampled{DateTime} [2017-06-05T12:00:00, , 2017-06-07T12:00:00] ForwardOrdered Irregular Points

I am not sure what are the expected dimensions for these datasets.

@felixcremer
Copy link
Contributor Author

Before we could merge this, we need to register ZarrDatasets

@rafaqz
Copy link
Owner

rafaqz commented Apr 29, 2024

The dimension name difference is because Rasters switches some common x/y/z names into explicit X/Y/Z to standardize algorithms (e.g. rasterize knows to automatically rasterize polygons into X/Y dimensions and not others).

There are some arguments for and against this, its a fairly random decision. But it definitely makes running the other alogorithms easier, and lets us check other things like broadcasts aren't mixing dimensions.

@felixcremer
Copy link
Contributor Author

The name differences is something that I would be fine with, but others are a bit more hesitant to have. I got a lot of issues in YAXArrays because with the switch to DimensionalData we automatically converted Time axes to Ti.
Maybe this could be a keyword argument when people are very keen on keeping their axes names or we could make the detected dimension a subtype of XDim without changing the name?

The other thing that I saw is that YAXArrays makes points and Rasters makes Intervals. And YAXArrays is using more Ranges instead of vectors also the types of the Axes seem to be off with the Rasters version I would not expect to have Missing values in the dimensions.

I am going to explore this further, but I rebased this today on main and the overhaul keywords PR broke some things which I am going through at the moment.

@felixcremer
Copy link
Contributor Author

It was not your PR but I made a dirty rebase. This is fixed now.

Copy link

codecov bot commented May 2, 2024

Codecov Report

Attention: Patch coverage is 44.44444% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 82.43%. Comparing base (a15ebb1) to head (98d837c).
Report is 13 commits behind head on main.

Current head 98d837c differs from pull request most recent head ff3fad8

Please upload reports for the commit ff3fad8 to get more accurate results.

Files Patch % Lines
ext/RastersZarrDatasetsExt/zarrdatasets_source.jl 0.00% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #640      +/-   ##
==========================================
+ Coverage   82.32%   82.43%   +0.10%     
==========================================
  Files          60       61       +1     
  Lines        4357     4480     +123     
==========================================
+ Hits         3587     3693     +106     
- Misses        770      787      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rafaqz
Copy link
Owner

rafaqz commented May 2, 2024

Missing values in the dimensions may be a CommonDataModel/NCDatsets bug, I think its forbidden in the CF standard too. Rasters would not introduce that.

Rasters will only give you Intervals if there is a bounds variable or some other metadata indicating intervals. My guess is YAX just doesn't check. But you really want intervals if you have them, so Contains and things that depend on it like extract work. Intervals let you assume the array value is the mean for the whole pixel, but you cant really do that for points.

We could use ranges instead of vectors for regular, but you do introduce the occasional floating point error.

@rafaqz
Copy link
Owner

rafaqz commented May 2, 2024

Oh right my issue for the missing values was closed:
JuliaGeo/NCDatasets.jl#185

We could just run map(identity, lookup) over the vectors before using them to remove Missing.

@Alexander-Barth I think we really more control over missing value handling in CommonDataModel in general. Is there a way to load dim variables to force ignoring the _FillValue ?. It seems people don't want the Missing type in their CF dimension variables. And allocating a vector to add the Missing type, then another one to remove it again seems unnecessary to me.

@felixcremer also note the problem comes from incorrect files that have _FillValue properties for dim variables when CF disallows them.

@rafaqz
Copy link
Owner

rafaqz commented May 2, 2024

#655 fixes the missing problem on the Rasters side for now

@felixcremer
Copy link
Contributor Author

@felixcremer also note the problem comes from incorrect files that have _FillValue properties for dim variables when CF disallows them.

In Zarr the FillValue does not necessarily mean that it is missing. It is the default value that you would get when a chunk is not written. This is often used to mean missing and in Zarr.jl there is also a keyword fillvalueasmissing.

Yes I think, that YAX does not really check for the bounds variables and it only discards them.

I am going to look into properly testing this tomorrow.

@rafaqz
Copy link
Owner

rafaqz commented May 2, 2024

Ok in that case we really need a way to stop CommonDataModel from replacing the FillValue with missing at all.

I'm going to overhaul missing handling and CF standards application soon, I don't like everything from CommonDataModel automatically replacing missing everywhere.

@Alexander-Barth
Copy link

Alexander-Barth commented May 3, 2024

In Zarr the FillValue does not necessarily mean that it is missing. It is the default value that you would get when a chunk is not written. This is often used to mean missing and in Zarr.jl there is also a keyword fillvalueasmissing.

Is that not analogous to NetCDF with unlimited dimensions? If time is an unlimited dimension and when you write only the second time slice, the first time will be treated with full of fill values.

You can ignore the defined FillValue by setting it to nothing:

fname = tempname()
ds = NCDataset(fname,"c")
defDim(ds,"lon",100)
v = defVar(ds,"lon",Float32,("lon",),fillvalue = 9999.)
close(ds)

ds = NCDataset(fname)
eltype(ds["lon"])
# Union{Missing, Float32}


ds = NCDataset(fname)
eltype(cfvariable(ds,"lon",fillvalue=nothing))
# Float32

(this code works with ZarrDatasets after this change)

As I understand, the Zarr spec, fill_value should not be set when it is not used:

fill_value
A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

However, it seems that the Zarr array here set the fill value for the coordinate dimensions (despite not being used).

@rafaqz
Copy link
Owner

rafaqz commented May 3, 2024

You can ignore the defined FillValue by setting it to nothing:

We really need this at the point of loading the file or when getting a specific variable, for example by passing a keyword. Its not really practical to ask people to modify a file on disk to change how it loads.

@Alexander-Barth
Copy link

In my example, you do not need to modify the data on disk. fillvalue is a keyword argument of the function cfvariable. Setting it to nothing, makes NCDatasets to ignore its value on disk.

@rafaqz
Copy link
Owner

rafaqz commented May 5, 2024

A right I have to stop reading these on mobile. We can allow setting that from Rasters fairly easily.

@Alexander-Barth
Copy link

In this commit JuliaGeo/ZarrDatasets.jl@a84ef0a
ZarrDatasets no longer considers Zarr fill_value the same as the CF _FillValue for coordinate variables. I think that this should help with the present issue.

(This issue was helpful to me to understand the differences between Zarr fill_value as CF _FillValue:
pydata/xarray#5475 (comment) )

@rafaqz
Copy link
Owner

rafaqz commented May 7, 2024

Great! If also added fixes to Rasters to avoid Missing in lookups

@felixcremer
Copy link
Contributor Author

The missing values in the lookups is fixed. But now I am a bit confused by the Locus that is detected. For the GLDAS dataset http://tinyurl.com/GLDAS-NOAH025-3H the time bounds are correctly recognized and the dimensions are set as intervals.
The locus of the time axis is set to Center but according to the metadata and also to the values in the time_bnds it should be the end of the interval.
Here gldas is the dataset opened as a raster and gldasz is the same data opened as a Zarr.Zarray.

How is the Locus set when opening a dataset?

julia> tmb = gldasz.arrays["time_bnds"][:,:]
2×6 Matrix{Float64}:
 1.15717e7  1.15718e7  1.1572e7   1.15722e7  1.15724e7  1.15726e7
 1.15718e7  1.1572e7   1.15722e7  1.15724e7  1.15726e7  1.15727e7

julia> tm = gldasz.arrays["time"][:]
6-element Vector{Float64}:
 1.157184e7
 1.157202e7
 1.15722e7
 1.157238e7
 1.157256e7
 1.157274e7

julia> d .+ Minute.(tmb[1, :])
6-element Vector{DateTime}:
 2022-01-01T00:00:00
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00

julia> d .+ Minute.(tm[:])
6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00

julia> d .+ Minute.(tmb[2, :])
6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00

julia> dims(gldas, Ti)
Ti Sampled{DateTime} ForwardOrdered Explicit DimensionalData.Dimensions.Lookups.Intervals{DimensionalData.Dimensions.Lookups.Center}
wrapping: 6-element Vector{DateTime}:
 2022-01-01T03:00:00
 2022-01-01T06:00:00
 2022-01-01T09:00:00
 2022-01-01T12:00:00
 2022-01-01T15:00:00
 2022-01-01T18:00:00
julia> gldasz.attrs["tavg definition:"]
"past 3-hour average"

@rafaqz
Copy link
Owner

rafaqz commented May 13, 2024

Ok its always assumed to be the center in CF standards, but I guess when a bounds matrix is provided we need to actually check that (and currently we dont). I have no idea what to do if its some other fractional position besides start/center/end...

If you have time to add the check based on bounds matrix values that would help, see

Probably all of this logic needs some kind of overhaul.
https://github.com/rafaqz/Rasters.jl/blob/main/src/sources/commondatamodel.jl#L333-L350

One complication is we really don't want to get Explicit bounds matrix spans if we can have Regular because Regular is so much easier to work with.

CF says:

If bounds are not provided, an application might reasonably assume the gridpoints to be at the centers of the cells, but we do not require that in this standard.

So I've forgotten about the corrolary of that - when bounds are provided the gridpoints can be anywhere

@felixcremer
Copy link
Contributor Author

The problem is, with Zarr data we will for now always get explicit, because there is no standard yet that defines how to save the Regular dimensions in the zarr format. See the discussion in the geozarr standard repository zarr-developers/geozarr-spec#17.

Or does Rasters try to convert an explicit Vector to a regular representation?
What is the locus information used for when the data bounds are explicit?

@rafaqz
Copy link
Owner

rafaqz commented May 13, 2024

Oh that's the same with netcdf, bounds matrix is optional. Oops I misread now for not. That is a pain, but we do try and check is Regular is ok anyway.

Yes we try to convert to Regular. But in hindsight I think it will be wrong in some edge cases that shouldn't exist, but could, and when the locus is not Center.

Butt when there is no bounds matrix according to CF we have to assume Center because as far as I know there is no metadata to indicate that position? Maybe zarr does specify that?

(I'm not sure that thread is relevant here, the bounds matrix is not the same as transformation variables... edit: because we already have this same problem in netcdf and just check for regularity... But also good to see the CF spec is as overwhelming for everyone else as it is for me...)

Otherwise the RasterStack code tries to load all zarr subfiles separately.
This always assumes a Center locus when the bounds are not explicitely hit.
We might want to check for the actual center, but this might become problematic for time bounds.
@felixcremer
Copy link
Contributor Author

I rebased the branch and add a check for the Locus to see whether the index is at one side of the boundaries and to set Start or End accordingly. I am not sure, whether we should really check whether the Center is in the middle between the boundaries, because what would we do with time axis where the index is at the 15th of the month and the end might be 30 or 31 and therefore the exact middle would always be off. I am not sure, how this is handled in other tools.
Also I am not sure whether we would need the same check for the regular span or whether we could assume Center there.

@rafaqz
Copy link
Owner

rafaqz commented May 15, 2024

We always have to special case date time, so I think that's ok

src/sources/commondatamodel.jl Outdated Show resolved Hide resolved
end
nondim = setdiff(keys(ds), toremove)
# Maybe this should be fixed in ZarrDatasets but it works with this patch.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes fine for now but these CMD functions should have consistent return values

This copied some of the ncdatasets tests and did them on a Zarr array.
This is necessary, because otherwise the  documentation build fails with strange pkg errors.
I am trying to make the _writevar! function only depend on CDM functions to make it backend agnostic.
This needs certain functions to make the NCD specific parts inside functions which can be specialized on the source type.
@felixcremer
Copy link
Contributor Author

I am currently working on making the writevar! function backend agnostic so that we can use it for writing Zarr and NCDatasets this is currently broken.
In the future I would also like to add some round trip tests where we load a netcdf, save it to zarr and open it up again and then save it back to netcdf and this should give the same netcdf file.

The documenter build is failing because of download issues with WorldClim and CI gave some strange malloc failure but I am hoping, that this was just a github action glitch.

@felixcremer felixcremer marked this pull request as draft May 16, 2024 15:30
@rafaqz
Copy link
Owner

rafaqz commented Jun 16, 2024

I'm writing some major changes to how cf standards, missing values etc are handled accross all the backends.

The will have pretty major clashes with what you are doing here - like in writevar. Any chance we can get this merged relatively soon before I do too much more of that PR?

@rafaqz
Copy link
Owner

rafaqz commented Aug 2, 2024

Any updates? would be good to merge this

@felixcremer
Copy link
Contributor Author

I'll have a look today. I am going to remove the write stuff, and focus just on reading zarr data for now, so that we can merge it soon.

@felixcremer felixcremer marked this pull request as ready for review August 2, 2024 14:37
@felixcremer
Copy link
Contributor Author

This is currently failing because of this test:

            @testset "non allowed values" begin
                @test_throws ArgumentError write(filename, convert.(Union{Missing,Float16}, ncarray); force=true)
            end

What is the purpose of this test?
The Errror changed from an ArgumentError to a MethodError because there is no fillvalue(Float16) from CommonDataModel.
I am not sure, whether this is something that fails now because of my changes.
Apart from that I removed the write function for ZarrDatasets and we should try to make that in the future.

@rafaqz
Copy link
Owner

rafaqz commented Aug 3, 2024

Its just to make sure there is a sensible argument error rather than a random method error users don't understand.

We need to say "can't write $T" explicitly

@felixcremer
Copy link
Contributor Author

Than this is something, that I would address in this PR.

@rafaqz
Copy link
Owner

rafaqz commented Aug 3, 2024

Yeah, I'm not sure where it threw the ArgumentError before but looks like it did

@felixcremer
Copy link
Contributor Author

I've redone the current behaviour and hopefully this is now going through all tests.
This should be good to go.

@rafaqz rafaqz merged commit 7d1fa2c into rafaqz:main Aug 13, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants