-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ZarrDataset extension #640
Conversation
Before we could merge this, we need to register ZarrDatasets |
The dimension name difference is because Rasters switches some common x/y/z names into explicit X/Y/Z to standardize algorithms (e.g. rasterize knows to automatically rasterize polygons into X/Y dimensions and not others). There are some arguments for and against this, its a fairly random decision. But it definitely makes running the other alogorithms easier, and lets us check other things like broadcasts aren't mixing dimensions. |
The name differences is something that I would be fine with, but others are a bit more hesitant to have. I got a lot of issues in YAXArrays because with the switch to DimensionalData we automatically converted Time axes to Ti. The other thing that I saw is that YAXArrays makes points and Rasters makes Intervals. And YAXArrays is using more Ranges instead of vectors also the types of the Axes seem to be off with the Rasters version I would not expect to have Missing values in the dimensions. I am going to explore this further, but I rebased this today on main and the overhaul keywords PR broke some things which I am going through at the moment. |
It was not your PR but I made a dirty rebase. This is fixed now. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #640 +/- ##
==========================================
+ Coverage 82.32% 82.43% +0.10%
==========================================
Files 60 61 +1
Lines 4357 4480 +123
==========================================
+ Hits 3587 3693 +106
- Misses 770 787 +17 ☔ View full report in Codecov by Sentry. |
Missing values in the dimensions may be a CommonDataModel/NCDatsets bug, I think its forbidden in the CF standard too. Rasters would not introduce that. Rasters will only give you Intervals if there is a bounds variable or some other metadata indicating intervals. My guess is YAX just doesn't check. But you really want intervals if you have them, so We could use ranges instead of vectors for regular, but you do introduce the occasional floating point error. |
Oh right my issue for the missing values was closed: We could just run @Alexander-Barth I think we really more control over missing value handling in CommonDataModel in general. Is there a way to load dim variables to force ignoring the @felixcremer also note the problem comes from incorrect files that have |
#655 fixes the missing problem on the Rasters side for now |
In Zarr the FillValue does not necessarily mean that it is missing. It is the default value that you would get when a chunk is not written. This is often used to mean Yes I think, that YAX does not really check for the bounds variables and it only discards them. I am going to look into properly testing this tomorrow. |
Ok in that case we really need a way to stop CommonDataModel from replacing the FillValue with missing at all. I'm going to overhaul |
Is that not analogous to NetCDF with unlimited dimensions? If time is an unlimited dimension and when you write only the second time slice, the first time will be treated with full of fill values. You can ignore the defined fname = tempname()
ds = NCDataset(fname,"c")
defDim(ds,"lon",100)
v = defVar(ds,"lon",Float32,("lon",),fillvalue = 9999.)
close(ds)
ds = NCDataset(fname)
eltype(ds["lon"])
# Union{Missing, Float32}
ds = NCDataset(fname)
eltype(cfvariable(ds,"lon",fillvalue=nothing))
# Float32 (this code works with ZarrDatasets after this change) As I understand, the Zarr spec,
However, it seems that the Zarr array here set the fill value for the coordinate dimensions (despite not being used). |
We really need this at the point of loading the file or when getting a specific variable, for example by passing a keyword. Its not really practical to ask people to modify a file on disk to change how it loads. |
In my example, you do not need to modify the data on disk. |
A right I have to stop reading these on mobile. We can allow setting that from Rasters fairly easily. |
In this commit JuliaGeo/ZarrDatasets.jl@a84ef0a (This issue was helpful to me to understand the differences between Zarr fill_value as CF _FillValue: |
Great! If also added fixes to Rasters to avoid Missing in lookups |
This is a first draft for opening zarr data with ZarrDatasets in Rasters
The missing values in the lookups is fixed. But now I am a bit confused by the Locus that is detected. For the GLDAS dataset http://tinyurl.com/GLDAS-NOAH025-3H the time bounds are correctly recognized and the dimensions are set as intervals. How is the Locus set when opening a dataset?
|
Ok its always assumed to be the center in CF standards, but I guess when a bounds matrix is provided we need to actually check that (and currently we dont). I have no idea what to do if its some other fractional position besides start/center/end... If you have time to add the check based on bounds matrix values that would help, see Probably all of this logic needs some kind of overhaul. One complication is we really don't want to get CF says:
So I've forgotten about the corrolary of that - when bounds are provided the gridpoints can be anywhere |
The problem is, with Zarr data we will for now always get explicit, because there is no standard yet that defines how to save the Regular dimensions in the zarr format. See the discussion in the geozarr standard repository zarr-developers/geozarr-spec#17. Or does Rasters try to convert an explicit Vector to a regular representation? |
Yes we try to convert to Regular. But in hindsight I think it will be wrong in some edge cases that shouldn't exist, but could, and when the locus is not Butt when there is no bounds matrix according to CF we have to assume (I'm not sure that thread is relevant here, the bounds matrix is not the same as transformation variables... edit: because we already have this same problem in netcdf and just check for regularity... But also good to see the CF spec is as overwhelming for everyone else as it is for me...) |
Otherwise the RasterStack code tries to load all zarr subfiles separately.
This always assumes a Center locus when the bounds are not explicitely hit. We might want to check for the actual center, but this might become problematic for time bounds.
I rebased the branch and add a check for the Locus to see whether the index is at one side of the boundaries and to set Start or End accordingly. I am not sure, whether we should really check whether the Center is in the middle between the boundaries, because what would we do with time axis where the index is at the 15th of the month and the end might be 30 or 31 and therefore the exact middle would always be off. I am not sure, how this is handled in other tools. |
We always have to special case date time, so I think that's ok |
end | ||
nondim = setdiff(keys(ds), toremove) | ||
# Maybe this should be fixed in ZarrDatasets but it works with this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes fine for now but these CMD functions should have consistent return values
This copied some of the ncdatasets tests and did them on a Zarr array.
This is necessary, because otherwise the documentation build fails with strange pkg errors.
I am trying to make the _writevar! function only depend on CDM functions to make it backend agnostic. This needs certain functions to make the NCD specific parts inside functions which can be specialized on the source type.
I am currently working on making the writevar! function backend agnostic so that we can use it for writing Zarr and NCDatasets this is currently broken. The documenter build is failing because of download issues with WorldClim and CI gave some strange malloc failure but I am hoping, that this was just a github action glitch. |
I'm writing some major changes to how cf standards, missing values etc are handled accross all the backends. The will have pretty major clashes with what you are doing here - like in |
Any updates? would be good to merge this |
I'll have a look today. I am going to remove the write stuff, and focus just on reading zarr data for now, so that we can merge it soon. |
This is currently failing because of this test: @testset "non allowed values" begin
@test_throws ArgumentError write(filename, convert.(Union{Missing,Float16}, ncarray); force=true)
end What is the purpose of this test? |
Its just to make sure there is a sensible argument error rather than a random method error users don't understand. We need to say "can't write $T" explicitly |
Than this is something, that I would address in this PR. |
Yeah, I'm not sure where it threw the ArgumentError before but looks like it did |
I've redone the current behaviour and hopefully this is now going through all tests. |
This is a first draft for opening zarr data with ZarrDatasets in Rasters.
I just copied the approach in the GRIBDatasets extension into a new extension.
This would still need some tests.
When I load the same data with this branch or with YAXArrays I get the following dimensions, hereby
ds
is the data loaded with YAXArrays andrs
with Rasters. This is the ESDC tiny cube which is available here "https://s3.bgc-jena.mpg.de:9000/esdl-esdc-v3.0.2/esdc-16d-2.5deg-46x72x1440-3.0.2.zarr":and for another example.:
I am not sure what are the expected dimensions for these datasets.