-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DiskArrays.jl compatability #8
Comments
Something like that happens in GRIBDatasets. The |
But say we want to use the properties of the Variable, like the CF standards it applies. Does broadcasting still work as if its a DiskArray? Or do we have to choose one or the other? (Note that lack of DiskArray compat in NCDatasets.jl will at some stage mean Rasters.jl will switch to using NetCDF.jl as a backend - I already get issues about this) We are planning on sharing more code in Rasters.jl and YAXarrays.jl - DiskArrays and DiskArrayEngine.jl compat is going to central to that. It would be best if all of these things could work together going forward, or we will have to work on the alternate CF standards implentation in DiskArrayTools.jl and wrap everything with that. |
Some variables in CommonDataModel.jl can actually exists only in memory (like the coordinate variable of GRIBDatasets) or a virtual array (whose elements are computed on the fly, using e.g. the coordinate transformation). I don't think that we should assume that all variables are based on disk. But I agree that it will be the most common case and that it should work nicely with DiskArrays. I still plan to implement DiskArrays in NCDatasets.jl, just want to have know from @meggart if the current behavior (related to unlimited dimensions) can be relied upon or if this will change as you suggested in the issue (but you know that issue :-)) I think that the approach in GRIBDatasets is good because only GRIBDatasets knows (and needs to know?) if an array is on disk or in memory (or virtually defined by a function). For example GRIBDatasets can also attach new methods to CommonDataModel.CFVariable which should not interfere to other packages: using GRIBDatasets
using CommonDataModel
function foobar(v::CommonDataModel.CFVariable{T,N,TV}) where TV <: GRIBDatasets.AbstractGRIBVariable where {T,N}
return 42
end
filename = joinpath(dirname(pathof(GRIBDatasets)),"..","test","sample-data","era5-levels-members.grib")
ds = GRIBDatasets.Dataset(filename)
foobar(ds["z"])
# 42 But maybe this leads to a lot of boilerplate code than should be better centralized?
Can you point me to the macros in questions? Would this result in a dependency to DiskArrays in CommonDataModel ? |
Currently I dont think GRIBDatasets.jl diskarrays will work on variables. (just from reading the code I may be wrong) The key problem is forwarding the iteration and broadcast machinery to the parent DiskArray. Otherwise variables will use julia fallbacks of calling I also saw that you were looking to share a In Rasters.jl/DimensionalData.jl there is extensive machinery to defer broadcast to the parent object, so it doesnt have to be a DiskArray to work. Thats another option here, but much more code. For the other problem of if a Variable is not actually holding a disk array: it can still work too, we can handle that behaviour in methods on the Variable object by checking the parent array type where needed. As for NCDatasets.jl and DiskArrays.jl, saying we can never do a bounds check on But, we could totally guarantee that you can always override Is that acceptable? |
Yes, I think that should be ok. After all NetCDF.jl handels unlimited dimension too. The bounds check will just be deactivated for unlimited dimension. As I am not too familiar with DiskArrays.jl, I think it would be best to first finish the integration with NCDatasets.jl to gain a bit of familiarity with it. |
I'm trying to catch up the previous discussions about implementing DiskArrays in NCDatasets, but I still need to see the big picture. In the meantime, I tried to make |
Ok great, I'm pretty excited to have it working everywhere. Feel free to make issues or ping me here or on slack if you have any questions at all, as it's a pretty complicated interface (and note the main docs are not building currently either but the dev docs work) . There are component macros to apply subsets of it or the whole interface, or you can inherit from |
I think we have been talking past each other to some extent, and it would be good to solidify our understanding of the requirements So I want to attempt to clarify a few outstanding questions that maybe I didn't answer clearly before: In #8 (comment)
That's actually fine - disk arrays don't have to be on disk, they just can be. In this case you would just set the chunking to
Any object that inherits from You don't lose control of anything, you just get fallback methods that work for disk based arrays. For growing arrays, a custom From #8 (comment)
It actually doesn't handel them, it just allows writes - the chunks are broken afterwards because they are not updated anywhere: So it also needs a custom Having broken chunks means broadcasting, iteration and other things can do strange things because we assume the chunks along an axis match the size of the array. |
Any hope we can make this happen? The DiskArrays compatiability in lower level GRIBDatasets.jl and NCDatasets.jl objects is not available to most users currently as it wont propagate through the variables defined here. |
Is there any progress on the issues mentioned here? Another thing that makes we wonder what could be the right approach is the comment from @meggart that DiskArrays.jl is tailored with Zarr in-mind and DiskArrays.jl make some choices that are not good when dealing with NetCDF files like loading a complete chunk when only a single element is requested. I guess we would run into the same issue when a variable is just a virtual array (whose element are computed on the fly, like in TIFFDatasets.jl) or in memory. I am wondering if it would make sense to separate the interface (like AbstractDiskArray.jl) from the implementation which can be tailored to Zarr files. |
Ok I will have a look at the PR. Your comments about zarr are not true as far as I know. ArchGDAL.jl and NetCDF.jl have used DiskArray.jl for years now, and HDF5.jl will likely also use it soon: JuliaIO/HDF5.jl#930 If there are specific things we should change to optimise netcdf please make an issue. We should definately fix those small data loads to take advantage of NetCDF caching. But also be aware you can already add any methods you want for you own types to change the implementation however you like, so I'm not sure that separation into a new package is needed. Also note there will always be some things that are worse in DiskArrays.jl than in your original implementation, its hard to make a perfect generalisaion. But broadcasts, reductions and iteration will actually work. There are some huge wins from having that. |
It would be good if DiskArrays.jl chunking, broadcasting, and iteration capabilities could propagate through any
AbstractVarable
here.They don't have to be
AbstractDiskArray
because there are macros in DiskArrays.jl that can be applied to the types.The text was updated successfully, but these errors were encountered: