You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently save the on-disk chunk sizes in the database, and tell xarray to use these when loading data. @jmunroe pointed out that these chunks can be quite small, leading to a large number of tasks in the graph for many computations. Some thoughts:
there are many tasks, but this is less of a problem than it used to be -- task overhead has been significantly reduced in dask
perhaps this problem should be pushed up to the model layer, so they don't write such small chunks
perhaps we could defer down to xarray with auto chunksizes (I'm not sure if this takes the on-disk chunk boundaries into account, which would lead to rechunking churn)
The text was updated successfully, but these errors were encountered:
When discussing chunking, xarray and netCDF it is very confusing because there is on-disk (netCDF file level) chunking and xarray (dask) chunking. The netCDF chunk size is going to have some optimum size based on lustre which I don't know or understand at pretty much any level beyond hand-waving.
The xarray (dask) chunking will be what affects the size of the task-graph. Currently the cookbook library automatically matches the dask chunking to the on-disk chunking because it is difficult to think of a better default.
It does make some sense to have the netCDF chunk size smaller than what might be the optimum size for xarray calculations, as different calculations might like different xarray chunking along select dimensions (think time series analysis vs spatial analysis), and if the netCDF chunk size were large in all dimensions this would lead to a lot of unnecessary IO, unless dask/xarray does some nifty caching or can re-order ops based on IO access patterns. I know less than nothing about that.
Looking into the interplay of on-disk/dask chunking and auto chunking is a great use case of standard analyses cf #210
We currently save the on-disk chunk sizes in the database, and tell
xarray
to use these when loading data. @jmunroe pointed out that these chunks can be quite small, leading to a large number of tasks in the graph for many computations. Some thoughts:xarray
withauto
chunksizes (I'm not sure if this takes the on-disk chunk boundaries into account, which would lead to rechunking churn)The text was updated successfully, but these errors were encountered: