Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibly non-optimal chunk sizes? #213

Open
angus-g opened this issue Sep 1, 2020 · 2 comments
Open

Possibly non-optimal chunk sizes? #213

angus-g opened this issue Sep 1, 2020 · 2 comments

Comments

@angus-g
Copy link
Collaborator

angus-g commented Sep 1, 2020

We currently save the on-disk chunk sizes in the database, and tell xarray to use these when loading data. @jmunroe pointed out that these chunks can be quite small, leading to a large number of tasks in the graph for many computations. Some thoughts:

  • there are many tasks, but this is less of a problem than it used to be -- task overhead has been significantly reduced in dask
  • perhaps this problem should be pushed up to the model layer, so they don't write such small chunks
  • perhaps we could defer down to xarray with auto chunksizes (I'm not sure if this takes the on-disk chunk boundaries into account, which would lead to rechunking churn)
@aidanheerdegen
Copy link
Collaborator

From an email thread with @jmunroe and @aekiss

When discussing chunking, xarray and netCDF it is very confusing because there is on-disk (netCDF file level) chunking and xarray (dask) chunking. The netCDF chunk size is going to have some optimum size based on lustre which I don't know or understand at pretty much any level beyond hand-waving.

The xarray (dask) chunking will be what affects the size of the task-graph. Currently the cookbook library automatically matches the dask chunking to the on-disk chunking because it is difficult to think of a better default.

It does make some sense to have the netCDF chunk size smaller than what might be the optimum size for xarray calculations, as different calculations might like different xarray chunking along select dimensions (think time series analysis vs spatial analysis), and if the netCDF chunk size were large in all dimensions this would lead to a lot of unnecessary IO, unless dask/xarray does some nifty caching or can re-order ops based on IO access patterns. I know less than nothing about that.

Looking into the interplay of on-disk/dask chunking and auto chunking is a great use case of standard analyses cf #210

@aidanheerdegen
Copy link
Collaborator

Related #194

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants