Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Dask-Zarr chunk alignment on the to_zarr method #9914

Open
josephnowak opened this issue Dec 21, 2024 · 2 comments
Open

Automatic Dask-Zarr chunk alignment on the to_zarr method #9914

josephnowak opened this issue Dec 21, 2024 · 2 comments

Comments

@josephnowak
Copy link
Contributor

josephnowak commented Dec 21, 2024

Is your feature request related to a problem?

In the time that I have used Xarray, I have seen many issues related to the alignment between the Dask and Zarr chunks, and most of them have been misunderstanding how to achieve a many-to-one relation between both chunks, for that reason, I would like to propose the addition of a new feature that allows aligning the chunks automatically, this should bring a significant reduction in the number of issues related to this problem and also simplify the way that Xarray, Dask, and Zarr interact from the user perspective.

Describe the solution you'd like

Add a new align chunks parameter.

Pros:

  • It would not break the current behavior of the method.
  • It gives the users control over whether they want the automatic alignment.

Cons:

  • It adds an extra parameter to a method that is already complex from my perspective.
  • We will have to add extra validations on the code and docs, for example, if the automatic chunk alignment is enabled and the synchronizer is also present, does it make sense to realign the chunks? or if the safe_chunks is enabled together with the automatic alignment, should we raise an error, or should we give priority to the chunk alignment? this also makes me think, what happens if someone sends a synchronizer and the safe_chunks is enabled? I have not seen this specified in the docs.

Drop the safe_chunks parameter and always align the chunks if it is necessary.

Pros:

  • Reduce the number of parameters in the to_zarr method.
  • Simplify the user experience, now the users could ignore the difference between Dask and Zarr Chunks, which should reduce the number of issues reported on this topic. It is important to highlight that If the synchronizer is set, then the automatic alignment of the chunks should be disabled. I think this is a better workflow and prevents data corruption in all the scenarios, this means that synchronizer would work as a safe_chunks = False.
  • Delete the possibility of corrupting the data, which I think is more important than affecting the performance.

Cons:

  • It would be a breaking change.
  • The automatic chunk alignment could increase the number of tasks in Dask or unexpectedly affect performance. A possible mitigation for this is to raise a warning indicating that the chunks were unaligned.

I think option B is the best for a good part of the users, it prevents corruption of the data and simplifies the workflow, but probably there are some use cases that I'm not taking into consideration that would explain why the safe_chunks parameter was added in first place.

Any feedback or different ideas are welcome

Describe alternatives you've considered

No response

Additional context

No response

@josephnowak
Copy link
Contributor Author

Hi @max-sixty here is the proposal based on the discussion of this issue, if anyone has another approach, idea, or change for this it would be good to discuss it, so we can implement the best possible option.

@josephnowak josephnowak changed the title Automatic chunk alignment on the to_zarr method Automatic Dask-Zarr chunk alignment on the to_zarr method Dec 21, 2024
@pjpetersik
Copy link

I really like this enhancement (both option A or B).

After an upgrade to xarray>=2024.10, I got as well these ValueErrors that are discussed here: #9767 Only after digging through this and other issues for a while, I could find out that using a synchronizer together with safe_chunks=False solves my problem. Therefore, speaking from the users perspective, both proposed enhancement would have probably saved me quite some time and trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants