Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer to merge_chunks in special cases of rechunk #282

Closed
TomNicholas opened this issue Jul 31, 2023 · 4 comments · Fixed by #612
Closed

Defer to merge_chunks in special cases of rechunk #282

TomNicholas opened this issue Jul 31, 2023 · 4 comments · Fixed by #612

Comments

@TomNicholas
Copy link
Member

#221 introduced merge_chunks, a special-case of rechunk that can be implemented using blockwise. I noticed that whilst reduction calls merge_chunks directly, inside ops.rechunk the primitive rechunk is always called. Shouldn't it be possible for ops.rechunk to check if the user is asking them to perform that special case, and internally dispatch to merge_chunks?

This also makes me wonder whether there are any other special cases of rechunk that could be written using blockwise.

@tomwhite
Copy link
Member

tomwhite commented Aug 1, 2023

This should be possible, but I'm not sure how much this occurs in practice. The only calls to rechunk in Cubed are in reshape and from_array (for some cases).

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 1, 2023

It will also happen if an xarray user calls .chunk on an already-chunked array, because it dispatches to cubed's rechunk method.

I agree it's not a very common case (though I expect it to come up in the full pangeo vorticity example where we pad then rechunk to merge the padded values back in).

@dcherian
Copy link

dcherian commented Aug 1, 2023

there are any other special cases of rechunk that could be written using blockwise.
It will also happen if an xarray user calls .chunk on an already-chunked array

This is very confusing to me.

Isn't a rechunk without a on-disk intermediate not "blockwise" by definition (you are communicating across chunks)? I thought the optimization was effectively optimizing chunking when reading from an intermediate store by looking at the chunks needed for the succeeding operation. But perhaps I'm misunderstanding something.

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 1, 2023

merge_chunks is implemented using map_direct, which

works by creating an empty array that has the same shape and chunk structure as the output, and calling map_blocks on this empty array, passing in the input arrays as side inputs to the function, which may access them in whatever way is needed.

That might resolve it for you @dcherian ?

Alternatively, the way I have been thinking about this merge_chunks operation is as just one-half of what rechunker does. In Tom A's original suggestion that led to rechunker, he breaks general rechunking into a split pass and a merge pass. If you can accomplish the specific rechunk only doing the merge pass you don't need the intermediate store.

(This also suggests that an equivalent split_chunks might also be possible to implement using map_direct)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants