Defer to merge_chunks in special cases of rechunk #282

TomNicholas · 2023-07-31T17:58:46Z

#221 introduced merge_chunks, a special-case of rechunk that can be implemented using blockwise. I noticed that whilst reduction calls merge_chunks directly, inside ops.rechunk the primitive rechunk is always called. Shouldn't it be possible for ops.rechunk to check if the user is asking them to perform that special case, and internally dispatch to merge_chunks?

This also makes me wonder whether there are any other special cases of rechunk that could be written using blockwise.

The text was updated successfully, but these errors were encountered:

tomwhite · 2023-08-01T10:12:33Z

This should be possible, but I'm not sure how much this occurs in practice. The only calls to rechunk in Cubed are in reshape and from_array (for some cases).

TomNicholas · 2023-08-01T21:15:37Z

It will also happen if an xarray user calls .chunk on an already-chunked array, because it dispatches to cubed's rechunk method.

I agree it's not a very common case (though I expect it to come up in the full pangeo vorticity example where we pad then rechunk to merge the padded values back in).

dcherian · 2023-08-01T21:45:02Z

there are any other special cases of rechunk that could be written using blockwise.
It will also happen if an xarray user calls .chunk on an already-chunked array

This is very confusing to me.

Isn't a rechunk without a on-disk intermediate not "blockwise" by definition (you are communicating across chunks)? I thought the optimization was effectively optimizing chunking when reading from an intermediate store by looking at the chunks needed for the succeeding operation. But perhaps I'm misunderstanding something.

TomNicholas · 2023-08-01T22:46:18Z

merge_chunks is implemented using map_direct, which

works by creating an empty array that has the same shape and chunk structure as the output, and calling map_blocks on this empty array, passing in the input arrays as side inputs to the function, which may access them in whatever way is needed.

That might resolve it for you @dcherian ?

Alternatively, the way I have been thinking about this merge_chunks operation is as just one-half of what rechunker does. In Tom A's original suggestion that led to rechunker, he breaks general rechunking into a split pass and a merge pass. If you can accomplish the specific rechunk only doing the merge pass you don't need the intermediate store.

(This also suggests that an equivalent split_chunks might also be possible to implement using map_direct)

TomNicholas added core optimization labels Jul 31, 2023

TomNicholas mentioned this issue Jul 31, 2023

Fuse pipelines with different numbers of tasks #284

Closed

tomwhite mentioned this issue Nov 8, 2024

Defer to merge_chunks in special cases of rechunk #612

Merged

tomwhite closed this as completed in #612 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer to merge_chunks in special cases of rechunk #282

Defer to merge_chunks in special cases of rechunk #282

TomNicholas commented Jul 31, 2023

tomwhite commented Aug 1, 2023

TomNicholas commented Aug 1, 2023 •

edited

Loading

dcherian commented Aug 1, 2023

TomNicholas commented Aug 1, 2023 •

edited

Loading

Defer to merge_chunks in special cases of rechunk #282

Defer to merge_chunks in special cases of rechunk #282

Comments

TomNicholas commented Jul 31, 2023

tomwhite commented Aug 1, 2023

TomNicholas commented Aug 1, 2023 • edited Loading

dcherian commented Aug 1, 2023

TomNicholas commented Aug 1, 2023 • edited Loading

TomNicholas commented Aug 1, 2023 •

edited

Loading

TomNicholas commented Aug 1, 2023 •

edited

Loading