Option to prevent automatic rechunking? #7711

TomNicholas · 2021-05-26T17:59:55Z

In xhistogram #57 I'm trying to test a blockwise-based algorithm for various chunk shapes, and finding that in my test suite dask will change my tests by automatically rechunking and issuing a PerformanceWarning:

  /home/tegn500/Documents/Work/Code/xhistogram/xhistogram/core.py:334: 
  PerformanceWarning: Increasing number of chunks by factor of 100
    bin_counts = dsa.blockwise(

I would prefer for dask not to override me like this - in a test suite I'm much more concerned that the tests are run exactly the way I specify than I am concerned about performance.

Is there a global option to prevent this? My dask.config.config dictionary looks like this

{'version': 1,
 'temporary-directory': None,
 'dataframe': {'shuffle-compression': None},
 'array': {'svg': {'size': 120}, 'slicing': {'split-large-chunks': None}},
 'optimization': {'fuse': {'active': None,
   'ave-width': 1,
   'max-width': None,
   'max-height': inf,
   'max-depth-new-edges': None,
   'subgraphs': None,
   'rename-keys': True}}}

but I'm not sure if any of the options in the configuration reference will affect this.

It's hard for me to know if my tests failing due to this or not. Some of my tests are failing, and when dask is automatically changing the test as it runs I don't really know how to debug them. blockwise is dispatching to code we wrote so it's plausible that the automatic rechunking is causing my test failures by switching to a chunking pattern which passes to a chunking pattern which fails.

The only issue I've seen that seems related is #4763 .

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2021-05-26T19:46:31Z

Thanks for raising an issue @TomNicholas. By default, blockwise will attempt to align the chunks of input arrays

dask/dask/array/blockwise.py

Lines 173 to 174 in 0b2053f

    
           if align_arrays: 
        
               chunkss, arrays = unify_chunks(*args)

This one place where rechunking is happening. To avoid this, you can specify align_arrays=False in your call to blockwise

mrocklin · 2021-05-26T20:21:08Z

It's also worth noting that the intent of the warning here is "The operation that you're doing is causing a lot of chunks, you may want to be aware" rather than "I've decided to rechunk your dataset for you". Given the phrasing of your comment my guess is that you think Dask is doing something extra / magical here. It isn't. Operations like tensordot create a lot of chunks. If the output or intermediate arrays have way more chunks than the input arrays then we let you know so that you can adjust expectations.

…

On Wed, May 26, 2021 at 2:46 PM James Bourbeau ***@***.***> wrote: Thanks for raising an issue @TomNicholas <https://github.com/TomNicholas>. By default, blockwise will attempt to align the chunks of input arrays https://github.com/dask/dask/blob/0b2053f958a2c222184c78218a87d93f6a82b5d9/dask/array/blockwise.py#L173-L174 This is where the rechunking is happening. To avoid this, you can specify align_arrays=False in your call to blockwise — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTF3BQATQTYR5EOEGA3TPVF2LANCNFSM45SSRMUQ> .

TomNicholas · 2021-05-26T20:29:54Z

Thanks both. align_arrays=False silences the warning for me, which is great.

Given the phrasing of your comment my guess is that you think Dask is doing
something extra / magical here. It isn't.

@mrocklin you're right - I completely misinterpreted the warning message to mean "[Dask is] Increasing number of chunks", rather than "[This operation will] Increase number of chunks". As a user it's not obvious to me that dask wouldn't magically do something like that, so I personally think the latter message would be a bit clearer, but I'll close this issue now.

gjoseph92 · 2021-05-26T20:54:05Z

Also to clarify, unify_chunks (align_arrays=True) is not the same thing as rechunk proper. It does an important job, and just turning it off without adding some other logic is likely not the right approach; in fact, that will probably cause more problems.

When the input arrays have different chunk patterns, to do a chunk-by-chunk operation, you have to do some sort of re-chunking to make the chunks line up with each other. Otherwise, you'll be trying to operate between NumPy arrays of different shapes. Consider two 1D dask arrays of the same shape, but with different chunk patterns:

A: |--------|----|--|
B: |--|--------|----|

(A, 0): |--------|
(B, 0): |--|
^ these NumPy arrays are different lengths! you can't use them together.

(A, 1): |----|
(B, 1): |--------|
...

When you do a blockwise operation between A and B, that means you want to use block 0 of A with block 0 of B, block 1 of A with block 1 of B, etc. So you need those corresponding blocks to match in shape.

What unify_chunks does is sort of pick a "least common denominator" chunk pattern, where at each point that any of the inputs have a chunk split, it adds the split for all of them. (@mrocklin or @jrbourbeau correct me if I'm wrong in this summary of the unify_chunks algorithm.) This is not necessarily the best algorithm performance-wise—in some cases, maybe it would be better to sometimes combine chunks rather than always splitting them—but it's simple and correct. Improving this is the idea in #4763.

After unification, corresponding chunks of the arrays will now always be the same length, and therefore interoperable:

             A : |--------|----|--|
             B : |--|--------|----|
unify_chunks(A): |--|-----|--|-|--|
unify_chunks(B): |--|-----|--|-|--|

(unify_chunks(A), 0): |--|
(unify_chunks(B), 0): |--|

(unify_chunks(A), 1): |-----|
(unify_chunks(B), 1): |-----|
...

So—if you're going to set align_arrays=False, it now becomes your job (instead of Dask's) to make sure all the chunks are the same length. You could implement a different algorithm for aligning the chunks that performs better for your use-case (likely involving calling .rechunk with some parameters you compute), but in the end, there's no getting around the fact that the chunks have to be aligned somehow, which entails some rechunking.

TomNicholas · 2021-05-26T20:59:13Z

Thank you for that clarification @gjoseph92 , that's extremely helpful.

It does make me wonder how my test input has even got unaligned chunks, but that's something to be discussed in xgcm/xhistogram#57 rather than here I guess.

quasiben · 2021-05-26T21:52:48Z

@gjoseph92 your comment is fantastic! I'm not sure where this should go immediately but I think we should find a space in the docs to capture those clarifying thoughts long term

TomNicholas mentioned this issue May 26, 2021

Test chunking (including Hypothesis tests) xgcm/xhistogram#57

Merged

TomNicholas closed this as completed May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to prevent automatic rechunking? #7711

Option to prevent automatic rechunking? #7711

TomNicholas commented May 26, 2021

jrbourbeau commented May 26, 2021 •

edited

Loading

mrocklin commented May 26, 2021 via email

TomNicholas commented May 26, 2021

gjoseph92 commented May 26, 2021

TomNicholas commented May 26, 2021

quasiben commented May 26, 2021 •

edited

Loading

Option to prevent automatic rechunking? #7711

Option to prevent automatic rechunking? #7711

Comments

TomNicholas commented May 26, 2021

jrbourbeau commented May 26, 2021 • edited Loading

mrocklin commented May 26, 2021 via email

TomNicholas commented May 26, 2021

gjoseph92 commented May 26, 2021

TomNicholas commented May 26, 2021

quasiben commented May 26, 2021 • edited Loading

jrbourbeau commented May 26, 2021 •

edited

Loading

quasiben commented May 26, 2021 •

edited

Loading