Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to prevent automatic rechunking? #7711

Closed
TomNicholas opened this issue May 26, 2021 · 6 comments
Closed

Option to prevent automatic rechunking? #7711

TomNicholas opened this issue May 26, 2021 · 6 comments

Comments

@TomNicholas
Copy link

In xhistogram #57 I'm trying to test a blockwise-based algorithm for various chunk shapes, and finding that in my test suite dask will change my tests by automatically rechunking and issuing a PerformanceWarning:

  /home/tegn500/Documents/Work/Code/xhistogram/xhistogram/core.py:334: 
  PerformanceWarning: Increasing number of chunks by factor of 100
    bin_counts = dsa.blockwise(

I would prefer for dask not to override me like this - in a test suite I'm much more concerned that the tests are run exactly the way I specify than I am concerned about performance.

Is there a global option to prevent this? My dask.config.config dictionary looks like this

{'version': 1,
 'temporary-directory': None,
 'dataframe': {'shuffle-compression': None},
 'array': {'svg': {'size': 120}, 'slicing': {'split-large-chunks': None}},
 'optimization': {'fuse': {'active': None,
   'ave-width': 1,
   'max-width': None,
   'max-height': inf,
   'max-depth-new-edges': None,
   'subgraphs': None,
   'rename-keys': True}}}

but I'm not sure if any of the options in the configuration reference will affect this.

It's hard for me to know if my tests failing due to this or not. Some of my tests are failing, and when dask is automatically changing the test as it runs I don't really know how to debug them. blockwise is dispatching to code we wrote so it's plausible that the automatic rechunking is causing my test failures by switching to a chunking pattern which passes to a chunking pattern which fails.

The only issue I've seen that seems related is #4763 .

@jrbourbeau
Copy link
Member

jrbourbeau commented May 26, 2021

Thanks for raising an issue @TomNicholas. By default, blockwise will attempt to align the chunks of input arrays

if align_arrays:
chunkss, arrays = unify_chunks(*args)

This one place where rechunking is happening. To avoid this, you can specify align_arrays=False in your call to blockwise

@mrocklin
Copy link
Member

mrocklin commented May 26, 2021 via email

@TomNicholas
Copy link
Author

Thanks both. align_arrays=False silences the warning for me, which is great.

Given the phrasing of your comment my guess is that you think Dask is doing
something extra / magical here. It isn't.

@mrocklin you're right - I completely misinterpreted the warning message to mean "[Dask is] Increasing number of chunks", rather than "[This operation will] Increase number of chunks". As a user it's not obvious to me that dask wouldn't magically do something like that, so I personally think the latter message would be a bit clearer, but I'll close this issue now.

@gjoseph92
Copy link
Collaborator

Also to clarify, unify_chunks (align_arrays=True) is not the same thing as rechunk proper. It does an important job, and just turning it off without adding some other logic is likely not the right approach; in fact, that will probably cause more problems.

When the input arrays have different chunk patterns, to do a chunk-by-chunk operation, you have to do some sort of re-chunking to make the chunks line up with each other. Otherwise, you'll be trying to operate between NumPy arrays of different shapes. Consider two 1D dask arrays of the same shape, but with different chunk patterns:

A: |--------|----|--|
B: |--|--------|----|

(A, 0): |--------|
(B, 0): |--|
^ these NumPy arrays are different lengths! you can't use them together.

(A, 1): |----|
(B, 1): |--------|
...

When you do a blockwise operation between A and B, that means you want to use block 0 of A with block 0 of B, block 1 of A with block 1 of B, etc. So you need those corresponding blocks to match in shape.

What unify_chunks does is sort of pick a "least common denominator" chunk pattern, where at each point that any of the inputs have a chunk split, it adds the split for all of them. (@mrocklin or @jrbourbeau correct me if I'm wrong in this summary of the unify_chunks algorithm.) This is not necessarily the best algorithm performance-wise—in some cases, maybe it would be better to sometimes combine chunks rather than always splitting them—but it's simple and correct. Improving this is the idea in #4763.

After unification, corresponding chunks of the arrays will now always be the same length, and therefore interoperable:

             A : |--------|----|--|
             B : |--|--------|----|
unify_chunks(A): |--|-----|--|-|--|
unify_chunks(B): |--|-----|--|-|--|

(unify_chunks(A), 0): |--|
(unify_chunks(B), 0): |--|

(unify_chunks(A), 1): |-----|
(unify_chunks(B), 1): |-----|
...

So—if you're going to set align_arrays=False, it now becomes your job (instead of Dask's) to make sure all the chunks are the same length. You could implement a different algorithm for aligning the chunks that performs better for your use-case (likely involving calling .rechunk with some parameters you compute), but in the end, there's no getting around the fact that the chunks have to be aligned somehow, which entails some rechunking.

@TomNicholas
Copy link
Author

Thank you for that clarification @gjoseph92 , that's extremely helpful.

It does make me wonder how my test input has even got unaligned chunks, but that's something to be discussed in xgcm/xhistogram#57 rather than here I guess.

@quasiben
Copy link
Member

quasiben commented May 26, 2021

@gjoseph92 your comment is fantastic! I'm not sure where this should go immediately but I think we should find a space in the docs to capture those clarifying thoughts long term

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants