Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk control #3361

Merged
merged 19 commits into from
Aug 23, 2019
Merged

Chunk control #3361

merged 19 commits into from
Aug 23, 2019

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Jul 25, 2019

Revised chunking policy, which will mostly affect loading from netcdf files.
Key change : allows multiplying up as well as dividing chunks, mainly to fix the #3357.

Closes #3357 #3362

@pp-mo
Copy link
Member Author

pp-mo commented Jul 25, 2019

I think I've finally got what I want out of this now.
Good to go @bjlittle ? 🙏

@pp-mo pp-mo requested a review from bjlittle July 25, 2019 17:35
@pp-mo
Copy link
Member Author

pp-mo commented Jul 26, 2019

Whoops. ⏰ ⚠️ 💣
Sorry guys, I didn't mean to do that ...

Hopefully fixed.

@pp-mo
Copy link
Member Author

pp-mo commented Jul 26, 2019

Evidence for this fixing the original problem in #3357 ...
Running sample testcode from there.
"Before" examples (release v2.2.1) :

Duration 25.450 s
Duration 26.748 s
Duration 27.222 s

"After" examples :

Duration 0.219 s
Duration 0.222 s
Duration 0.232 s

@tkarna
Copy link

tkarna commented Jul 26, 2019

Thanks! I confirm that this fixes the issue #3357.

@bjlittle
Copy link
Member

@pp-mo This is also applicable to #3362

@@ -23,12 +23,14 @@
from __future__ import (absolute_import, division, print_function)
from six.moves import (filter, input, map, range, zip) # noqa

from collections import Iterable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo See #3320 for context.

Currently in Python3.7, you get the following DeprecationWarning:

>>> from collections import Iterable
__main__:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working

Could you adopt the following pattern:

try:  # Python 3
    from collections.abc import Iterable
except ImportError:  # Python 2
    from collections import Iterable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, will do !

@lbdreyer lbdreyer self-assigned this Aug 19, 2019
@pp-mo
Copy link
Member Author

pp-mo commented Aug 20, 2019

@lbdreyer I think this is only failing due to numpy 1.17,
thus since #3369, a rebase should get a clean pass.
Are you ok for that, or are you still writing review comments on existing commits ?

Copy link
Member

@lbdreyer lbdreyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

Just a few questions...

# Fetch the default 'optimal' chunksize from the dask config.
limit = dask.config.get('array.chunk-size')
# Convert to bytes
limit = da.core.parse_bytes(limit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you chose to get parse_bytes from da.core rather than from dask.utils?

Copy link
Member Author

@pp-mo pp-mo Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I copied this from the dask sourcecode somewhere.
I will fix it.


# Create result chunks, starting with a copy of the input.
result = list(chunks)
if shape is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would shape be None? I don't think we should be allowing for shape=None. you also iterate through shape on line 105

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to mimic the API of dask.array.core.normalize_chunks, in case we can use that in the future.
Actually I haven't achieved that, and we never use this option, so I will remove it.

@@ -511,7 +511,7 @@ def _get_cf_var_data(cf_var, filename):
proxy = NetCDFDataProxy(cf_var.shape, dtype, filename, cf_var.cf_name,
fill_value)
chunks = cf_var.cf_data.chunking()
# Chunks can be an iterable, None, or `'contiguous'`.
# Chunks can be an iterable, or `'contiguous'`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change. You have removed None and yet two lines down it sets

chunks = None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different sets of conventions here.

  • The 'chunks' value that nc.Variable.data_chunking returns can not be None, I believe.
    • I don't know quite why it ever said that : it just seems wrong to me.
  • The 'chunks' keyword we pass to as_lazy_data can be None. And it absolutely can't be 'contiguous', which is why we are converting here.

I will try to clarify in the comments.

lib/iris/_lazy_data.py Show resolved Hide resolved
# Return chunks unchanged, for types of invocation we don't comprehend.
if (any(elem <= 0 for elem in shape) or
not isinstance(chunks, Iterable) or
len(chunks) != len(shape)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have explicit tests for this check.
I'm not sure how thorough we want to be with testing this. It does seem like a bit of overkill to add tests for
_optimum_chunksize((200,300), (1,200,300))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an attempt to allow alternative, dask-type chunks arguments in 'as_lazy_data'.
Obviously we don't use anything like that at present. The only immediate need is to skip shapes with a zero in them (see comment).

I now see this is wrong anyway, as that initial test clause assumes that shape is iterable !
I will simplify to just what we need, and add a testcase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... after some thought, I have removed this to the caller + documented the behaviour there.

lib/iris/_lazy_data.py Outdated Show resolved Hide resolved
@pp-mo
Copy link
Member Author

pp-mo commented Aug 20, 2019

Attempted to address review comments.
Please re-review @lbdreyer .

Note : I didn't make any changes to address the comment about the size of divided chunks.
We can still discuss that if wanted.

limitcall_patch = self.patch('iris._lazy_data._optimum_chunksize')
test_shape = (2, 1, 0, 2)
data = self._dummydata(test_shape)
result = as_lazy_data(data, chunks=test_shape)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F841 local variable 'result' is assigned to but never used

@pp-mo
Copy link
Member Author

pp-mo commented Aug 21, 2019

Thanks @lbdreyer this fixes the outstanding errors.

I'm happy that those other uses in testing of as_lazy_data(.. chunks=X ..) are all now unnecessary, and we don't need to support any more complex usages : We accept that this 'chunks' keyword is not much like the dask one, and that is ok : the new docstring reflects this I hope.

Regarding the "lost" test,
test_as_lazy_data.Test__optimised_chunks.test_large_specific_chunk_passthrough
I think that is now obsolete, because the earlier idea was that a specified chunks= should always pass through unchanged to the dask call, but that is definitely no longer the case.

@pp-mo
Copy link
Member Author

pp-mo commented Aug 21, 2019

Meanwhile, though ...

I'm still wondering about your comment "This does end with an array that is unequally split into chunks"
There could still be an improvement that can be made here, but I'm still not clear exactly how it should work.

So far : a good practical example ...

>>> _optimum_chunksize((3, 300, 200), (117, 300, 2000))
(54, 300, 2000)

(given the current default dask chunksize, which is 128 Mb)
this will result in 3 chunks with the first-dimension sizes (54, 54, 9),
but you "could have had" (39, 39, 39) instead.
I will investigate how that can be calculated, taking account of what happens to slightly-smaller and slightly-larger cases (i.e. that don't divide equally).

To be continued ...

@pp-mo
Copy link
Member Author

pp-mo commented Aug 21, 2019

Interested ? @cpelley

@pp-mo
Copy link
Member Author

pp-mo commented Aug 22, 2019

Hi @lbdreyer
Finally, I think I made sense of the point you raised above, regarding a better choice of chunk sizes.
Hope this makes sense, sorry it has proved so complicated to explain + test !
Please re-review ...

@pp-mo
Copy link
Member Author

pp-mo commented Aug 23, 2019

Hi again @lbdreyer
As-per offline discussion, now simplified the testing a bit.

@lbdreyer
Copy link
Member

This is a really great change! 💯

Thanks for persisting with it @pp-mo!

@lbdreyer lbdreyer merged commit f402a19 into SciTools:master Aug 23, 2019
@pp-mo pp-mo deleted the chunk_control branch October 17, 2019 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading netcdf files is slow if there are unlimited dimensions
5 participants