Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better bootstrapping #285

Merged
merged 34 commits into from
Feb 4, 2020
Merged

better bootstrapping #285

merged 34 commits into from
Feb 4, 2020

Conversation

aaronspring
Copy link
Collaborator

@aaronspring aaronspring commented Dec 29, 2019

Description

  • speed-up in bootstrap functions (bootstrap_func for stats and bootstrap_compute for skill):
    • xr.quantile exchanged for dask.map_blocks(np.percentile)
    • properly implemented handling for lazy results when chunked inputs
    • user gets warned when chunking potentially (un)-necessary

How I got there - 2 working gists:

Still to do:

  • notebook demonstrating effective chunking
  • generalize my_quantile with dim argument

Waiting for PRs to merge first:

Closes #145

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue) slow speed of compute_skill before can be considered a bug
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

pytest

Checklist (while developing)

  • I have added docstrings to all new functions.
  • I have commented my code, particularly in hard-to-understand areas
  • Tests added for pytest, if necessary.
  • I have updated the sphinx documentation, if necessary.

Pre-Merge Checklist (final steps)

  • I have rebased onto master or develop (wherever I am merging) and dealt with any conflicts.
  • I have squashed commits to a reasonable amount, and force-pushed the squashed commits.

@aaronspring
Copy link
Collaborator Author

performance increase on my 2-core macbookpro 2018:

asv continuous master AS_fix_bootstrapping

       before           after         ratio
     [f2e95592]       [21f41994]
     <master>         <AS_fix_bootstrapping>
-       3.74±0.2s         463±10ms     0.12  benchmarks_perfect_model.Compute.time_bootstrap_perfect_model

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

This 9x performance increase comes from removing the unnecessary .compute() calls.

@pytest.mark.parametrize('chunk', [True, False])
def test_dask_percentile_implemented_faster_xr_quantile(control3d, chunk):
chunk_dim, dim = 'x', 'time'
# chunk_dim, dim = 'time', 'x' # fails, why?
Copy link
Collaborator Author

@aaronspring aaronspring Jan 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my_quantile works as expected in bootstrap_compute because it is applied to the first dimension (there bootstrap, here time). but it fails to work on axis!=0 with nans.

this small demo runs without asserion errors:

a.dims ('time', 'lon', 'lat')

a.shape

q = .05

chunk_dim, dim = 'lon', 'time'
ac = a.chunk({chunk_dim: 2})
acp = a.quantile(dim=dim,q=q)
acp.shape
%time acp2 = my_quantile(ac, dim, q).compute()
acp2.shape
xr.testing.assert_allclose(acp, acp2)
acl = ac.load()
%time acp = my_quantile(acl, dim, q).compute()
acp.shape
xr.testing.assert_allclose(acp, acp2)


chunk_dim, dim = 'time', 'lon'
acp = a.quantile(dim=dim,q=q)
acp.shape
ac = a.chunk({chunk_dim: 2})
%time acp2 = my_quantile(ac, dim, q).compute()
acp2.shape
xr.testing.assert_allclose(acp, acp2)
acl = ac.load()
%time acp = my_quantile(acl, dim, q).compute()
acp.shape
xr.testing.assert_allclose(acp, acp2)

climpred/bootstrap.py Outdated Show resolved Hide resolved
climpred/bootstrap.py Outdated Show resolved Hide resolved
climpred/bootstrap.py Outdated Show resolved Hide resolved
return np.percentile(arr, axis=axis, q=q)


def my_quantile(ds, dim='bootstrap', q=0.95):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the tests. this implementation works great if dim is axis=0 and there are no missing values. I dont really get why tests fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time to do this right now. Hopefully you'll be able to work this out when responding to comments. I can look another time after you incorporate the comments.

climpred/comparisons.py Outdated Show resolved Hide resolved
climpred/utils.py Outdated Show resolved Hide resolved
@aaronspring aaronspring requested a review from bradyrx January 4, 2020 21:45
Copy link
Collaborator

@bradyrx bradyrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of refactoring/docstring changes. Going to rebase this and pull it down locally to test timing and the breaking tests.

CHANGELOG.rst Outdated Show resolved Hide resolved
CHANGELOG.rst Outdated Show resolved Hide resolved
asv_bench/asv.conf.json Outdated Show resolved Hide resolved
asv_bench/benchmarks/benchmarks_perfect_model.py Outdated Show resolved Hide resolved
climpred/bootstrap.py Outdated Show resolved Hide resolved
climpred/bootstrap.py Outdated Show resolved Hide resolved
climpred/utils.py Outdated Show resolved Hide resolved
climpred/utils.py Outdated Show resolved Hide resolved
climpred/utils.py Outdated Show resolved Hide resolved
@bradyrx
Copy link
Collaborator

bradyrx commented Jan 12, 2020

So I've mentioned this in individual comments, but right now, this just encourages/enables dask to be used to chunk the original object. This deals with the fact that it's embarassingly parallel in space -- we are computing the same quantities at each grid cell and they don't depend on each other. So for sufficiently large datasets, this speeds things up. That's why you get a speedup with compute_perfect_model with dask. It's one iteration but we're chunking in space.

However, this doesn't address at all the embarassingly parallel iteration problem. I think we need to leverage something more advanced like multiprocessing or something from dask... maybe map_blocks or something similar. @ahuang11 posted a suggested solution for how to parallelize computing multiple things at once for instance.

Andrew asked this here... it may be subtle and we just need to wrap the iterations? I'll test on a small example locally with a different function. https://groups.google.com/forum/#!searchin/xarray/andrew$20huang%7Csort:date/xarray/KBwllWmMY30/v98z03lcAgAJ

But my stats were:


Compute function

Macbook Pro:

No chunking: 24.8s
Chunking (with Client): 11.4s

Casper with 36 cores/workers:

No chunking: 15.7s
Chunking: 17.9s

Bootstrapping:

Macbook Pro:

Chunking: 2min29s

Casper:

Chunking: 2min57s

Copy link
Collaborator

@bradyrx bradyrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a lot of comments but each of them should only take a few seconds. I was really nitpicky here because I think it's important for us to commit really clean code with good docstrings, etc. so we don't have to worry as much about refactoring later (and so it's easy on new contributors)

HOWTOCONTRIBUTE.rst Outdated Show resolved Hide resolved
HOWTOCONTRIBUTE.rst Outdated Show resolved Hide resolved
HOWTOCONTRIBUTE.rst Outdated Show resolved Hide resolved
HOWTOCONTRIBUTE.rst Show resolved Hide resolved
HOWTOCONTRIBUTE.rst Outdated Show resolved Hide resolved
climpred/tests/test_bootstrap.py Outdated Show resolved Hide resolved
climpred/tests/test_bootstrap.py Outdated Show resolved Hide resolved
climpred/tests/test_bootstrap.py Outdated Show resolved Hide resolved
climpred/utils.py Outdated Show resolved Hide resolved
climpred/utils.py Show resolved Hide resolved
@bradyrx
Copy link
Collaborator

bradyrx commented Jan 29, 2020

This PR is also ready. Except that I need to looking into rebasing. Basically I implemented the requests. my_quantile works for dim=bootstrapping being axis =0 which is always the case after concat. I see my_q rather as a placeholder until xr.q is daskified and therefore this is ok as is

My benchmark tests are failing but I think it's because it needs to be rebased. Is xr.quantile not daskified on the highest level? As in the xarray folks haven't implemented this?

@aaronspring
Copy link
Collaborator Author

nope

@aaronspring
Copy link
Collaborator Author

Somehow docs fail in travis but not in rtd check

@bradyrx
Copy link
Collaborator

bradyrx commented Jan 30, 2020

Somehow docs fail in travis but not in rtd check

@aaronspring , this is because RTD check just makes sure that they can build. The notebook that broke is pre-compiled so that RTD doesn't need to use the memory to compile it. Travis wipes the notebook clean and tries to compile it to see if we have any breaking code. Looks like there's some breaking code in there that needs to be fixed.

I can look back over this PR later today

@bradyrx
Copy link
Collaborator

bradyrx commented Jan 30, 2020

Two things that need to be fixed for travis:

  1. flake8:

climpred/tests/test_checks.py:20:1: E303 too many blank lines (3)

  1. notebook:
CellExecutionError in examples/subseasonal/weekly-subx-example.ipynb:

------------------

fcstds=decode_cf(fcstds,'init')

fcstds['init']=pd.to_datetime(fcstds.init.values.astype(str))

fcstds['init']=pd.to_datetime(fcstds['init'].dt.strftime('%Y%m%d 00:00'))

------------------

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-8-b3735a2f029c> in <module>

      1 fcstds=decode_cf(fcstds,'init')

      2 fcstds['init']=pd.to_datetime(fcstds.init.values.astype(str))

----> 3 fcstds['init']=pd.to_datetime(fcstds['init'].dt.strftime('%Y%m%d 00:00'))

~/miniconda/envs/climpred-dev/lib/python3.6/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)

    736             result = convert_listlike(arg, format)

    737     elif is_list_like(arg):

--> 738         cache_array = _maybe_cache(arg, format, cache, convert_listlike)

    739         if not cache_array.empty:

    740             result = _convert_and_box_cache(arg, cache_array)

~/miniconda/envs/climpred-dev/lib/python3.6/site-packages/pandas/core/tools/datetimes.py in _maybe_cache(arg, format, cache, convert_listlike)

    145     if cache:

    146         # Perform a quicker unique check

--> 147         if not should_cache(arg):

    148             return cache_array

    149 

~/miniconda/envs/climpred-dev/lib/python3.6/site-packages/pandas/core/tools/datetimes.py in should_cache(arg, unique_share, check_count)

    114     assert 0 < unique_share < 1, "unique_share must be in next bounds: (0; 1)"

    115 

--> 116     unique_elements = set(islice(arg, check_count))

    117     if len(unique_elements) > check_count * unique_share:

    118         do_caching = False

TypeError: unhashable type: 'DataArray'

TypeError: unhashable type: 'DataArray'

@aaronspring
Copy link
Collaborator Author

xr.quantile is now in xr 0015 daskified.

Do you understand the notebook error? Haven’t touched that...

@bradyrx
Copy link
Collaborator

bradyrx commented Jan 31, 2020

xr.quantile is now in xr 0015 daskified.

Great.

Do you understand the notebook error? Haven’t touched that...

I just tried reproducing it locally and it ran fine, so I re-triggered the travis build to see if it was just a weird error. I'll follow up after the build is done...

@bradyrx
Copy link
Collaborator

bradyrx commented Jan 31, 2020

That didn't work. @aaronspring try replacing the cell:

fcstds=decode_cf(fcstds,'init')
fcstds['init']=pd.to_datetime(fcstds.init.values.astype(str))
fcstds['init']=pd.to_datetime(fcstds['init'].dt.strftime('%Y%m%d 00:00'))

with:

fcstds=decode_cf(fcstds,'init')

I'm not convinced those pd.to_datetime lines are needed.

@aaronspring
Copy link
Collaborator Author

this now passes. fixed with pandas==0.25.3. but there is a bug in bootstrap_hindcast for probabilistic metrics which I only found because of asv. #317

@bradyrx
Copy link
Collaborator

bradyrx commented Feb 3, 2020

Thanks @aaronspring! We can keep #317 to another PR. Probably something we will talk about during OSM/your week here.

I'll review/merge later today or tomorrow. Have some deadlines to meet today..

@aaronspring
Copy link
Collaborator Author

aaronspring commented Feb 3, 2020

The implementation of my_quantile is a bit dirty, but tested for the dim=first_axis case and gives an incredible speedup, see https://github.com/bradyrx/climpred/issues/316

I would prefer to implement this sneaky new my_quantile than not having this performance increase.

@bradyrx
Copy link
Collaborator

bradyrx commented Feb 4, 2020

This is great @aaronspring, thanks so much. Merging now!

@bradyrx bradyrx merged commit 5310e82 into pangeo-data:master Feb 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parallelize bootstrap
3 participants