-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely long task graph times for resampling with replacement #764
Comments
Note that we have discussed a way to reimplement this without constructing the full ( |
so eventually, we want to use this algorithm for bootstrapping more iterations (up to 4k) and on geo-spatial grids, e.g. at least a 5x5 degree (36 lat x 72 lon ) indicies grid for smoothed ESM output and hopefully also for high-res prediction simulations for some heavy lifting. Here's a demo of climpred in the pangeo cloud: https://github.com/aaronspring/climpred-cloud-demo |
Many thanks for the report regarding (2) @bradyrx and the nice examples. I'm not surprised there are some performance issues with |
Thanks @spencerkclark! The |
I'm working on (1) mainly -- the base speed of our solution. Here is an updated notebook where I bootstrap the uninitialized ensemble and then compute a pearson r correlation relative to observations: https://nbviewer.jupyter.org/gist/bradyrx/4b55dc8587333d721e8477ce4afb0a69. I couldn't get into the queue fast enough for multi-node workers so I'm just using a small problem size on 1 core. In the "old way", I use the fastest implementation from the originally posted notebook but as we do it currently in In the "new way", I just verify each bootstrap initialization without building up the full mock dataset ( The fundamental issue is that we still get 20s graph building times here. I get this with 500 bootstrap iterations and 4 nodes of 36 workers each. I imagine this is because we're using |
I did a little profiling regarding (2) and it seems like the primary issue lies in cftime rather than xarray; here are couple minimal examples that are relevant to Creating a new
|
Thanks so much for your work on this @spencerkclark. This is great. Do you have any sense of where to start with profiling/speeding up |
I think a large bottleneck is in the sequence of code used to compute the
I think it could be worth migrating this discussion to an issue in the |
Part of me wonders whether those attributes ( |
I just moved this discussion over to FYI, I fixed the task graph building issue. We were running shift essentially at every single bootstrap iteration which was building a monster graph. I refactored things to use indexing to construct a singular Dataset with a But I still think this is an interesting and useful problem to address with |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
in anyone wonders how we dealt with this issue... 1.) get rid of cftime in 2.) for not too large lazy data and eager data @ahuang11 and @bradyrx developed def _resample_iterations_idx(init, iterations, dim='member', replace=True):
"""Resample over ``dim`` by index ``iterations`` times.
.. note::
This is a much faster way to bootstrap than resampling each iteration
individually and applying the function to it. However, this will create a
DataArray with dimension ``iteration`` of size ``iterations``. It is probably
best to do this out-of-memory with ``dask`` if you are doing a large number
of iterations or using spatial output (i.e., not time series data).
Args:
init (xr.DataArray, xr.Dataset): Initialized prediction ensemble.
iterations (int): Number of bootstrapping iterations.
dim (str): Dimension name to bootstrap over. Defaults to ``'member'``.
replace (bool): Bootstrapping with or without replacement. Defaults to ``True``.
Returns:
xr.DataArray, xr.Dataset: Bootstrapped data with additional dim ```iteration```
"""
if dask.is_dask_collection(init):
init = init.chunk({'lead':-1,'member':-1})
init = init.copy(deep=True)
def select_bootstrap_indices_ufunc(x, idx):
"""Selects multi-level indices ``idx`` from xarray object ``x`` for all
iterations."""
# `apply_ufunc` sometimes adds a singleton dimension on the end, so we squeeze
# it out here. This leverages multi-level indexing from numpy, so we can
# select a different set of, e.g., ensemble members for each iteration and
# construct one large DataArray with ``iterations`` as a dimension.
return np.moveaxis(x.squeeze()[idx.squeeze().transpose()], 0, -1)
# resample with or without replacement
if replace:
idx = np.random.randint(
0, init[dim].size, (iterations, init[dim].size))
elif not replace:
# create 2d np.arange()
idx = np.linspace(
(np.arange(init[dim].size)),
(np.arange(init[dim].size)),
iterations,
dtype='int',
)
# shuffle each line
for ndx in np.arange(iterations):
np.random.shuffle(idx[ndx])
idx_da = xr.DataArray(
idx,
dims=('iteration', dim),
coords=({'iteration': range(iterations), dim: init[dim]}),
)
return xr.apply_ufunc(
select_bootstrap_indices_ufunc,
init.transpose(dim, ...),# transpose_coords=False),
idx_da,
dask='parallelized',
output_dtypes=[float],
) This multi-index selection in numpy gives massive increases in task graph building. However for larger lazy data I run into memory issues (maybe because of inplace selection when using multiple times, I dont know, if anyone has an idea why https://gist.github.com/aaronspring/665d69c3099b1f646a94b93072a6dfdd fails, ping me). Because for larger data computation takes also more time, here we use the more safe from climpred.constants import CONCAT_KWARGS
def _resample_iterations(init, iterations, dim='member', replace=True):
if replace:
idx = np.random.randint(
0, init[dim].size, (iterations, init[dim].size))
elif not replace:
# create 2d np.arange()
idx = np.linspace(
(np.arange(init[dim].size)),
(np.arange(init[dim].size)),
iterations,
dtype='int',
)
# shuffle each line
for ndx in np.arange(iterations):
np.random.shuffle(idx[ndx])
idx_da = xr.DataArray(
idx,
dims=('iteration', dim),
coords=({'iteration': range(iterations), dim: init[dim]}),
)
init_smp = []
for i in np.arange(iterations):
idx = idx_da.sel(iteration=i).data
init_smp2 = init.isel({dim: idx}).assign_coords({dim: init[dim].data})
init_smp.append(init_smp2)
init_smp = xr.concat(init_smp, dim='iteration',**CONCAT_KWARGS)
return init_smp Comparison of the two methods: https://gist.github.com/aaronspring/ff8c4b649fbc7230ace98cfc9f1043c8 Don't create more chunks than needed in the input of But in
We are implementing 3.) first resample, then do the calulation/metric/heavy lifting on the new dataset with dim Maybe this is of help to someone for their challenges of resampling with/without replacement. Thanks stale bot for the reminder. I think this issue can be closed, while still interesting for reference. |
Thanks for updating everyone here, @aaronspring! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
We chatted about this today on the pangeo call and @aaronspring and I were encouraged to post our issue here for some help. We've hit a huge speed bottleneck in
climpred
(https://climpred.readthedocs.io/) in our bootstrapping module and are looking for some guidance.Here is a notebook demonstrating the timing issues: https://nbviewer.jupyter.org/gist/bradyrx/8d77a45dea26480ef863fa1ca2dd4cce?flush_cache=true.
Application:
Resample with replacement a dataset of dimensions (
time
,member
) into (init
,member
,lead
). You take a prediction ensemble (init
,member
,lead
) and randomly resample with replacement the members from your control/CESM-LE dataset (time
,member
) and align them into aninit
,member
,lead
framework. Then one can compute a metric (e.g. ACC, MSE, Brier Score) on the reference ensemble to compare the initialized ensemble to. The bootstrapping iterations give a range of skill.Problems
(1) In the minimal example it takes 8 seconds to build the task graph for 500 bootstrapping iterations (some papers use many thousand iterations over a full grid; we're just doing a time series location in this demo). Line profiling shows that most of the time is spent on
xr.concat
and on the list comprehension where we create a list of many individual datasets to concatenate.(2) In the second case we add
cftime
indices instead of just integer temporal dimensions. We need these forclimpred
for a number of reasons to handle datetime alignment of forecasts. Now 500 iterations takes >1 minute to set up the task graph, but only 1 second to compute.CFTimeIndex.shift()
is the huge speed bottleneck. This is out-of-the-box fromxarray
so I might need to profile that and see if there's some inefficiencies there.Any thoughts and suggestions on this would be helpful! We use
CFTimeIndex().shift
throughout the code base to handle alignment in various locations. So we definitely need a solution for (2) moving forward forclimpred
to be scalable.The text was updated successfully, but these errors were encountered: