cannot use rechunker starting from `0.4.0` #92

apatlpo · 2021-06-08T21:02:59Z

I apologize in advance for posting an issue that may be incomplete.

After a recent library update I am no longer able to use rechunker for my use case.
This on an hpc platform

Symptoms are that nothing happens on the dask dashboard when launching the actual rechunking with execute.
Using the top command on the scheduler node indicates 100% cpu usage and a slowly increasing memory usage.
On the other hand, action takes place right away on the dask dashboard with older version of rechunker (version 0.3.3).

git bisecting versions indicates:

6cc0f26720bfecb1eba00579a13d9b7c8004f652 is the first bad commit

I am not sure what I could do in order to investigate further and would welcome suggestions.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.12.53-60.30-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.06.0
distributed: 2021.06.0
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.2
conda: None
pytest: None
IPython: 7.24.1
sphinx: None

The text was updated successfully, but these errors were encountered:

rabernat · 2021-06-09T12:54:08Z

Thanks for reporting this. Indeed it looks like 6cc0f26 (part of #77) caused this problem.

@TomAugspurger - this sounds very similar to the problems we are having in Pangeo Forge, which we are attempting to fix in pangeo-forge/pangeo-forge-recipes#153.

The basic theory is that the way we are building dask graphs results post #77 results in objects that are much too large.

Aurelien, if you wanted to help debug, you could run the same steps that Tom ran in pangeo-forge/pangeo-forge-recipes#116 (comment) and report the results back here.

From my point of view, we could simply revert #77. The main point of that was to have a representation of pipelines we could share between rechunker and pangeo-forge-recipes. But since we are likely abandoning the rechunker dependency in pangeo-forge-recipes, this no longer seems necessary.

apatlpo · 2021-06-14T14:43:58Z

Here is the result with version 0.3.3:

hl_key	time	size
array	0.0001261560246348381	51329.0
barrier	4.3830834329128265e-05	594.0
from-zarr	0.00048235198482871056	25113.0
from-zarr-store	0.0008570537902414799	7104.0
getitem	0.09110138937830925	871471.0
open_dataset	0.00033695297315716743	1597.0
rechunked	7.380731403827667e-06	132.0
store	0.006245309021323919	224540.0

and with latest version (73ef80b):

hl_key	time	size
copy_chunk	381.4043052890338	9747762599.0
merge	1.449882984161377e-05	304.0
stage	0.0010634679347276688	19195.0

Please let me know if I need to adjust anything, you find the full code in the following notebook

rabernat · 2021-06-14T14:59:38Z

Ok that definitely confirms my suspicion that the way we are generating dask graphs is not efficient. 🤦 Looks like we will have to revert quite a few changes.

@apatlpo - how urgent is this problem for you? Can you just keep using 0.3.0 for now?

apatlpo · 2021-06-14T18:46:49Z

absolutely no rush on my side, I am perfectly happy with 0.3.3 (sorry corrected typo in earlier post), thx for your concern.

d70-t · 2021-06-16T18:16:44Z

I just stumbled across this, having a similar use case. For now, using 0.3.3 seems to be fine for me as well.

rybchuk · 2021-07-04T12:02:42Z

I also wanted to second this - I was having problems rechunking a variable that had many small chunks. Dask would get stuck in the first few moments of trying to rechunk. Reverting to 0.3.3 solved this issue for me.

bolliger32 · 2021-07-17T10:43:31Z

+1 on this issue. Reverting to 0.3.3 seems to solve it, though unrelatedly I'm getting workers using 2-3x the max_mem setting (possibly related to the issues of #54). This might be fixed by v0.4, I just haven't been able to test it).

That being said, anecdotally I seem to see workflows using more memory than expected with recent updates to dask/distributed starting around 2021.6 I believe, so it might be unrelated to rechunker. Going to do some more investigating

TomAugspurger · 2021-07-17T11:20:18Z

I haven't had a chance to look at 6cc0f26, but pangeo-forge/pangeo-forge-recipes#116 was determined to be some kind of serialization problem, and was fixed by pangeo-forge/pangeo-forge-recipes#160.

Just to confirm, when people say "using more memory" is that on the workers or the scheduler?

bolliger32 · 2021-07-17T12:29:15Z

@TomAugspurger for me it's the workers, and it's usually just one or two workers that quickly balloon up to that size while the remainder seem to respect max_mem. Let me see if I can come up with an example (might put it in a different issue since it's separate than the main thread of this one)

rabernat · 2021-08-16T21:17:32Z

Apologies for the slow pace here. I've been on vacation much of the past month.

@TomAugspurger - I'd like chart a path towards resolving this. As discussed in today's Pangeo Forge meeting, this issue has implications for Pangeo Forge architecture. We have to decide whether to maintain the Pipeline abstraction or deprecate it. The performance problems described in this issue are closely tied to the switch to the Pipeline executor model. However, as you noted today, the Pipeline model is not inherently different from what is now implemented directly in BaseRecipe.to_dask() (https://github.com/pangeo-forge/pangeo-forge-recipes/blob/49997cb52cff466bd394c1348ef23981e782a4d9/pangeo_forge_recipes/recipes/base.py#L113-L154). So really the difference comes down to the way the various functions and objects are partialized / serialized.

I'd personally like to keep the Pipelines framework and even break it out into its own package. But that only makes sense if we can get it working properly within rechunker first.

I understand that you (Tom) probably don't have lots of time to dig into this right now. If that's the case, it would be great if you could at least help outline a work plan to get to the bottom of this issue. We have several other developers (e.g. me, @cisaacstern, @alxmrs, @TomNicholas) who could potentially be working on this. But your input would be really helpful to get started.

In pangeo-forge/pangeo-forge-recipes#116 (comment), Tom made a nice diagnosis of the serializability issues in Paneo Forge recipes. Perhaps we could do the same here. We would need a minimum reproducible example for this problem.

jmccreight · 2021-09-10T17:05:12Z

+1 & thanks!
I'll note that reverting solved my issues. In case it helps anyone else, the symptoms of my issues were roughly that
time_to_solution ~ e^(chunk_size).
For my desired chunk size in time i could not complete the rechunking in my cluster's walltime (6 hours), reverting i can get it in about 90 seconds.

rabernat · 2021-09-10T17:24:49Z

@jmccreight - would you be able to provide any more details? Were you using Xarray or Zarr inputs? Which executor? Code would be even better. Thanks so much for your help.

jmccreight · 2021-09-10T19:38:15Z

Hi @rabernat!
I'm basically doing what @rsignell-usgs did here:
https://nbviewer.jupyter.org/gist/rsignell-usgs/c0b87ed1fa5fc694e665fb789e8381bb

The quick overview:

bunch of output files, one for each timestep
loop on n_time_chunks to process, each time grabbing len(files) = n_files_in_time_chunk
- ds = xr.open_mfdataset(files)
- rechunked = rechunk(ds, chunk_plan, max_mem, step_file, temp_file)
- result = rechunked.execute(retries=10). <-------- problem here
- ds = xr.open_zarr(file_step, consolidated=False)
- if not file_chunked.exists():
  ds.to_zarr(file_chunked, consolidated=True, mode='w')
  else:
  ds.to_zarr(file_chunked, consolidated=True, append_dim='time')

rsignell-usgs · 2022-01-25T17:06:44Z

I just hit this again -- one of my USGS Colleagues had rechunker=0.4.2 and it was hanging, and after reverting to rechunker=0.3.3, it worked fine.

rabernat · 2022-01-25T17:09:08Z

Sorry for all the friction everyone! This package needs some maintenance. In the meantime, should we just pull the 0.4.2 release from pypi?

rabernat · 2022-03-23T22:49:26Z

Ok so #112 has been merged, and the current master should have a solution to this issue.

@apatlpo - any chance you would be able to try your test with the latest rechunker master branch to confirm it fixes your original problem?

bzah · 2022-03-29T13:08:25Z

I'm not the O.P but I had a similar issue. I tried the latest version on master and the rechunking went smoothly.
I tried it on a LocalCluster on my laptop if that makes any difference.

max-sixty · 2023-11-02T20:36:09Z

Was this fixed in recent releases / can we close?

tomwhite mentioned this issue Feb 7, 2022

Investigate upgrading rechunker sgkit-dev/sgkit#820

Closed

rabernat mentioned this issue Mar 2, 2022

new release? #109

Closed

tomwhite mentioned this issue Mar 3, 2022

Rechunker 0.3.3 is incompatible with Dask 2022.01.1 and later #110

Closed

rabernat mentioned this issue Mar 16, 2022

Refactor executors again #112

Merged

bzah mentioned this issue Mar 17, 2022

BUG: AssertionError when using rechunker cerfacs-globc/icclim#141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot use rechunker starting from `0.4.0` #92

cannot use rechunker starting from `0.4.0` #92

apatlpo commented Jun 8, 2021 •

edited

Loading

INSTALLED VERSIONS

rabernat commented Jun 9, 2021

apatlpo commented Jun 14, 2021 •

edited

Loading

rabernat commented Jun 14, 2021

apatlpo commented Jun 14, 2021

d70-t commented Jun 16, 2021

rybchuk commented Jul 4, 2021

bolliger32 commented Jul 17, 2021 •

edited

Loading

TomAugspurger commented Jul 17, 2021

bolliger32 commented Jul 17, 2021

rabernat commented Aug 16, 2021

jmccreight commented Sep 10, 2021 •

edited

Loading

rabernat commented Sep 10, 2021

jmccreight commented Sep 10, 2021

rsignell-usgs commented Jan 25, 2022

rabernat commented Jan 25, 2022

rabernat commented Mar 23, 2022

bzah commented Mar 29, 2022

max-sixty commented Nov 2, 2023

cannot use rechunker starting from 0.4.0 #92

cannot use rechunker starting from 0.4.0 #92

Comments

apatlpo commented Jun 8, 2021 • edited Loading

INSTALLED VERSIONS

rabernat commented Jun 9, 2021

apatlpo commented Jun 14, 2021 • edited Loading

rabernat commented Jun 14, 2021

apatlpo commented Jun 14, 2021

d70-t commented Jun 16, 2021

rybchuk commented Jul 4, 2021

bolliger32 commented Jul 17, 2021 • edited Loading

TomAugspurger commented Jul 17, 2021

bolliger32 commented Jul 17, 2021

rabernat commented Aug 16, 2021

jmccreight commented Sep 10, 2021 • edited Loading

rabernat commented Sep 10, 2021

jmccreight commented Sep 10, 2021

rsignell-usgs commented Jan 25, 2022

rabernat commented Jan 25, 2022

rabernat commented Mar 23, 2022

bzah commented Mar 29, 2022

max-sixty commented Nov 2, 2023

cannot use rechunker starting from `0.4.0` #92

cannot use rechunker starting from `0.4.0` #92

apatlpo commented Jun 8, 2021 •

edited

Loading

apatlpo commented Jun 14, 2021 •

edited

Loading

bolliger32 commented Jul 17, 2021 •

edited

Loading

jmccreight commented Sep 10, 2021 •

edited

Loading