Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot use rechunker starting from 0.4.0 #92

Open
apatlpo opened this issue Jun 8, 2021 · 18 comments · Fixed by #112
Open

cannot use rechunker starting from 0.4.0 #92

apatlpo opened this issue Jun 8, 2021 · 18 comments · Fixed by #112

Comments

@apatlpo
Copy link
Member

apatlpo commented Jun 8, 2021

I apologize in advance for posting an issue that may be incomplete.

After a recent library update I am no longer able to use rechunker for my use case.
This on an hpc platform

Symptoms are that nothing happens on the dask dashboard when launching the actual rechunking with execute.
Using the top command on the scheduler node indicates 100% cpu usage and a slowly increasing memory usage.
On the other hand, action takes place right away on the dask dashboard with older version of rechunker (version 0.3.3).

git bisecting versions indicates:

6cc0f26720bfecb1eba00579a13d9b7c8004f652 is the first bad commit

I am not sure what I could do in order to investigate further and would welcome suggestions.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.12.53-60.30-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.18.2
pandas: 1.2.4
numpy: 1.20.3
scipy: 1.6.3
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.06.0
distributed: 2021.06.0
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: 0.11.1
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.1.2
conda: None
pytest: None
IPython: 7.24.1
sphinx: None

@rabernat
Copy link
Member

rabernat commented Jun 9, 2021

Thanks for reporting this. Indeed it looks like 6cc0f26 (part of #77) caused this problem.

@TomAugspurger - this sounds very similar to the problems we are having in Pangeo Forge, which we are attempting to fix in pangeo-forge/pangeo-forge-recipes#153.

The basic theory is that the way we are building dask graphs results post #77 results in objects that are much too large.

Aurelien, if you wanted to help debug, you could run the same steps that Tom ran in pangeo-forge/pangeo-forge-recipes#116 (comment) and report the results back here.

From my point of view, we could simply revert #77. The main point of that was to have a representation of pipelines we could share between rechunker and pangeo-forge-recipes. But since we are likely abandoning the rechunker dependency in pangeo-forge-recipes, this no longer seems necessary.

@apatlpo
Copy link
Member Author

apatlpo commented Jun 14, 2021

Here is the result with version 0.3.3:

hl_key time size
array 0.0001261560246348381 51329.0
barrier 4.3830834329128265e-05 594.0
from-zarr 0.00048235198482871056 25113.0
from-zarr-store 0.0008570537902414799 7104.0
getitem 0.09110138937830925 871471.0
open_dataset 0.00033695297315716743 1597.0
rechunked 7.380731403827667e-06 132.0
store 0.006245309021323919 224540.0

and with latest version (73ef80b):

hl_key time size
copy_chunk 381.4043052890338 9747762599.0
merge 1.449882984161377e-05 304.0
stage 0.0010634679347276688 19195.0

Please let me know if I need to adjust anything, you find the full code in the following notebook

@rabernat
Copy link
Member

Ok that definitely confirms my suspicion that the way we are generating dask graphs is not efficient. 🤦 Looks like we will have to revert quite a few changes.

@apatlpo - how urgent is this problem for you? Can you just keep using 0.3.0 for now?

@apatlpo
Copy link
Member Author

apatlpo commented Jun 14, 2021

absolutely no rush on my side, I am perfectly happy with 0.3.3 (sorry corrected typo in earlier post), thx for your concern.

@d70-t
Copy link

d70-t commented Jun 16, 2021

I just stumbled across this, having a similar use case. For now, using 0.3.3 seems to be fine for me as well.

@rybchuk
Copy link

rybchuk commented Jul 4, 2021

I also wanted to second this - I was having problems rechunking a variable that had many small chunks. Dask would get stuck in the first few moments of trying to rechunk. Reverting to 0.3.3 solved this issue for me.

@bolliger32
Copy link

bolliger32 commented Jul 17, 2021

+1 on this issue. Reverting to 0.3.3 seems to solve it, though unrelatedly I'm getting workers using 2-3x the max_mem setting (possibly related to the issues of #54). This might be fixed by v0.4, I just haven't been able to test it).

That being said, anecdotally I seem to see workflows using more memory than expected with recent updates to dask/distributed starting around 2021.6 I believe, so it might be unrelated to rechunker. Going to do some more investigating

@TomAugspurger
Copy link
Member

I haven't had a chance to look at 6cc0f26, but pangeo-forge/pangeo-forge-recipes#116 was determined to be some kind of serialization problem, and was fixed by pangeo-forge/pangeo-forge-recipes#160.

Just to confirm, when people say "using more memory" is that on the workers or the scheduler?

@bolliger32
Copy link

@TomAugspurger for me it's the workers, and it's usually just one or two workers that quickly balloon up to that size while the remainder seem to respect max_mem. Let me see if I can come up with an example (might put it in a different issue since it's separate than the main thread of this one)

@rabernat
Copy link
Member

Apologies for the slow pace here. I've been on vacation much of the past month.

@TomAugspurger - I'd like chart a path towards resolving this. As discussed in today's Pangeo Forge meeting, this issue has implications for Pangeo Forge architecture. We have to decide whether to maintain the Pipeline abstraction or deprecate it. The performance problems described in this issue are closely tied to the switch to the Pipeline executor model. However, as you noted today, the Pipeline model is not inherently different from what is now implemented directly in BaseRecipe.to_dask() (https://github.com/pangeo-forge/pangeo-forge-recipes/blob/49997cb52cff466bd394c1348ef23981e782a4d9/pangeo_forge_recipes/recipes/base.py#L113-L154). So really the difference comes down to the way the various functions and objects are partialized / serialized.

I'd personally like to keep the Pipelines framework and even break it out into its own package. But that only makes sense if we can get it working properly within rechunker first.

I understand that you (Tom) probably don't have lots of time to dig into this right now. If that's the case, it would be great if you could at least help outline a work plan to get to the bottom of this issue. We have several other developers (e.g. me, @cisaacstern, @alxmrs, @TomNicholas) who could potentially be working on this. But your input would be really helpful to get started.

In pangeo-forge/pangeo-forge-recipes#116 (comment), Tom made a nice diagnosis of the serializability issues in Paneo Forge recipes. Perhaps we could do the same here. We would need a minimum reproducible example for this problem.

@jmccreight
Copy link

jmccreight commented Sep 10, 2021

+1 & thanks!
I'll note that reverting solved my issues. In case it helps anyone else, the symptoms of my issues were roughly that
time_to_solution ~ e^(chunk_size).
For my desired chunk size in time i could not complete the rechunking in my cluster's walltime (6 hours), reverting i can get it in about 90 seconds.

@rabernat
Copy link
Member

@jmccreight - would you be able to provide any more details? Were you using Xarray or Zarr inputs? Which executor? Code would be even better. Thanks so much for your help.

@jmccreight
Copy link

Hi @rabernat!
I'm basically doing what @rsignell-usgs did here:
https://nbviewer.jupyter.org/gist/rsignell-usgs/c0b87ed1fa5fc694e665fb789e8381bb

The quick overview:

  • bunch of output files, one for each timestep
  • loop on n_time_chunks to process, each time grabbing len(files) = n_files_in_time_chunk
    • ds = xr.open_mfdataset(files)
    • rechunked = rechunk(ds, chunk_plan, max_mem, step_file, temp_file)
    • result = rechunked.execute(retries=10). <-------- problem here
    • ds = xr.open_zarr(file_step, consolidated=False)
    • if not file_chunked.exists():
      ds.to_zarr(file_chunked, consolidated=True, mode='w')
      else:
      ds.to_zarr(file_chunked, consolidated=True, append_dim='time')

@rsignell-usgs
Copy link
Member

I just hit this again -- one of my USGS Colleagues had rechunker=0.4.2 and it was hanging, and after reverting to rechunker=0.3.3, it worked fine.

@rabernat
Copy link
Member

Sorry for all the friction everyone! This package needs some maintenance. In the meantime, should we just pull the 0.4.2 release from pypi?

@rabernat
Copy link
Member

Ok so #112 has been merged, and the current master should have a solution to this issue.

@apatlpo - any chance you would be able to try your test with the latest rechunker master branch to confirm it fixes your original problem?

@bzah
Copy link

bzah commented Mar 29, 2022

I'm not the O.P but I had a similar issue. I tried the latest version on master and the rechunking went smoothly.
I tried it on a LocalCluster on my laptop if that makes any difference.

@max-sixty
Copy link

Was this fixed in recent releases / can we close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants