-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_netcdf() doesn't work with multiprocessing scheduler #3781
Comments
I think I ran into a similar problem when combining dask-chunked DataSets (originating from MCVE Code Sample import xarray as xr
from multiprocessing import Pool
import os
if (False):
"""
Load data without using dask
"""
ds = xr.open_dataset("http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995.1960.nc")
else:
"""
Load data using dask
"""
ds = xr.open_dataset("http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995.1960.nc", chunks={})
print(ds.nbytes / 1e6, 'MB')
print('chunks', ds.air.chunks) # chunks is empty without dask
outdir = '/glade/scratch/lvank' # change this to some temporary directory on your system
def do_work(n):
print(n)
ds.to_netcdf(os.path.join(outdir, f'{n}.nc'))
tasks = range(10)
with Pool(processes=2) as pool:
pool.map(do_work, tasks)
print('done') Expected Output Problem Description Output of xr.show_versions() INSTALLED VERSIONScommit: None xarray: 0.16.1
|
I am also hitting the problem as described by @bcbnz |
@lvankampenhout, I ran into your problem. OP's seems like it's actually in In short, ds.load(scheduler='sync') at some point. If it's outside |
I'm currently studying this problem in depth and I noticed that while the threaded scheduler uses a lock that is defined in function of the file name (as the xarray/xarray/backends/locks.py Lines 24 to 32 in 8d23032
the process-based scheduler throws away the key: xarray/xarray/backends/locks.py Lines 35 to 39 in 8d23032
I'm not sure yet what are the consequences and logical interpretation of that, but I would like to reraise @bcbnz's question above: should this scenario simply raise a |
allows skipping the usual .nc generation behavior in case caller wants better control for file creation, eg. after a multi-day parallel processing see pydata/xarray#3781
If I create a chunked lazily-computed array, writing it to disk with
to_netcdf()
computes and writes it with the threading and distributed schedulers, but not with the multiprocessing scheduler. The only reference I've found when searching for the exception message comes from this StackOverflow question.MCVE Code Sample
Expected Output
Complete netCDF files should be created from all three schedulers.
Problem Description
The thread pool and distributed local cluster schedulers result in a complete output. The process pool scheduler fails when trying to write (note that test-process.nc is created with the header and coordinate information, but no actual data is written). The traceback is:
With a bit of editing of the system multiprocessing module I was able to determine that the lock being reported by this exception was the first lock created. I then added a breakpoint to the Lock constructor to get a traceback of what was creating it:
This last function creates the offending multiprocessing.Lock() object. Note that there are six Locks constructed and so its possible that the later-created ones would also cause an issue.
The h5netcdf backend has the same problem with Lock. However the SciPy backend gives a NotImplementedError for this:
I'm not sure how simple it would be to get this working with the multiprocessing scheduler, or how vital it is given that the distributed scheduler works. If nothing else, it would be good to get the same NotImplementedError as with the SciPy backend.
Output of
xr.show_versions()
commit: None
python: 3.8.1 (default, Jan 22 2020, 06:38:00)
[GCC 9.2.0]
python-bits: 64
OS: Linux
OS-release: 5.5.4-arch1-1
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8
LOCALE: en_NZ.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.3
xarray: 0.15.0
pandas: 1.0.1
numpy: 1.18.1
scipy: 1.4.1
netCDF4: 1.5.3
pydap: None
h5netcdf: 0.7.4
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.1.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.10.1
distributed: 2.10.0
matplotlib: 3.1.3
cartopy: 0.17.0
seaborn: None
numbagg: None
setuptools: 45.2.0
pip: 19.3
conda: None
pytest: 5.3.5
IPython: 7.12.0
sphinx: 2.4.2
The text was updated successfully, but these errors were encountered: