-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_zarr
with append or region mode and _FillValue
doesnt work
#6329
Comments
I will try to reproduce the strange behaviour but it was in a cloud environment (google) and the time steps were writing over each other and the number of "preserved" time-steps varied with time. |
If that's necessary to reproduce the problem, then yes. If it's possible to show the same thing with less "noise", then it's better to not use the tutorial dataset and to not use something like a cloud backend. But we can also try to iterate on this again, to progressively get down to a smaller example. |
Sorry to add to the confusion I actually have had another kind of strange behaviour by deleting the fill_value with the |
I can confirm that it also fails with precomputing a dataset and fill regions with the same error
|
This will fail like append. just tried to make some kind of realistic example like reprojecting from a geographic to an orthogonal system. If you look at all the stages you need to go through... and still not sure this is working as it should import xarray as xr
from rasterio.enums import Resampling
import numpy as np
def init_coord(ds):
''' To have the geometry right'''
arr_r=some_processing(ds.isel(time=slice(0,1))
return arr_r.x.values, arr_r.y.values
def some_processing(arr):
''' A reprojection routine'''
arr = arr.rio.write_crs('EPSG:4326')
arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
return arr_r
filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
x,y=init_coord(ds)
ds_to_write=xr.Dataset({'coords':{'time':('time',ds.time.values),'x':('x', x),'y':('y',y)}})
ds_to_write.to_zarr(filename, compute =false, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
# some kind of heavy processing
arr_r=some_processing(ds.isel(time=slice(i,i+1))
agg_r_t= agg_r.drop(['spatial_ref']).expand_dims({'time':[ds.time.values[i]]})
buff= xr.Dataset(({'air':agg_r_t}).chunk({'time':1,'x':250,'y':250})
buff.drop(['x','y']).to_zarr(filename, , region={'time':slice(i,i+1)}) You would need to change the processing function to something like: def some_processing(arr):
''' A reprojection routine'''
arr = arr.rio.write_crs('EPSG:4326')
arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
del arr_r.attrs["_FillValue"]
return arr_r Sorry maybe I am repetitive but I want to be sure that it is clearly illustrated. I have done another test on the cloud, checking the values at the moment. |
Sorry, @Boorhin. But the code example you showed has many syntax errors:
(there are more and I wasn't sure how to fix them at all places to match what you likely wanted to express) |
Ok sorry for the different mistakes, I wrote that in a hurry. Strangely enough this has a different behaviour but it crashes too. import xarray as xr
from rasterio.enums import Resampling
import numpy as np
def init_coord(ds):
''' To have the geometry right'''
arr_r=some_processing(ds.isel(time=slice(0,1)))
return arr_r.x.values, arr_r.y.values
def some_processing(arr):
''' A reprojection routine'''
arr = arr.rio.write_crs('EPSG:4326')
arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
return arr_r
filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
x,y=init_coord(ds)
ds_to_write=xr.Dataset(coords={'time':('time',ds.time.values),'x':('x', x),'y':('y',y)})
ds_to_write.to_zarr(filename, compute=False, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
# some kind of heavy processing
arr_r=some_processing(ds.isel(time=slice(i,i+1)))
buff= arr_r.drop(['spatial_ref','x','y']).chunk({'time':1,'x':250,'y':250})
buff.to_zarr(filename, mode='a', region={'time':slice(i,i+1)}) With error:
but the output of buff is: ie. it contains only floats |
You've got the print(buff.air.encoding)
|
OK, that is easy to change, now you have the exact same error message as for the appending. import xarray as xr
from rasterio.enums import Resampling
import numpy as np
def init_coord(ds):
''' To have the geometry right'''
arr_r=some_processing(ds.isel(time=slice(0,1)))
return arr_r.x.values, arr_r.y.values
def some_processing(arr):
''' A reprojection routine'''
arr = arr.rio.write_crs('EPSG:4326')
arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
return arr_r
filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
x,y=init_coord(ds)
ds_to_write=xr.Dataset(coords={'time':('time',ds.time.values),'x':('x', x),'y':('y',y)})
ds_to_write.to_zarr(filename, compute=False, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
# some kind of heavy processing
arr_r=some_processing(ds.isel(time=slice(i,i+1)))
buff= arr_r.drop(['spatial_ref','x','y']).chunk({'time':1,'x':250,'y':250})
buff.air.encoding['dtype']=np.dtype('float32')
buff.to_zarr(filename, mode='a', region={'time':slice(i,i+1)})
|
Yes, that looks like the error as described in the initial post.
Which is due to a mix of append-mode (
Currently, I can't really imagine how a mix of both should behave. If you can't prepare the dataset for the final shape upfront (to use |
sorry that's a mistake. I think append was suggested at some point by one of the error message. |
Sure, no problem.
So the difference between "a" and "r+" roughly codifies the intended behaviour for sequential access (it's ok to modify everything) and parallel access to independent chunks (where modifying metadata would be bad). So probably that message was suggesting that you have to use "a" if you want to modify metadata (e.g. by expanding the shape), which is true. But to me, it's unclear how one would do that safely with (potentially) parallel region writes, so it's kind of reasonable that region writes don't like to modify metadata. |
Ok, changing to I have found something that gives me satisfactory results. The reason why I have issues in the cloud, I still don't know, I am still investigating. Maybe it is unrelated. The following script kinds of keep the important stuff but still it is not very clean as some of the parameters are not included in the final file. I ended up doing the same kind of convoluted approach as I was making before. But hopefully that's helpful to someone looking for some sort of real-case example. Definitely clarified stuff in my head. import xarray as xr
from rasterio.enums import Resampling
import numpy as np
import dask.array as da
def init_coord(ds, X,Y):
''' To have the geometry right'''
arr_r=some_processing(ds.isel(time=slice(0,1)), X,Y)
return arr_r.x.values, arr_r.y.values
def some_processing(arr, X,Y):
''' A reprojection routine'''
arr = arr.rio.write_crs('EPSG:4326')
arr_r = arr.rio.reproject('EPSG:3857', shape=(Y,X), resampling=Resampling.bilinear, nodata=np.nan)
return arr_r
filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
ds.air.encoding['dtype']=np.dtype('float32')
X,Y=250, 250 #size of each final timestep
x,y=init_coord(ds, X,Y)
dummy=da.zeros((len(ds.time.values), Y, X))
ds_to_write=xr.Dataset({'air':(('time','y','x'), dummy)},
coords={'time':('time',ds.time.values),'x':('x', x),'y':('y',y)})
ds_to_write.to_zarr(filename, compute=False, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
# some kind of heavy processing
arr_r=some_processing(ds.isel(time=slice(i,i+1)),X,Y)
buff= arr_r.drop(['spatial_ref','x','y']).chunk({'time':1,'x':X,'y':Y})
del buff.air.attrs["_FillValue"]
buff.to_zarr(filename, mode='r+', region={'time':slice(i,i+1)}) |
Yes, this is kind of the behaviour I'd expect. And great that it helped clarifying things.
in the previous comment. I think, establishing and documenting good practices for this would help, but probably we also want to have better tools. In any case, this would probably be yet another issue. Note that if you care about this paricular example (e.g. appending in a single thread in increasing order of timesteps), then it should also be possible to do this much simpler using append: filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
ds.air.encoding['dtype']=np.dtype('float32')
X,Y=250, 250 #size of each final timestep
for i in range(len(ds.time)):
# some kind of heavy processing
arr_r=some_processing(ds.isel(time=slice(i,i+1)),X,Y)
del arr_r.air.attrs["_FillValue"]
if os.path.exists(filename):
arr_r.to_zarr(filename, append_dim='time')
else:
arr_r.to_zarr(filename) If you find out more about the cloud case, please post a note, otherwise, we can assume that the original bug report is fine? |
I think so, except that it affects append and region methods not just append. |
to_zarr
with append-mode and _FillValue
doesnt workto_zarr
with append or region mode and _FillValue
doesnt work
Thanks for pointing out |
What happened?
raises
What did you expect to happen?
I'd expect this to just work (effectively concatenating the dataset to itself).
Anything else we need to know?
appears also for
region
writesThe same issue appears for region writes as in:
raises
there's a workaround
The workaround (deleting the
_FillValue
in subsequent writes):seems to do the trick.
There are indications that the result might still be broken, but it's not yet clear how to reproduce them (see comments below).
This issue has been split off from #6069
Environment
INSTALLED VERSIONS
commit: None
python: 3.9.10 (main, Jan 15 2022, 11:48:00)
[Clang 13.0.0 (clang-1300.0.29.3)]
python-bits: 64
OS: Darwin
OS-release: 20.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4
xarray: 0.20.1
pandas: 1.2.0
numpy: 1.21.2
scipy: 1.6.2
netCDF4: 1.5.8
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.11.0
cftime: 1.3.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.1
distributed: 2021.11.1
matplotlib: 3.4.1
cartopy: 0.20.1
seaborn: 0.11.1
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.17
sparse: 0.13.0
setuptools: 60.5.0
pip: 21.3.1
conda: None
pytest: 6.2.2
IPython: 8.0.0.dev
sphinx: 3.5.0
The text was updated successfully, but these errors were encountered: