`to_zarr` with append or region mode and `_FillValue` doesnt work #6329

d70-t · 2022-03-04T18:21:32Z

What happened?

import numpy as np
import xarray as xr
ds = xr.Dataset({"a": ("x", [3.], {"_FillValue": np.nan})})
m = {}
ds.to_zarr(m)
ds.to_zarr(m, append_dim="x")

raises

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

What did you expect to happen?

I'd expect this to just work (effectively concatenating the dataset to itself).

Anything else we need to know?

appears also for `region` writes

The same issue appears for region writes as in:

import numpy as np
import dask.array as da
import xarray as xr
ds = xr.Dataset({"a": ("x", da.array([3.,4.]), {"_FillValue": np.nan})})
m = {}
ds.to_zarr(m, compute=False, encoding={"a": {"chunks": (1,)}})
ds.isel(x=slice(0,1)).to_zarr(m, region={"x": slice(0,1)})

raises

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

there's a workaround

The workaround (deleting the _FillValue in subsequent writes):

m = {}
ds.to_zarr(m)
del ds.a.attrs["_FillValue"]
ds.to_zarr(m, append_dim="x")

seems to do the trick.

There are indications that the result might still be broken, but it's not yet clear how to reproduce them (see comments below).

This issue has been split off from #6069

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.10 (main, Jan 15 2022, 11:48:00)
[Clang 13.0.0 (clang-1300.0.29.3)]
python-bits: 64
OS: Darwin
OS-release: 20.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: ('de_DE', 'UTF-8')
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.20.1
pandas: 1.2.0
numpy: 1.21.2
scipy: 1.6.2
netCDF4: 1.5.8
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.11.0
cftime: 1.3.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.1
distributed: 2021.11.1
matplotlib: 3.4.1
cartopy: 0.20.1
seaborn: 0.11.1
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.17
sparse: 0.13.0
setuptools: 60.5.0
pip: 21.3.1
conda: None
pytest: 6.2.2
IPython: 8.0.0.dev
sphinx: 3.5.0

The text was updated successfully, but these errors were encountered:

Boorhin · 2022-03-04T18:44:01Z

I will try to reproduce the strange behaviour but it was in a cloud environment (google) and the time steps were writing over each other and the number of "preserved" time-steps varied with time.
I suggest we use something closer to the original problem such as the tutorial dataset?

d70-t · 2022-03-04T18:48:13Z

If that's necessary to reproduce the problem, then yes. If it's possible to show the same thing with less "noise", then it's better to not use the tutorial dataset and to not use something like a cloud backend. But we can also try to iterate on this again, to progressively get down to a smaller example.

Boorhin · 2022-03-05T12:26:52Z

Sorry to add to the confusion I actually have had another kind of strange behaviour by deleting the fill_value with the region method. I thought the run worked but it didn't. I am investigating...

Boorhin · 2022-03-05T13:01:40Z

just to make clear what is weird
this is just a test to see if the regions were written to file and it seems that it did randomly and most likely overprinted regions on regions. I have no idea how that is possible. In theory everything should be written from i = 95 to 954. It could be in my code so I am checking again but that sounds unlikely without raising any error. I am just showing this so that you better understand what I am observing

Just to say that I had all the timesteps written in theory as I print a confirmation message at each iteration

Boorhin · 2022-03-05T14:48:58Z

I can confirm that it also fails with precomputing a dataset and fill regions with the same error

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

Boorhin · 2022-03-07T10:48:21Z

This will fail like append. just tried to make some kind of realistic example like reprojecting from a geographic to an orthogonal system. If you look at all the stages you need to go through... and still not sure this is working as it should

import xarray as xr
from rasterio.enums import Resampling
import numpy as np

def init_coord(ds):
    ''' To have the geometry right'''
    arr_r=some_processing(ds.isel(time=slice(0,1))
    return arr_r.x.values, arr_r.y.values

def some_processing(arr):
    ''' A reprojection routine'''
    arr = arr.rio.write_crs('EPSG:4326')
    arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
    return arr_r

filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
x,y=init_coord(ds)
ds_to_write=xr.Dataset({'coords':{'time':('time',ds.time.values),'x':('x', x),'y':('y',y)}})
ds_to_write.to_zarr(filename, compute =false, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
    # some kind of heavy processing
    arr_r=some_processing(ds.isel(time=slice(i,i+1))
    agg_r_t= agg_r.drop(['spatial_ref']).expand_dims({'time':[ds.time.values[i]]})
    buff= xr.Dataset(({'air':agg_r_t}).chunk({'time':1,'x':250,'y':250})
    buff.drop(['x','y']).to_zarr(filename, , region={'time':slice(i,i+1)})

You would need to change the processing function to something like:

def some_processing(arr):
    ''' A reprojection routine'''
    arr = arr.rio.write_crs('EPSG:4326')
    arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
    del arr_r.attrs["_FillValue"]
    return arr_r

Sorry maybe I am repetitive but I want to be sure that it is clearly illustrated. I have done another test on the cloud, checking the values at the moment.

d70-t · 2022-03-07T20:03:18Z

Sorry, @Boorhin. But the code example you showed has many syntax errors:

$ python3 test.py
  File "test.py", line 8
    return arr_r.x.values, arr_r.y.values
    ^
SyntaxError: invalid syntax

(there are more and I wasn't sure how to fix them at all places to match what you likely wanted to express)

Boorhin · 2022-03-08T10:55:50Z

Ok sorry for the different mistakes, I wrote that in a hurry. Strangely enough this has a different behaviour but it crashes too.

import xarray as xr
from rasterio.enums import Resampling
import numpy as np

def init_coord(ds):
    ''' To have the geometry right'''
    arr_r=some_processing(ds.isel(time=slice(0,1)))
    return arr_r.x.values, arr_r.y.values

def some_processing(arr):
    ''' A reprojection routine'''
    arr = arr.rio.write_crs('EPSG:4326')
    arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
    return arr_r

filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
x,y=init_coord(ds)
ds_to_write=xr.Dataset(coords={'time':('time',ds.time.values),'x':('x', x),'y':('y',y)})
ds_to_write.to_zarr(filename, compute=False, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
    # some kind of heavy processing
    arr_r=some_processing(ds.isel(time=slice(i,i+1)))
    buff= arr_r.drop(['spatial_ref','x','y']).chunk({'time':1,'x':250,'y':250})
    buff.to_zarr(filename, mode='a', region={'time':slice(i,i+1)})

With error:

ValueError: fill_value nan is not valid for dtype int16; nested exception: cannot convert float NaN to integer

but the output of buff is:

ie. it contains only floats

d70-t · 2022-03-08T12:09:38Z

You've got the encoding of air set to int16:

print(buff.air.encoding)

{'source': '.../xarray_tutorial_data/69c68be1605878a6c8efdd34d85b4ca1-air_temperature.nc', 'original_shape': (2920, 25, 53), 'dtype': dtype('int16'), 'scale_factor': 0.01, 'grid_mapping': 'spatial_ref'}

Boorhin · 2022-03-09T09:30:42Z

OK, that is easy to change, now you have the exact same error message as for the appending.
I have tried a lot of different ways and I am not getting anywhere with writing the data correctly in a store.

import xarray as xr
from rasterio.enums import Resampling
import numpy as np

def init_coord(ds):
    ''' To have the geometry right'''
    arr_r=some_processing(ds.isel(time=slice(0,1)))
    return arr_r.x.values, arr_r.y.values

def some_processing(arr):
    ''' A reprojection routine'''
    arr = arr.rio.write_crs('EPSG:4326')
    arr_r = arr.rio.reproject('EPSG:3857', shape=(250, 250), resampling=Resampling.bilinear, nodata=np.nan)
    return arr_r

filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
x,y=init_coord(ds)
ds_to_write=xr.Dataset(coords={'time':('time',ds.time.values),'x':('x', x),'y':('y',y)})
ds_to_write.to_zarr(filename, compute=False, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
    # some kind of heavy processing
    arr_r=some_processing(ds.isel(time=slice(i,i+1)))
    buff= arr_r.drop(['spatial_ref','x','y']).chunk({'time':1,'x':250,'y':250})
    buff.air.encoding['dtype']=np.dtype('float32')
    buff.to_zarr(filename, mode='a', region={'time':slice(i,i+1)})

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

d70-t · 2022-03-09T10:06:22Z

Yes, that looks like the error as described in the initial post.
Adding the described workaround (i.e. del buff.air.attrs["_FillValue"] in this case) leads to the next error message:

ValueError: variable 'air' already exists with different dimension sizes: {'time': 0, 'y': 250, 'x': 250} != {'time': 1, 'y': 250, 'x': 250}. to_zarr() only supports changing dimension sizes when explicitly appending, but append_dim=None.

Which is due to a mix of append-mode (mode='a') and region-write (region={'time':slice(i,i+1)}), which is e.g. out of the scope as outlined in this comment. It may or may not be possible or intended to support this, but I'm not deep enough into the design of xarray to give a definitive answer here. For me, it's unclear how this should behave. My current point of view is:

append: may change structure-defining metadata, must be sequential, mode='a'
region: may not change structure-defining metadata, can be parallel, mode='r+'

Currently, I can't really imagine how a mix of both should behave. If you can't prepare the dataset for the final shape upfront (to use region) and you also can't use append_dim, then probably what's needed is a separate method of expanding the dataset (i.e. reshape) without filling in the data. If such a thing would be available, one could (as a user) ensure that all reshaping operations are properly sequenced with region operations, but region operations could be run in parallel. (I think this is possible with plain-zarr, but I'm not aware of a corresponding xarray API).

Boorhin · 2022-03-10T09:36:00Z

sorry that's a mistake. I think append was suggested at some point by one of the error message.
I cannot remember 'r+' being described into the doc of xarray. Would you mind detailing what it does?
Cheers

d70-t · 2022-03-10T09:44:59Z

Sure, no problem.
I believe, this page has a good summary:

mode ({"w", "w-", "a", "r+", None}, optional) – Persistence mode: “w” means create (overwrite if exists); “w-” means create (fail if exists); “a” means override existing variables (create if does not exist); “r+” means modify existing array values only (raise an error if any metadata or shapes would change). The default mode is “a” if append_dim is set. Otherwise, it is “r+” if region is set and w- otherwise.

So the difference between "a" and "r+" roughly codifies the intended behaviour for sequential access (it's ok to modify everything) and parallel access to independent chunks (where modifying metadata would be bad).

So probably that message was suggesting that you have to use "a" if you want to modify metadata (e.g. by expanding the shape), which is true. But to me, it's unclear how one would do that safely with (potentially) parallel region writes, so it's kind of reasonable that region writes don't like to modify metadata.

Boorhin · 2022-03-10T11:21:18Z

Ok, changing to 'r+' leads to the error suggesting to use 'a'
ValueError: dataset contains non-pre-existing variables ['air'], which is not allowed in ``xarray.Dataset.to_zarr()`` with mode='r+'. To allow writing new variables, set mode='a'.

I have found something that gives me satisfactory results. The reason why I have issues in the cloud, I still don't know, I am still investigating. Maybe it is unrelated. The following script kinds of keep the important stuff but still it is not very clean as some of the parameters are not included in the final file. I ended up doing the same kind of convoluted approach as I was making before. But hopefully that's helpful to someone looking for some sort of real-case example. Definitely clarified stuff in my head.

import xarray as xr
from rasterio.enums import Resampling
import numpy as np
import dask.array as da

def init_coord(ds, X,Y):
    ''' To have the geometry right'''
    arr_r=some_processing(ds.isel(time=slice(0,1)), X,Y)
    return arr_r.x.values, arr_r.y.values

def some_processing(arr, X,Y):
    ''' A reprojection routine'''   
    arr = arr.rio.write_crs('EPSG:4326')
    arr_r = arr.rio.reproject('EPSG:3857', shape=(Y,X), resampling=Resampling.bilinear, nodata=np.nan)
    return arr_r

filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
ds.air.encoding['dtype']=np.dtype('float32')
X,Y=250, 250 #size of each final timestep
x,y=init_coord(ds, X,Y)
dummy=da.zeros((len(ds.time.values), Y, X))
ds_to_write=xr.Dataset({'air':(('time','y','x'), dummy)},
                       coords={'time':('time',ds.time.values),'x':('x', x),'y':('y',y)})
ds_to_write.to_zarr(filename, compute=False, encoding={"time": {"chunks": [1]}})
for i in range(len(ds.time)):
    # some kind of heavy processing
    arr_r=some_processing(ds.isel(time=slice(i,i+1)),X,Y)
    buff= arr_r.drop(['spatial_ref','x','y']).chunk({'time':1,'x':X,'y':Y})
    del buff.air.attrs["_FillValue"]
    buff.to_zarr(filename, mode='r+', region={'time':slice(i,i+1)})

d70-t · 2022-03-10T11:56:44Z

Yes, this is kind of the behaviour I'd expect. And great that it helped clarifying things.
Still, building up the metadata nicely upfront (which is required for region writes) ist quite convoluted... That's what I meant with

some better tooling for writing and updating zarr dataset metadata (I don't know if that would fit in the realm of xarray though, as it looks like handling Datasets without content. For "appending" metadata, I really don't know how I'd picture this propery in xarray world.)

in the previous comment. I think, establishing and documenting good practices for this would help, but probably we also want to have better tools. In any case, this would probably be yet another issue.

Note that if you care about this paricular example (e.g. appending in a single thread in increasing order of timesteps), then it should also be possible to do this much simpler using append:

filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
ds.air.encoding['dtype']=np.dtype('float32')
X,Y=250, 250 #size of each final timestep

for i in range(len(ds.time)):
    # some kind of heavy processing
    arr_r=some_processing(ds.isel(time=slice(i,i+1)),X,Y)
    del arr_r.air.attrs["_FillValue"]
    if os.path.exists(filename):
        arr_r.to_zarr(filename, append_dim='time')
    else:
        arr_r.to_zarr(filename)

If you find out more about the cloud case, please post a note, otherwise, we can assume that the original bug report is fine?

Boorhin · 2022-03-11T10:19:03Z

If you find out more about the cloud case, please post a note, otherwise, we can assume that the original bug report is fine?

I think so, except that it affects append and region methods not just append.
Yes for the above case, it should work. I need to better test all this. Thanks

d70-t · 2022-03-11T10:28:35Z

Thanks for pointing out region again. I've updated the header and the initial comment.

d70-t added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 4, 2022

dcherian added topic-CF conventions and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 4, 2022

d70-t changed the title ~~to_zarr with append-mode and _FillValue doesnt work~~ to_zarr with append or region mode and _FillValue doesnt work Mar 11, 2022

jhamman added topic-backends topic-zarr Related to zarr storage library labels Mar 17, 2023

jhamman mentioned this issue Mar 27, 2023

Add reset_encoding to Dataset and DataArray objects #7686

Closed

max-sixty mentioned this issue Oct 24, 2023

to_zarr: region not recognised as dataset dimensions #6069

Open

cpegel mentioned this issue May 30, 2024

to_zarr with mode='a' fails when _FillValue is present, caused by open_zarr with mask_and_scale=False. #9053

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_zarr` with append or region mode and `_FillValue` doesnt work #6329

`to_zarr` with append or region mode and `_FillValue` doesnt work #6329

d70-t commented Mar 4, 2022 •

edited

Loading

INSTALLED VERSIONS

Boorhin commented Mar 4, 2022

d70-t commented Mar 4, 2022

Boorhin commented Mar 5, 2022

Boorhin commented Mar 5, 2022 •

edited

Loading

Boorhin commented Mar 5, 2022

Boorhin commented Mar 7, 2022

d70-t commented Mar 7, 2022

Boorhin commented Mar 8, 2022

d70-t commented Mar 8, 2022

Boorhin commented Mar 9, 2022

d70-t commented Mar 9, 2022

Boorhin commented Mar 10, 2022 •

edited

Loading

d70-t commented Mar 10, 2022

Boorhin commented Mar 10, 2022

d70-t commented Mar 10, 2022

Boorhin commented Mar 11, 2022 •

edited

Loading

d70-t commented Mar 11, 2022

to_zarr with append or region mode and _FillValue doesnt work #6329

to_zarr with append or region mode and _FillValue doesnt work #6329

Comments

d70-t commented Mar 4, 2022 • edited Loading

What happened?

What did you expect to happen?

Anything else we need to know?

appears also for region writes

there's a workaround

INSTALLED VERSIONS

Boorhin commented Mar 4, 2022

d70-t commented Mar 4, 2022

Boorhin commented Mar 5, 2022

Boorhin commented Mar 5, 2022 • edited Loading

Boorhin commented Mar 5, 2022

Boorhin commented Mar 7, 2022

d70-t commented Mar 7, 2022

Boorhin commented Mar 8, 2022

d70-t commented Mar 8, 2022

Boorhin commented Mar 9, 2022

d70-t commented Mar 9, 2022

Boorhin commented Mar 10, 2022 • edited Loading

d70-t commented Mar 10, 2022

Boorhin commented Mar 10, 2022

d70-t commented Mar 10, 2022

Boorhin commented Mar 11, 2022 • edited Loading

d70-t commented Mar 11, 2022

`to_zarr` with append or region mode and `_FillValue` doesnt work #6329

`to_zarr` with append or region mode and `_FillValue` doesnt work #6329

d70-t commented Mar 4, 2022 •

edited

Loading

appears also for `region` writes

Boorhin commented Mar 5, 2022 •

edited

Loading

Boorhin commented Mar 10, 2022 •

edited

Loading

Boorhin commented Mar 11, 2022 •

edited

Loading