Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan values appearing when saving and loading from netCDF due to encoding #7691

Closed
1 of 4 tasks
euronion opened this issue Mar 28, 2023 · 11 comments · Fixed by #8713
Closed
1 of 4 tasks

nan values appearing when saving and loading from netCDF due to encoding #7691

euronion opened this issue Mar 28, 2023 · 11 comments · Fixed by #8713

Comments

@euronion
Copy link

What happened?

When writing to and reading my dataset from netCDF using ds.to_netcdf() and xr.open_dataset(...), xarray creates nan values where previously number values (float32) where.

The issue seems related to the encoding used for the original dataset, which causes the data to be stored as short. During loading, the stored values then collide with _FillValue leading to the numbers being interpreted as nan.

What did you expect to happen?

Values after saving & loading should be the same as before saving.

Minimal Complete Verifiable Example

We had a back-and-forth on SO about this, I hope it's fine to just refer to it here:

https://stackoverflow.com/a/75806771/11318472

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

See the SO link above.

Anything else we need to know?

I'm not sure whether this should be considered a bug or just a combination of conflicting features. My current workaround is resetting the encoding and letting xarray decide to store as float instead of short (cf. #7686).

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.90.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.11.0
pandas: 1.5.2
numpy: 1.23.5
scipy: 1.10.0
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.3
cfgrib: None
iris: None
bottleneck: 1.3.5
dask: 2022.02.1
distributed: 2022.2.1
matplotlib: 3.6.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: 7.2.0
IPython: 8.11.0
sphinx: None

@euronion euronion added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 28, 2023
@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Mar 28, 2023

Thanks for all the details, @euronion.

From what I can tell, everything is OK with the original file. It's using packed data: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Packed-Data-Values. The only thing what might be a bit off is why they didn't choose -32768 as _FillValue

As both scale_factor and add_offset are of dtype float64 in the original file the data should be unpacked to float64 according to NetCDF-specs.

The reason why this isn't done is because the

if mask_and_scale:
for coder in [
variables.UnsignedIntegerCoder(),
variables.CFMaskCoder(),
variables.CFScaleOffsetCoder(),
]:

CFMaskCoder will promote int16 to float32 unconditionally. This happens in dtypes.maybe_promote():

elif np.issubdtype(dtype, np.integer):
dtype = np.float32 if dtype.itemsize <= 2 else np.float64
fill_value = np.nan
elif np.issubdtype(dtype, np.complexfloating):

The CFScaleOffsetCoder itself is able to correctly convert this to the wanted dtype float64:

def _choose_float_dtype(dtype: np.dtype, has_offset: bool) -> type[np.floating[Any]]:
"""Return a float dtype that can losslessly represent `dtype` values."""
# Keep float32 as-is. Upcast half-precision to single-precision,
# because float16 is "intended for storage but not computation"
if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating):
return np.float32
# float32 can exactly represent all integers up to 24 bits
if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer):
# A scale factor is entirely safe (vanishing into the mantissa),
# but a large integer offset could lead to loss of precision.
# Sensitivity analysis can be tricky, so we just use a float64
# if there's any offset at all - better unoptimised than wrong!
if not has_offset:
return np.float32
# For all other types and circumstances, we just use float64.
# (safe because eg. complex numbers are not supported in NetCDF)
return np.float64

As this doesn't surface that often it might just happen here by accident. If the _FillValue/missing_value would be -32768 then the issue would not manifest.

Update: corrected to maybe_promote()

@kmuehlbauer
Copy link
Contributor

As this doesn't surface that often it might just happen here by accident. If the _FillValue/missing_value would be -32768 then the issue would not manifest.

So for NetCDF the default fillvalue for NC_SHORT (int16) is -32767. That means the promotion to float32 instead the needed float64 is the problem here (floating point precision).

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Mar 28, 2023

MCVE:

fname = "test-7691.nc"
import netCDF4 as nc
with nc.Dataset(fname, "w") as ds0:
    ds0.createDimension("t", 5)
    ds0.createVariable("x", "int16", ("t",), fill_value=-32767)
    v = ds0.variables["x"]
    v.set_auto_maskandscale(False)
    v.add_offset = 278.297319296597
    v.scale_factor = 1.16753614203674e-05
    v[:] = np.array([-32768, -32767, -32766, 32767, 0])
with nc.Dataset(fname) as ds1:
    x1 = ds1["x"][:]
    print("netCDF4-python:", x1.dtype, x1)
with xr.open_dataset(fname) as ds2:
    x2 = ds2["x"].values
    ds2.to_netcdf("test-7691-01.nc")
    print("xarray first read:", x2.dtype, x2)
with xr.open_dataset("test-7691-01.nc") as ds3:
    x3 = ds3["x"].values
    print("xarray roundtrip:", x3.dtype, x3)
netCDF4-python: float64 [277.9147410535744 -- 277.9147644042972 278.67988586425815
 278.297319296597]
xarray first read: float32 [277.91476       nan 277.91476 278.6799  278.29733]
xarray roundtrip: float32 [      nan       nan       nan 278.6799  278.29733]

I've confirmed that correctly promoting to float64 in CFMaskCoder solves this issue.

@dcherian
Copy link
Contributor

dcherian commented Mar 28, 2023

It would be good to merge some version of #6812. This seems to be pretty common

Review comments and PR remixes welcome!

@kmuehlbauer kmuehlbauer mentioned this issue Mar 31, 2023
4 tasks
@kmuehlbauer
Copy link
Contributor

@euronion There is a potential fix for your issue in #7654. It would be great, if you could have a closer look and test against that PR.

@euronion
Copy link
Author

@euronion There is a potential fix for your issue in #7654. It would be great, if you could have a closer look and test against that PR.

Yes, the PR seems to solve my specific issue without changing the encoding in the netCDF file. Great, thanks!

@kmuehlbauer
Copy link
Contributor

, the PR seems to solve my specific issue without changing the encoding

Great, thanks for testing.

@kmuehlbauer
Copy link
Contributor

#7654 got closed without adding several changes to other PR. #8713 takes over.

@kmuehlbauer
Copy link
Contributor

@euronion Sorry for letting go #7654 out of focus. Would you mind looking/testing #8713 and let us know your thoughts over there? Much appreciated!

@euronion
Copy link
Author

Hi @kmuehlbauer , thanks for looking into the issue again!
I'm moving between jobs right now and have to re-setup my coding / testing locally. No promises on a fast response, maybe in 1-2 weeks.

@kmuehlbauer
Copy link
Contributor

Thanks @euronion, no worries, just have a look when you can spare the time.

coroa pushed a commit to coroa/atlite that referenced this issue Nov 1, 2024
see reported issue pydata/xarray#7691 and pr
pydata/xarray#8713 which was included into xarray v2024.03.0
lkstrp pushed a commit to PyPSA/atlite that referenced this issue Nov 11, 2024
* fix: Skip previous encoding workaround for fixed xarray versions

see reported issue pydata/xarray#7691 and pr
pydata/xarray#8713 which was included into xarray v2024.03.0

* Replace workaround by a new xarray lower bound

---------

Co-authored-by: Jonas Hoersch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants