`nan` values appearing when saving and loading from `netCDF` due to encoding #7691

euronion · 2023-03-28T07:58:21Z

What happened?

When writing to and reading my dataset from netCDF using ds.to_netcdf() and xr.open_dataset(...), xarray creates nan values where previously number values (float32) where.

The issue seems related to the encoding used for the original dataset, which causes the data to be stored as short. During loading, the stored values then collide with _FillValue leading to the numbers being interpreted as nan.

What did you expect to happen?

Values after saving & loading should be the same as before saving.

Minimal Complete Verifiable Example

We had a back-and-forth on SO about this, I hope it's fine to just refer to it here:

https://stackoverflow.com/a/75806771/11318472

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

See the SO link above.

Anything else we need to know?

I'm not sure whether this should be considered a bug or just a combination of conflicting features. My current workaround is resetting the encoding and letting xarray decide to store as float instead of short (cf. #7686).

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.90.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1

xarray: 2022.11.0
pandas: 1.5.2
numpy: 1.23.5
scipy: 1.10.0
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.3
cfgrib: None
iris: None
bottleneck: 1.3.5
dask: 2022.02.1
distributed: 2022.2.1
matplotlib: 3.6.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: 7.2.0
IPython: 8.11.0
sphinx: None

The text was updated successfully, but these errors were encountered:

kmuehlbauer · 2023-03-28T09:37:58Z

Thanks for all the details, @euronion.

From what I can tell, everything is OK with the original file. It's using packed data: https://docs.unidata.ucar.edu/nug/current/best_practices.html#bp_Packed-Data-Values. The only thing what might be a bit off is why they didn't choose -32768 as _FillValue

As both scale_factor and add_offset are of dtype float64 in the original file the data should be unpacked to float64 according to NetCDF-specs.

The reason why this isn't done is because the

xarray/xarray/conventions.py

Lines 379 to 384 in 020b4c0

    
           if mask_and_scale: 
        
               for coder in [ 
        
                   variables.UnsignedIntegerCoder(), 
        
                   variables.CFMaskCoder(), 
        
                   variables.CFScaleOffsetCoder(), 
        
               ]:

CFMaskCoder will promote int16 to float32 unconditionally. This happens in dtypes.maybe_promote():

xarray/xarray/core/dtypes.py

Lines 67 to 70 in 020b4c0

    
           elif np.issubdtype(dtype, np.integer): 
        
               dtype = np.float32 if dtype.itemsize <= 2 else np.float64 
        
               fill_value = np.nan 
        
           elif np.issubdtype(dtype, np.complexfloating):

The CFScaleOffsetCoder itself is able to correctly convert this to the wanted dtype float64:

xarray/xarray/coding/variables.py

Lines 235 to 251 in e79eaf5

    
           def _choose_float_dtype(dtype: np.dtype, has_offset: bool) -> type[np.floating[Any]]: 
        
               """Return a float dtype that can losslessly represent `dtype` values.""" 
        
               # Keep float32 as-is.  Upcast half-precision to single-precision, 
        
               # because float16 is "intended for storage but not computation" 
        
               if dtype.itemsize <= 4 and np.issubdtype(dtype, np.floating): 
        
                   return np.float32 
        
               # float32 can exactly represent all integers up to 24 bits 
        
               if dtype.itemsize <= 2 and np.issubdtype(dtype, np.integer): 
        
                   # A scale factor is entirely safe (vanishing into the mantissa), 
        
                   # but a large integer offset could lead to loss of precision. 
        
                   # Sensitivity analysis can be tricky, so we just use a float64 
        
                   # if there's any offset at all - better unoptimised than wrong! 
        
                   if not has_offset: 
        
                       return np.float32 
        
               # For all other types and circumstances, we just use float64. 
        
               # (safe because eg. complex numbers are not supported in NetCDF) 
        
               return np.float64

As this doesn't surface that often it might just happen here by accident. If the _FillValue/missing_value would be -32768 then the issue would not manifest.

Update: corrected to maybe_promote()

kmuehlbauer · 2023-03-28T12:41:43Z

As this doesn't surface that often it might just happen here by accident. If the _FillValue/missing_value would be -32768 then the issue would not manifest.

So for NetCDF the default fillvalue for NC_SHORT (int16) is -32767. That means the promotion to float32 instead the needed float64 is the problem here (floating point precision).

kmuehlbauer · 2023-03-28T13:16:31Z

MCVE:

fname = "test-7691.nc"
import netCDF4 as nc
with nc.Dataset(fname, "w") as ds0:
    ds0.createDimension("t", 5)
    ds0.createVariable("x", "int16", ("t",), fill_value=-32767)
    v = ds0.variables["x"]
    v.set_auto_maskandscale(False)
    v.add_offset = 278.297319296597
    v.scale_factor = 1.16753614203674e-05
    v[:] = np.array([-32768, -32767, -32766, 32767, 0])
with nc.Dataset(fname) as ds1:
    x1 = ds1["x"][:]
    print("netCDF4-python:", x1.dtype, x1)
with xr.open_dataset(fname) as ds2:
    x2 = ds2["x"].values
    ds2.to_netcdf("test-7691-01.nc")
    print("xarray first read:", x2.dtype, x2)
with xr.open_dataset("test-7691-01.nc") as ds3:
    x3 = ds3["x"].values
    print("xarray roundtrip:", x3.dtype, x3)

netCDF4-python: float64 [277.9147410535744 -- 277.9147644042972 278.67988586425815
 278.297319296597]
xarray first read: float32 [277.91476       nan 277.91476 278.6799  278.29733]
xarray roundtrip: float32 [      nan       nan       nan 278.6799  278.29733]

I've confirmed that correctly promoting to float64 in CFMaskCoder solves this issue.

dcherian · 2023-03-28T13:42:36Z

It would be good to merge some version of #6812. This seems to be pretty common

Review comments and PR remixes welcome!

kmuehlbauer · 2023-03-31T13:19:01Z

@euronion There is a potential fix for your issue in #7654. It would be great, if you could have a closer look and test against that PR.

euronion · 2023-03-31T14:23:49Z

@euronion There is a potential fix for your issue in #7654. It would be great, if you could have a closer look and test against that PR.

Yes, the PR seems to solve my specific issue without changing the encoding in the netCDF file. Great, thanks!

kmuehlbauer · 2023-03-31T15:05:17Z

, the PR seems to solve my specific issue without changing the encoding

Great, thanks for testing.

kmuehlbauer · 2024-02-06T08:55:58Z

#7654 got closed without adding several changes to other PR. #8713 takes over.

kmuehlbauer · 2024-02-07T21:27:25Z

@euronion Sorry for letting go #7654 out of focus. Would you mind looking/testing #8713 and let us know your thoughts over there? Much appreciated!

euronion · 2024-02-11T21:56:00Z

Hi @kmuehlbauer , thanks for looking into the issue again!
I'm moving between jobs right now and have to re-setup my coding / testing locally. No promises on a fast response, maybe in 1-2 weeks.

kmuehlbauer · 2024-02-12T08:09:10Z

Thanks @euronion, no worries, just have a look when you can spare the time.

see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0

* fix: Skip previous encoding workaround for fixed xarray versions see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0 * Replace workaround by a new xarray lower bound --------- Co-authored-by: Jonas Hoersch <[email protected]>

euronion added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 28, 2023

kmuehlbauer mentioned this issue Mar 31, 2023

cf-coding #7654

Closed

4 tasks

kmuehlbauer added topic-backends topic-CF conventions and removed needs triage Issue that has not been reviewed by xarray team member labels Apr 28, 2023

kmuehlbauer mentioned this issue Feb 6, 2024

correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

Merged

5 tasks

dcherian closed this as completed in #8713 Mar 15, 2024

coroa pushed a commit to coroa/atlite that referenced this issue Nov 1, 2024

fix: Skip previous encoding workaround for fixed xarray versions

9bd3014

see reported issue pydata/xarray#7691 and pr pydata/xarray#8713 which was included into xarray v2024.03.0

coroa mentioned this issue Nov 1, 2024

fix: Skip previous encoding workaround for fixed xarray versions PyPSA/atlite#401

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nan` values appearing when saving and loading from `netCDF` due to encoding #7691

`nan` values appearing when saving and loading from `netCDF` due to encoding #7691

euronion commented Mar 28, 2023

INSTALLED VERSIONS

kmuehlbauer commented Mar 28, 2023 •

edited

Loading

kmuehlbauer commented Mar 28, 2023

kmuehlbauer commented Mar 28, 2023 •

edited

Loading

dcherian commented Mar 28, 2023 •

edited

Loading

kmuehlbauer commented Mar 31, 2023

euronion commented Mar 31, 2023

kmuehlbauer commented Mar 31, 2023

kmuehlbauer commented Feb 6, 2024

kmuehlbauer commented Feb 7, 2024

euronion commented Feb 11, 2024

kmuehlbauer commented Feb 12, 2024

nan values appearing when saving and loading from netCDF due to encoding #7691

nan values appearing when saving and loading from netCDF due to encoding #7691

Comments

euronion commented Mar 28, 2023

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

kmuehlbauer commented Mar 28, 2023 • edited Loading

kmuehlbauer commented Mar 28, 2023

kmuehlbauer commented Mar 28, 2023 • edited Loading

dcherian commented Mar 28, 2023 • edited Loading

kmuehlbauer commented Mar 31, 2023

euronion commented Mar 31, 2023

kmuehlbauer commented Mar 31, 2023

kmuehlbauer commented Feb 6, 2024

kmuehlbauer commented Feb 7, 2024

euronion commented Feb 11, 2024

kmuehlbauer commented Feb 12, 2024

`nan` values appearing when saving and loading from `netCDF` due to encoding #7691

`nan` values appearing when saving and loading from `netCDF` due to encoding #7691

kmuehlbauer commented Mar 28, 2023 •

edited

Loading

kmuehlbauer commented Mar 28, 2023 •

edited

Loading

dcherian commented Mar 28, 2023 •

edited

Loading