Encoding error when saving netcdf #7039

etsmith14 · 2022-09-14T17:35:44Z

What happened?

When I select a single point (or regional subset) from a netcdf file and save as a new netcdf file, the newly saved file has an encoding issue that causes the data to be incorrect.

How the data should look:

How the newly saved file looks when opened:

Additional context copied from original discussion:
I am trying to save a regional subset of a netcdf file as netcdf file. I am first opening some data with dimensions of time, latitude, and longitude and then slicing that data by latitude and longitude to produce a smaller subset of the data. I save that smaller subset with the to_netcdf command. But when I go to open the new netcdf, the timeseries definitely wrong (see figures). The figure named 'correct' is what the temperature timeseries looks like when plotting directly from the original dataset. The figure named 'wrong' is what the temperature timeseries looks like when plotting from the newly saved netcdf (hopefully both figures attached properly). This happens when I select just a single point and save the data as a netcdf and it also happens when I save as a zarr file. However, when I load a single netcdf with open_dataset (instead of open_mfdataset) and save it as a new netcdf, everything is correct. So the issue seems to be coming from open_mfdataset. I've also noticed that not all grid points are incorrect, only some grid points have this issue. This doesn't happen when I convert to a series then save as a CSV, just happens when saving as a netcdf or zarr.
Link to the original discussion: #7025 (comment)

What did you expect to happen?

The data should have looked exactly the same.

Minimal Complete Verifiable Example

`import numpy as np
import xarray as xr
import pandas as pd
import matplotlib.pyplot as plt

# I took the encoding data from the original data and applied it to a dummy dataset to reproduce the issue
original_encoding = {
'original_shape': (744, 109, 245),
'missing_value': -32767,
'_FillValue': -32767,
'scale_factor': 0.0011997040993123216,
'add_offset': 269.40377331689564}


# create dummy dataframe
times = pd.date_range(start='2000-01-01',freq='1H',periods=8760)

# create dataset
ds = xr.Dataset({
    't2m': xr.DataArray(
                data   = np.random.random(8760),   
                dims   = ['time'],
                coords = {'time': times},

                ),
    'tmax': xr.DataArray(
                data   = np.random.random(8760),   
                dims   = ['time'],
                coords = {'time': times},
                )
            },
    )

# apply original encoding
ds.t2m.encoding = original_encoding

# save dataset as netcdf
ds.to_netcdf(r"...\test_ds2.nc")

# load saved dataset
ds_test = xr.open_dataset(r'...\test_ds2.nc')

# Plot the difference between the two variables
plt.plot(ds.t2m - ds_test.t2m)`

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.8 (default, Apr 13 2021, 15:08:03) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 165 Stepping 5, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: ('English_United States', '1252')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 2022.6.0
pandas: 1.4.1
numpy: 1.21.5
scipy: 1.8.0
netCDF4: 1.6.0
pydap: None
h5netcdf: 0.13.1
h5py: 3.6.0
Nio: None
zarr: 2.8.1
cftime: 1.6.0
nc_time_axis: 1.4.1
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.10.1
iris: 3.1.0
bottleneck: 1.3.5
dask: 2021.08.1
distributed: 2021.08.1
matplotlib: 3.5.1
cartopy: 0.18.0
seaborn: 0.11.1
numbagg: None
fsspec: 2022.8.2
cupy: None
pint: 0.19.2
sparse: None
flox: None
numpy_groupies: None
setuptools: 56.0.0
pip: 21.0.1
conda: None
pytest: None
IPython: 7.22.0
sphinx: 3.5.3

rabernat · 2022-09-15T13:25:11Z

Thanks so much for taking the time to write up this detailed bug report! 🙏

rabernat · 2022-09-15T15:12:34Z

I'm puzzled that I was not able to reproduce this error. I modified the end slightly as follows

# save dataset as netcdf
ds.to_netcdf("test.nc")
# load saved dataset
ds_test = xr.open_dataset('test.nc')
# verify that the two are equal within numerical precision
xr.testing.assert_allclose(ds, ds_test)

# plot
plt.plot(ds.t2m - ds_test.t2m)

In my case, the differences were just numerical noise (order 10^-14)

I used the binder environment for this.

I'm pretty stumped.

etsmith14 · 2022-09-15T15:54:17Z

That figure is basically what I am getting. Perhaps I designed the MRE poorly, however, I am curious as to what exactly from the encoding introduces the noise (I still need to read through the documentation more thoroughly)? If I don't apply the original encoding, I get a straight line at 0 for the difference plot.

With that being said, if you are willing to try a test with the actual ERA5 data, I've attached it here via a box link. I went back and figured out I need at least several files to get large differences. Oddly enough, if I use only 2 files, the difference looks more like noise (+/- 0.0005). If I only open a single file, no difference. If I add a couple more files, the differences become quite large.

Data: https://epri.box.com/s/spw9plf77lrjj1xz2spmwd34b5ls9dea

import xarray as xr
import matplotlib.pyplot as plt

# Open original time series
ERA5_t2m = xr.open_mfdataset(r'...\Test\T2m_*' + '.nc') # open 4 files

# Save time series as netcdf
ERA5_t2m.to_netcdf(r"...\Test\Phx_Temperature_to_netcdf.nc") # save 4 files

# open bad netcdf
ERA5_t2m_bad = xr.open_dataset(r'...\Test\Phx_Temperature_to_netcdf.nc')

# Lat and lon for Phx
lats = [33.35]
lons = [-112.86]

# plot the difference between the same point from the two files
plt.plot(ERA5_t2m.t2m.sel(latitude = lats[0], longitude = lons[0], method='nearest')
         - ERA5_t2m_bad.t2m.sel(latitude = lats[0], longitude = lons[0], method='nearest'))

rabernat · 2022-09-15T16:02:17Z

I am curious as to what exactly from the encoding introduces the noise (I still need to read through the documentation more thoroughly)?

The encoding says that your data should be encoded according to the following pseudocode formula:

encoded = int((original - offset) / scale_factor)
decoded = (scale_factor * float(encoded)) + offset

So the floating-point data are converted back and forth to a less precise type (integer) in order to save space. These numerical operations cannot preserve exact floating point accuracy. That's just how numerical float-point operations work. If you skip the encoding, then you just write the floating point bytes directly to disk, with no loss of precision.

This sort of encoding a crude form of lossy compression that is still unfortunately in use, even though there are much better algorithms available (and built into netcdf and zarr). Differences on the order of 10^-14 should not affect any real-world calculations.

However, this seems like a much, much smaller difference than the problem you originally reported. This suggests that the MRE does not actually reproduce the bug after all. How was the plot above (#7039 (comment)) generated? From your actual MRE code? Or from your earlier example with real data?

etsmith14 · 2022-09-15T16:55:08Z

Thanks for the explanation. Makes a lot more sense now! All figures I've attached are from the real ERA5 data. The figure I attached in my most recent comment with the alternative MRE (with the ERA5 data) is what I get when I run that code with the data I provided in the test folder.

etsmith14 · 2022-09-15T18:01:43Z

One last observation. If I just remove dtype from the original encoding and apply it to the dataset before writing to a netcdf, it works fine. Otherwise, I have the issue if I leave dtype in.

# This works
encoding = {'original_shape': (720, 109, 245),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.0009673806360857793,
 'add_offset': 282.08577424425226}

# This does not work
encoding = {'original_shape': (720, 109, 245),
 'missing_value': -32767,
'dtype': 'int16',  # the original form says it should be 'dtype': dtype('int16'), but this causes an error for me, whereas this form works fine to change between data types
 '_FillValue': -32767,
 'scale_factor': 0.0009673806360857793,
 'add_offset': 282.08577424425226}

veenstrajelmer · 2023-02-16T11:29:45Z

I have also encountered an issue with reading of ERA5 data with open_mfdatset, writing it to_netcdf() and reading it again (Deltares/dfm_tools#239). I was actually looking for a place to land this, and found your issue.

My expectation is that this is because the ERA5 data is saved as ints, but all files have different offsets/scalingfactors. Upon opening it with open_mfdataset(), the data is converted to floats and to the offset/scalingfactor of the first file. This is fine, but the issue occurs I think (and what you also mention) since {'dtype': 'int16'} is in the encoding. The file is written as ints and this seems to mess up the data. (all a theory)

A workaround is to remove the dtype from the encoding for all variables in the file (or update to float32), but that seems cumbersome.

etsmith14 · 2023-02-16T16:09:59Z

Thanks for flagging the issue again. I've been using the same workaround of removing the dtype before writing to a zarr/netcdf. It's an extra step but has worked for me so far.

veenstrajelmer · 2023-02-21T10:36:48Z

I have been thinking about a desireable solution, but I have a bit of trouble with it. Besides removing dtype from encoding (resulting in floats being written), one could also change the scale_factor to a higher value (e.g. 0.5). Writing this to int does take half the disksize than releasing the int restriction and writing it to float32. Whatever you do, the data is altered at least slightly.

Apparently, the data cannot be properly written to integers after reading it. This is a bit odd I would say, would that mean that the scaling+offset of ERA5 data is that thightly chosen that when applying it to another dataset/month, the data would fall out of the integer reach? Would be great if this would "just work". At the moment, apparently reading and writing ERA5 data with xarray results in incorrect netcdf files. I expected xarray would work off the shelf with these type of data, it feels like xarray is designed for doing exactly these type of things.

veenstrajelmer · 2023-03-07T17:47:20Z

@etsmith14: another workaround is removing the scale_factor instead of the dtype. This keeps the file size small. However, there are slight offsets between the source and destination datasets, which is to be expected since the original value for the msl variable was in the range of 0.1/0.11 and removing it defaults to 1. For your variable, the scale_factor might also be completely different. However, maybe the scale_factor (and add_offset) can be replaced by something that works for all ERA5 data instead of a value very specific to a single dataset/period.

etsmith14 · 2023-03-08T13:00:21Z

Thanks for the alternative @veenstrajelmer. I'll give it a try on my end.

veenstrajelmer · 2023-03-08T13:32:02Z

Hi @etsmith14. The suggestion I did loses accuracy and depending on the variable this is not acceptable. However, recomputing scale_factor and add_offset is possible: ArcticSnow/TopoPyScale#60 (comment)
It is more complicated than dropping the dtype, but it does keep the filesize small.

etsmith14 · 2023-03-08T13:37:10Z

Thanks for that note. I have a bunch of variables, like precipitation type, where that would be totally fine. Definitely looking to save on disk space, so may try to recompute the scale_factor and add_offset on other variables as suggested.

rabernat · 2023-03-08T13:51:06Z

Rather than using the scale_factor and add_offset approach, I would look into xbitinfo if you want to optimize your compression.

etsmith14 added bug needs triage Issue that has not been reviewed by xarray team member labels Sep 14, 2022

jhamman mentioned this issue Dec 14, 2022

'open_mfdataset' zarr zip timestamp issue #7354

Open

4 tasks

dcherian removed bug needs triage Issue that has not been reviewed by xarray team member labels Jan 17, 2023

ArcticSnow mentioned this issue Mar 7, 2023

MASSIVE bug in downscaling ArcticSnow/TopoPyScale#60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding error when saving netcdf #7039

Encoding error when saving netcdf #7039

etsmith14 commented Sep 14, 2022 •

edited by rabernat

Loading

INSTALLED VERSIONS

rabernat commented Sep 15, 2022

rabernat commented Sep 15, 2022

etsmith14 commented Sep 15, 2022 •

edited by keewis

Loading

rabernat commented Sep 15, 2022

etsmith14 commented Sep 15, 2022

etsmith14 commented Sep 15, 2022 •

edited by keewis

Loading

veenstrajelmer commented Feb 16, 2023

etsmith14 commented Feb 16, 2023

veenstrajelmer commented Feb 21, 2023 •

edited

Loading

veenstrajelmer commented Mar 7, 2023 •

edited

Loading

etsmith14 commented Mar 8, 2023

veenstrajelmer commented Mar 8, 2023

etsmith14 commented Mar 8, 2023 •

edited

Loading

rabernat commented Mar 8, 2023

Encoding error when saving netcdf #7039

Encoding error when saving netcdf #7039

Comments

etsmith14 commented Sep 14, 2022 • edited by rabernat Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

rabernat commented Sep 15, 2022

rabernat commented Sep 15, 2022

etsmith14 commented Sep 15, 2022 • edited by keewis Loading

rabernat commented Sep 15, 2022

etsmith14 commented Sep 15, 2022

etsmith14 commented Sep 15, 2022 • edited by keewis Loading

veenstrajelmer commented Feb 16, 2023

etsmith14 commented Feb 16, 2023

veenstrajelmer commented Feb 21, 2023 • edited Loading

veenstrajelmer commented Mar 7, 2023 • edited Loading

etsmith14 commented Mar 8, 2023

veenstrajelmer commented Mar 8, 2023

etsmith14 commented Mar 8, 2023 • edited Loading

rabernat commented Mar 8, 2023

etsmith14 commented Sep 14, 2022 •

edited by rabernat

Loading

etsmith14 commented Sep 15, 2022 •

edited by keewis

Loading

etsmith14 commented Sep 15, 2022 •

edited by keewis

Loading

veenstrajelmer commented Feb 21, 2023 •

edited

Loading

veenstrajelmer commented Mar 7, 2023 •

edited

Loading

etsmith14 commented Mar 8, 2023 •

edited

Loading