Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

agoodm · 2018-08-15T23:18:41Z

Code Sample, a copy-pastable example if possible

Data file used is here: test.nc.zip
Output from each statement is commented out.

import xarray as xr
ds = xr.open_dataset('test.nc')
ds.cold_rad_cnts.min()
#13038.
ds.cold_rad_cnts.max()
#13143.
ds.cold_rad_cnts.mean()
#12640.583984
ds.cold_rad_cnts.std()
#455.035156
ds.cold_rad_cnts.sum()
#4.472997e+10

Problem description

As you can see above, the mean falls outside the range of the data, and the standard deviation is nearly two orders of magnitude higher than it should be. This is because a significant loss of precision is occurring when using bottleneck's nansum() on data with a float32 dtype. I demonstrated this effect here: pydata/bottleneck#193.

Naturally, this means that converting the data to float64 or any int dtype will give the correct result, as well as using numpy's built-in functions instead or uninstalling bottleneck. An example is shown below.

Expected Output

In [8]: import numpy as np

In [9]: np.nansum(ds.cold_rad_cnts)
Out[9]: 46357123000.0

In [10]: np.nanmean(ds.cold_rad_cnts)
Out[10]: 13100.413

In [11]: np.nanstd(ds.cold_rad_cnts)
Out[11]: 8.158843

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.0
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: 0.6.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: None
cartopy: None
seaborn: None
setuptools: 40.0.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.5.0
sphinx: None

Unfortunately this will probably not be fixed downstream anytime soon, so I think it would be nice if xarray provided some sort of automatic workaround for this rather than having to remember to manually convert my data if it's float32. I am thinking making float64 the default (as discussed in #2304 ) would be nice but perhaps it might also be good if there was at least a warning whenever bottleneck's nansum() is used on float32 arrays.

The text was updated successfully, but these errors were encountered:

fujiisoup · 2018-08-16T02:01:38Z

After #2236, sum(skipna=False) will use numpy function even if bottleneck is installed.
But in other cases, we still use bottleneck.

I am actually not sure that the automatically casting to float64 or switching to numpy function are the correct path.
Some people may want to use float32 for saving memory and some other people may want to use bn.nansum as it is more efficient than numpy's counterpart.
The best algorithm may depend on usecases.

My proposal is to make these method more explicit, e.g. supporting engine='bottleneck' keyword.

agoodm · 2018-08-16T19:28:42Z

Perhaps we could make it possible to to set the ops engine (to either numpy or bottleneck) and dtype (float32, float64) via set_options()? Right now bottleneck is automatically chosen if it is installed, which is rather annoying since the xarray recipe on conda-forge ships with bottleneck even though it should be an optional dependency. Maybe that's something I should take up with the feedstock maintainers, but at the very least I think xarray should at least make its inclusion less rigid in light of these issues.

fujiisoup · 2018-08-17T06:48:54Z

Right now bottleneck is automatically chosen if it is installed, which is rather annoying since the xarray recipe on conda-forge ships with bottleneck even though it should be an optional dependency.

I didn't notice that.
I also think that bottleneck should be an optional dependency.
@shoyer, can you check this?
Maybe this file defines dependency?

Perhaps we could make it possible to to set the ops engine (to either numpy or bottleneck) and dtype (float32, float64) via set_options()

I think this is a reasonable option.

Personally, I think we can consider to stop using bottleneck entirely or make it completely optional.
With dask backend, bottleneck is not being used and we use this konly in nan-aggregation methods for numpy backend and the rolling operation.
After #1837, our rolling with pure numpy is not terriblly slow compared with bottleneck.

shoyer · 2018-08-17T06:58:21Z

There has been discussion about changing this condo-forge dependencies for xarray: conda-forge/xarray-feedstock#5. Bottleneck definitely isn’t a true required dependency.

Does it work to simply specify an explicit dtype in the sum?

I also wonder if it’s really worth the hassle of using bottleneck here, given these numerical precision issues and how it can’t be used with cask. But I do think it still probably offers a meaningful speedup in many cases....

fujiisoup · 2018-08-17T07:16:43Z

Does it work to simply specify an explicit dtype in the sum?

Yes. If the original array is in np.float32 and we specify dtype=np.float64, then the calculation will be performed with np.nansum, so we can avoid using bottleneck.
But if we set dtype=np.float32 (the same with the input dtype), then bottleneck will be used.

But I do think it still probably offers a meaningful speedup in many cases....

How about making numpy function default and we use bottleneck only when it is specified explicitly?
It does not simplify our code though...

stale · 2020-07-18T00:03:25Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

agoodm mentioned this issue Aug 17, 2018

Mean and std for float32 dataframe pandas-dev/pandas#22385

Open

bjarketol mentioned this issue Sep 6, 2018

DataArray.mean() produces wrong result for float32 arrays of particular shapes #2401

Closed

stale bot added the stale label Jul 18, 2020

stale bot closed this as completed Aug 17, 2020

dcherian mentioned this issue Dec 1, 2022

Disable bottleneck by default? #7344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

agoodm commented Aug 15, 2018

fujiisoup commented Aug 16, 2018 •

edited

Loading

agoodm commented Aug 16, 2018

fujiisoup commented Aug 17, 2018

shoyer commented Aug 17, 2018

fujiisoup commented Aug 17, 2018

stale bot commented Jul 18, 2020

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

Comments

agoodm commented Aug 15, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of xr.show_versions()

fujiisoup commented Aug 16, 2018 • edited Loading

agoodm commented Aug 16, 2018

fujiisoup commented Aug 17, 2018

shoyer commented Aug 17, 2018

fujiisoup commented Aug 17, 2018

stale bot commented Jul 18, 2020

Output of `xr.show_versions()`

fujiisoup commented Aug 16, 2018 •

edited

Loading