Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results when calculating sums on float32 arrays w/ bottleneck installed #2370

Closed
agoodm opened this issue Aug 15, 2018 · 6 comments
Labels

Comments

@agoodm
Copy link
Contributor

agoodm commented Aug 15, 2018

Code Sample, a copy-pastable example if possible

Data file used is here: test.nc.zip
Output from each statement is commented out.

import xarray as xr
ds = xr.open_dataset('test.nc')
ds.cold_rad_cnts.min()
#13038.
ds.cold_rad_cnts.max()
#13143.
ds.cold_rad_cnts.mean()
#12640.583984
ds.cold_rad_cnts.std()
#455.035156
ds.cold_rad_cnts.sum()
#4.472997e+10

Problem description

As you can see above, the mean falls outside the range of the data, and the standard deviation is nearly two orders of magnitude higher than it should be. This is because a significant loss of precision is occurring when using bottleneck's nansum() on data with a float32 dtype. I demonstrated this effect here: pydata/bottleneck#193.

Naturally, this means that converting the data to float64 or any int dtype will give the correct result, as well as using numpy's built-in functions instead or uninstalling bottleneck. An example is shown below.

Expected Output

In [8]: import numpy as np

In [9]: np.nansum(ds.cold_rad_cnts)
Out[9]: 46357123000.0

In [10]: np.nanmean(ds.cold_rad_cnts)
Out[10]: 13100.413

In [11]: np.nanstd(ds.cold_rad_cnts)
Out[11]: 8.158843

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.8
pandas: 0.23.4
numpy: 1.15.0
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: 0.6.1
h5py: 2.8.0
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.2
distributed: 1.22.1
matplotlib: None
cartopy: None
seaborn: None
setuptools: 40.0.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.5.0
sphinx: None

Unfortunately this will probably not be fixed downstream anytime soon, so I think it would be nice if xarray provided some sort of automatic workaround for this rather than having to remember to manually convert my data if it's float32. I am thinking making float64 the default (as discussed in #2304 ) would be nice but perhaps it might also be good if there was at least a warning whenever bottleneck's nansum() is used on float32 arrays.

@fujiisoup
Copy link
Member

fujiisoup commented Aug 16, 2018

After #2236, sum(skipna=False) will use numpy function even if bottleneck is installed.
But in other cases, we still use bottleneck.

I am actually not sure that the automatically casting to float64 or switching to numpy function are the correct path.
Some people may want to use float32 for saving memory and some other people may want to use bn.nansum as it is more efficient than numpy's counterpart.
The best algorithm may depend on usecases.

My proposal is to make these method more explicit, e.g. supporting engine='bottleneck' keyword.

@agoodm
Copy link
Contributor Author

agoodm commented Aug 16, 2018

Perhaps we could make it possible to to set the ops engine (to either numpy or bottleneck) and dtype (float32, float64) via set_options()? Right now bottleneck is automatically chosen if it is installed, which is rather annoying since the xarray recipe on conda-forge ships with bottleneck even though it should be an optional dependency. Maybe that's something I should take up with the feedstock maintainers, but at the very least I think xarray should at least make its inclusion less rigid in light of these issues.

@fujiisoup
Copy link
Member

Right now bottleneck is automatically chosen if it is installed, which is rather annoying since the xarray recipe on conda-forge ships with bottleneck even though it should be an optional dependency.

I didn't notice that.
I also think that bottleneck should be an optional dependency.
@shoyer, can you check this?
Maybe this file defines dependency?

Perhaps we could make it possible to to set the ops engine (to either numpy or bottleneck) and dtype (float32, float64) via set_options()

I think this is a reasonable option.

Personally, I think we can consider to stop using bottleneck entirely or make it completely optional.
With dask backend, bottleneck is not being used and we use this konly in nan-aggregation methods for numpy backend and the rolling operation.
After #1837, our rolling with pure numpy is not terriblly slow compared with bottleneck.

@shoyer
Copy link
Member

shoyer commented Aug 17, 2018

There has been discussion about changing this condo-forge dependencies for xarray: conda-forge/xarray-feedstock#5. Bottleneck definitely isn’t a true required dependency.

Does it work to simply specify an explicit dtype in the sum?

I also wonder if it’s really worth the hassle of using bottleneck here, given these numerical precision issues and how it can’t be used with cask. But I do think it still probably offers a meaningful speedup in many cases....

@fujiisoup
Copy link
Member

Does it work to simply specify an explicit dtype in the sum?

Yes. If the original array is in np.float32 and we specify dtype=np.float64, then the calculation will be performed with np.nansum, so we can avoid using bottleneck.
But if we set dtype=np.float32 (the same with the input dtype), then bottleneck will be used.

But I do think it still probably offers a meaningful speedup in many cases....

How about making numpy function default and we use bottleneck only when it is specified explicitly?
It does not simplify our code though...

@stale
Copy link

stale bot commented Jul 18, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants