Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable bottleneck by default? #7344

Open
dcherian opened this issue Dec 1, 2022 · 11 comments
Open

Disable bottleneck by default? #7344

dcherian opened this issue Dec 1, 2022 · 11 comments

Comments

@dcherian
Copy link
Contributor

dcherian commented Dec 1, 2022

What is your issue?

Our choice to enable bottleneck by default results in quite a few issues about numerical stability and funny dtype behaviour: #7336, #7128, #2370, #1346 (and probably more)

Shall we disable it by default?

@TomNicholas
Copy link
Member

I kinda think correctness by default is more important than performance, especially if the default performance isn't prohibitively slow.

@max-sixty
Copy link
Collaborator

I'd be fine with disabling, since bottleneck doesn't seem to be actively maintained.

Though I would say it's numerically unstable rather than incorrect! I would always want it enabled, but it does make sense to default to the conservative option.

I had dreams of making numbagg into a better bottleneck — it's just as fast, much more flexible, and integrates really well with xarray. But those dreams have not come to pass (yet!).

@shoyer
Copy link
Member

shoyer commented Dec 4, 2022

The case where Bottleneck really makes a difference was for moving window statistics, where it uses a smarter algorithm than our current NumPy implementation, which creating a moving window view.

Otherwise, I agree, it probably isn't worth the trouble.

That said -- we could also switch to smarter NumPy based algorithms to implement most moving window calculations, e.g,. using np.nancumsum for moving window means.

@markelg
Copy link
Contributor

markelg commented Dec 14, 2022

I fully agree that correctness is the priority. Note however that some functions now require bottleneck, like ffill and bfill (I am not sure if there are more cases). They may need to be modified so they can run without bottleneck.

@shoyer
Copy link
Member

shoyer commented Dec 14, 2022

I think it's OK to still require bottleneck for ffill and bfill:

  1. There are no numerical concerns: these functions simply repeat numbers forward (or backwards).
  2. There is no good alternative to using a loop, and writing the loop in NumPy would be probitively slow.

@headtr1ck
Copy link
Collaborator

Maybe it is just a problem of documenting it more clearly?

@riley-brady
Copy link

I want to add a +1 to disable it by default. It's pretty common to be using float32 precision arrays. I have a rolling mean operation early on in some code and the errors balloon over time in subsequent processes. This was a super obscure bug to track down as well.

@bqi343
Copy link

bqi343 commented Mar 26, 2024

Though I would say it's numerically unstable rather than incorrect! I would always want it enabled, but it does make sense to default to the conservative option.

I recently stumbled across a case where bottleneck gives a completely incorrect answer or segfaults when taking the max of a small array (similar to: pydata/bottleneck#381 - a fix was proposed but never merged).

I think it's OK to still require bottleneck for ffill and bfill:

As of #8389 bottleneck is no longer used for ffill and bfill by default. Given the issues w/ precision and correctness, is there any remaining reason to not disable it by default? If any remaining functions require bottleneck, it would make sense to enable bottleneck by default for those functions only.

@shoyer
Copy link
Member

shoyer commented Mar 26, 2024

I would support disabling bottleneck by default for now, and eventually permanently removing support for bottleneck. Numbagg seems to be a more than capable and much more maintainable replacement.

@max-sixty
Copy link
Collaborator

As an update — numbagg is in a pretty good place now, and xarray defaults to using numbagg over bottleneck for overlapping functions when both are installed.

(I didn't finish my fuzzing work in numbagg, so I'm not as confident as I'd hope to be, but also no one has reported anything for a long time, so the "fuzzing from the world" has been successful...)

Numbagg doesn't do all of bottlenecks functions — there are functions that aren't practical to write with numba, since they require specialized data structures, such as moving max.

...though I'm not sure if that updates us towards or away from disabling it by default — the substitute is better than it used to be, but also the substitute overrides many of its functions by default when installed...

@bqi343
Copy link

bqi343 commented Mar 27, 2024

...though I'm not sure if that updates us towards or away from disabling it by default — the substitute is better than it used to be, but also the substitute overrides many of its functions by default when installed...

If I don't have numbagg installed but happen to have bottleneck installed, I would prefer for the default behavior to be using np.nanmax instead of the incorrect bottleneck.nanmax from the issue I linked above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants