Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

raphaelvallat · 2021-08-12T00:58:23Z

pingouin.compute_bootci could leverage the newly-added scipy.stats.bootstrap function (SciPy 1.7.0+) to calculate the bootstrapped confidence intervals.

Pros

SciPy implements the bias corrected and accelerated percentile method ("BCa"), which is also the default in the Matlab bootci function. However, it is currently only implemented for univariate statistics.
SciPy's implementation is overall fairly similar to Pingouin and it should not require a huge amount of work.

Cons

SciPy does not return the bootstrapped distribution, so we would need to deprecate the return_dist argument.
SciPy does not implement the normal approximation method ("norm").

The text was updated successfully, but these errors were encountered:

FlorinAndrei · 2022-06-19T03:16:09Z

I think return_dist could be simulated with a wrapper function that calls np.mean() or whatever, and appends the partial results to a global list at each step. The wrapper function is then passed as a value to statistic. At the end, the global list becomes the bootstrapped distribution. If that sounds right, could you add this suggestion to the docs please?

The BCa being only available for univariate statistics is not a deal breaker, but kinda annoying. I've hit that issue a few times. Maybe I'll stick a feature request into Scipy.

EDIT: scipy/scipy#16433

Thank you for all the improvements.

raphaelvallat · 2022-06-19T18:06:11Z

Thanks for opening the issue on SciPy! I think that scipy's bootstrap module is still under development so I'll stick to Pingouin's internal implementation for now.

I'm not sure to understand your method to get the bootstrapped distribution, could you please provide some example code?

Thanks,
Raphael

FlorinAndrei · 2022-06-20T21:22:19Z

Sure. I did this recently with Scipy:

Data set:

BirdPecks.csv

Code:

import pandas as pd
import numpy as np
from scipy import stats

BirdPecks = pd.read_csv("BirdPecks.csv")
Nboot = 10000

pg1 = BirdPecks[BirdPecks["group"] == 1]["pecks"].to_numpy()
pg2 = BirdPecks[BirdPecks["group"] == 2]["pecks"].to_numpy()

bdist = []
global bdist

def boot_mean(x, y):
    global bdist
    m = np.mean(x) - np.mean(y)
    bdist.append(m)
    return m

bs_out = stats.bootstrap(
    data=(pg1, pg2),
    statistic=boot_mean,
    vectorized=False,
    paired=False,
    confidence_level=0.9,
    n_resamples=Nboot,
    random_state=123,
    method="basic"
)

print(bdist)

FlorinAndrei · 2022-06-25T03:45:24Z

@raphaelvallat There are proposals for significant improvements to scipy.stats.bootstrap().

scipy/scipy#16454

scipy/scipy#16455

They're asking for extra comments, but right now I'm under a big time crunch, so I'm not sure I can help.

mdhaber · 2022-06-25T15:30:36Z

I'd be happy to add other enhancements to scipy.stats.bootstrap if it would make it more attractive!

As @FlorinAndrei mentioned, domain knowledge review of PRs would help - the main reason I haven't made enhancements already is shortage of reviewer time.

We can definitely return the bootstrap distribution in 1.10, and I now see no technical impediment to extending BCa to the multi-sample case.

I decided against returning a normal approximation confidence interval because Efron does not seem to recommended them. Of course, that would be super easy to add, either in a Pingouin wrapper or SciPy itself (if there's a good case made for it).

I'm currently writing a SciPy tutorial for all the resampling methods we offer (permutation_test and monte_carlo_test, too). Let me know if you'd like a preview!

raphaelvallat · 2022-06-26T23:04:17Z

Hi @mdhaber,

Thanks for all your work on the bootstrap module. I'd be happy to review a PR and/or tutorial 👍

I think that returning the bootstrapped distribution in SciPy would be great. If so, the current pingouin.compute_bootci would essentially become a wrapper around scipy.stats.bootstrap.

Two other potential ideas:

I think it might also be helpful to have public API for the lower-level confidence intervals estimation function, such that users can also calculate the CI from their own bootstrapped distribution. This is something that we can implement in pingouin.compute_bootci: the ability to pass as input either the raw array(s), or the bootstrapped distribution (e.g. pg.compute_bootci(bootdist, method="BCa")).
I do love the vectorized argument in scipy.stats.bootstrap, but I think for the (many) Python users that are not familiar with vectorization, this can be a little difficult to understand and use ( it took me a few trials to understand how to configure the axes, etc). Do you think that there might be a way to automatically detect if a function is vectorizable? Like if there is an axis keyword-argumnt, like in most numpy/scipy functions?

mdhaber · 2022-06-26T23:55:20Z

I'd be happy to review a PR and/or tutorial

Great! If you'd like to look at doc additions in scipy/scipy#16454, which returns the bootstrap distribution, I'd appreciate it! If you approve, be sure to select that radio button when you submit your review. As a maintainer, it certainly improves my confidence in merging when the PR has the approval of others.

I think it might also be helpful to have public API for the lower-level confidence intervals estimation function

OK, I'll be thinking about that.

Do you think that there might be a way to automatically detect if a function is vectorizable?

Yup, I can do that. Good idea. Update: see scipy/scipy#16651.

mdhaber · 2022-07-20T02:58:40Z

@raphaelvallat Now that scipy/scipy#16651 addresses your idea 2 above, here are some thoughts about your idea 1.

In retrospect, I would have made confidence_interval a method so that users could do this, but originally, bootstrap was going to return only confidence_interval. When we made it return a result object and standard_error, we didn't reconsider. Oops.

But perhaps it's for the best: a really flexible way of doing this is to add bootstrap_distribution as a parameter of bootstrap. The idea is that bootstrap would behave exactly as it does now, but it would append bootstrap_distribution to the bootstrap distribution it calculates. To calculate a differenct confidence interval, the user would just pass bootstrap_distribution and n_resamples=0 into bootstrap. But besides allowing the user to calculate different sorts of confidence intervals, they could also use this to iteratively increase the number of bootstrap replications. What do you think?

Another possibility is to make the new bootstrap_distribution an object with a method for doing the same sort of thing as above, but I think I'd rather not do this because it would be a new API to document (even if it is mostly a copy of the bootstrap API).

raphaelvallat · 2022-07-20T14:51:34Z

Awesome @mdhaber, I just reviewed scipy/scipy#16651!

a really flexible way of doing this is to add bootstrap_distribution as a parameter of bootstrap

I think that having a dedicated Bootstrap class, which can be initialized either with data + statistic, or bootstrap_distribution, and that contains methods such as confidence_intervals and standard_error is probably the cleanest. But what you suggest here is a good compromise. Will it also require passing data and statistic as well as bootstrap_distribution? I could see the case where the users only have the bootstrap distribution and not the underlying data or function.

Another possibility is to make the new bootstrap_distribution an object with a method for doing the same sort of thing as above, but I think I'd rather not do this because it would be a new API to document

I agree.

mdhaber · 2022-07-20T15:20:17Z

Will it also require passing data and statistic as well as bootstrap_distribution

Most CI methods need to know the observed value of the statistic to compute the confidence interval. I suppose I could allow data to default to None, in which case statistic could be the observed value instead of a function. In an initial PR, I would probably not implement this, but I'll think about it.

mdhaber · 2022-07-26T20:52:57Z

Just thought I'd mention that scipy/scipy#16455 would add BCa bootstrap for multi-sample statistics. I think it's correct now, and I'm working on unit tests for it. Reviews appreciated!

mdhaber · 2022-07-27T22:09:31Z

And scipy/scipy#16714 adds the functionality suggested in #189 (comment) to address @raphaelvallat's idea 1. With that, one can change confidence level, CI type, and add resamples (or not) like:

res = stats.bootstrap((x,), np.mean, confidence_level=0.95, method='percentile')
# without any new resampling, change the confidence level and method
res2 = stats.bootstrap((x,), np.mean, n_resamples=0, bootstrap_result=res, confidence_level=0.9, method='BCa')

raphaelvallat · 2022-07-31T20:40:21Z

@mdhaber I reviewed the second PR. Unfortunately, I don't think I have the stats skills nor time to do a full in-depth review of scipy/scipy#16455 — @FlorinAndrei might be better able to help here. But please let me know how I can be helping. Thanks!

mdhaber · 2022-07-31T20:45:30Z

No problem. Thanks for taking a look at scipy/scipy#16714!

mdhaber · 2022-08-23T15:03:59Z

The multi-sample BCa PR merged. Hope it helps!

raphaelvallat self-assigned this Aug 12, 2021

raphaelvallat added the feature request 🚧 New feature or request label Aug 12, 2021

raphaelvallat changed the title ~~Use scipy.stats.bootstrap in pingouin.compute_bootci~~ Use scipy.stats.bootstrap in pingouin.compute_bootci? Aug 12, 2021

This was referenced Jun 17, 2022

[Bugfix] compute_bootci should resample x and y separately when paired=False #276

Closed

Roadmap for release 0.5.2 #242

Closed

[bugfix] resample x and y separately in bootci when paired=False #281

Merged

mdhaber mentioned this issue Jul 20, 2022

ENH: stats.resampling: automatically detect whether statistic is vectorized scipy/scipy#16651

Merged

mdhaber mentioned this issue Jul 27, 2022

ENH: stats.bootstrap: extend previous bootstrap result scipy/scipy#16714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

raphaelvallat commented Aug 12, 2021

FlorinAndrei commented Jun 19, 2022 •

edited

Loading

raphaelvallat commented Jun 19, 2022

FlorinAndrei commented Jun 20, 2022

FlorinAndrei commented Jun 25, 2022

mdhaber commented Jun 25, 2022 •

edited

Loading

raphaelvallat commented Jun 26, 2022

mdhaber commented Jun 26, 2022 •

edited

Loading

mdhaber commented Jul 20, 2022 •

edited

Loading

raphaelvallat commented Jul 20, 2022

mdhaber commented Jul 20, 2022 •

edited

Loading

mdhaber commented Jul 26, 2022

mdhaber commented Jul 27, 2022

raphaelvallat commented Jul 31, 2022

mdhaber commented Jul 31, 2022

mdhaber commented Aug 23, 2022

Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

Comments

raphaelvallat commented Aug 12, 2021

FlorinAndrei commented Jun 19, 2022 • edited Loading

raphaelvallat commented Jun 19, 2022

FlorinAndrei commented Jun 20, 2022

FlorinAndrei commented Jun 25, 2022

mdhaber commented Jun 25, 2022 • edited Loading

raphaelvallat commented Jun 26, 2022

mdhaber commented Jun 26, 2022 • edited Loading

mdhaber commented Jul 20, 2022 • edited Loading

raphaelvallat commented Jul 20, 2022

mdhaber commented Jul 20, 2022 • edited Loading

mdhaber commented Jul 26, 2022

mdhaber commented Jul 27, 2022

raphaelvallat commented Jul 31, 2022

mdhaber commented Jul 31, 2022

mdhaber commented Aug 23, 2022

FlorinAndrei commented Jun 19, 2022 •

edited

Loading

mdhaber commented Jun 25, 2022 •

edited

Loading

mdhaber commented Jun 26, 2022 •

edited

Loading

mdhaber commented Jul 20, 2022 •

edited

Loading

mdhaber commented Jul 20, 2022 •

edited

Loading