Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

Open
Tracked by #242
raphaelvallat opened this issue Aug 12, 2021 · 15 comments
Open
Tracked by #242

Use scipy.stats.bootstrap in pingouin.compute_bootci? #189

raphaelvallat opened this issue Aug 12, 2021 · 15 comments
Assignees
Labels
feature request 🚧 New feature or request

Comments

@raphaelvallat
Copy link
Owner

pingouin.compute_bootci could leverage the newly-added scipy.stats.bootstrap function (SciPy 1.7.0+) to calculate the bootstrapped confidence intervals.

Pros

  • SciPy implements the bias corrected and accelerated percentile method ("BCa"), which is also the default in the Matlab bootci function. However, it is currently only implemented for univariate statistics.
  • SciPy's implementation is overall fairly similar to Pingouin and it should not require a huge amount of work.

Cons

  • SciPy does not return the bootstrapped distribution, so we would need to deprecate the return_dist argument.
  • SciPy does not implement the normal approximation method ("norm").
@raphaelvallat raphaelvallat self-assigned this Aug 12, 2021
@raphaelvallat raphaelvallat added the feature request 🚧 New feature or request label Aug 12, 2021
@raphaelvallat raphaelvallat changed the title Use scipy.stats.bootstrap in pingouin.compute_bootci Use scipy.stats.bootstrap in pingouin.compute_bootci? Aug 12, 2021
@FlorinAndrei
Copy link

FlorinAndrei commented Jun 19, 2022

I think return_dist could be simulated with a wrapper function that calls np.mean() or whatever, and appends the partial results to a global list at each step. The wrapper function is then passed as a value to statistic. At the end, the global list becomes the bootstrapped distribution. If that sounds right, could you add this suggestion to the docs please?

The BCa being only available for univariate statistics is not a deal breaker, but kinda annoying. I've hit that issue a few times. Maybe I'll stick a feature request into Scipy.

EDIT: scipy/scipy#16433

Thank you for all the improvements.

@raphaelvallat
Copy link
Owner Author

Thanks for opening the issue on SciPy! I think that scipy's bootstrap module is still under development so I'll stick to Pingouin's internal implementation for now.

I'm not sure to understand your method to get the bootstrapped distribution, could you please provide some example code?

Thanks,
Raphael

@FlorinAndrei
Copy link

Sure. I did this recently with Scipy:

Data set:

BirdPecks.csv

Code:

import pandas as pd
import numpy as np
from scipy import stats

BirdPecks = pd.read_csv("BirdPecks.csv")
Nboot = 10000

pg1 = BirdPecks[BirdPecks["group"] == 1]["pecks"].to_numpy()
pg2 = BirdPecks[BirdPecks["group"] == 2]["pecks"].to_numpy()

bdist = []
global bdist

def boot_mean(x, y):
    global bdist
    m = np.mean(x) - np.mean(y)
    bdist.append(m)
    return m

bs_out = stats.bootstrap(
    data=(pg1, pg2),
    statistic=boot_mean,
    vectorized=False,
    paired=False,
    confidence_level=0.9,
    n_resamples=Nboot,
    random_state=123,
    method="basic"
)

print(bdist)

@FlorinAndrei
Copy link

@raphaelvallat There are proposals for significant improvements to scipy.stats.bootstrap().

scipy/scipy#16454

scipy/scipy#16455

They're asking for extra comments, but right now I'm under a big time crunch, so I'm not sure I can help.

@mdhaber
Copy link

mdhaber commented Jun 25, 2022

I'd be happy to add other enhancements to scipy.stats.bootstrap if it would make it more attractive!

As @FlorinAndrei mentioned, domain knowledge review of PRs would help - the main reason I haven't made enhancements already is shortage of reviewer time.

We can definitely return the bootstrap distribution in 1.10, and I now see no technical impediment to extending BCa to the multi-sample case.

I decided against returning a normal approximation confidence interval because Efron does not seem to recommended them. Of course, that would be super easy to add, either in a Pingouin wrapper or SciPy itself (if there's a good case made for it).

I'm currently writing a SciPy tutorial for all the resampling methods we offer (permutation_test and monte_carlo_test, too). Let me know if you'd like a preview!

@raphaelvallat
Copy link
Owner Author

Hi @mdhaber,

Thanks for all your work on the bootstrap module. I'd be happy to review a PR and/or tutorial 👍

I think that returning the bootstrapped distribution in SciPy would be great. If so, the current pingouin.compute_bootci would essentially become a wrapper around scipy.stats.bootstrap.

Two other potential ideas:

  1. I think it might also be helpful to have public API for the lower-level confidence intervals estimation function, such that users can also calculate the CI from their own bootstrapped distribution. This is something that we can implement in pingouin.compute_bootci: the ability to pass as input either the raw array(s), or the bootstrapped distribution (e.g. pg.compute_bootci(bootdist, method="BCa")).
  2. I do love the vectorized argument in scipy.stats.bootstrap, but I think for the (many) Python users that are not familiar with vectorization, this can be a little difficult to understand and use ( it took me a few trials to understand how to configure the axes, etc). Do you think that there might be a way to automatically detect if a function is vectorizable? Like if there is an axis keyword-argumnt, like in most numpy/scipy functions?

@mdhaber
Copy link

mdhaber commented Jun 26, 2022

I'd be happy to review a PR and/or tutorial

Great! If you'd like to look at doc additions in scipy/scipy#16454, which returns the bootstrap distribution, I'd appreciate it! If you approve, be sure to select that radio button when you submit your review. As a maintainer, it certainly improves my confidence in merging when the PR has the approval of others.

I think it might also be helpful to have public API for the lower-level confidence intervals estimation function

OK, I'll be thinking about that.

Do you think that there might be a way to automatically detect if a function is vectorizable?

Yup, I can do that. Good idea. Update: see scipy/scipy#16651.

@mdhaber
Copy link

mdhaber commented Jul 20, 2022

@raphaelvallat Now that scipy/scipy#16651 addresses your idea 2 above, here are some thoughts about your idea 1.

In retrospect, I would have made confidence_interval a method so that users could do this, but originally, bootstrap was going to return only confidence_interval. When we made it return a result object and standard_error, we didn't reconsider. Oops.

But perhaps it's for the best: a really flexible way of doing this is to add bootstrap_distribution as a parameter of bootstrap. The idea is that bootstrap would behave exactly as it does now, but it would append bootstrap_distribution to the bootstrap distribution it calculates. To calculate a differenct confidence interval, the user would just pass bootstrap_distribution and n_resamples=0 into bootstrap. But besides allowing the user to calculate different sorts of confidence intervals, they could also use this to iteratively increase the number of bootstrap replications. What do you think?

Another possibility is to make the new bootstrap_distribution an object with a method for doing the same sort of thing as above, but I think I'd rather not do this because it would be a new API to document (even if it is mostly a copy of the bootstrap API).

@raphaelvallat
Copy link
Owner Author

Awesome @mdhaber, I just reviewed scipy/scipy#16651!

a really flexible way of doing this is to add bootstrap_distribution as a parameter of bootstrap

I think that having a dedicated Bootstrap class, which can be initialized either with data + statistic, or bootstrap_distribution, and that contains methods such as confidence_intervals and standard_error is probably the cleanest. But what you suggest here is a good compromise. Will it also require passing data and statistic as well as bootstrap_distribution? I could see the case where the users only have the bootstrap distribution and not the underlying data or function.

Another possibility is to make the new bootstrap_distribution an object with a method for doing the same sort of thing as above, but I think I'd rather not do this because it would be a new API to document

I agree.

@mdhaber
Copy link

mdhaber commented Jul 20, 2022

Will it also require passing data and statistic as well as bootstrap_distribution

Most CI methods need to know the observed value of the statistic to compute the confidence interval. I suppose I could allow data to default to None, in which case statistic could be the observed value instead of a function. In an initial PR, I would probably not implement this, but I'll think about it.

@mdhaber
Copy link

mdhaber commented Jul 26, 2022

Just thought I'd mention that scipy/scipy#16455 would add BCa bootstrap for multi-sample statistics. I think it's correct now, and I'm working on unit tests for it. Reviews appreciated!

@mdhaber
Copy link

mdhaber commented Jul 27, 2022

And scipy/scipy#16714 adds the functionality suggested in #189 (comment) to address @raphaelvallat's idea 1. With that, one can change confidence level, CI type, and add resamples (or not) like:

res = stats.bootstrap((x,), np.mean, confidence_level=0.95, method='percentile')
# without any new resampling, change the confidence level and method
res2 = stats.bootstrap((x,), np.mean, n_resamples=0, bootstrap_result=res, confidence_level=0.9, method='BCa')

@raphaelvallat
Copy link
Owner Author

@mdhaber I reviewed the second PR. Unfortunately, I don't think I have the stats skills nor time to do a full in-depth review of scipy/scipy#16455@FlorinAndrei might be better able to help here. But please let me know how I can be helping. Thanks!

@mdhaber
Copy link

mdhaber commented Jul 31, 2022

No problem. Thanks for taking a look at scipy/scipy#16714!

@mdhaber
Copy link

mdhaber commented Aug 23, 2022

The multi-sample BCa PR merged. Hope it helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request 🚧 New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants