-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow calculate()
to take function as input
#175
Comments
I think it might be best to just focus on generalizing response variable calculations like |
This is not hard to implement but the question of whether it is reasonable remains. This functionality implies the possibility of performing hypothesis test of whether certain statistics equals certain value based on present data. Say, we want to test if maximum value of mtcars %>%
specify(mpg ~ NULL) %>%
hypothesize("point", value = 40) %>%
generate(1000) %>%
calculate(max) However, this seems logically impossible, as hypothesis construction and samples generation is done without any knowledge of what statistic they should represent. A solution one can think about might be something like this: mtcars %>%
specify(mpg ~ NULL) %>%
hypothesize("point", stat = max, value = 40) %>%
generate(1000) %>%
calculate() Unfortunately, this approach also has problems because of the need to use The approach with the need to As I said before, the solution might be to merge I would really like to hear whether all this makes sense here. |
Whoofta. Yes. That does make sense. The struggles with getting the bootstrap centered in the right spot I don't believe can be generalized. It might just be the case that It might be useful to just focus on cleaning up the code internally to see if there are any lessons learned that can be taken to a more generic framework. |
I think we can get the general functionality working for confidence intervals though since there isn’t the dependence on the hypothesize step. |
Allowing function as input to Examples of functionality:
mtcars %>%
specify(mpg ~ NULL) %>%
hypothesize("independence") %>%
generate(1000) %>%
calculate(max)
#> Error: Function in `calculate()` shouldn't be used whyle testing hypothesis.
set.seed(101)
mtcars %>%
specify(mpg ~ NULL) %>%
generate(1000) %>%
calculate(max) set.seed(101)
mtcars %>%
specify(mpg ~ NULL) %>%
generate(1000) %>%
dplyr::summarise(stat = max(mpg)) Note that |
Several notes:
|
Personally I think adding support for general functions is going to be too much of a mess in {infer}. This is unfortunate as we’d like “power users” to also use this framework. I think it best we focus on how one could take the ideas from {infer} and apply them in a more general sense, but that should probably be a vignette with references to another package/extension. |
I have the same feeling about this. It was just a reaction to a new "help wanted" label, to clarify what is considered to be "help" here. |
It appears I accidentally added the help wanted label. Sorry about that! |
Somehow I missed this whole discussion, but you guys covered lots of good ground. Let me see if I can outline the specific functionality that I had originally envisioned and then evaluate whether it collides with the real problems Evgeni brings up. 1. Where base R stat functions exist, I'd like to use them (e.g.
|
To my mind, your comment somewhat confirms Chester's "too much of a mess" opinion. My current chain of reasoning for this issue:
So, if to be implemented, for now my best idea of doing this is to create new argument |
This also seems like beyond the scope of {infer} given current status. Would love to see this functionality in a different "power-user" package though. |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
I created this issue to discuss a what seems to be a desired upgrade of {infer}: let
calculate()
take arbitrary function as input to compute more broad range of statistics.For quite some time I've been thinking about this. For now, I concluded that there are two major issues.
tapply()
shouldn't be very familiar to modern beginners). Also there is a problem to translate common one-argument function to this interface.specify()
as those variables will be retyped in that input function.specify()
-hypothesize()
-generate()
-calculate()
pipeline, the result will represent the distribution of statistic under null hypothesis. As current scope of statistics is bounded, this is manually handled ingenerate()
by recentering response column. This seems to be enough for now (probably, it is a good idea to recheck this claim).However, allowing arbitrary function as input to
calculate()
breaks the idea of "statistic distribution under null hypothesis", asgenerate()
won't know about statistic beforehand. Imagine an idea to do a following hypothesis test: ismean(mpg) + median(mpg) + sd(mpg)
equals to 35? As clumsy as it sounds, it should be a valid question to test with {infer}, ascalculate()
should take any function. However, currently there is no way to generate distribution of this statistic under null that it equals 35, asgenerate()
doesn't know about the function itself.The solution might be to merge
generate()
withcalculate()
and partially withhypothesize()
(to set value in"point"
tests), however this will heavily undermine {infer} foundations.TL;DR. I am afraid that the only way to allow
calculate()
to take truly arbitrary function as input is to rewrite most of the package, and the result will almost surely be different from current {infer} approach and pipeline. If you interested in this, I have some vague ideas but I don't want to work on something that will compete with {infer} approach.The text was updated successfully, but these errors were encountered: