Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow calculate() to take function as input #175

Closed
echasnovski opened this issue Aug 6, 2018 · 13 comments
Closed

Allow calculate() to take function as input #175

echasnovski opened this issue Aug 6, 2018 · 13 comments

Comments

@echasnovski
Copy link
Collaborator

I created this issue to discuss a what seems to be a desired upgrade of {infer}: let calculate() take arbitrary function as input to compute more broad range of statistics.

For quite some time I've been thinking about this. For now, I concluded that there are two major issues.

  1. User interface, i.e. which argument(s) should supplied function require?
    • If it is assumed to operate only on response variable, than the scope of possible statistics is reduced drastically, as there will be no way to use explanatory variable in it.
    • If it is assumed to take two arguments (response and explanatory) than there will be troubles with concise use of explanatory in case of grouping (tapply() shouldn't be very familiar to modern beginners). Also there is a problem to translate common one-argument function to this interface.
    • There is also one thing to keep in mind. For now, response and explanatory variables can only be represented by one column. If generalizing {infer} to work with regression models is a desirable goal (in some distant future) then it will also raise question about which arguments that input function should require.
    • Today I can see only one approach to solve this: let that input function to operate on a data frame with all response and explanatory columns. Functions than can be written in the form of magrittr functional sequence (kind of what I suggest in my {ruler} package). However, this undermines current package design to use specify() as those variables will be retyped in that input function.
  2. Theoretical background. This is more serious issue. Intention of {infer} is that after using whole specify()-hypothesize()-generate()-calculate() pipeline, the result will represent the distribution of statistic under null hypothesis. As current scope of statistics is bounded, this is manually handled in generate() by recentering response column. This seems to be enough for now (probably, it is a good idea to recheck this claim).
    However, allowing arbitrary function as input to calculate() breaks the idea of "statistic distribution under null hypothesis", as generate() won't know about statistic beforehand. Imagine an idea to do a following hypothesis test: is mean(mpg) + median(mpg) + sd(mpg) equals to 35? As clumsy as it sounds, it should be a valid question to test with {infer}, as calculate() should take any function. However, currently there is no way to generate distribution of this statistic under null that it equals 35, as generate() doesn't know about the function itself.
    The solution might be to merge generate() with calculate() and partially with hypothesize() (to set value in "point" tests), however this will heavily undermine {infer} foundations.

TL;DR. I am afraid that the only way to allow calculate() to take truly arbitrary function as input is to rewrite most of the package, and the result will almost surely be different from current {infer} approach and pipeline. If you interested in this, I have some vague ideas but I don't want to work on something that will compete with {infer} approach.

@ismayc
Copy link
Collaborator

ismayc commented Aug 6, 2018

I think it might be best to just focus on generalizing response variable calculations like mean(), median(), max(), etc. for the moment. As you mentioned, anything more complicated than that is going to head down a rabbit hole quickly.

@echasnovski
Copy link
Collaborator Author

This is not hard to implement but the question of whether it is reasonable remains.

This functionality implies the possibility of performing hypothesis test of whether certain statistics equals certain value based on present data. Say, we want to test if maximum value of mpg from mtcars equals to 40 (observed statistic is 33.9). Based on the current package design, sample from statistic distribution under null hypothesis (maximum equals 40) should be done in following fashion:

mtcars %>%
  specify(mpg ~ NULL) %>%
  hypothesize("point", value = 40) %>%
  generate(1000) %>%
  calculate(max)

However, this seems logically impossible, as hypothesis construction and samples generation is done without any knowledge of what statistic they should represent. A solution one can think about might be something like this:

mtcars %>%
  specify(mpg ~ NULL) %>%
  hypothesize("point", stat = max, value = 40) %>%
  generate(1000) %>%
  calculate()

Unfortunately, this approach also has problems because of the need to use generate(). Basically, result of generate() should be a set of population samples which after applying statistics (to each sample) would give a sample from statistic distribution under null hypothesis. Given arbitrariness of input statistic, it also seems impossible to do, as function can be very non-linear and complicated. Take odd_stat = mean(mpg) + median(mpg) + sd(mpg), for example. In point test the goal then would be to produce a set of mpg samples that represent the world, where odd_stat of them somewhat equals to certain value.

The approach with the need to generate() introduces a very big restriction of possible statistics that can be "point-tested". I think that this is a root cause of #127. The current code doesn't actually allow to perform bootstrap hypothesis test of whether standard deviation equals certain value. This can be done by applying some sort of this commented solution. However, possibility of doing this sort of manipulation for arbitrary statistics is questionable.

As I said before, the solution might be to merge generate() with calculate() and partially with hypothesize() (to set value in "point" tests), however this will heavily undermine {infer} foundations.

I would really like to hear whether all this makes sense here.

@ismayc
Copy link
Collaborator

ismayc commented Aug 7, 2018

Whoofta. Yes. That does make sense. The struggles with getting the bootstrap centered in the right spot I don't believe can be generalized. It might just be the case that {infer} is best used for introductory statistics. It looks like generalizing without severely breaking the foundations is going to be mighty tricky.

It might be useful to just focus on cleaning up the code internally to see if there are any lessons learned that can be taken to a more generic framework.

@ismayc
Copy link
Collaborator

ismayc commented Aug 8, 2018

I think we can get the general functionality working for confidence intervals though since there isn’t the dependence on the hypothesize step.

@echasnovski
Copy link
Collaborator Author

Allowing function as input to calculate() only outside hypothesis testing framework is totally possible. However, for my taste it might be a little bit confusing for users.

Examples of functionality:

  1. Using hypothesize() in pipeline results in an error:
mtcars %>%
  specify(mpg ~ NULL) %>%
  hypothesize("independence") %>%
  generate(1000) %>%
  calculate(max)
#> Error: Function in `calculate()` shouldn't be used whyle testing hypothesis.
  1. Outside of hypothesis testing these chunks of code should return identical results:
set.seed(101)
mtcars %>%
  specify(mpg ~ NULL) %>%
  generate(1000) %>%
  calculate(max)
set.seed(101)
mtcars %>%
  specify(mpg ~ NULL) %>%
  generate(1000) %>%
  dplyr::summarise(stat = max(mpg))

Note that max() statistic is a bad candidate to be used inside bootstrap framework as the result will never be greater than actual max(mtcars$mpg). This fact introduces educational problem: one can't just plug in any function in calculate() and always obtain reasonable output.

@echasnovski
Copy link
Collaborator Author

Several notes:

  • So are requirements settled? Allow function as input only outside of hypothesis testing framework (checked with is_nuat(x, "null"))? This might be a confusing behavior: taking different type of input only if another function was called two steps ago. Imagine reading documentation of this as a new user.
  • If settled, this is better to be implemented after Converting to list-columns in generate() #208 is done.

@ismayc
Copy link
Collaborator

ismayc commented Nov 12, 2018

Personally I think adding support for general functions is going to be too much of a mess in {infer}. This is unfortunate as we’d like “power users” to also use this framework. I think it best we focus on how one could take the ideas from {infer} and apply them in a more general sense, but that should probably be a vignette with references to another package/extension.

@echasnovski
Copy link
Collaborator Author

I have the same feeling about this. It was just a reaction to a new "help wanted" label, to clarify what is considered to be "help" here.

@ismayc
Copy link
Collaborator

ismayc commented Nov 12, 2018

It appears I accidentally added the help wanted label. Sorry about that!

@andrewpbray
Copy link
Collaborator

andrewpbray commented Nov 12, 2018

Somehow I missed this whole discussion, but you guys covered lots of good ground.

Let me see if I can outline the specific functionality that I had originally envisioned and then evaluate whether it collides with the real problems Evgeni brings up.

1. Where base R stat functions exist, I'd like to use them (e.g. mean(), median())

To form a bootstrap interval, this would be straightforward enough:

mtcars %>%
  specify(mpg ~ NULL) %>%
  generate(1000) %>%
  calculate(mean)

The hypothesis test on a single mean (and median) would also work.

mtcars %>%
  specify(mpg ~ NULL) %>%
  hypothesize("point", mu = 30) %>%
  generate(1000) %>%
  calculate(mean)

The reason these would work is because we have already hard-coded the appropriate generate() behavior based on the input to hypothesize().

2. I'd like to be able to easily riff of the chi-squared and F statistics (e.g. use a statistic that takes absolute val instead of the square of the diffs in each sum).

In a setting like this:

mtcars %>%
  specify(mpg ~ cyl) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "F")

I'd like to be able to replace the F statistic with another arbitrary statistic that measures the ratio between the variability between and within groups. Here's one (ignoring NSE issues):

F_1 <- function(data, response, explanatory){
  data %>%
    mutate(grand_mean = mean(response) %>%
    group_by(explanatory) %>%
    mutate(group_mean = mean(response)
    ungroup() %>%
    mutate(within_diff     = abs(response - group_mean),
                 between_diff = abs(group_mean - grand_mean)) %>%
    summarize(mean_diff_within     = mean(within_diff),
                       mean_diff_between = mean(between_diff))%>%
    mutate(F_1 = mean_diff_between / mean_diff_within) %>%
    pull()
}

Looking back at calculate() there are certainly challenges to implementing this in a coherent way. It's possible that for these statistics that require > 1 column, the functions will necessarily be difficult to write so that they work with NSE and with the S3 object system of routing the calculation.


Turning towards the broader points raised by Evgeni's original post, while an independence null can be generated regardless of the statistic, the same is not true for most point nulls because they don't fully describe the data generating process. We solve that by borrowing all of the other necessary information from the ecdf and using the bootstrap mechanism, which is coded into generate().

One simple solution would be to allow arbitrary functions only in the case the independence null. That would solve (2). We could solve (1) even crudely just by matching the function to an S3 class in a similar way that we currently do with strings. The question, though, would be whether this behavior - being able pass mean but not my_mean - would do more harm than good.

One thing we could aim for: converting everything over so that the stat argument takes functions and not strings. We have a pre-set menu of functions, as we do currently with strings, but have an asterisk saying that you can pass arbitrary functions when null = "independence".

@echasnovski
Copy link
Collaborator Author

To my mind, your comment somewhat confirms Chester's "too much of a mess" opinion.

My current chain of reasoning for this issue:

  • Changing limited set of string inputs (in calculate()) to limited set of function inputs would break too much (in fact, all "full" pipeline) code. This is bad.
  • To preserve backward compatibility, at least three things might be done:
    • Allow stat in calculate() to be both string and function. Although this approach is seen in tidyverse, this feels like "too much of a mess".
    • Add new argument. This might be stat_f with NULL default. If it is NULL, calculate() uses stat argument, otherwise - stat_f directly.
    • Add new function. This might be calculate_f() and can take only function as input. This feels like breaking a nice {infer} "only natural verbs" pipeline, although might be a good approach.
  • If user-defined function is to be allowed as input, what are {infer} responsibility boundaries to prevent its improper use?
    • Allow function input in all situations. This is not only bad from methodological point of view (as discussed earlier), but also from offered functionality (this can be done straightforwardly with {dplyr}).
    • Allow function input in all situations accept in bootstrap generation for testing any hypothesis. It means allowing it outside of hypothesis testing framework (the very first example in previous comment) and inside of it in case generate() was called with "permute" or "simulate". In these cases generated data seems to always have "assumed" properties.

So, if to be implemented, for now my best idea of doing this is to create new argument stat_f in calculate() which can't be used within bootstrap testing of some hypothesis. However, this still does seem like too much of a mess.

@ismayc
Copy link
Collaborator

ismayc commented Jun 17, 2020

This also seems like beyond the scope of {infer} given current status. Would love to see this functionality in a different "power-user" package though.

@github-actions
Copy link

github-actions bot commented Mar 8, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants