Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks on delay_density #89

Closed
pratikunterwegs opened this issue Oct 25, 2023 · 26 comments · Fixed by #94
Closed

Checks on delay_density #89

pratikunterwegs opened this issue Oct 25, 2023 · 26 comments · Fixed by #94
Assignees

Comments

@pratikunterwegs
Copy link
Collaborator

pratikunterwegs commented Oct 25, 2023

This issue is in reference to the function that is passed to delay_density. I would ideally like to resolve this issue by adding a fix to #88.

In #88, I have implemented a check on delay_density being a function with only argument. It could be good to constrain users in this way to avoid potential errors.

This is not really in line with other functionals as pointed out in the review comment below. I'm happy to remove this check, but it would be good to hear opinions on either side.

My initial preference would be to allow both to be in line with what other common functions are doing (apply(), map(), etc.). Could you open an issue for this please so others can weigh in and we can take some time to consider both alternatives?

Originally posted by @Bisaloo in #88 (comment)

@Bisaloo
Copy link
Member

Bisaloo commented Oct 25, 2023

From a very practical standpoint, this issue is about whether we want to also allow:

cfr_static(
  ebola1976,
  delay_density = mydensityfct
)

or only:

cfr_static(
  ebola1976,
  delay_density = function(x) mydensityfct(x)
)

@pratikunterwegs
Copy link
Collaborator Author

Yes, thanks for the clearer explanation.

@joshwlambert
Copy link
Member

I agree the check for a single argument (https://github.com/epiverse-trace/cfr/blob/dev/handle-dists/R/known_outcomes.R#L53-L58) is useful, but I think the first implementation Hugo listed is preferable given it will be more familiar to R users with knowledge of functional programming.

@pratikunterwegs
Copy link
Collaborator Author

Okay, thanks both - @adamkucharski is on leave for a few days but tagging here for future reference: I'll take the check away and users can follow the examples and/or know what they're doing.

@TimTaylor
Copy link

TimTaylor commented Oct 26, 2023

IIANM mistaken the way delay_density() is eventually used is in known_outcomes() where it is only given 1 input. Without the check, a mistake on the users part could potentially give confusing error messages. For this reason is it not best to leave the check in place? There may be scope to restructure a little to avoid duplicate checks.

@pratikunterwegs
Copy link
Collaborator Author

Thanks @TimTaylor - a related issue is that known_outcomes() is quasi-internal, but we do show how to use it in a vignette IIRC. Perhaps we can leave the check out for now, and revisit it once we get some user feedback.

@pratikunterwegs
Copy link
Collaborator Author

As follow up @TimTaylor - what restructure did you have in mind? If it's small, can be made here, or else in another PR

@TimTaylor
Copy link

This is more of a general thing.

Where you end up with external functions used internally you can sometimes avoid duplicating checks by having an external function just be a thin checking wrapper around an unchecked function. Then when you need to use the function internally you call the unchecked version. I often use the same name for the internal function but with a . prefix (e.g. myfun() and .myfun().

Of course this works best when the checks on one function are a subset of another (else you start needing to bubble up errors which quickly gets tricky).

@pratikunterwegs
Copy link
Collaborator Author

Thanks - so not related to the delay_density issue then? Will merge #88 for now

@TimTaylor
Copy link

TimTaylor commented Oct 26, 2023

Thanks - so not related to the delay_density issue then?

Well related in that checks on delay_density() are being duplicated.

Will merge #88 for now

It's hard to keep track of issues and PRs but my main concern was 08f870a. I think you want this check in place no? Again I may be missing something but if, for instance, someone passed delay_density = dgamma to the current PR would the resulting error message be helpful?

@TimTaylor
Copy link

I may be missing something but if, for instance, someone passed delay_density = dgamma to the current PR would the resulting error message be helpful?

Answering my own question. It would be ok because the underlying function called is delay_density() so the error message is still quite helpful even though it is dgamma erroring ...

Error in delay_density(seq(from = 0, to = nrow(data) - 1L)) : 
  argument "shape" is missing, with no default

@TimTaylor
Copy link

@pratikunterwegs - although I do still think this is more useful ...

R> cfr_static(ebola1976,  delay_density = dgamma)

Error in cfr_static(ebola1976, delay_density = dgamma) : 
  `delay_density` must be a distribution density function with 1 argument
    evaluating density at a vector of values and returning a numeric vector.
    E.g. function(x) stats::dgamma(x = x, shape = 5, scale = 1)

@pratikunterwegs
Copy link
Collaborator Author

Okay thanks - let me have a think about how we can balance the issue of informative erroring with @Bisaloo's point about not having to wrap functions with extra arguments

@Bisaloo
Copy link
Member

Bisaloo commented Oct 31, 2023

The immediate issue is that the nargs argument in test_function() checks the total number of arguments where we would like to have the number of required arguments / arguments without default value. This is illustrated in this reprex:

library(checkmate)

f1 <- function(x) {
  cat("Hello ", x)
}

f1("world")
#> Hello  world

checkmate::check_function(f1, nargs = 1)
#> [1] TRUE

f2 <- function(x, y = NULL) {
  cat("Hello ", x)
}

# f2 behaves the same as f1 because y has a default value
f2("world")
#> Hello  world

checkmate::check_function(f2, nargs = 1)
#> [1] "Must have exactly 1 formal arguments, but has 2"

Created on 2023-10-31 with reprex v2.0.2

This also means that my initial understanding of this issue was wrong and the real reason why #88 (comment) was failing was not because the function was passed as a symbol.

The immediate fix would be to write a test_function() variant and/or open a feature request in checkmate with a new required_nargs argument.

PR #94 addresses a distinct issue: the type and length of the return value. But it will not change anything to the error message mentioned in #89 (comment).

f <- function(delay_density) {
  
  stopifnot(
    "`delay_density` must be a function evaluating distribution density at a
    vector of values and returning a numeric vector of the same length.
    E.g. function(x) stats::dgamma(x = x, shape = 5, scale = 1)" =
      (checkmate::test_function(delay_density) &&
         checkmate::test_numeric(delay_density(seq(10)),
                                 lower = 0,
                                 any.missing = FALSE, finite = TRUE, len = 10L
         )) || is.null(delay_density)
  )

}

f(dgamma)
#> Error in delay_density(seq(10)): argument "shape" is missing, with no default

Created on 2023-10-31 with reprex v2.0.2

@pratikunterwegs
Copy link
Collaborator Author

Thanks - I had misunderstood what each of you had been asking to be fixed.

From @Bisaloo I understood that the change mentioned in #88 (comment) should work. This was IMO fixed in 08f870a, but because that function already has default arguments for shape and scale.

From @TimTaylor I understood that the error message resulting from removing a check on n_args in 08f870a might not be informative, but I didn't see a specific preference there.

I can implement a small test_function() variant for now as I'd prefer to push {cfr} soon, and optionally make a feature request to {checkmate} with this variant as an example.

@pratikunterwegs
Copy link
Collaborator Author

An alternative is a suggestion from @sbfnk on Slack to pass the function with distribution parameters contained in ..., which would be added to all functions cfr_*(). See e.g.

f = function(fn, n = seq(10), ...) {
  args = list(...)
  args = c(
    x = list(n),
    args
  )
  do.call(fn, args)
}

f(dgamma, shape = 5, scale = 1)
#>  [1] 0.01532831 0.09022352 0.16803136 0.19536681 0.17546737 0.13385262
#>  [7] 0.09122619 0.05725229 0.03373716 0.01891664

Created on 2023-10-31 with reprex v2.0.2

@Bisaloo
Copy link
Member

Bisaloo commented Oct 31, 2023

I can implement a small test_function() variant for now as I'd prefer to push {cfr} soon, and optionally make a feature request to {checkmate} with this variant as an example.

Yes, this seems like the best way forward. Note that this is not a blocker for release IMO as it's an enhancement that won't result in breaking changes. Happy to let others chip in if they feel otherwise though.

@pratikunterwegs
Copy link
Collaborator Author

Thanks - this has been added and used, see d0ef400 part of #94. Happy to hear alternatives as well.

@TimTaylor
Copy link

Using the ... will probably work well and would likely be simpler if people wanted to map over different parameters (e.g. shape and scale). You may (later) want to force them to be named and check they are valid within the function.

Is there a benefit to the do.call() approach versus fn(n, ...) (maybe - I'm trying to think at the moment)?

Think you would benefit from handling all of this within the underlying known_outcomes() (or whatever it's currently called) function. Then you could tryCatch() (or the comparable {rlang} call) in the wrapper function to bubble up the error messages. This is probably for later but would dry the input checking out a bit.

@pratikunterwegs
Copy link
Collaborator Author

Thanks @TimTaylor - I've gone with the test_function variant approach for now. It resolves the initial issues you and Hugo raised, and is not a user-facing change.

Using the ... will probably work well and would likely be simpler if people wanted to map over different parameters (e.g. shape and scale).

I don't understand this suggestion. Users can already pass a density function with different parameters. Could you show me an example of what you mean?

You may (later) want to force them to be named and check they are valid within the function.

No. This something I would prefer to leave to a dedicted distribution class instead. We do have an entire vignette on how to use this functionality so hopefully mant examples to follow.

Is there a benefit to the do.call() approach versus fn(n, ...) (maybe - I'm trying to think at the moment)?

That would work too, and looks cleaner, but we've decided to go with checking the delay_density arg instead.

Think you would benefit from handling all of this within the underlying known_outcomes() (or whatever it's currently called) function. Then you could tryCatch() (or the comparable {rlang} call) in the wrapper function to bubble up the error messages. This is probably for later but would dry the input checking out a bit.

Yes, that's fair, I'm happy to do that.

@TimTaylor
Copy link

I don't understand this suggestion. Users can already pass a density function with different parameters. Could you show me an example of what you mean?

.mapply(known_outcomes, dots = list(shape = 1:5, scale = 1:5), MoreArgs = list(data, dgamma))

I'm relatively indifferent to the approach you take but thought this was a pro for the ... approach as one less step I guess.

@pratikunterwegs
Copy link
Collaborator Author

pratikunterwegs commented Nov 1, 2023

I don't understand this suggestion. Users can already pass a density function with different parameters. Could you show me an example of what you mean?

.mapply(known_outcomes, dots = list(shape = 1:5, scale = 1:5), MoreArgs = list(data, dgamma))

Thanks. I do not want to support mapping over parameters. I think iteration over parameters should be a considered step where I do want users to slow down.

@pratikunterwegs
Copy link
Collaborator Author

I'm relatively indifferent to the approach you take but thought this was a pro for the ... approach as one less step I guess.

For clarity, I think the use case we have seen in the vignettes is about mapping cfr_*() over different datasets (e.g. countries), rather than over different delay distribution parameters. So I do not think that vectorising over the distribution parameters is likely to be a use case.

@pratikunterwegs
Copy link
Collaborator Author

Think you would benefit from handling all of this within the underlying known_outcomes() (or whatever it's currently called) function. Then you could tryCatch() (or the comparable {rlang} call) in the wrapper function to bubble up the error messages. This is probably for later but would dry the input checking out a bit.

I can take the check away from other cfr_*() functions as the error does bubble up as is.

library(cfr)
data = ebola1976
ddens = \(x) dgamma(x, shape = 2.40, scale = 3.33)

cfr_static(data, delay_density = ddens)
#>   severity_mean severity_low severity_high
#> 1         0.959        0.842             1

ddens = \(x, some) dgamma(x, shape = 2.40, scale = 3.33)

cfr_static(data, delay_density = ddens)
#> Error in known_outcomes(data = data, delay_density = delay_density): `delay_density` must be a function with a single required argument,
#>     and evaluating distribution density at a vector of values and returning a
#>     numeric vector of the same length.
#>     E.g. function(x) stats::dgamma(x = x, shape = 5, scale = 1)

Created on 2023-11-01 with reprex v2.0.2

@TimTaylor
Copy link

For clarity, I think the use case we have seen in the vignettes is about mapping cfr_*() over different datasets (e.g. countries), rather than over different delay distribution parameters. So I do not think that vectorising over the distribution parameters is likely to be a use case.

OK. But returning to:

Thanks. I do not want to support mapping over parameters. I think iteration over parameters should be a considered step where I do want users to slow down.

This is not about you performing the iteration for users but how your interface works with or against it. Users can still map over parameters just with a little more friction. I'm merely trying to highlight pros and cons of either approach. E.g.

funs <- mapply(
    function(shape, scale) \(x) dgamma(x, shape = shape, scale = scale),
    shape = 1:5,
    scale = 1:5
)

lapply(funs, known_outcomes, data = data)

Like I say, overall pretty I'm indifferent to the closure versus ellipsis approach.

@pratikunterwegs
Copy link
Collaborator Author

Okay thanks - I do want to work against such iteration. If users want to estimate CFRs based on different parameter sets, that is something I want them to pause and reconsider so this friction is inadvertent but IMO helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants