Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second full review cfr v0.1.0 #103

Closed
wants to merge 381 commits into from
Closed

Second full review cfr v0.1.0 #103

wants to merge 381 commits into from

Conversation

pratikunterwegs
Copy link
Collaborator

@pratikunterwegs pratikunterwegs commented Nov 10, 2023

This PR is intended to be a second and ideally brief full package review before a CRAN release of cfr v0.1.0.

Please note that the threshold for considering major changes as part of this release is high. Anybody who would like to review is welcome to assign themselves in the reviewer panel. Two reviewers or fewer will help to keep this process brief.

The deadline for handing in reviews is Wednesday 15th November.

Note: this is a second attempt at this review after looking into options for comparing directly against packagetemplate which led to #102 being closed. Please go ahead and review the PR even though the dependency change workflow is failing as the base is not an R package.

Note: a release checklist is open as issue #101, and is being worked on in #104.

pratikunterwegs and others added 30 commits June 15, 2023 13:58
Correct vignette references

Corrections to vignette code

Further corrections to vignette
@pratikunterwegs
Copy link
Collaborator Author

Thanks @sbfnk and @Bisaloo for your feedback. I'll go through it and see whether any fixes can be rolled into #104, this PR, or whether it would be better to open a fresh PR.

@pratikunterwegs
Copy link
Collaborator Author

pratikunterwegs commented Nov 15, 2023

Thanks @sbfnk and @Bisaloo for your feedback. Going through I've offered fixes/optiosn for key issues that need to be tackled prior to release. Could you and @adamkucharski please leave your feedback on these next steps. I will atempt to address them by week's end, but it might be good to budget for another couple of weeks before they are resolved, especially if code review is required.

  1. Package scope: The package is agnostic to the severity estimate but is targeted at CFR calculation. As such I think it should be named {cfr}, and I can tone down the references to other severity estimates.

  2. Delay distributions: {cfr} is aimed at daily data. If the methods specifically require discretised distributions, we should limit delay_density to discretised distributions, e.g. from {distcrete}, and take this on as a hard dependency. The "delay distributions" vignette will be removed.

  3. Delay distribution parameter uncertainty: I'm not sure how to address that we don't really account for this, beyond making this caveat clear to users. Happy to hear suggestions.

  4. estimate_ascertainment() returns a single estimate with CI for both the "static" and "time_varying" methods. An alternative is to return a time series of ascertainment estimates similar to the time-varying CFR. In this case it would make sense to split the function into two separate functions mirroring cfr_static() and cfr_ascertainment().

  5. cfr_static() returns a subset of cfr_rolling(), and could be an exported function that calls cfr_rolling() under the hood. The main downside of this is that cfr_rolling() calls estimate_severity() for each row of the data, which could be slow for large datasets. I would suggest leaving it as is for now but planning to merge in future as Hugo suggests.

  6. Vignette scope: I have tried to make the vignettes as detailed as possible as that's the preference I have previously heard. Could I get an indication of which parts of the vignettes should be removed, and which parts should be retained?

@adamkucharski
Copy link
Member

adamkucharski commented Nov 15, 2023

Thanks for the review comments and discussion. A few thoughts on above:

  1. Agree makes sense to keep naming and main focus on CFR, but make clear to users that conceptually they could calculate other delayed outcome risks too (and maybe point to EpiNow2 functionality in later versions, for users who want more advanced options for handling secondary outcomes).

  2. There are some options for discrete distributions in epiparameter, but for longer delays (e.g. the sort of range for most onset-to-death distributions), the error will be relatively small (for n days, error in the equivalent Riemann sum scales with 1/2n), plus we're aggregating over various cumulative periods to get expected values in the calculation, so this will reduce overall error further (because variance of mean scales with 1/n too). In practice the current approximation will therefore be much more reasonable that it would be for an adjustment from, say, onset to reporting for incidence estimation (which may be a couple of days). Suggest avoid hard dependency (which will add barriers to usability) and instead allow flexibility in input distributions, but make limitation clear to user. Also, reminder of this comment on scope to expand for non-daily data: Methods for low-frequency reporting #76 - tagging @CarmenTamayo as may be nice one for case study as starting point, then can discuss for a later CFR version.

  3. This links in with a larger discussion about how to propagate uncertainty in parameters across our packages (e.g. Passing uncertainty between functions/packages quickfit#10 - and discussions on {epidemics} and {scenarios}. Suggest keep flag to user that we're very much just calculating the expectation currently (which is still far, far better than the common practice of just doing deaths/cases), and look at integration across the Epiverse system of 'glue' functions than can pass parameter uncertainty in sensible ways. Tagging @joshwlambert for info.

  4. Things get quite fiddly when having time varying estimation of underascertainment, as ideally want a smooth underlying process model to handle this – we implemented something along these lines for COVID (https://github.com/thimotei/CFR_calculation), but the computational complexity created a lot of barriers to reuse.

  5. cfr_rolling() is mostly to support conceptial understanding of the role of delays I think (i.e answering 'how would it have changed in real-time as more data come in?') - in actual real-time most users will want the 'best' estimate that uses all data, rather than a hindsight illustration. So I'd keep cfr_static() as clean and fast as possible.

  6. Haven't reviewed changes to new vignettes, but previous version and quick starts seemed pretty to-the-point currently?

@pratikunterwegs
Copy link
Collaborator Author

Thanks @adamkucharski for the feedback, it's really useful. I'm laying out next steps for this release below the discussion points, do please (all) come in if you feel they should be different.

1. Agree makes sense to keep naming and main focus on CFR, but make clear to users that conceptually they could calculate other delayed outcome risks too (and maybe point to EpiNow2 functionality in later versions, for users who want more advanced options for handling secondary outcomes).

Will emphasise CFR over other measures, and point to other packages as Seb has suggested.

2. There are some options for discrete distributions in [epiparameter](https://epiverse-trace.github.io/epiparameter/reference/discretise.html), but for longer delays (e.g. the sort of range for most onset-to-death distributions), the error will be relatively small (for n days, error in the equivalent Riemann sum scales with 1/2n), plus we're aggregating over various cumulative periods to get expected values in the calculation, so this will reduce overall error further (because variance of mean scales with 1/n too). In practice the current approximation will therefore be much more reasonable that it would be for an adjustment from, say, onset to reporting for incidence estimation (which may be a couple of days). Suggest avoid hard dependency (which will add barriers to usability) and instead allow flexibility in input distributions, but make limitation clear to user. Also, reminder of this comment on scope to expand for non-daily data: [Methods for low-frequency reporting #76](https://github.com/epiverse-trace/cfr/issues/76) - tagging @CarmenTamayo  as may be nice one for case study as starting point, then can discuss for a later CFR version.

I will keep things as they are (passing a function evaluating PDF), and update Readme and vignettes with boxes recommending discrete distributions. Speaking with @joshwlambert about this yesterday, {epiparameter} has functionality to discretise distributions as well, so a useful addition as a dependency in future releases (cutting user workload).

3. This links in with a larger discussion about how to propagate uncertainty in parameters across our packages (e.g. [Passing uncertainty between functions/packages quickfit#10](https://github.com/epiverse-trace/quickfit/issues/10) - and discussions on {epidemics} and {scenarios}. Suggest keep flag to user that we're very much just calculating the expectation currently (which is still far, far better than the common practice of just doing deaths/cases), and look at integration across the Epiverse system of 'glue' functions than can pass parameter uncertainty in sensible ways. Tagging @joshwlambert for info.

Will mention in vignettes etc. but leave as is for now.

4. Things get quite fiddly when having time varying estimation of underascertainment, as ideally want a smooth underlying process model to handle this – we implemented something along these lines for COVID (https://github.com/thimotei/CFR_calculation), but the computational complexity created a lot of barriers to reuse.

Thanks for clarifying - happy to implement in a future version, but will leave as is for now.

5. `cfr_rolling()` is mostly to support conceptial understanding of the role of delays I think (i.e answering 'how would it have changed in real-time as more data come in?') - in actual real-time most users will want the 'best' estimate that uses all data, rather than a hindsight illustration. So I'd keep `cfr_static()` as clean and fast as possible.

Will add the conceptual understanding point to the function doc Details section.

6. Haven't reviewed changes to new vignettes, but previous version and quick starts seemed pretty to-the-point currently?

@Bisaloo I'm open to cutting down the vignettes (see also #65), but would be good to tag specific sections as you have done in the {incidence2} vignette (will remove that section).

@sbfnk
Copy link
Contributor

sbfnk commented Nov 16, 2023

  1. There are some options for discrete distributions in epiparameter, but for longer delays (e.g. the sort of range for most onset-to-death distributions), the error will be relatively small (for n days, error in the equivalent Riemann sum scales with 1/2n), plus we're aggregating over various cumulative periods to get expected values in the calculation, so this will reduce overall error further (because variance of mean scales with 1/n too). In practice the current approximation will therefore be much more reasonable that it would be for an adjustment from, say, onset to reporting for incidence estimation (which may be a couple of days). Suggest avoid hard dependency (which will add barriers to usability) and instead allow flexibility in input distributions, but make limitation clear to user. Also, reminder of this comment on scope to expand for non-daily data: Methods for low-frequency reporting #76 - tagging @CarmenTamayo as may be nice one for case study as starting point, then can discuss for a later CFR version.

I agree on avoiding the hard dependency and instead focus on demonstrating good practice (i.e. use discrete distributions in examples) and clarifying to the potentially unsuspecting user.

  1. This links in with a larger discussion about how to propagate uncertainty in parameters across our packages (e.g. Passing uncertainty between functions/packages quickfit#10 - and discussions on {epidemics} and {scenarios}. Suggest keep flag to user that we're very much just calculating the expectation currently (which is still far, far better than the common practice of just doing deaths/cases), and look at integration across the Epiverse system of 'glue' functions than can pass parameter uncertainty in sensible ways. Tagging @joshwlambert for info.

It's fine to keep as is if we don't have a good solution for the moment but I agree it needs to be made clear to the user in the vignettes that the uncertainty we are stating is less than the true uncertainty.

@pratikunterwegs
Copy link
Collaborator Author

pratikunterwegs commented Nov 17, 2023

Thanks all for your reviews and for the discussion.

I have implemented fixes to the issues raised here in #104 including:

  1. Substantial cuts to the vignettes per @Bisaloo,
  2. Unifying the estimate_outcomes() and cfr_time_varying() cases-PMF convolution into a single function .calc_expected_outcomes() per @sbfnk,
  3. Adding caveats and warnings about discrete distributions and parameter uncertainty.

Checks on #104 are currently failing due to a single snapshot test that has changed following 8d7db4c suggested by @sbfnk (and my implementation in 70647b8) - the estimated ascertainment for Covid in the UK using this method jumps from ~45% to ~80% which seems quite high - it would be great if somebody could check cfr_time_varying() and especially .calc_expected_outcomes() and confirm this change should be accepted.

I'll leave this PR open until #104 is closed, but it might be easier to move the discussion there already.

@adamkucharski
Copy link
Member

The underascertainment calculation should be using the static CFR calculation right? (Following on from discussion above about challenges of implementing estimation for time varying reporting, so beyond scope for this simpler estimate package).

I haven't dug into the new commits, but the time varying calculation should just be a straight implementation of the convolution described in the vignette. The 'burn in' is just to avoid returning the earliest estimates to users (because not enough cases will have accrued in the time series to give a sensible estimate of total outcomes for early dates). So it may be a book keeping issue if a commit has changed how the vector of daily known outcomes lines up with the vector of daily deaths.

@pratikunterwegs
Copy link
Collaborator Author

Thanks @adamkucharski - that's cleared it up, I've reversed the change in 8d7db4c.

@adamkucharski
Copy link
Member

Thanks @adamkucharski - that's cleared it up, I've reversed the change in 8d7db4c.

Just to check, is the conclusion that the original version of cfr_time_varying.R is the correct convolution implementation, or was @sbfnk's edit making an improvement (but perhaps one where we needed to double check the book keeping on the vector alignment?)

@pratikunterwegs
Copy link
Collaborator Author

Thanks @adamkucharski - that's cleared it up, I've reversed the change in 8d7db4c.

Just to check, is the conclusion that the original version of cfr_time_varying.R is the correct convolution implementation, or was @sbfnk's edit making an improvement (but perhaps one where we needed to double check the book keeping on the vector alignment?)

My understanding is that the original implementation was correct, but the notation in Nishiura is a bit confusing. Seb's suggestion was to offset the cases by the burn_in argument, but I don't think that's necessary as those estimates are returned as NA by default anyway. Happy to make corrections

@sbfnk
Copy link
Contributor

sbfnk commented Nov 20, 2023

Can I suggest to add a test that

cfr_time_varying(
  data = df_covid_uk,
  burn_in = 7L
)

yields the same result as

cfr_time_varying(
  data = df_covid_uk,
  delay_density = \(x) ifelse(x == 0, 1, 0), 
  burn_in = 7L
)

to make sure your implementation doesn't introduce any bias due to index shifting?

@pratikunterwegs
Copy link
Collaborator Author

Thanks @sbfnk - implementing this test shows that your implementation in 8d7db4c is correct.

So I'm unsure why the ascertainment snapshot test fails when other snapshot tests pass fine.

@adamkucharski
Copy link
Member

adamkucharski commented Nov 20, 2023

Is there a simple sense check that could provide some insight? E.g. fix cases to be flat at 500 a day, deaths at 10 per day and CFR at 1% - which should give an ascertainment rate of 0.01/(10/500) = 50% if method working?

@pratikunterwegs
Copy link
Collaborator Author

Is there a simple sense check that could provide some insight? E.g. fix cases to be flat at 500 a day, deaths at 10 per day and CFR at 1% - which should give an ascertainment rate of 0.01/(10/500) = 50% if method working?

Thanks, added these tests and they pass fine.

The convolution suggested earlier would offset cases and PMF values like so:

# here offset is the burn_in value
delay_pmf_eval <- pmf_vals[seq_len(i - offset)]

# estimate expected number of outcomes
expected_outcomes <- cases[seq(offset + 1, i)] * rev(delay_pmf_eval)

creating misaligned vectors - PMF values are in the range 1:(n - burn_in) (e.g. 1:43 for burn_in = 7, 50 days), while cases are in the range (burn_in + 1: i) (e.g. 8:50).

I've shifted to a simpler function, and am setting the first burn_in and last smoothing_window estimates to NA in the body of cfr_time_varying() instead.

.convolve_cases_pmfs <- function(cases, pmf_vals) {
  # no input checks as this is an internal function
  vapply(
    X = seq_along(cases),
    FUN = function(i) {
      # estimate expected number of outcomes
      sum(cases[seq_len(i)] * rev(pmf_vals[seq_len(i)]))
    },
    FUN.VALUE = numeric(1)
  )
}

@pratikunterwegs
Copy link
Collaborator Author

Thanks again all, closing this PR with changes implemented in #104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants