Second full review cfr v0.1.0 #103

pratikunterwegs · 2023-11-10T09:33:12Z

This PR is intended to be a second and ideally brief full package review before a CRAN release of cfr v0.1.0.

Please note that the threshold for considering major changes as part of this release is high. Anybody who would like to review is welcome to assign themselves in the reviewer panel. Two reviewers or fewer will help to keep this process brief.

The deadline for handing in reviews is Wednesday 15th November.

Note: this is a second attempt at this review after looking into options for comparing directly against packagetemplate which led to #102 being closed. Please go ahead and review the PR even though the dependency change workflow is failing as the base is not an R package.

Note: a release checklist is open as issue #101, and is being worked on in #104.

Correct vignette references Corrections to vignette code Further corrections to vignette

pratikunterwegs · 2023-11-15T09:57:05Z

Thanks @sbfnk and @Bisaloo for your feedback. I'll go through it and see whether any fixes can be rolled into #104, this PR, or whether it would be better to open a fresh PR.

pratikunterwegs · 2023-11-15T15:35:08Z

Thanks @sbfnk and @Bisaloo for your feedback. Going through I've offered fixes/optiosn for key issues that need to be tackled prior to release. Could you and @adamkucharski please leave your feedback on these next steps. I will atempt to address them by week's end, but it might be good to budget for another couple of weeks before they are resolved, especially if code review is required.

Package scope: The package is agnostic to the severity estimate but is targeted at CFR calculation. As such I think it should be named {cfr}, and I can tone down the references to other severity estimates.
Delay distributions: {cfr} is aimed at daily data. If the methods specifically require discretised distributions, we should limit delay_density to discretised distributions, e.g. from {distcrete}, and take this on as a hard dependency. The "delay distributions" vignette will be removed.
Delay distribution parameter uncertainty: I'm not sure how to address that we don't really account for this, beyond making this caveat clear to users. Happy to hear suggestions.
estimate_ascertainment() returns a single estimate with CI for both the "static" and "time_varying" methods. An alternative is to return a time series of ascertainment estimates similar to the time-varying CFR. In this case it would make sense to split the function into two separate functions mirroring cfr_static() and cfr_ascertainment().
cfr_static() returns a subset of cfr_rolling(), and could be an exported function that calls cfr_rolling() under the hood. The main downside of this is that cfr_rolling() calls estimate_severity() for each row of the data, which could be slow for large datasets. I would suggest leaving it as is for now but planning to merge in future as Hugo suggests.
Vignette scope: I have tried to make the vignettes as detailed as possible as that's the preference I have previously heard. Could I get an indication of which parts of the vignettes should be removed, and which parts should be retained?

adamkucharski · 2023-11-15T22:01:05Z

Thanks for the review comments and discussion. A few thoughts on above:

Agree makes sense to keep naming and main focus on CFR, but make clear to users that conceptually they could calculate other delayed outcome risks too (and maybe point to EpiNow2 functionality in later versions, for users who want more advanced options for handling secondary outcomes).
There are some options for discrete distributions in epiparameter, but for longer delays (e.g. the sort of range for most onset-to-death distributions), the error will be relatively small (for n days, error in the equivalent Riemann sum scales with 1/2n), plus we're aggregating over various cumulative periods to get expected values in the calculation, so this will reduce overall error further (because variance of mean scales with 1/n too). In practice the current approximation will therefore be much more reasonable that it would be for an adjustment from, say, onset to reporting for incidence estimation (which may be a couple of days). Suggest avoid hard dependency (which will add barriers to usability) and instead allow flexibility in input distributions, but make limitation clear to user. Also, reminder of this comment on scope to expand for non-daily data: Methods for low-frequency reporting #76 - tagging @CarmenTamayo as may be nice one for case study as starting point, then can discuss for a later CFR version.
This links in with a larger discussion about how to propagate uncertainty in parameters across our packages (e.g. Passing uncertainty between functions/packages quickfit#10 - and discussions on {epidemics} and {scenarios}. Suggest keep flag to user that we're very much just calculating the expectation currently (which is still far, far better than the common practice of just doing deaths/cases), and look at integration across the Epiverse system of 'glue' functions than can pass parameter uncertainty in sensible ways. Tagging @joshwlambert for info.
Things get quite fiddly when having time varying estimation of underascertainment, as ideally want a smooth underlying process model to handle this – we implemented something along these lines for COVID (https://github.com/thimotei/CFR_calculation), but the computational complexity created a lot of barriers to reuse.
cfr_rolling() is mostly to support conceptial understanding of the role of delays I think (i.e answering 'how would it have changed in real-time as more data come in?') - in actual real-time most users will want the 'best' estimate that uses all data, rather than a hindsight illustration. So I'd keep cfr_static() as clean and fast as possible.
Haven't reviewed changes to new vignettes, but previous version and quick starts seemed pretty to-the-point currently?

pratikunterwegs · 2023-11-16T09:48:07Z

Thanks @adamkucharski for the feedback, it's really useful. I'm laying out next steps for this release below the discussion points, do please (all) come in if you feel they should be different.

1. Agree makes sense to keep naming and main focus on CFR, but make clear to users that conceptually they could calculate other delayed outcome risks too (and maybe point to EpiNow2 functionality in later versions, for users who want more advanced options for handling secondary outcomes).

Will emphasise CFR over other measures, and point to other packages as Seb has suggested.

2. There are some options for discrete distributions in [epiparameter](https://epiverse-trace.github.io/epiparameter/reference/discretise.html), but for longer delays (e.g. the sort of range for most onset-to-death distributions), the error will be relatively small (for n days, error in the equivalent Riemann sum scales with 1/2n), plus we're aggregating over various cumulative periods to get expected values in the calculation, so this will reduce overall error further (because variance of mean scales with 1/n too). In practice the current approximation will therefore be much more reasonable that it would be for an adjustment from, say, onset to reporting for incidence estimation (which may be a couple of days). Suggest avoid hard dependency (which will add barriers to usability) and instead allow flexibility in input distributions, but make limitation clear to user. Also, reminder of this comment on scope to expand for non-daily data: [Methods for low-frequency reporting #76](https://github.com/epiverse-trace/cfr/issues/76) - tagging @CarmenTamayo  as may be nice one for case study as starting point, then can discuss for a later CFR version.

I will keep things as they are (passing a function evaluating PDF), and update Readme and vignettes with boxes recommending discrete distributions. Speaking with @joshwlambert about this yesterday, {epiparameter} has functionality to discretise distributions as well, so a useful addition as a dependency in future releases (cutting user workload).

3. This links in with a larger discussion about how to propagate uncertainty in parameters across our packages (e.g. [Passing uncertainty between functions/packages quickfit#10](https://github.com/epiverse-trace/quickfit/issues/10) - and discussions on {epidemics} and {scenarios}. Suggest keep flag to user that we're very much just calculating the expectation currently (which is still far, far better than the common practice of just doing deaths/cases), and look at integration across the Epiverse system of 'glue' functions than can pass parameter uncertainty in sensible ways. Tagging @joshwlambert for info.

Will mention in vignettes etc. but leave as is for now.

4. Things get quite fiddly when having time varying estimation of underascertainment, as ideally want a smooth underlying process model to handle this – we implemented something along these lines for COVID (https://github.com/thimotei/CFR_calculation), but the computational complexity created a lot of barriers to reuse.

Thanks for clarifying - happy to implement in a future version, but will leave as is for now.

5. `cfr_rolling()` is mostly to support conceptial understanding of the role of delays I think (i.e answering 'how would it have changed in real-time as more data come in?') - in actual real-time most users will want the 'best' estimate that uses all data, rather than a hindsight illustration. So I'd keep `cfr_static()` as clean and fast as possible.

Will add the conceptual understanding point to the function doc Details section.

6. Haven't reviewed changes to new vignettes, but previous version and quick starts seemed pretty to-the-point currently?

@Bisaloo I'm open to cutting down the vignettes (see also #65), but would be good to tag specific sections as you have done in the {incidence2} vignette (will remove that section).

Co-authored-by: Hugo Gruson <[email protected]>

Co-authored-by: Sebastian Funk <[email protected]>

sbfnk · 2023-11-16T12:30:10Z

There are some options for discrete distributions in epiparameter, but for longer delays (e.g. the sort of range for most onset-to-death distributions), the error will be relatively small (for n days, error in the equivalent Riemann sum scales with 1/2n), plus we're aggregating over various cumulative periods to get expected values in the calculation, so this will reduce overall error further (because variance of mean scales with 1/n too). In practice the current approximation will therefore be much more reasonable that it would be for an adjustment from, say, onset to reporting for incidence estimation (which may be a couple of days). Suggest avoid hard dependency (which will add barriers to usability) and instead allow flexibility in input distributions, but make limitation clear to user. Also, reminder of this comment on scope to expand for non-daily data: Methods for low-frequency reporting #76 - tagging @CarmenTamayo as may be nice one for case study as starting point, then can discuss for a later CFR version.

I agree on avoiding the hard dependency and instead focus on demonstrating good practice (i.e. use discrete distributions in examples) and clarifying to the potentially unsuspecting user.

This links in with a larger discussion about how to propagate uncertainty in parameters across our packages (e.g. Passing uncertainty between functions/packages quickfit#10 - and discussions on {epidemics} and {scenarios}. Suggest keep flag to user that we're very much just calculating the expectation currently (which is still far, far better than the common practice of just doing deaths/cases), and look at integration across the Epiverse system of 'glue' functions than can pass parameter uncertainty in sensible ways. Tagging @joshwlambert for info.

It's fine to keep as is if we don't have a good solution for the moment but I agree it needs to be made clear to the user in the vignettes that the uncertainty we are stating is less than the true uncertainty.

Co-authored-by: Hugo Gruson <[email protected]>

Co-authored-by: Sebastian Funk <[email protected]>

pratikunterwegs · 2023-11-17T16:42:04Z

Thanks all for your reviews and for the discussion.

I have implemented fixes to the issues raised here in #104 including:

Substantial cuts to the vignettes per @Bisaloo,
Unifying the estimate_outcomes() and cfr_time_varying() cases-PMF convolution into a single function .calc_expected_outcomes() per @sbfnk,
Adding caveats and warnings about discrete distributions and parameter uncertainty.

Checks on #104 are currently failing due to a single snapshot test that has changed following 8d7db4c suggested by @sbfnk (and my implementation in 70647b8) - the estimated ascertainment for Covid in the UK using this method jumps from ~45% to ~80% which seems quite high - it would be great if somebody could check cfr_time_varying() and especially .calc_expected_outcomes() and confirm this change should be accepted.

I'll leave this PR open until #104 is closed, but it might be easier to move the discussion there already.

adamkucharski · 2023-11-19T10:15:03Z

The underascertainment calculation should be using the static CFR calculation right? (Following on from discussion above about challenges of implementing estimation for time varying reporting, so beyond scope for this simpler estimate package).

I haven't dug into the new commits, but the time varying calculation should just be a straight implementation of the convolution described in the vignette. The 'burn in' is just to avoid returning the earliest estimates to users (because not enough cases will have accrued in the time series to give a sensible estimate of total outcomes for early dates). So it may be a book keeping issue if a commit has changed how the vector of daily known outcomes lines up with the vector of daily deaths.

pratikunterwegs · 2023-11-20T09:29:35Z

Thanks @adamkucharski - that's cleared it up, I've reversed the change in 8d7db4c.

adamkucharski · 2023-11-20T09:52:37Z

Thanks @adamkucharski - that's cleared it up, I've reversed the change in 8d7db4c.

Just to check, is the conclusion that the original version of cfr_time_varying.R is the correct convolution implementation, or was @sbfnk's edit making an improvement (but perhaps one where we needed to double check the book keeping on the vector alignment?)

pratikunterwegs · 2023-11-20T10:16:44Z

Thanks @adamkucharski - that's cleared it up, I've reversed the change in 8d7db4c.

Just to check, is the conclusion that the original version of cfr_time_varying.R is the correct convolution implementation, or was @sbfnk's edit making an improvement (but perhaps one where we needed to double check the book keeping on the vector alignment?)

My understanding is that the original implementation was correct, but the notation in Nishiura is a bit confusing. Seb's suggestion was to offset the cases by the burn_in argument, but I don't think that's necessary as those estimates are returned as NA by default anyway. Happy to make corrections

sbfnk · 2023-11-20T11:49:32Z

Can I suggest to add a test that

cfr_time_varying(
  data = df_covid_uk,
  burn_in = 7L
)

yields the same result as

cfr_time_varying(
  data = df_covid_uk,
  delay_density = \(x) ifelse(x == 0, 1, 0), 
  burn_in = 7L
)

to make sure your implementation doesn't introduce any bias due to index shifting?

pratikunterwegs · 2023-11-20T12:11:08Z

Thanks @sbfnk - implementing this test shows that your implementation in 8d7db4c is correct.

So I'm unsure why the ascertainment snapshot test fails when other snapshot tests pass fine.

adamkucharski · 2023-11-20T13:56:33Z

Is there a simple sense check that could provide some insight? E.g. fix cases to be flat at 500 a day, deaths at 10 per day and CFR at 1% - which should give an ascertainment rate of 0.01/(10/500) = 50% if method working?

pratikunterwegs · 2023-11-20T16:10:25Z

Is there a simple sense check that could provide some insight? E.g. fix cases to be flat at 500 a day, deaths at 10 per day and CFR at 1% - which should give an ascertainment rate of 0.01/(10/500) = 50% if method working?

Thanks, added these tests and they pass fine.

The convolution suggested earlier would offset cases and PMF values like so:

# here offset is the burn_in value
delay_pmf_eval <- pmf_vals[seq_len(i - offset)]

# estimate expected number of outcomes
expected_outcomes <- cases[seq(offset + 1, i)] * rev(delay_pmf_eval)

creating misaligned vectors - PMF values are in the range 1:(n - burn_in) (e.g. 1:43 for burn_in = 7, 50 days), while cases are in the range (burn_in + 1: i) (e.g. 8:50).

I've shifted to a simpler function, and am setting the first burn_in and last smoothing_window estimates to NA in the body of cfr_time_varying() instead.

.convolve_cases_pmfs <- function(cases, pmf_vals) {
  # no input checks as this is an internal function
  vapply(
    X = seq_along(cases),
    FUN = function(i) {
      # estimate expected number of outcomes
      sum(cases[seq_len(i)] * rev(pmf_vals[seq_len(i)]))
    },
    FUN.VALUE = numeric(1)
  )
}

pratikunterwegs · 2023-11-21T08:18:35Z

Thanks again all, closing this PR with changes implemented in #104

pratikunterwegs and others added 30 commits June 15, 2023 13:58

Corrections to estimate_time_varying()

eef365e

Update default options for get_default_burn_in()

47d938f

Corrections to known_outcomes()

459c7b2

Remove format_output()

954e67c

Update examples and comments

c54ae77

Remove package documentation

b3dc62b

Minor updates to test files, add snaphots

a7ab82c

Remove format_output() from vignettes

38ac876

Correct example in known_outcomes()

10ded3b

Updates to README - rm bad plots

b59a134

Update DESCRIPTION and NAMESPACE

7d96525

Re-add {epiparameter} remote for package install

551e087

Static severity vignette no longer uses plot_ fns

223635b

Correct plotting code in estimate_static vignette

4d83493

Vignette estimate_time_varying: use ggplot, no custom fns

1da9448

Correct references in ascertainment vignette

d7aad32

Formatting in known_outcomes documentation

9827242

Edit Readme to trigger Md build

bdfd4f7

Automatic readme update

ef9c07b

Correct docu for ebola1976 and estimate_static

543f885

Remove setup vignette

324d7c0

Add Get started vignette

f1a7017

Use ggplot in ascertainment vignette

01a3822

Minor edits to static_severity vignette

d9e523d

Add Camacho 2014 reference to library

2de0167

Correct estimate_ascertainment vignette

872aae7

Correct vignette references Corrections to vignette code Further corrections to vignette

Fn known_outcomes() does not require deaths

65e44e0

Update Readme

b319f22

Correct function linking in vignettes

a74eda0

Errors TRUE on spelling, update wordlist and docs

5ecd7ac

pratikunterwegs mentioned this pull request Nov 16, 2023

Consider merging cfr_static() and cfr_rolling() #105

Closed

pratikunterwegs and others added 2 commits November 16, 2023 12:03

Vapply avoids loops in cfr_rolling.R

2bea29f

Co-authored-by: Hugo Gruson <[email protected]>

Correct known_onsets shifting

8d7db4c

Co-authored-by: Sebastian Funk <[email protected]>

pratikunterwegs and others added 4 commits November 16, 2023 13:08

Compact iteration in cfr_time_varying.R

7babf50

Co-authored-by: Hugo Gruson <[email protected]>

Remove duplicated date col in Readme.Rmd

fe9dd14

Co-authored-by: Hugo Gruson <[email protected]>

Automatic readme update

83df8e8

Vignette edits from code review

d3332ac

Co-authored-by: Sebastian Funk <[email protected]>

This was referenced Nov 17, 2023

Single convolution function for cases and PMFs #106

Closed

Consider optim() in estimate_severity() #107

Closed

pratikunterwegs mentioned this pull request Nov 20, 2023

Pre-release fixes #104

Merged

pratikunterwegs mentioned this pull request Nov 20, 2023

Test for bias due to index shifting #108

Closed

pratikunterwegs closed this Nov 21, 2023

pratikunterwegs mentioned this pull request Nov 21, 2023

Remove time-varying ascertainment #110

Merged

pratikunterwegs deleted the review branch March 7, 2024 09:43

pratikunterwegs mentioned this pull request Mar 19, 2024

CFR stands for Risk or Ratio? #130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second full review cfr v0.1.0 #103

Second full review cfr v0.1.0 #103

pratikunterwegs commented Nov 10, 2023 •

edited

Loading

pratikunterwegs commented Nov 15, 2023

pratikunterwegs commented Nov 15, 2023 •

edited

Loading

adamkucharski commented Nov 15, 2023 •

edited

Loading

pratikunterwegs commented Nov 16, 2023

sbfnk commented Nov 16, 2023

pratikunterwegs commented Nov 17, 2023 •

edited

Loading

adamkucharski commented Nov 19, 2023

pratikunterwegs commented Nov 20, 2023

adamkucharski commented Nov 20, 2023

pratikunterwegs commented Nov 20, 2023

sbfnk commented Nov 20, 2023

pratikunterwegs commented Nov 20, 2023

adamkucharski commented Nov 20, 2023 •

edited

Loading

pratikunterwegs commented Nov 20, 2023

pratikunterwegs commented Nov 21, 2023

Second full review cfr v0.1.0 #103

Second full review cfr v0.1.0 #103

Conversation

pratikunterwegs commented Nov 10, 2023 • edited Loading

pratikunterwegs commented Nov 15, 2023

pratikunterwegs commented Nov 15, 2023 • edited Loading

adamkucharski commented Nov 15, 2023 • edited Loading

pratikunterwegs commented Nov 16, 2023

sbfnk commented Nov 16, 2023

pratikunterwegs commented Nov 17, 2023 • edited Loading

adamkucharski commented Nov 19, 2023

pratikunterwegs commented Nov 20, 2023

adamkucharski commented Nov 20, 2023

pratikunterwegs commented Nov 20, 2023

sbfnk commented Nov 20, 2023

pratikunterwegs commented Nov 20, 2023

adamkucharski commented Nov 20, 2023 • edited Loading

pratikunterwegs commented Nov 20, 2023

pratikunterwegs commented Nov 21, 2023

pratikunterwegs commented Nov 10, 2023 •

edited

Loading

pratikunterwegs commented Nov 15, 2023 •

edited

Loading

adamkucharski commented Nov 15, 2023 •

edited

Loading

pratikunterwegs commented Nov 17, 2023 •

edited

Loading

adamkucharski commented Nov 20, 2023 •

edited

Loading