From 41038d2c556ed56a649226f6e29285c1bf516eb1 Mon Sep 17 00:00:00 2001 From: athowes Date: Tue, 14 May 2024 11:07:07 +0100 Subject: [PATCH 01/19] Rewriting and scoping out getting started vignette --- vignettes/epidist.Rmd | 197 ++++++++---------------------------------- 1 file changed, 35 insertions(+), 162 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 83a50fb22..a2d19490d 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -1,7 +1,6 @@ --- -title: "Getting Started with Epidist" -description: "A quick start guide to using the Epidist package" -author: Epidist Team +title: "Getting started with epidist" +description: "A quick start guide to using the epidist R package" output: bookdown::html_vignette2: fig_caption: yes @@ -11,7 +10,7 @@ pkgdown: csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl link-citations: true vignette: > - %\VignetteIndexEntry{Getting Started with Epidist} + %\VignetteIndexEntry{Getting started with epidist} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- @@ -20,20 +19,30 @@ vignette: > # exclude compile warnings from cmdstanr knitr::opts_chunk$set( fig.path = "figures/getting-started-nowcasting-", - cache = TRUE, dpi = 330, - collapse = TRUE, comment = "#>", out.width = "100%", - message = FALSE, warning = FALSE, error = FALSE, - eval = FALSE + cache = TRUE, + dpi = 330, + collapse = TRUE, + comment = "#>", + out.width = "100%", + message = FALSE, + warning = FALSE, + error = FALSE, + eval = TRUE ) ``` -# Quick start +In epidemiology, many important quantities can be thought of as the time between two events, or as delays. +Examples include: -In this quick start, we demonstrate using `epidist` to ... +* the incubation period (time between infection and symptom onset) +* serial interval (time between symptom onset of infectee and symptom onset of infected), and +* generation interval (time between infection of infectee and infection of infected). -# Package +Unfortunately, estimating these quantities accurately from observational data is usually difficult. +In our experience, the two main issues are interval censoring and right truncation. +The `epidist` package makes it easy to estimate delays using a statistical model which properly accounts for these two issues. -In this quick start, we also use `data.table`, `purrr`, and `ggplot2` packages. These are both installed as dependencies when `epidist` is installed. Note that all output from `epidist` is readily useable with other tools, including `tidyverse` packages (see [here](https://mgimond.github.io/rug_2019_12/Index.html) for a comparison). +## Example data ```{r load-requirements} library(epidist) @@ -42,160 +51,24 @@ library(purrr) library(ggplot2) ``` -### Simulate the data +Here we quickly generate some simulated example data. +We focus on the format that the data needs to take for it to work with the fitting function. +Eventually, we discuss the tools available in the package to get your data into the right format, or we link to a vignette about it. -Simulate data from an outbreak. +## Fit the model -```{r simulate-outbreak} -outbreak <- simulate_gillespie(seed = 101) -``` - -Define the secondary distribution to use in the simulation - -```{r secondary-dist} -secondary_dist <- data.table( - meanlog = 1.8, sdlog = 0.5 -) |> - add_natural_scale_mean_sd() -``` - -Simulate an observation process during the growth phase for a secondary event using a lognormal distribution, and finally simulate observing this event. - -```{r simulate-data} -obs <- outbreak |> - simulate_secondary( - meanlog = secondary_dist$meanlog[[1]], - sdlog = secondary_dist$sdlog[[1]] - ) |> - observe_process() -``` - -Observe the outbreak after 25 days and take 100 samples. - -```{r observe-data} -truncated_obs <- obs |> - filter_obs_by_obs_time(obs_time = 25) -truncated_obs <- truncated_obs[sample(1:.N, 200, replace = FALSE)] -``` - -Plot primary cases (columns), and secondary cases (dots) by the observation time of their secondary events. This reflects would could be observed in real-time. - -```{r observed-cases} -truncated_cases <- construct_cases_by_obs_window( - obs, windows = c(25), obs_type = "stime" -) - -plot_cases_by_obs_window(truncated_cases) -``` - -We can alternatively plot the observed data based on primary events. This corresponds to a retrospective cohort based view of the data. - -```{r cohort-observed-cases} -truncated_cases <- construct_cases_by_obs_window( - obs, windows = c(25), obs_type = "ptime" -) - -plot_cases_by_obs_window(truncated_cases) -``` - -Plot the true continuous delay distribution and the empirical observed distribution for each observation window. - -```{r empirical-dist} -combined_obs <- combine_obs(truncated_obs, obs) - -plot_empirical_delay( - combined_obs, meanlog = secondary_dist$meanlog[[1]], - sdlog = secondary_dist$sdlog[[1]] -) -``` - -Plot the mean difference between continuous and discrete event time: - -```{r censor_delay} -censor_delay <- calculate_censor_delay(truncated_obs) -plot_censor_delay(censor_delay) -``` - -### Model - -Adjust for right truncation and date censoring using a latent variable approach. - -```{r} -latent_truncation_censoring_fit <- latent_truncation_censoring_adjusted_delay( - data = truncated_obs, cores = 4, refresh = 0 -) -``` - -### Summarise model posterior and compare to known truth - -Combine models into a named list. +We fit the model adjusting for right truncation and date censoring using a latent variable approach. +We briefly describe the model, and the inference methodology. +Eventually, we link to a vignette about common issues with model fitting and HMC. ```{r} -models <- list( - "Latent variable truncation and censoring adjusted" = latent_truncation_censoring_fit -) -``` - -Extract and summarise lognormal posterior estimates. - -```{r lognormal-draws} -draws <- models |> - map(extract_lognormal_draws) |> - rbindlist(idcol = "model") |> - rbind(epinowcast_draws, use.names = TRUE) - -draws <- draws[, - model := factor( - model, levels = c("Joint incidence and forward delay", rev(names(models))) - )] - -summarised_draws <- draws |> - draws_to_long() |> - summarise_draws(sf = 3) - -knitr::kable(summarised_draws[parameter %in% c("meanlog", "sdlog")]) -``` - -Plot summarised posterior estimates from each model compared to the ground truth. - -```{r, fig.width = 9, fig.height = 4} -draws |> - draws_to_long() |> - make_relative_to_truth(draws_to_long(secondary_dist)) |> - plot_relative_recovery(y = model, fill = model) + - facet_wrap(vars(parameter), nrow = 1, scales = "free_x") + - scale_fill_brewer(palette = "Dark2") + - guides(fill = guide_none()) + - labs( - y = "Model", x = "Relative to ground truth" - ) +# latent_truncation_censoring_fit <- latent_truncation_censoring_adjusted_delay( +# data = truncated_obs, cores = 4, refresh = 0 +# ) ``` -Finally, check the mean posterior predictions for each model against the observed daily cohort mean. -```{r postcheck} -truncated_draws <- draws |> - calculate_truncated_means( - obs_at = max(truncated_obs$stime_daily), - ptime = range(truncated_obs$ptime_daily) - ) |> - summarise_variable(variable = "trunc_mean", by = c("obs_horizon", "model")) - -truncated_draws[, model := factor( - model, levels = c("Joint incidence and forward delay", rev(names(models))) - )] +## Compare estimates to reality -truncated_draws |> - plot_mean_posterior_pred( - truncated_obs |> - calculate_cohort_mean( - type = "cohort", obs_at = max(truncated_obs$stime_daily) - ), - col = model, fill = model, mean = TRUE, ribbon = TRUE - ) + - guides( - fill = guide_legend(title = "Model", nrow = 4), - col = guide_legend(title = "Model", nrow = 4) - ) + - scale_fill_brewer(palette = "Dark2", aesthetics = c("fill", "colour")) + - theme(legend.direction = "vertical") -``` +Compare modelled estimate to underlying truth here to convince user that it works. +Perhaps as an exercise, we could show the "really simple" estimate being wrong here otherwise it won't be so impressive that the model is right. +This section should also show the user how to get objects that they might want out of the fitted object. From 1817865e18708ebb55305c1954a9d5e32426da11 Mon Sep 17 00:00:00 2001 From: athowes Date: Tue, 14 May 2024 11:49:19 +0100 Subject: [PATCH 02/19] Basic version of code functionality --- vignettes/epidist.Rmd | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index a2d19490d..0679e2fa3 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -55,6 +55,22 @@ Here we quickly generate some simulated example data. We focus on the format that the data needs to take for it to work with the fitting function. Eventually, we discuss the tools available in the package to get your data into the right format, or we link to a vignette about it. +```{r} +outbreak <- simulate_gillespie(seed = 101) + +secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> + add_natural_scale_mean_sd() + +truncated_obs <- outbreak |> + simulate_secondary( + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ) |> + observe_process() |> + filter_obs_by_obs_time(obs_time = 25) %>% + .[sample(1:.N, 200, replace = FALSE)] +``` + ## Fit the model We fit the model adjusting for right truncation and date censoring using a latent variable approach. @@ -62,9 +78,9 @@ We briefly describe the model, and the inference methodology. Eventually, we link to a vignette about common issues with model fitting and HMC. ```{r} -# latent_truncation_censoring_fit <- latent_truncation_censoring_adjusted_delay( -# data = truncated_obs, cores = 4, refresh = 0 -# ) +fit <- latent_truncation_censoring_adjusted_delay( + data = truncated_obs, cores = 4, refresh = 0 +) ``` ## Compare estimates to reality @@ -72,3 +88,9 @@ Eventually, we link to a vignette about common issues with model fitting and HMC Compare modelled estimate to underlying truth here to convince user that it works. Perhaps as an exercise, we could show the "really simple" estimate being wrong here otherwise it won't be so impressive that the model is right. This section should also show the user how to get objects that they might want out of the fitted object. + +```{r} +draws <- extract_lognormal_draws(fit) +draws_long <- draws_to_long(draws) +summarise_draws(draws_long, sf = 2) +``` \ No newline at end of file From 7cd6dcc851a5df5673b916315e45509cea66f45f Mon Sep 17 00:00:00 2001 From: athowes Date: Tue, 14 May 2024 16:45:45 +0100 Subject: [PATCH 03/19] Rewrite data simulation section --- vignettes/epidist.Rmd | 58 +++++++++++++++++++++++++++++++------------ 1 file changed, 42 insertions(+), 16 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 0679e2fa3..e89e18ee7 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -31,44 +31,70 @@ knitr::opts_chunk$set( ) ``` -In epidemiology, many important quantities can be thought of as the time between two events, or as delays. +```{r load-requirements} +library(epidist) +library(data.table) +library(purrr) +library(ggplot2) +``` + +Many important quantities in epidemiology can be thought of as the time between two events, or "delays". Examples include: * the incubation period (time between infection and symptom onset) * serial interval (time between symptom onset of infectee and symptom onset of infected), and * generation interval (time between infection of infectee and infection of infected). -Unfortunately, estimating these quantities accurately from observational data is usually difficult. +Unfortunately, estimating delays accurately from observational data is usually difficult. In our experience, the two main issues are interval censoring and right truncation. The `epidist` package makes it easy to estimate delays using a statistical model which properly accounts for these two issues. ## Example data -```{r load-requirements} -library(epidist) -library(data.table) -library(purrr) -library(ggplot2) -``` +Data must be in a particular format to be used within the `epidist` package. +In this section, we use package functions to simulate data of the appropriate format. + -Here we quickly generate some simulated example data. -We focus on the format that the data needs to take for it to work with the fitting function. -Eventually, we discuss the tools available in the package to get your data into the right format, or we link to a vignette about it. +We start by using the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data from a stochastic compartmental model: ```{r} outbreak <- simulate_gillespie(seed = 101) +``` +`outbreak` is a `data.table` object, with two columns: the case number `case` and the time of primary event `ptime`. + +For the delay between primary and secondary event, we will use a lognormal distribution: + +```{r} secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> add_natural_scale_mean_sd() -truncated_obs <- outbreak |> +obs <- outbreak |> simulate_secondary( meanlog = secondary_dist[["meanlog"]], sdlog = secondary_dist[["sdlog"]] - ) |> + ) + +head(obs) +``` + +Now we are ready to simulate observation of the epidemic process at time `obs_time`: + + + +```{r} +obs_time <- 25 + +obs <- obs |> observe_process() |> - filter_obs_by_obs_time(obs_time = 25) %>% - .[sample(1:.N, 200, replace = FALSE)] + filter_obs_by_obs_time(obs_time = obs_time) +``` + +Finally, we suppose that only a random sample of individuals of size `sample_size` are observed: + +```{r} +sample_size <- 200 +obs <- obs[sample(1:.N, sample_size, replace = FALSE)] ``` ## Fit the model @@ -79,7 +105,7 @@ Eventually, we link to a vignette about common issues with model fitting and HMC ```{r} fit <- latent_truncation_censoring_adjusted_delay( - data = truncated_obs, cores = 4, refresh = 0 + data = obs, cores = 4, refresh = 0 ) ``` From 8d4a2c4fbd30086248eb9c5c6576b068d9c99323 Mon Sep 17 00:00:00 2001 From: athowes Date: Tue, 14 May 2024 17:08:06 +0100 Subject: [PATCH 04/19] Split up data observation process writing --- vignettes/epidist.Rmd | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index e89e18ee7..04a099849 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -78,34 +78,36 @@ obs <- outbreak |> head(obs) ``` -Now we are ready to simulate observation of the epidemic process at time `obs_time`: - - +Now we are ready to simulate observation of the epidemic process with both interval censoring and right truncation. +Using `observe_process` we censor primary and secondary event times with a daily interval: ```{r} -obs_time <- 25 +obs_cens <- obs |> observe_process() +``` + +Using `filter_obs_by_obs_time` we suppose that only individuals with a secondary event time before `obs_time` are observed: -obs <- obs |> - observe_process() |> - filter_obs_by_obs_time(obs_time = obs_time) +```{r} +obs_time <- 25 +obs_cens_trunc <- filter_obs_by_obs_time(obs_cens_trunc, obs_time = obs_time) ``` Finally, we suppose that only a random sample of individuals of size `sample_size` are observed: ```{r} sample_size <- 200 -obs <- obs[sample(1:.N, sample_size, replace = FALSE)] +obs_cens_trunc_samp <- obs_cens_trunc[sample(1:.N, sample_size, replace = FALSE)] ``` ## Fit the model -We fit the model adjusting for right truncation and date censoring using a latent variable approach. -We briefly describe the model, and the inference methodology. -Eventually, we link to a vignette about common issues with model fitting and HMC. +If we had access to the data `obs` it would be simple to estimate the delay distribution. +However, with only access to the censored, truncated, sampled data `obs_cens_trunc_samp` it is not so easy... +Fortunately we solve it! ```{r} fit <- latent_truncation_censoring_adjusted_delay( - data = obs, cores = 4, refresh = 0 + data = obs_cens_trunc_samp, cores = 4, refresh = 0 ) ``` From 44b74f7d7e92465f16f02556ea8641f77324d64a Mon Sep 17 00:00:00 2001 From: athowes Date: Wed, 15 May 2024 10:46:16 +0100 Subject: [PATCH 05/19] Rewrite Section 1 on data simulation --- vignettes/epidist.Rmd | 69 +++++++++++++++++++++++++++++++++---------- 1 file changed, 53 insertions(+), 16 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 04a099849..ea5d97bc6 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -5,6 +5,7 @@ output: bookdown::html_vignette2: fig_caption: yes code_folding: show + number_sections: true pkgdown: as_is: true csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl @@ -38,8 +39,8 @@ library(purrr) library(ggplot2) ``` -Many important quantities in epidemiology can be thought of as the time between two events, or "delays". -Examples include: +Many quantities in epidemiology can be thought of as the time between two events, or "delays". +Important examples include: * the incubation period (time between infection and symptom onset) * serial interval (time between symptom onset of infectee and symptom onset of infected), and @@ -47,23 +48,26 @@ Examples include: Unfortunately, estimating delays accurately from observational data is usually difficult. In our experience, the two main issues are interval censoring and right truncation. -The `epidist` package makes it easy to estimate delays using a statistical model which properly accounts for these two issues. +Don't worry if you've not come across these terms before. +By simulating example data, we will explain what they mean in Section \@ref(data)! +Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues. +Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth. -## Example data +# Example data {#data} -Data must be in a particular format to be used within the `epidist` package. -In this section, we use package functions to simulate data of the appropriate format. +Data should be in a certain format for use within the `epidist` package. +In this section, we simulate data of the appropriate format, and in doing so explain the two main issues with observational delay data. -We start by using the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data from a stochastic compartmental model: +First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data from a stochastic compartmental model: ```{r} outbreak <- simulate_gillespie(seed = 101) ``` -`outbreak` is a `data.table` object, with two columns: the case number `case` and the time of primary event `ptime`. +`outbreak` is a `data.table` with columns for the case number `case` and the time of primary event `ptime`. -For the delay between primary and secondary event, we will use a lognormal distribution: +Now, to generate secondary events, we will use a lognormal distribution ror the delay between primary and secondary events: ```{r} secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> @@ -78,28 +82,61 @@ obs <- outbreak |> head(obs) ``` -Now we are ready to simulate observation of the epidemic process with both interval censoring and right truncation. -Using `observe_process` we censor primary and secondary event times with a daily interval: +`obs` is a `data.table` object with additional columns for the delay `delay` and time of secondary event `stime`. +The secondary event time is simply the primary event time plus the delay: ```{r} +all(obs$ptime + obs$delay == obs$stime) +``` + +If we received the data `obs` as above then estimating the delay distribution would be easy, and this package wouldn't need to exist. +However, in reality, we will almost never receive the data as above. + +First, the times of primary and secondary events will usually be censored. +This means that rather than exact event times, we observe event times within an interval. +Here we suppose that the interval is daily, resulting in an obscuring of the exact delay data (Figure \@ref(fig:cens)): + +```{r} +# observe_process() should be renamed and include a "daily" i.e. 1 argument obs_cens <- obs |> observe_process() ``` -Using `filter_obs_by_obs_time` we suppose that only individuals with a secondary event time before `obs_time` are observed: +(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. + +```{r cens, fig.cap="(ref:cens)"} +ggplot(obs_cens, aes(x = delay, y = delay_daily)) + + geom_point(alpha = 0.25) + + geom_abline(slope = 1, intercept = 0, linetype = "dashed") + + theme_minimal() + + coord_fixed() + + labs(x = "Exact delay time (days)", y = "Censored delay time (days)") +``` + +Next, during an outbreak we will usually be estimating delays in real time. +The result is that only those cases with a secondary event occuring before some time will be observed. +This is called (right) truncation, and biases the observation process towards shorter delays: ```{r} obs_time <- 25 -obs_cens_trunc <- filter_obs_by_obs_time(obs_cens_trunc, obs_time = obs_time) +# filter_obs_by_obs_time() should be renamed to refer to stime +obs_cens_trunc <- filter_obs_by_obs_time(obs_cens, obs_time = obs_time) ``` -Finally, we suppose that only a random sample of individuals of size `sample_size` are observed: +Finally, in reality, it's not possible to observe every case. We suppose that a random sample of individuals of size `sample_size` are observed: ```{r} sample_size <- 200 +``` + +This sample size corresponds to `r 100 * round(sample_size / nrow(obs_cens_trunc), 3)`% of the data. + +```{r} obs_cens_trunc_samp <- obs_cens_trunc[sample(1:.N, sample_size, replace = FALSE)] ``` -## Fit the model +With our censored, truncated, and sampled data, we are now ready to attempt to recover the underlying delay distribution using `epidist`. + +# Fit the model {#fit} If we had access to the data `obs` it would be simple to estimate the delay distribution. However, with only access to the censored, truncated, sampled data `obs_cens_trunc_samp` it is not so easy... @@ -111,7 +148,7 @@ fit <- latent_truncation_censoring_adjusted_delay( ) ``` -## Compare estimates to reality +# Compare estimates {#compare} Compare modelled estimate to underlying truth here to convince user that it works. Perhaps as an exercise, we could show the "really simple" estimate being wrong here otherwise it won't be so impressive that the model is right. From 037afbbc178d45ef6eae8da3bd78f89f7adb13c0 Mon Sep 17 00:00:00 2001 From: athowes Date: Wed, 15 May 2024 11:40:54 +0100 Subject: [PATCH 06/19] Start writing the fitting section, and edits to Section 1 --- vignettes/epidist.Rmd | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index ea5d97bc6..2f6075ab4 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -42,14 +42,18 @@ library(ggplot2) Many quantities in epidemiology can be thought of as the time between two events, or "delays". Important examples include: -* the incubation period (time between infection and symptom onset) +* the incubation period (time between infection and symptom onset), * serial interval (time between symptom onset of infectee and symptom onset of infected), and * generation interval (time between infection of infectee and infection of infected). Unfortunately, estimating delays accurately from observational data is usually difficult. -In our experience, the two main issues are interval censoring and right truncation. -Don't worry if you've not come across these terms before. -By simulating example data, we will explain what they mean in Section \@ref(data)! +In our experience, the two main issues are: + +1. interval censoring, and +2. right truncation. + +Don't worry if you've not come across these terms before! +In Section \@ref(data), we will explain what they mean by simulating example data analogous to what we might observe during an ongoing infectious disease outbreak. Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues. Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth. @@ -138,14 +142,19 @@ With our censored, truncated, and sampled data, we are now ready to attempt to r # Fit the model {#fit} -If we had access to the data `obs` it would be simple to estimate the delay distribution. -However, with only access to the censored, truncated, sampled data `obs_cens_trunc_samp` it is not so easy... -Fortunately we solve it! +It would be simple to estimate the delay distribution if we had access to `obs`. +However, with only access to `obs_cens_trunc_samp` we must use a more sophisticated statistical model to ensure that our estimates are not biased. + +The primary modelling function in `epidist` to do this is `latent_truncation_censoring_adjusted_delay`. +In a future vignette, we will explain in more detail the structure of the model. +We infer the parameters of the model using Bayesian inference via the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm. ```{r} fit <- latent_truncation_censoring_adjusted_delay( data = obs_cens_trunc_samp, cores = 4, refresh = 0 ) + +summary(fit) ``` # Compare estimates {#compare} @@ -158,4 +167,4 @@ This section should also show the user how to get objects that they might want o draws <- extract_lognormal_draws(fit) draws_long <- draws_to_long(draws) summarise_draws(draws_long, sf = 2) -``` \ No newline at end of file +``` From b1cc428e6ffed48c7c626a211e1cdf49d24c9da5 Mon Sep 17 00:00:00 2001 From: athowes Date: Thu, 16 May 2024 12:08:14 +0100 Subject: [PATCH 07/19] Add epidemic growth figure --- vignettes/epidist.Rmd | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 2f6075ab4..fb1d866d9 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -63,12 +63,21 @@ Data should be in a certain format for use within the `epidist` package. In this section, we simulate data of the appropriate format, and in doing so explain the two main issues with observational delay data. -First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data from a stochastic compartmental model: +First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model: ```{r} outbreak <- simulate_gillespie(seed = 101) ``` +(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. + +```{r outbreak, fig.cap="(ref:outbreak)"} +ggplot(outbreak, aes(x = ptime, y = case)) + + geom_point() + + labs(x = "Primary event time (day)", y = "Case number") + + theme_minimal() +``` + `outbreak` is a `data.table` with columns for the case number `case` and the time of primary event `ptime`. Now, to generate secondary events, we will use a lognormal distribution ror the delay between primary and secondary events: @@ -82,8 +91,6 @@ obs <- outbreak |> meanlog = secondary_dist[["meanlog"]], sdlog = secondary_dist[["sdlog"]] ) - -head(obs) ``` `obs` is a `data.table` object with additional columns for the delay `delay` and time of secondary event `stime`. From 50fc64de7c1273908ae981c16e75173c2c8abdaf Mon Sep 17 00:00:00 2001 From: athowes Date: Thu, 16 May 2024 13:22:38 +0100 Subject: [PATCH 08/19] Add plot of lognormal --- vignettes/epidist.Rmd | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index fb1d866d9..907a57df5 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -21,14 +21,11 @@ vignette: > knitr::opts_chunk$set( fig.path = "figures/getting-started-nowcasting-", cache = TRUE, - dpi = 330, collapse = TRUE, comment = "#>", - out.width = "100%", message = FALSE, warning = FALSE, - error = FALSE, - eval = TRUE + error = FALSE ) ``` @@ -79,13 +76,23 @@ ggplot(outbreak, aes(x = ptime, y = case)) + ``` `outbreak` is a `data.table` with columns for the case number `case` and the time of primary event `ptime`. - -Now, to generate secondary events, we will use a lognormal distribution ror the delay between primary and secondary events: +Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events: ```{r} secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> add_natural_scale_mean_sd() +``` +(ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability. + +```{r lognormal, fig.cap="(ref:lognormal)"} +ggplot(data.frame(x = c(0, 8)), aes(x = x)) + + geom_function(fun = dlnorm, meanlog = 1.8, sdlog = 0.5) + + theme_minimal() + + labs(x = "Delay between primary and secondary event (days)", y = "Probability density") +``` + +```{r} obs <- outbreak |> simulate_secondary( meanlog = secondary_dist[["meanlog"]], @@ -124,7 +131,7 @@ ggplot(obs_cens, aes(x = delay, y = delay_daily)) + ``` Next, during an outbreak we will usually be estimating delays in real time. -The result is that only those cases with a secondary event occuring before some time will be observed. +The result is that only those cases with a secondary event occurring before some time will be observed. This is called (right) truncation, and biases the observation process towards shorter delays: ```{r} From cbe5903ca9a524860c29ac667f52c602065e9e6e Mon Sep 17 00:00:00 2001 From: athowes Date: Thu, 16 May 2024 13:47:30 +0100 Subject: [PATCH 09/19] Add delays figure (and move to every 50th case) --- vignettes/epidist.Rmd | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 907a57df5..c8a253052 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -66,16 +66,18 @@ First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_ outbreak <- simulate_gillespie(seed = 101) ``` -(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. +(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (To make this figure easier to read, only every 50th case is shown.) ```{r outbreak, fig.cap="(ref:outbreak)"} -ggplot(outbreak, aes(x = ptime, y = case)) + +outbreak[case %% 50 == 0, ] |> + ggplot(aes(x = ptime, y = case)) + geom_point() + labs(x = "Primary event time (day)", y = "Case number") + theme_minimal() ``` `outbreak` is a `data.table` with columns for the case number `case` and the time of primary event `ptime`. + Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events: ```{r} @@ -87,7 +89,7 @@ secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> ```{r lognormal, fig.cap="(ref:lognormal)"} ggplot(data.frame(x = c(0, 8)), aes(x = x)) + - geom_function(fun = dlnorm, meanlog = 1.8, sdlog = 0.5) + + geom_function(fun = dlnorm, meanlog = secondary_dist[["meanlog"]], sdlog = secondary_dist[["sdlog"]]) + theme_minimal() + labs(x = "Delay between primary and secondary event (days)", y = "Probability density") ``` @@ -100,6 +102,18 @@ obs <- outbreak |> ) ``` +(ref:delay) Delay (To make this figure easier to read, only every 50th case is shown.) + +```{r delay, fig.cap="(ref:delay)"} +obs[case %% 50 == 0, ] |> + ggplot(aes(y = case)) + + geom_point(aes(x = ptime)) + + geom_point(aes(x = stime)) + + geom_segment(aes(x = ptime, xend = stime, y = case, yend = case)) + + labs(x = "Event time (day)", y = "Case number") + + theme_minimal() +``` + `obs` is a `data.table` object with additional columns for the delay `delay` and time of secondary event `stime`. The secondary event time is simply the primary event time plus the delay: From d58459b35f84fd56b5e895531c0b4e3b01510171 Mon Sep 17 00:00:00 2001 From: athowes Date: Fri, 17 May 2024 11:59:08 +0100 Subject: [PATCH 10/19] Improve figures here --- vignettes/epidist.Rmd | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index c8a253052..ec61ce406 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -71,7 +71,7 @@ outbreak <- simulate_gillespie(seed = 101) ```{r outbreak, fig.cap="(ref:outbreak)"} outbreak[case %% 50 == 0, ] |> ggplot(aes(x = ptime, y = case)) + - geom_point() + + geom_point(col = "#56B4E9") + labs(x = "Primary event time (day)", y = "Case number") + theme_minimal() ``` @@ -102,14 +102,14 @@ obs <- outbreak |> ) ``` -(ref:delay) Delay (To make this figure easier to read, only every 50th case is shown.) +(ref:delay) Secondary events (in green) occur at random (As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown.) ```{r delay, fig.cap="(ref:delay)"} obs[case %% 50 == 0, ] |> ggplot(aes(y = case)) + - geom_point(aes(x = ptime)) + - geom_point(aes(x = stime)) + - geom_segment(aes(x = ptime, xend = stime, y = case, yend = case)) + + geom_segment(aes(x = ptime, xend = stime, y = case, yend = case), col = "grey") + + geom_point(aes(x = ptime), col = "#56B4E9") + + geom_point(aes(x = stime), col = "#009E73") + labs(x = "Event time (day)", y = "Case number") + theme_minimal() ``` @@ -137,8 +137,8 @@ obs_cens <- obs |> observe_process() ```{r cens, fig.cap="(ref:cens)"} ggplot(obs_cens, aes(x = delay, y = delay_daily)) + - geom_point(alpha = 0.25) + - geom_abline(slope = 1, intercept = 0, linetype = "dashed") + + geom_point(col = "#E69F00") + + geom_abline(slope = 1, intercept = 0, linetype = "dashed", col = "grey") + theme_minimal() + coord_fixed() + labs(x = "Exact delay time (days)", y = "Censored delay time (days)") @@ -147,6 +147,7 @@ ggplot(obs_cens, aes(x = delay, y = delay_daily)) + Next, during an outbreak we will usually be estimating delays in real time. The result is that only those cases with a secondary event occurring before some time will be observed. This is called (right) truncation, and biases the observation process towards shorter delays: + ```{r} obs_time <- 25 @@ -170,12 +171,14 @@ With our censored, truncated, and sampled data, we are now ready to attempt to r # Fit the model {#fit} -It would be simple to estimate the delay distribution if we had access to `obs`. -However, with only access to `obs_cens_trunc_samp` we must use a more sophisticated statistical model to ensure that our estimates are not biased. +If we had access to `obs`, then it would be simple to estimate the delay distribution +However, with only access to `obs_cens_trunc_samp`, we must use a more sophisticated statistical model to make sure that our estimates are not severely biased! -The primary modelling function in `epidist` to do this is `latent_truncation_censoring_adjusted_delay`. +The primary modelling function in `epidist` is `latent_truncation_censoring_adjusted_delay`. In a future vignette, we will explain in more detail the structure of the model. -We infer the parameters of the model using Bayesian inference via the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm. + +The parameters of the model are inferred using Bayesian inference. +In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the `brms` R package. ```{r} fit <- latent_truncation_censoring_adjusted_delay( From 033f36ff67894b45e2518632e3a44388418a7903 Mon Sep 17 00:00:00 2001 From: athowes Date: Fri, 17 May 2024 12:25:54 +0100 Subject: [PATCH 11/19] Fix lint issues --- vignettes/epidist.Rmd | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index ec61ce406..c7bee7c33 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -89,9 +89,16 @@ secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> ```{r lognormal, fig.cap="(ref:lognormal)"} ggplot(data.frame(x = c(0, 8)), aes(x = x)) + - geom_function(fun = dlnorm, meanlog = secondary_dist[["meanlog"]], sdlog = secondary_dist[["sdlog"]]) + + geom_function( + fun = dlnorm, + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ) + theme_minimal() + - labs(x = "Delay between primary and secondary event (days)", y = "Probability density") + labs( + x = "Delay between primary and secondary event (days)", + y = "Probability density" + ) ``` ```{r} @@ -107,7 +114,10 @@ obs <- outbreak |> ```{r delay, fig.cap="(ref:delay)"} obs[case %% 50 == 0, ] |> ggplot(aes(y = case)) + - geom_segment(aes(x = ptime, xend = stime, y = case, yend = case), col = "grey") + + geom_segment( + aes(x = ptime, xend = stime, y = case, yend = case), + col = "grey" + ) + geom_point(aes(x = ptime), col = "#56B4E9") + geom_point(aes(x = stime), col = "#009E73") + labs(x = "Event time (day)", y = "Case number") + @@ -164,7 +174,8 @@ sample_size <- 200 This sample size corresponds to `r 100 * round(sample_size / nrow(obs_cens_trunc), 3)`% of the data. ```{r} -obs_cens_trunc_samp <- obs_cens_trunc[sample(1:.N, sample_size, replace = FALSE)] +obs_cens_trunc_samp <- + obs_cens_trunc[sample(seq_len(.N), sample_size, replace = FALSE)] ``` With our censored, truncated, and sampled data, we are now ready to attempt to recover the underlying delay distribution using `epidist`. From bb58792a54a02a0432112248115f5084e16116cd Mon Sep 17 00:00:00 2001 From: athowes Date: Mon, 20 May 2024 16:15:33 +0100 Subject: [PATCH 12/19] Change from html_vignette2 to html_document2 --- vignettes/epidist.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index c7bee7c33..14800a603 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -2,7 +2,7 @@ title: "Getting started with epidist" description: "A quick start guide to using the epidist R package" output: - bookdown::html_vignette2: + bookdown::html_document2: fig_caption: yes code_folding: show number_sections: true From d7e331ee25e7a4fbcb067045c1311d0b22d8652c Mon Sep 17 00:00:00 2001 From: athowes Date: Mon, 20 May 2024 16:52:53 +0100 Subject: [PATCH 13/19] Change to fig.path --- vignettes/epidist.Rmd | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 14800a603..c00107c63 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -19,7 +19,7 @@ vignette: > ```{r setup, include=FALSE} # exclude compile warnings from cmdstanr knitr::opts_chunk$set( - fig.path = "figures/getting-started-nowcasting-", + fig.path = "figures/epidist-", cache = TRUE, collapse = TRUE, comment = "#>", @@ -182,13 +182,12 @@ With our censored, truncated, and sampled data, we are now ready to attempt to r # Fit the model {#fit} -If we had access to `obs`, then it would be simple to estimate the delay distribution -However, with only access to `obs_cens_trunc_samp`, we must use a more sophisticated statistical model to make sure that our estimates are not severely biased! +If we had access to `obs`, then it would be simple to estimate the delay distribution. +However, with only access to `obs_cens_trunc_samp`, we must use a statistical model to avoid severely biased estimates! -The primary modelling function in `epidist` is `latent_truncation_censoring_adjusted_delay`. -In a future vignette, we will explain in more detail the structure of the model. - -The parameters of the model are inferred using Bayesian inference. +The main model function in `epidist` is `latent_truncation_censoring_adjusted_delay`. + +The parameters of this model are inferred using Bayesian inference. In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the `brms` R package. ```{r} From 0c3d7f7371e471eda6f34ec9cf0f64de0bf5236f Mon Sep 17 00:00:00 2001 From: athowes Date: Tue, 21 May 2024 11:10:26 +0100 Subject: [PATCH 14/19] Fixes based on Katie comments --- vignettes/epidist.Rmd | 39 +++++++++++++++++++++++++-------------- vignettes/references.bib | 20 ++++++++++++++++++++ 2 files changed, 45 insertions(+), 14 deletions(-) create mode 100644 vignettes/references.bib diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index c00107c63..1f61f1918 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -8,12 +8,13 @@ output: number_sections: true pkgdown: as_is: true -csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl link-citations: true vignette: > %\VignetteIndexEntry{Getting started with epidist} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} +bibliography: references.bib --- ```{r setup, include=FALSE} @@ -49,15 +50,22 @@ In our experience, the two main issues are: 1. interval censoring, and 2. right truncation. -Don't worry if you've not come across these terms before! -In Section \@ref(data), we will explain what they mean by simulating example data analogous to what we might observe during an ongoing infectious disease outbreak. +Don't worry if you've not come across these terms before. +In Section \@ref(data), we will explain what they mean by simulating data like that we might observe during an ongoing infectious disease outbreak. Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues. Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth. +If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best. + # Example data {#data} -Data should be in a certain format for use within the `epidist` package. -In this section, we simulate data of the appropriate format, and in doing so explain the two main issues with observational delay data. +Data should be formatted as a [`data.table`]() with the following columns for use within the `epidist` package: + +* `case`: +* `ptime`: +* `stime`: + +Here we simulate data in this format, and in doing so explain the two main issues with observational delay data. First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model: @@ -66,7 +74,7 @@ First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_ outbreak <- simulate_gillespie(seed = 101) ``` -(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (To make this figure easier to read, only every 50th case is shown.) +(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.) ```{r outbreak, fig.cap="(ref:outbreak)"} outbreak[case %% 50 == 0, ] |> @@ -76,7 +84,7 @@ outbreak[case %% 50 == 0, ] |> theme_minimal() ``` -`outbreak` is a `data.table` with columns for the case number `case` and the time of primary event `ptime`. +`outbreak` is a [`data.table`]() with the columns `case` and `ptime`. Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events: @@ -109,7 +117,7 @@ obs <- outbreak |> ) ``` -(ref:delay) Secondary events (in green) occur at random (As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown.) +(ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure \@ref(fig:lognormal)). As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown. ```{r delay, fig.cap="(ref:delay)"} obs[case %% 50 == 0, ] |> @@ -124,19 +132,19 @@ obs[case %% 50 == 0, ] |> theme_minimal() ``` -`obs` is a `data.table` object with additional columns for the delay `delay` and time of secondary event `stime`. +`obs` is now a [`data.table`]() object with further columns for `delay` and `stime`. The secondary event time is simply the primary event time plus the delay: ```{r} all(obs$ptime + obs$delay == obs$stime) ``` -If we received the data `obs` as above then estimating the delay distribution would be easy, and this package wouldn't need to exist. -However, in reality, we will almost never receive the data as above. +If we were to receive the data `obs` as above then estimating the delay distribution would be easy, and the `epidist` package wouldn't need to exist. +However, in reality, during an outbreak we almost never receive the data as above. First, the times of primary and secondary events will usually be censored. This means that rather than exact event times, we observe event times within an interval. -Here we suppose that the interval is daily, resulting in an obscuring of the exact delay data (Figure \@ref(fig:cens)): +Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure \@ref(fig:cens)): ```{r} # observe_process() should be renamed and include a "daily" i.e. 1 argument @@ -165,7 +173,8 @@ obs_time <- 25 obs_cens_trunc <- filter_obs_by_obs_time(obs_cens, obs_time = obs_time) ``` -Finally, in reality, it's not possible to observe every case. We suppose that a random sample of individuals of size `sample_size` are observed: +Finally, in reality, it's not possible to observe every case. +We suppose that a sample of individuals of size `sample_size` are observed: ```{r} sample_size <- 200 @@ -178,7 +187,7 @@ obs_cens_trunc_samp <- obs_cens_trunc[sample(seq_len(.N), sample_size, replace = FALSE)] ``` -With our censored, truncated, and sampled data, we are now ready to attempt to recover the underlying delay distribution using `epidist`. +With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using `epidist`. # Fit the model {#fit} @@ -209,3 +218,5 @@ draws <- extract_lognormal_draws(fit) draws_long <- draws_to_long(draws) summarise_draws(draws_long, sf = 2) ``` + +## Bibliography {-} diff --git a/vignettes/references.bib b/vignettes/references.bib new file mode 100644 index 000000000..983e428d0 --- /dev/null +++ b/vignettes/references.bib @@ -0,0 +1,20 @@ +@article {park2024estimating, + author = {Park, Sang Woo and Akhmetzhanov, Andrei R. and Charniga, Kelly and Cori, Anne and Davies, Nicholas G. and Dushoff, Jonathan and Funk, Sebastian and Gostic, Katie and Grenfell, Bryan and Linton, Natalie M. and Lipsitch, Marc and Lison, Adrian and Overton, Christopher E. and Ward, Thomas and Abbott, Sam}, + title = {Estimating epidemiological delay distributions for infectious diseases}, + elocation-id = {2024.01.12.24301247}, + year = {2024}, + doi = {10.1101/2024.01.12.24301247}, + publisher = {Cold Spring Harbor Laboratory Press}, + URL = {https://www.medrxiv.org/content/early/2024/01/13/2024.01.12.24301247}, + eprint = {https://www.medrxiv.org/content/early/2024/01/13/2024.01.12.24301247.full.pdf}, + journal = {medRxiv} +} + +@article{charniga2024best, + title={Best practices for estimating and reporting epidemiological delay distributions of infectious diseases using public health surveillance and healthcare data}, + author={Kelly Charniga and Sang Woo Park and Andrei R Akhmetzhanov and Anne Cori and Jonathan Dushoff and Sebastian Funk and Katelyn M Gostic and Natalie M Linton and Adrian Lison and Christopher E Overton and Juliet R C Pulliam and Thomas Ward and Simon Cauchemez and Sam Abbott}, + year={2024}, + eprint={2405.08841}, + archivePrefix={arXiv}, + primaryClass={stat.ME} +} \ No newline at end of file From e289d8c4ac44026d032763692bfd3fe06e86784f Mon Sep 17 00:00:00 2001 From: athowes Date: Tue, 21 May 2024 12:29:56 +0100 Subject: [PATCH 15/19] Work on the modelling section writing --- vignettes/epidist.Rmd | 44 ++++++++++++++++++++++++++++++++-------- vignettes/references.bib | 13 ++++++++++++ 2 files changed, 49 insertions(+), 8 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 1f61f1918..0fce61e92 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -59,7 +59,7 @@ If you would like more technical details, the `epidist` package implements model # Example data {#data} -Data should be formatted as a [`data.table`]() with the following columns for use within the `epidist` package: +Data should be formatted as a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the following columns for use within the `epidist` package: * `case`: * `ptime`: @@ -84,7 +84,7 @@ outbreak[case %% 50 == 0, ] |> theme_minimal() ``` -`outbreak` is a [`data.table`]() with the columns `case` and `ptime`. +`outbreak` is a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the columns `case` and `ptime`. Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events: @@ -132,7 +132,7 @@ obs[case %% 50 == 0, ] |> theme_minimal() ``` -`obs` is now a [`data.table`]() object with further columns for `delay` and `stime`. +`obs` is now a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) object with further columns for `delay` and `stime`. The secondary event time is simply the primary event time plus the delay: ```{r} @@ -191,22 +191,50 @@ With our censored, truncated, and sampled data, we are now ready to try to recov # Fit the model {#fit} -If we had access to `obs`, then it would be simple to estimate the delay distribution. +If we had access to the data `obs`, then it would be simple to estimate the delay distribution (Figure \@ref(fig:obs-est)). + +(ref:obs-est) The histogram of delays matches closely the underlying distribution. Edit: at the moment something is wrong here. + +```{r obs-est, fig.cap="(ref:obs-est)"} +ggplot() + + geom_histogram(data = obs, aes(x = delay, y = stat(count / sum(count)))) + + geom_function( + data = data.frame(x = c(0, 30)), + aes(x = x), + fun = dlnorm, + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]], + ) + + labs(x = "Delay between primary and secondary event (days)", y = "Probability density") + + theme_minimal() +``` + However, with only access to `obs_cens_trunc_samp`, we must use a statistical model to avoid severely biased estimates! -The main model function in `epidist` is `latent_truncation_censoring_adjusted_delay`. +The main `epidist` model function is `latent_truncation_censoring_adjusted_delay`. -The parameters of this model are inferred using Bayesian inference. -In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the `brms` R package. +The parameters of the model are inferred using Bayesian inference. +In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the [`brms`](https://paul-buerkner.github.io/brms/) R package [@brms]. ```{r} fit <- latent_truncation_censoring_adjusted_delay( data = obs_cens_trunc_samp, cores = 4, refresh = 0 ) +``` -summary(fit) +The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters (Table \@ref(tab:pars)) in the model. +Many of these are so-called "latent variables". + +```{r pars} +pars <- fit$fit@par_dims |> + map(.f = function(x) if(identical(x, integer(0))) return(1) else return(x)) + +data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |> + gt::gt(caption = "All of the parameters that are included in the model") ``` +Although some users may wish to work with `fit` directly, we will see in the following section that `epidist` provides functions to make common post-processing tasks easy. + # Compare estimates {#compare} Compare modelled estimate to underlying truth here to convince user that it works. diff --git a/vignettes/references.bib b/vignettes/references.bib index 983e428d0..d07f00314 100644 --- a/vignettes/references.bib +++ b/vignettes/references.bib @@ -17,4 +17,17 @@ @article{charniga2024best eprint={2405.08841}, archivePrefix={arXiv}, primaryClass={stat.ME} +} + +@Article{brms, + title = {{brms}: An {R} Package for {Bayesian} Multilevel Models + Using {Stan}}, + author = {Paul-Christian Bürkner}, + journal = {Journal of Statistical Software}, + year = {2017}, + volume = {80}, + number = {1}, + pages = {1--28}, + doi = {10.18637/jss.v080.i01}, + encoding = {UTF-8}, } \ No newline at end of file From c921b312a39bc155b08b4a69085f17ce6e19e0e4 Mon Sep 17 00:00:00 2001 From: athowes Date: Wed, 22 May 2024 16:51:08 +0100 Subject: [PATCH 16/19] Improvements to first vignette based on Sam feedback --- DESCRIPTION | 3 ++- vignettes/epidist.Rmd | 38 ++++++++++++++++++++++++++++---------- 2 files changed, 30 insertions(+), 11 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 1bea3aae6..15f2b8d36 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -34,7 +34,8 @@ Suggests: epinowcast, testthat (>= 3.0.0), readxl, - janitor + janitor, + gt Remotes: stan-dev/cmdstanr, Rdatatable/data.table, diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 0fce61e92..269ac5258 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -151,7 +151,7 @@ Here we suppose that the interval is daily, meaning that only the date of the pr obs_cens <- obs |> observe_process() ``` -(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. +(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals. ```{r cens, fig.cap="(ref:cens)"} ggplot(obs_cens, aes(x = delay, y = delay_daily)) + @@ -191,13 +191,26 @@ With our censored, truncated, and sampled data, we are now ready to try to recov # Fit the model {#fit} -If we had access to the data `obs`, then it would be simple to estimate the delay distribution (Figure \@ref(fig:obs-est)). +If we had access to the data `obs`, then it would be simple to estimate the delay distribution. +However, with only access to `obs_cens_trunc_samp`, the delay distribution we observe is biased (Figure \@ref(fig:obs-est)) and we must use a statistical model. -(ref:obs-est) The histogram of delays matches closely the underlying distribution. Edit: at the moment something is wrong here. +(ref:obs-est) The histogram of delays from `obs` matches closely the underlying distribution, whereas those from `obs_cens_trunc_samp` are systematically biased. ```{r obs-est, fig.cap="(ref:obs-est)"} -ggplot() + - geom_histogram(data = obs, aes(x = delay, y = stat(count / sum(count)))) + +dplyr::bind_rows( + obs |> + dplyr::mutate(type = "Full data") |> + dplyr::select(delay, type), + obs_cens_trunc |> + dplyr::mutate(type = "Censored, truncated,\nsampled data") |> + dplyr::select(delay, type) +) |> + ggplot() + + geom_histogram( + aes(x = delay, y = ..density.., fill = type, group = type), + position = "dodge" + ) + + scale_fill_manual(values = c("#0072B2", "#D55E00")) + geom_function( data = data.frame(x = c(0, 30)), aes(x = x), @@ -205,12 +218,15 @@ ggplot() + meanlog = secondary_dist[["meanlog"]], sdlog = secondary_dist[["sdlog"]], ) + - labs(x = "Delay between primary and secondary event (days)", y = "Probability density") + - theme_minimal() + labs( + x = "Delay between primary and secondary event (days)", + y = "Probability density", + type = "" + ) + + theme_minimal() + + theme(legend.position = "bottom") ``` -However, with only access to `obs_cens_trunc_samp`, we must use a statistical model to avoid severely biased estimates! - The main `epidist` model function is `latent_truncation_censoring_adjusted_delay`. The parameters of the model are inferred using Bayesian inference. @@ -233,7 +249,9 @@ data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |> gt::gt(caption = "All of the parameters that are included in the model") ``` -Although some users may wish to work with `fit` directly, we will see in the following section that `epidist` provides functions to make common post-processing tasks easy. +In the following section, we show that `epidist` provides functions to make common post-processing tasks easy. +Additionally, for users familiar with Stan and `brms`, they may like to work with `fit` directly. +For example, any tool that supports `brms` fitted model objects will be compatible with `fit`. # Compare estimates {#compare} From edee9759942928bb75c153d7464c1c27da611142 Mon Sep 17 00:00:00 2001 From: athowes Date: Thu, 23 May 2024 11:56:46 +0100 Subject: [PATCH 17/19] Fix args problem, first go at results in at the end --- vignettes/epidist.Rmd | 71 +++++++++++++++++++++++++++++++------------ 1 file changed, 51 insertions(+), 20 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 269ac5258..216cf75c1 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -96,11 +96,13 @@ secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> (ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability. ```{r lognormal, fig.cap="(ref:lognormal)"} -ggplot(data.frame(x = c(0, 8)), aes(x = x)) + +ggplot(data.frame(x = c(0, 30)), aes(x = x)) + geom_function( fun = dlnorm, - meanlog = secondary_dist[["meanlog"]], - sdlog = secondary_dist[["sdlog"]] + args = list( + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ) ) + theme_minimal() + labs( @@ -189,7 +191,7 @@ obs_cens_trunc_samp <- With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using `epidist`. -# Fit the model {#fit} +# Fit the model and compare estimates {#fit} If we had access to the data `obs`, then it would be simple to estimate the delay distribution. However, with only access to `obs_cens_trunc_samp`, the delay distribution we observe is biased (Figure \@ref(fig:obs-est)) and we must use a statistical model. @@ -215,13 +217,15 @@ dplyr::bind_rows( data = data.frame(x = c(0, 30)), aes(x = x), fun = dlnorm, - meanlog = secondary_dist[["meanlog"]], - sdlog = secondary_dist[["sdlog"]], + args = list( + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ), ) + labs( x = "Delay between primary and secondary event (days)", y = "Probability density", - type = "" + fill = "" ) + theme_minimal() + theme(legend.position = "bottom") @@ -239,30 +243,57 @@ fit <- latent_truncation_censoring_adjusted_delay( ``` The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters (Table \@ref(tab:pars)) in the model. -Many of these are so-called "latent variables". +Users familiar with Stan and `brms`, can work with `fit` directly. +Any tool that supports `brms` fitted model objects will be compatible with `fit`. ```{r pars} pars <- fit$fit@par_dims |> map(.f = function(x) if(identical(x, integer(0))) return(1) else return(x)) data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |> - gt::gt(caption = "All of the parameters that are included in the model") + gt::gt(caption = + "All of the parameters that are included in the model. + Many of these parameters are the so called latent variables in the model." + ) ``` -In the following section, we show that `epidist` provides functions to make common post-processing tasks easy. -Additionally, for users familiar with Stan and `brms`, they may like to work with `fit` directly. -For example, any tool that supports `brms` fitted model objects will be compatible with `fit`. - -# Compare estimates {#compare} - -Compare modelled estimate to underlying truth here to convince user that it works. -Perhaps as an exercise, we could show the "really simple" estimate being wrong here otherwise it won't be so impressive that the model is right. -This section should also show the user how to get objects that they might want out of the fitted object. +The `epidist` package also provides functions to make common post-processing tasks easy. +For example, samples of the fitted lognormal `meanlog` and `sdlog` parameter can be extracted using: ```{r} draws <- extract_lognormal_draws(fit) -draws_long <- draws_to_long(draws) -summarise_draws(draws_long, sf = 2) +``` + +Figure \@ref(fig:fitted-lognormal) shows... + +(ref:fitted-lognormal) Figure caption. + +```{r fitted-lognormal, fig.cap="(ref:fitted-lognormal)"} +ggplot() + + geom_function( + data = data.frame(x = c(0, 30)), + aes(x = x), + fun = dlnorm, + args = list( + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ), + ) + + geom_function( + data = data.frame(x = c(0, 30)), + aes(x = x), + fun = dlnorm, + args = list( + meanlog = mean(draws$meanlog), + sdlog = mean(draws$sdlog) + ), + col = "#D55E00" + ) + + labs( + x = "Delay between primary and secondary event (days)", + y = "Probability density" + ) + + theme_minimal() ``` ## Bibliography {-} From 0e47b3f5087d83d8dced9baed6bfbcf159c14200 Mon Sep 17 00:00:00 2001 From: athowes Date: Thu, 23 May 2024 16:42:32 +0100 Subject: [PATCH 18/19] Resolve some remaining issues --- vignettes/epidist.Rmd | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 216cf75c1..b2aa2e166 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -57,13 +57,16 @@ Finally, in Section \@ref(compare), we demonstrate that the fitted delay distrib If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best. +Finally, to run this vignette yourself, you will need the `data.table`, `purrr` and `ggplot2` packages installed. +Note that to work with outputs from `epidist` you do not need to use `data.table`: any tool of your preference is suitable. + # Example data {#data} -Data should be formatted as a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the following columns for use within the `epidist` package: +Data should be formatted as a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the following columns for use within `epidist`: -* `case`: -* `ptime`: -* `stime`: +* `case`: The unique case ID. +* `ptime`: The time of the primary event. +* `stime`: The time of the secondary event. Here we simulate data in this format, and in doing so explain the two main issues with observational delay data. @@ -189,6 +192,11 @@ obs_cens_trunc_samp <- obs_cens_trunc[sample(seq_len(.N), sample_size, replace = FALSE)] ``` +Another issue, which `epidist` currently does not account for, is that sometimes only the secondary event might be observed, and not the primary event. +For example, symptom onset may be reported, but start of infection unknown. +Discarding events of this type leads to what are called ascertainment biases. +Whereas each case is equally likely to appear in the sample above, under ascertainment bias some cases are more likely to appear in the data than others. + With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using `epidist`. # Fit the model and compare estimates {#fit} From 858eeb21164563c7b704cbbd8ce85ff8a0d66b57 Mon Sep 17 00:00:00 2001 From: athowes Date: Thu, 23 May 2024 18:01:50 +0100 Subject: [PATCH 19/19] Fix lint issues --- vignettes/epidist.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index b2aa2e166..3ca84ce1b 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -256,12 +256,12 @@ Any tool that supports `brms` fitted model objects will be compatible with `fit` ```{r pars} pars <- fit$fit@par_dims |> - map(.f = function(x) if(identical(x, integer(0))) return(1) else return(x)) + map(.f = function(x) if (identical(x, integer(0))) return(1) else return(x)) data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |> gt::gt(caption = - "All of the parameters that are included in the model. - Many of these parameters are the so called latent variables in the model." + "All of the parameters that are included in the model. + Many of these parameters are the so called latent variables in the model." ) ```