diff --git a/DESCRIPTION b/DESCRIPTION index 1bea3aae6..15f2b8d36 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -34,7 +34,8 @@ Suggests: epinowcast, testthat (>= 3.0.0), readxl, - janitor + janitor, + gt Remotes: stan-dev/cmdstanr, Rdatatable/data.table, diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index 83a50fb22..3ca84ce1b 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -1,40 +1,35 @@ --- -title: "Getting Started with Epidist" -description: "A quick start guide to using the Epidist package" -author: Epidist Team +title: "Getting started with epidist" +description: "A quick start guide to using the epidist R package" output: - bookdown::html_vignette2: + bookdown::html_document2: fig_caption: yes code_folding: show + number_sections: true pkgdown: as_is: true -csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl +# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl link-citations: true vignette: > - %\VignetteIndexEntry{Getting Started with Epidist} + %\VignetteIndexEntry{Getting started with epidist} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} +bibliography: references.bib --- ```{r setup, include=FALSE} # exclude compile warnings from cmdstanr knitr::opts_chunk$set( - fig.path = "figures/getting-started-nowcasting-", - cache = TRUE, dpi = 330, - collapse = TRUE, comment = "#>", out.width = "100%", - message = FALSE, warning = FALSE, error = FALSE, - eval = FALSE + fig.path = "figures/epidist-", + cache = TRUE, + collapse = TRUE, + comment = "#>", + message = FALSE, + warning = FALSE, + error = FALSE ) ``` -# Quick start - -In this quick start, we demonstrate using `epidist` to ... - -# Package - -In this quick start, we also use `data.table`, `purrr`, and `ggplot2` packages. These are both installed as dependencies when `epidist` is installed. Note that all output from `epidist` is readily useable with other tools, including `tidyverse` packages (see [here](https://mgimond.github.io/rug_2019_12/Index.html) for a comparison). - ```{r load-requirements} library(epidist) library(data.table) @@ -42,160 +37,271 @@ library(purrr) library(ggplot2) ``` -### Simulate the data +Many quantities in epidemiology can be thought of as the time between two events, or "delays". +Important examples include: + +* the incubation period (time between infection and symptom onset), +* serial interval (time between symptom onset of infectee and symptom onset of infected), and +* generation interval (time between infection of infectee and infection of infected). + +Unfortunately, estimating delays accurately from observational data is usually difficult. +In our experience, the two main issues are: + +1. interval censoring, and +2. right truncation. + +Don't worry if you've not come across these terms before. +In Section \@ref(data), we will explain what they mean by simulating data like that we might observe during an ongoing infectious disease outbreak. +Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues. +Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth. -Simulate data from an outbreak. +If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best. -```{r simulate-outbreak} +Finally, to run this vignette yourself, you will need the `data.table`, `purrr` and `ggplot2` packages installed. +Note that to work with outputs from `epidist` you do not need to use `data.table`: any tool of your preference is suitable. + +# Example data {#data} + +Data should be formatted as a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the following columns for use within `epidist`: + +* `case`: The unique case ID. +* `ptime`: The time of the primary event. +* `stime`: The time of the secondary event. + +Here we simulate data in this format, and in doing so explain the two main issues with observational delay data. + + +First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model: + +```{r} outbreak <- simulate_gillespie(seed = 101) ``` -Define the secondary distribution to use in the simulation +(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.) -```{r secondary-dist} -secondary_dist <- data.table( - meanlog = 1.8, sdlog = 0.5 -) |> +```{r outbreak, fig.cap="(ref:outbreak)"} +outbreak[case %% 50 == 0, ] |> + ggplot(aes(x = ptime, y = case)) + + geom_point(col = "#56B4E9") + + labs(x = "Primary event time (day)", y = "Case number") + + theme_minimal() +``` + +`outbreak` is a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the columns `case` and `ptime`. + +Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events: + +```{r} +secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |> add_natural_scale_mean_sd() ``` -Simulate an observation process during the growth phase for a secondary event using a lognormal distribution, and finally simulate observing this event. +(ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability. -```{r simulate-data} +```{r lognormal, fig.cap="(ref:lognormal)"} +ggplot(data.frame(x = c(0, 30)), aes(x = x)) + + geom_function( + fun = dlnorm, + args = list( + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ) + ) + + theme_minimal() + + labs( + x = "Delay between primary and secondary event (days)", + y = "Probability density" + ) +``` + +```{r} obs <- outbreak |> simulate_secondary( - meanlog = secondary_dist$meanlog[[1]], - sdlog = secondary_dist$sdlog[[1]] - ) |> - observe_process() + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ) ``` -Observe the outbreak after 25 days and take 100 samples. +(ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure \@ref(fig:lognormal)). As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown. -```{r observe-data} -truncated_obs <- obs |> - filter_obs_by_obs_time(obs_time = 25) -truncated_obs <- truncated_obs[sample(1:.N, 200, replace = FALSE)] +```{r delay, fig.cap="(ref:delay)"} +obs[case %% 50 == 0, ] |> + ggplot(aes(y = case)) + + geom_segment( + aes(x = ptime, xend = stime, y = case, yend = case), + col = "grey" + ) + + geom_point(aes(x = ptime), col = "#56B4E9") + + geom_point(aes(x = stime), col = "#009E73") + + labs(x = "Event time (day)", y = "Case number") + + theme_minimal() ``` -Plot primary cases (columns), and secondary cases (dots) by the observation time of their secondary events. This reflects would could be observed in real-time. - -```{r observed-cases} -truncated_cases <- construct_cases_by_obs_window( - obs, windows = c(25), obs_type = "stime" -) +`obs` is now a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) object with further columns for `delay` and `stime`. +The secondary event time is simply the primary event time plus the delay: -plot_cases_by_obs_window(truncated_cases) +```{r} +all(obs$ptime + obs$delay == obs$stime) ``` -We can alternatively plot the observed data based on primary events. This corresponds to a retrospective cohort based view of the data. +If we were to receive the data `obs` as above then estimating the delay distribution would be easy, and the `epidist` package wouldn't need to exist. +However, in reality, during an outbreak we almost never receive the data as above. -```{r cohort-observed-cases} -truncated_cases <- construct_cases_by_obs_window( - obs, windows = c(25), obs_type = "ptime" -) +First, the times of primary and secondary events will usually be censored. +This means that rather than exact event times, we observe event times within an interval. +Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure \@ref(fig:cens)): -plot_cases_by_obs_window(truncated_cases) +```{r} +# observe_process() should be renamed and include a "daily" i.e. 1 argument +obs_cens <- obs |> observe_process() ``` -Plot the true continuous delay distribution and the empirical observed distribution for each observation window. - -```{r empirical-dist} -combined_obs <- combine_obs(truncated_obs, obs) +(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals. -plot_empirical_delay( - combined_obs, meanlog = secondary_dist$meanlog[[1]], - sdlog = secondary_dist$sdlog[[1]] -) +```{r cens, fig.cap="(ref:cens)"} +ggplot(obs_cens, aes(x = delay, y = delay_daily)) + + geom_point(col = "#E69F00") + + geom_abline(slope = 1, intercept = 0, linetype = "dashed", col = "grey") + + theme_minimal() + + coord_fixed() + + labs(x = "Exact delay time (days)", y = "Censored delay time (days)") ``` -Plot the mean difference between continuous and discrete event time: +Next, during an outbreak we will usually be estimating delays in real time. +The result is that only those cases with a secondary event occurring before some time will be observed. +This is called (right) truncation, and biases the observation process towards shorter delays: + -```{r censor_delay} -censor_delay <- calculate_censor_delay(truncated_obs) -plot_censor_delay(censor_delay) +```{r} +obs_time <- 25 +# filter_obs_by_obs_time() should be renamed to refer to stime +obs_cens_trunc <- filter_obs_by_obs_time(obs_cens, obs_time = obs_time) ``` -### Model +Finally, in reality, it's not possible to observe every case. +We suppose that a sample of individuals of size `sample_size` are observed: -Adjust for right truncation and date censoring using a latent variable approach. +```{r} +sample_size <- 200 +``` + +This sample size corresponds to `r 100 * round(sample_size / nrow(obs_cens_trunc), 3)`% of the data. ```{r} -latent_truncation_censoring_fit <- latent_truncation_censoring_adjusted_delay( - data = truncated_obs, cores = 4, refresh = 0 -) +obs_cens_trunc_samp <- + obs_cens_trunc[sample(seq_len(.N), sample_size, replace = FALSE)] ``` -### Summarise model posterior and compare to known truth +Another issue, which `epidist` currently does not account for, is that sometimes only the secondary event might be observed, and not the primary event. +For example, symptom onset may be reported, but start of infection unknown. +Discarding events of this type leads to what are called ascertainment biases. +Whereas each case is equally likely to appear in the sample above, under ascertainment bias some cases are more likely to appear in the data than others. + +With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using `epidist`. + +# Fit the model and compare estimates {#fit} + +If we had access to the data `obs`, then it would be simple to estimate the delay distribution. +However, with only access to `obs_cens_trunc_samp`, the delay distribution we observe is biased (Figure \@ref(fig:obs-est)) and we must use a statistical model. + +(ref:obs-est) The histogram of delays from `obs` matches closely the underlying distribution, whereas those from `obs_cens_trunc_samp` are systematically biased. + +```{r obs-est, fig.cap="(ref:obs-est)"} +dplyr::bind_rows( + obs |> + dplyr::mutate(type = "Full data") |> + dplyr::select(delay, type), + obs_cens_trunc |> + dplyr::mutate(type = "Censored, truncated,\nsampled data") |> + dplyr::select(delay, type) +) |> + ggplot() + + geom_histogram( + aes(x = delay, y = ..density.., fill = type, group = type), + position = "dodge" + ) + + scale_fill_manual(values = c("#0072B2", "#D55E00")) + + geom_function( + data = data.frame(x = c(0, 30)), + aes(x = x), + fun = dlnorm, + args = list( + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ), + ) + + labs( + x = "Delay between primary and secondary event (days)", + y = "Probability density", + fill = "" + ) + + theme_minimal() + + theme(legend.position = "bottom") +``` -Combine models into a named list. +The main `epidist` model function is `latent_truncation_censoring_adjusted_delay`. + +The parameters of the model are inferred using Bayesian inference. +In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the [`brms`](https://paul-buerkner.github.io/brms/) R package [@brms]. ```{r} -models <- list( - "Latent variable truncation and censoring adjusted" = latent_truncation_censoring_fit +fit <- latent_truncation_censoring_adjusted_delay( + data = obs_cens_trunc_samp, cores = 4, refresh = 0 ) ``` -Extract and summarise lognormal posterior estimates. +The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters (Table \@ref(tab:pars)) in the model. +Users familiar with Stan and `brms`, can work with `fit` directly. +Any tool that supports `brms` fitted model objects will be compatible with `fit`. -```{r lognormal-draws} -draws <- models |> - map(extract_lognormal_draws) |> - rbindlist(idcol = "model") |> - rbind(epinowcast_draws, use.names = TRUE) +```{r pars} +pars <- fit$fit@par_dims |> + map(.f = function(x) if (identical(x, integer(0))) return(1) else return(x)) -draws <- draws[, - model := factor( - model, levels = c("Joint incidence and forward delay", rev(names(models))) - )] +data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |> + gt::gt(caption = + "All of the parameters that are included in the model. + Many of these parameters are the so called latent variables in the model." + ) +``` -summarised_draws <- draws |> - draws_to_long() |> - summarise_draws(sf = 3) +The `epidist` package also provides functions to make common post-processing tasks easy. +For example, samples of the fitted lognormal `meanlog` and `sdlog` parameter can be extracted using: -knitr::kable(summarised_draws[parameter %in% c("meanlog", "sdlog")]) +```{r} +draws <- extract_lognormal_draws(fit) ``` -Plot summarised posterior estimates from each model compared to the ground truth. +Figure \@ref(fig:fitted-lognormal) shows... -```{r, fig.width = 9, fig.height = 4} -draws |> - draws_to_long() |> - make_relative_to_truth(draws_to_long(secondary_dist)) |> - plot_relative_recovery(y = model, fill = model) + - facet_wrap(vars(parameter), nrow = 1, scales = "free_x") + - scale_fill_brewer(palette = "Dark2") + - guides(fill = guide_none()) + - labs( - y = "Model", x = "Relative to ground truth" - ) -``` +(ref:fitted-lognormal) Figure caption. -Finally, check the mean posterior predictions for each model against the observed daily cohort mean. -```{r postcheck} -truncated_draws <- draws |> - calculate_truncated_means( - obs_at = max(truncated_obs$stime_daily), - ptime = range(truncated_obs$ptime_daily) - ) |> - summarise_variable(variable = "trunc_mean", by = c("obs_horizon", "model")) - -truncated_draws[, model := factor( - model, levels = c("Joint incidence and forward delay", rev(names(models))) - )] - -truncated_draws |> - plot_mean_posterior_pred( - truncated_obs |> - calculate_cohort_mean( - type = "cohort", obs_at = max(truncated_obs$stime_daily) - ), - col = model, fill = model, mean = TRUE, ribbon = TRUE +```{r fitted-lognormal, fig.cap="(ref:fitted-lognormal)"} +ggplot() + + geom_function( + data = data.frame(x = c(0, 30)), + aes(x = x), + fun = dlnorm, + args = list( + meanlog = secondary_dist[["meanlog"]], + sdlog = secondary_dist[["sdlog"]] + ), ) + - guides( - fill = guide_legend(title = "Model", nrow = 4), - col = guide_legend(title = "Model", nrow = 4) + geom_function( + data = data.frame(x = c(0, 30)), + aes(x = x), + fun = dlnorm, + args = list( + meanlog = mean(draws$meanlog), + sdlog = mean(draws$sdlog) + ), + col = "#D55E00" ) + - scale_fill_brewer(palette = "Dark2", aesthetics = c("fill", "colour")) + - theme(legend.direction = "vertical") + labs( + x = "Delay between primary and secondary event (days)", + y = "Probability density" + ) + + theme_minimal() ``` + +## Bibliography {-} diff --git a/vignettes/references.bib b/vignettes/references.bib new file mode 100644 index 000000000..d07f00314 --- /dev/null +++ b/vignettes/references.bib @@ -0,0 +1,33 @@ +@article {park2024estimating, + author = {Park, Sang Woo and Akhmetzhanov, Andrei R. and Charniga, Kelly and Cori, Anne and Davies, Nicholas G. and Dushoff, Jonathan and Funk, Sebastian and Gostic, Katie and Grenfell, Bryan and Linton, Natalie M. and Lipsitch, Marc and Lison, Adrian and Overton, Christopher E. and Ward, Thomas and Abbott, Sam}, + title = {Estimating epidemiological delay distributions for infectious diseases}, + elocation-id = {2024.01.12.24301247}, + year = {2024}, + doi = {10.1101/2024.01.12.24301247}, + publisher = {Cold Spring Harbor Laboratory Press}, + URL = {https://www.medrxiv.org/content/early/2024/01/13/2024.01.12.24301247}, + eprint = {https://www.medrxiv.org/content/early/2024/01/13/2024.01.12.24301247.full.pdf}, + journal = {medRxiv} +} + +@article{charniga2024best, + title={Best practices for estimating and reporting epidemiological delay distributions of infectious diseases using public health surveillance and healthcare data}, + author={Kelly Charniga and Sang Woo Park and Andrei R Akhmetzhanov and Anne Cori and Jonathan Dushoff and Sebastian Funk and Katelyn M Gostic and Natalie M Linton and Adrian Lison and Christopher E Overton and Juliet R C Pulliam and Thomas Ward and Simon Cauchemez and Sam Abbott}, + year={2024}, + eprint={2405.08841}, + archivePrefix={arXiv}, + primaryClass={stat.ME} +} + +@Article{brms, + title = {{brms}: An {R} Package for {Bayesian} Multilevel Models + Using {Stan}}, + author = {Paul-Christian Bürkner}, + journal = {Journal of Statistical Software}, + year = {2017}, + volume = {80}, + number = {1}, + pages = {1--28}, + doi = {10.18637/jss.v080.i01}, + encoding = {UTF-8}, +} \ No newline at end of file