diff --git a/vignettes/approx-inference.Rmd b/vignettes/approx-inference.Rmd index 14a816e4e..819992611 100644 --- a/vignettes/approx-inference.Rmd +++ b/vignettes/approx-inference.Rmd @@ -102,7 +102,9 @@ library(tidybayes) library(cmdstanr) # nolint ``` -To access the approximate inference methods used in this vignette we will need to use the `cmdstanr` backend for `brms` (we generally recommend using this backend for fitting models). To do this, we first need to install CmdStan (see the README for more details). We can check we have everything we need as follows: +To access the approximate inference methods used in this vignette we will need to use the `cmdstanr` backend for `brms` (we generally recommend using this backend for fitting models). +To do this, we first need to install CmdStan (see the README for more details). +We can check we have everything we need as follows: ```{r} cmdstanr::cmdstan_version() @@ -262,7 +264,7 @@ pmap_df( ## Comparison of time taken -In this example, HMC took much longer than the other methods, with Pathfinder being the fastest method. +In this example, HMC took a longer time to run than the other methods and Pathfinder was the fastest running method. That said, even for HMC the computation time in this case is unlikely to be prohibitive. ```{r} diff --git a/vignettes/ebola.Rmd b/vignettes/ebola.Rmd index 536ebd4f5..0e3592597 100644 --- a/vignettes/ebola.Rmd +++ b/vignettes/ebola.Rmd @@ -163,7 +163,7 @@ That is, $\mu$ and $\sigma$ such that when $x \sim \mathcal{N}(\mu, \sigma)$ the ## Data preparation -To prepare the data, we begin by filtering for the relevant columns and converting the date columns to `Date` objects: +To prepare the data, we begin by converting the date columns to `Date` objects and selecting the relevant columns: ```{r} obs_cens <- sierra_leone_ebola_data |> @@ -193,7 +193,7 @@ subsample <- 0.2 ``` Additionally, to speed up computation, we take a random `r 100 * subsample`% subsample of the complete data. -(In a real analysis, we'd recommend using all the available data). +(In a real analysis, we'd recommend using all the available data.) ```{r} obs_cens <- obs_cens |> @@ -210,7 +210,9 @@ linelist_data <- obs_cens |> ) ``` -Note that this has made some assumptions about the data in that it has assumed that as we did not supply upper bounds for the primary and secondary events, that the upper bounds are one day after the lower bounds. It has also assumed that the observation time is the maximum of the secondary event upper bound as we also did not supply an observation time column. +In this call to [as_epidist_linelist_data()] it has made some assumptions about the data. +First, because we did not supply upper bounds for the primary and secondary events (`pdate_upr` and `sdate_upr`), it has assumed that the upper bounds are one day after the lower bounds. +Second, because we also did not supply an observation time column (`obs_date`), it has assumed that the observation time is the maximum of the secondary event upper bounds. ## Model fitting diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd index b4cd727c0..86fa9b254 100644 --- a/vignettes/epidist.Rmd +++ b/vignettes/epidist.Rmd @@ -49,8 +49,9 @@ First, in Section \@ref(data), we explain interval censoring and right truncatio Then, in Section \@ref(fit), we show how `epidist` can be used to accurately estimate delay distributions by using a statistical model which properly accounts for these two issues. If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best. +We also recommend the introduction provided by the [nowcasting and forecasting infectious disease dynamics](https://nfidd.github.io/nfidd/) course. -To run this vignette yourself, along with the `epidist` package, you will need the following packages: +To run this vignette yourself, as well as the `epidist` package, you will need the following packages: ```{r load-requirements} library(epidist) @@ -62,16 +63,7 @@ library(dplyr) # Example data {#data} -Data should be formatted as a `data.frame` with the following columns for use within `epidist`: - -* `case`: The unique case ID. -* `ptime`: The time of the primary event. -* `stime`: The time of the secondary event. - -Here we simulate data in this format, and in doing so explain the two main issues with observational delay data. - - -First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model: +First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model. ```{r} outbreak <- simulate_gillespie(seed = 101) @@ -88,14 +80,26 @@ outbreak |> theme_minimal() ``` -`outbreak` is a `data.frame` with the columns `case` and `ptime`. +`outbreak` is a `data.frame` with the two columns: `case` and `ptime`. +Here `ptime` is a numeric column giving the time of infection. +In reality, it is more common to recive primary event times as a date rather than a numeric. + +```{r} +head(outbreak) +``` -Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events: +To generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events: ```{r} secondary_dist <- data.frame(mu = 1.8, sigma = 0.5) class(secondary_dist) <- c("lognormal_samples", class(secondary_dist)) secondary_dist <- add_mean_sd(secondary_dist) + +obs <- outbreak |> + simulate_secondary( + meanlog = secondary_dist[["mu"]], + sdlog = secondary_dist[["sigma"]] + ) ``` (ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability. @@ -116,14 +120,6 @@ ggplot(data.frame(x = c(0, 30)), aes(x = x)) + ) ``` -```{r} -obs <- outbreak |> - simulate_secondary( - meanlog = secondary_dist[["mu"]], - sdlog = secondary_dist[["sigma"]] - ) -``` - (ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure \@ref(fig:lognormal)). As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown. ```{r delay, fig.cap="(ref:delay)"} @@ -147,7 +143,7 @@ The secondary event time is simply the primary event time plus the delay: all(obs$ptime + obs$delay == obs$stime) ``` -If we were to receive the data `obs` as above then estimating the delay distribution would be easy, and the `epidist` package wouldn't need to exist. +If we were to receive the complete data `obs` as above then it would be simple to accurately estimate the delay distribution. However, in reality, during an outbreak we almost never receive the data as above. First, the times of primary and secondary events will usually be censored. @@ -168,15 +164,30 @@ obs_cens <- obs |> (ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. A common example of this is when events are reported as daily aggregates. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals. ```{r cens, fig.cap="(ref:cens)"} -ggplot(obs_cens, aes(x = delay, y = delay_daily)) + - geom_point(col = "#E69F00") + - geom_abline(slope = 1, intercept = 0, linetype = "dashed", col = "grey") + - theme_minimal() + - coord_fixed() + - labs(x = "Exact delay time (days)", y = "Censored delay time (days)") +obs_cens |> + filter(case %% 50 == 0, case <= 250) |> + ggplot(aes(y = case)) + + geom_segment( + aes(x = ptime, xend = stime, y = case, yend = case), + col = "grey" + ) + + # The primary event censoring intervals + geom_errorbarh( + aes(xmin = ptime_lwr, xmax = ptime_upr, y = case, yend = case), + col = "#56B4E9", height = 5 + ) + + # The secondary event censoring intervals + geom_errorbarh( + aes(xmin = stime_lwr, xmax = stime_upr, y = case, yend = case), + col = "#009E73", height = 5 + ) + + geom_point(aes(x = ptime), fill = "white", col = "#56B4E9", shape = 21) + + geom_point(aes(x = stime), fill = "white", col = "#009E73", shape = 21) + + labs(x = "Event time (day)", y = "Case number") + + theme_minimal() ``` -Next, during an outbreak we will usually be estimating delays in real time. +During an outbreak we will usually be estimating delays in real time. The result is that only those cases with a secondary event occurring before some time will be observed. This is called (right) truncation, and biases the observation process towards shorter delays: @@ -277,13 +288,8 @@ In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo fit <- epidist(data = data, chains = 2, cores = 2, refresh = 0) ``` -The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters (Table \@ref(tab:pars)) in the model. -Users familiar with Stan and `brms`, can work with `fit` directly. -Any tool that supports `brms` fitted model objects will be compatible with `fit`. - -**Note that here we use the default `rstan` backend but we generally recommend using the `cmdstanr` backend for faster sampling and additional features. This can be set using `backend = "cmdstanr"` after following the installing `cmdstan` instructions in the README.** - -(ref:pars) All of the parameters that are included in the model. Many of these parameters (e.g. `swindow` and `pwindow`) are the so called latent variables in the model, and have lengths corresponding to the `sample_size`. We extracted the model parameters using `brms::variables()` and removed the indices. +The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters in the model, shown in the table below. +Many of these parameters (e.g. `swindow` and `pwindow`) are the so called latent variables, and have lengths corresponding to the `sample_size`. ```{r pars} pars <- fit |> @@ -293,10 +299,15 @@ pars <- fit |> data.frame( Parameter = unique(pars), Length = table(pars) ) |> - gt() |> - tab_caption("(ref:pars)") + gt() ``` +Users familiar with Stan and `brms`, can work with `fit` directly. +Any tool that supports `brms` fitted model objects will be compatible with `fit`. + +**Note that here we use the default `rstan` backend but we generally recommend using the `cmdstanr` backend for faster sampling and additional features.** +**This can be set using `backend = "cmdstanr"` after following the installing CmdStan instructions in the README.** + The `epidist` package also provides functions to make common post-processing tasks easy. For example, individual predictions of the lognormal delay parameters can be extracted using: diff --git a/vignettes/faq.Rmd b/vignettes/faq.Rmd index d7b3ce6a4..82f028cdb 100644 --- a/vignettes/faq.Rmd +++ b/vignettes/faq.Rmd @@ -17,10 +17,11 @@ vignette: > bibliography: references.bib --- -Here we provide tips for working with the `epidist` package, and answers to "frequently" asked questions. -If you have a question about using the package, please create an [issue](https://github.com/epinowcast/epidist/issues) and we will endeavour to get back to you soon! +Here we provide tips for working with the `epidist` package, and answers to frequently asked questions. +If you have a question about using the package, please create an [issue](https://github.com/epinowcast/epidist/issues). +We will endeavour to respond promptly! -The code block below is provided to facilitate reproduction of this script, if required! +The code block below facilitates reproduction of this script. ```{r message=FALSE, results='hide', class.source='fold-hide'} library(epidist) @@ -81,7 +82,7 @@ draws <- as_draws_df(fit, variable = c("Intercept", "Intercept_sigma")) head(draws) ``` -## How can I assess if sampling has converged? +## How can I assess whether sampling has converged? The output of a call to `epidist` is compatible with typical Stan workflows. We recommend use of the [`bayesplot`](http://mc-stan.org/bayesplot/) package for sampling diagnostic plots. @@ -109,7 +110,7 @@ We particularly highlight two functions which might be useful: For an example use of these functions, have a look at the [`epidist-paper`](https://github.com/parksw3/epidist-paper) repository containing the code for @park2024estimating. (Note that in that codebase, we use `map` as a part of a [`targets`](https://books.ropensci.org/targets/) pipeline.) -## How did you choose the default priors for `epidist`? +## How are the default priors for `epidist` chosen? [`brms`](http://paulbuerkner.com/brms/) provides default priors for all parameters. However, some of those priors do not make sense in the context of our application.