epinowcast · seabbs · May 23, 2024 · May 14, 2024 · May 14, 2024 · May 14, 2024
diff --git a/vignettes/epidist.Rmd b/vignettes/epidist.Rmd
@@ -1,201 +1,250 @@
 ---
-title: "Getting Started with Epidist"
-description: "A quick start guide to using the Epidist package"
-author: Epidist Team
+title: "Getting started with epidist"
+description: "A quick start guide to using the epidist R package"
 output: 
-  bookdown::html_vignette2:
+  bookdown::html_document2:
     fig_caption: yes
     code_folding: show
+    number_sections: true
 pkgdown:
   as_is: true
-csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl
+# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl
 link-citations: true
 vignette: >
-  %\VignetteIndexEntry{Getting Started with Epidist}
+  %\VignetteIndexEntry{Getting started with epidist}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
+bibliography: references.bib
 ---
 
 ```{r setup, include=FALSE}
 # exclude compile warnings from cmdstanr
 knitr::opts_chunk$set(
-  fig.path = "figures/getting-started-nowcasting-",
-  cache = TRUE, dpi = 330,
-  collapse = TRUE, comment = "#>", out.width = "100%",
-  message = FALSE, warning = FALSE, error = FALSE,
-  eval = FALSE
+  fig.path = "figures/epidist-",
+  cache = TRUE,
+  collapse = TRUE,
+  comment = "#>",
+  message = FALSE,
+  warning = FALSE,
+  error = FALSE
 )
 ```
 
-# Quick start
-
-In this quick start, we demonstrate using `epidist` to ...
-
-# Package
-
-In this quick start, we also use `data.table`, `purrr`, and `ggplot2` packages. These are both installed as dependencies when `epidist` is installed. Note that all output from `epidist` is readily useable with  other tools, including `tidyverse` packages (see [here](https://mgimond.github.io/rug_2019_12/Index.html) for a comparison).
-
 ```{r load-requirements}
 library(epidist)
 library(data.table)
 library(purrr)
 library(ggplot2)
 ```
 
-### Simulate the data
+Many quantities in epidemiology can be thought of as the time between two events, or "delays".
+Important examples include:
+
+* the incubation period (time between infection and symptom onset),
+* serial interval (time between symptom onset of infectee and symptom onset of infected), and 
+* generation interval (time between infection of infectee and infection of infected).
+
+Unfortunately, estimating delays accurately from observational data is usually difficult.
+In our experience, the two main issues are:
+
+1. interval censoring, and
+2. right truncation.
 
-Simulate data from an outbreak.
+Don't worry if you've not come across these terms before.
+In Section \@ref(data), we will explain what they mean by simulating data like that we might observe during an ongoing infectious disease outbreak.
+Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues.
+Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth.
 
-```{r simulate-outbreak}
+If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best.
+
+# Example data {#data}
+
+Data should be formatted as a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the following columns for use within the `epidist` package:
+
+* `case`: 
+* `ptime`:
+* `stime`:
+
+Here we simulate data in this format, and in doing so explain the two main issues with observational delay data.
+<!-- Please see as of yet unwritten vignette for information about how to transform your data of other formats to the right format. -->
+
+First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model:
+
+```{r}
 outbreak <- simulate_gillespie(seed = 101)
 ```
 
-Define the secondary distribution to use in the simulation
+(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.)
 
-```{r secondary-dist}
-secondary_dist <- data.table(
-  meanlog = 1.8, sdlog = 0.5
-) |>
+```{r outbreak, fig.cap="(ref:outbreak)"}
+outbreak[case %% 50 == 0, ] |>
+  ggplot(aes(x = ptime, y = case)) +
+  geom_point(col = "#56B4E9") +
+  labs(x = "Primary event time (day)", y = "Case number") +
+  theme_minimal()
+```
+
+`outbreak` is a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the columns `case` and `ptime`.
+
+Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events:
+
+```{r}
+secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |>
   add_natural_scale_mean_sd()
 ```
 
-Simulate an observation process during the growth phase for a secondary event using a lognormal distribution, and finally simulate observing this event.
+(ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability.
 
-```{r simulate-data}
+```{r lognormal, fig.cap="(ref:lognormal)"}
+ggplot(data.frame(x = c(0, 8)), aes(x = x)) +
+  geom_function(
+    fun = dlnorm,
+    meanlog = secondary_dist[["meanlog"]],
+    sdlog = secondary_dist[["sdlog"]]
+  ) +
+  theme_minimal() +
+  labs(
+    x = "Delay between primary and secondary event (days)",
+    y = "Probability density"
+  )
+```
+
+```{r}
 obs <- outbreak |>
   simulate_secondary(
-    meanlog = secondary_dist$meanlog[[1]],
-    sdlog = secondary_dist$sdlog[[1]]
-  ) |>
-  observe_process()
+    meanlog = secondary_dist[["meanlog"]],
+    sdlog = secondary_dist[["sdlog"]]
+  )
+```
+
+(ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure \@ref(fig:lognormal)). As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown.
+
+```{r delay, fig.cap="(ref:delay)"}
+obs[case %% 50 == 0, ] |>
+  ggplot(aes(y = case)) +
+  geom_segment(
+    aes(x = ptime, xend = stime, y = case, yend = case),
+    col = "grey"
+  ) +
+  geom_point(aes(x = ptime), col = "#56B4E9") +
+  geom_point(aes(x = stime), col = "#009E73") +
+  labs(x = "Event time (day)", y = "Case number") +
+  theme_minimal()
 ```
 
-Observe the outbreak after 25 days and take 100 samples.
+`obs` is now a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) object with further columns for `delay` and `stime`.
+The secondary event time is simply the primary event time plus the delay:
 
-```{r observe-data}
-truncated_obs <- obs |>
-  filter_obs_by_obs_time(obs_time = 25)
-truncated_obs <- truncated_obs[sample(1:.N, 200, replace = FALSE)]
+```{r}
+all(obs$ptime + obs$delay == obs$stime)
 ```
 
-Plot primary cases (columns), and secondary cases (dots) by the observation time of their secondary events. This reflects would could be observed in real-time.
+If we were to receive the data `obs` as above then estimating the delay distribution would be easy, and the `epidist` package wouldn't need to exist.
+However, in reality, during an outbreak we almost never receive the data as above.
 
-```{r observed-cases}
-truncated_cases <- construct_cases_by_obs_window(
-  obs, windows = c(25), obs_type = "stime"
-)
+First, the times of primary and secondary events will usually be censored.
+This means that rather than exact event times, we observe event times within an interval.
+Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure \@ref(fig:cens)):
 
-plot_cases_by_obs_window(truncated_cases)
+```{r}
+# observe_process() should be renamed and include a "daily" i.e. 1 argument
+obs_cens <- obs |> observe_process()
 ```
 
-We can alternatively plot the observed data based on primary events. This corresponds to a retrospective cohort based view of the data.
-
-```{r cohort-observed-cases}
-truncated_cases <- construct_cases_by_obs_window(
-  obs, windows = c(25), obs_type = "ptime"
-)
+(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times.
 
-plot_cases_by_obs_window(truncated_cases)
+```{r cens, fig.cap="(ref:cens)"}
+ggplot(obs_cens, aes(x = delay, y = delay_daily)) +
+  geom_point(col = "#E69F00") +
+  geom_abline(slope = 1, intercept = 0, linetype = "dashed", col = "grey") +
+  theme_minimal() +
+  coord_fixed() +
+  labs(x = "Exact delay time (days)", y = "Censored delay time (days)")
 ```
 
-Plot the true continuous delay distribution and the empirical observed distribution for each observation window.
+Next, during an outbreak we will usually be estimating delays in real time.
+The result is that only those cases with a secondary event occurring before some time will be observed.
+This is called (right) truncation, and biases the observation process towards shorter delays:
+<!-- I think possible to do a better job with showing what right truncation is with a figure here! -->
 
-```{r empirical-dist}
-combined_obs <- combine_obs(truncated_obs, obs)
+```{r}
+obs_time <- 25
+# filter_obs_by_obs_time() should be renamed to refer to stime
+obs_cens_trunc <- filter_obs_by_obs_time(obs_cens, obs_time = obs_time)
+```
 
-plot_empirical_delay(
-  combined_obs, meanlog = secondary_dist$meanlog[[1]],
-  sdlog = secondary_dist$sdlog[[1]]
-)
+Finally, in reality, it's not possible to observe every case.
+We suppose that a sample of individuals of size `sample_size` are observed:
+
+```{r}
+sample_size <- 200
 ```
 
-Plot the mean difference between continuous and discrete event time:
+This sample size corresponds to `r 100 * round(sample_size / nrow(obs_cens_trunc), 3)`% of the data.
 
-```{r censor_delay}
-censor_delay <- calculate_censor_delay(truncated_obs)
-plot_censor_delay(censor_delay)
+```{r}
+obs_cens_trunc_samp <-
+  obs_cens_trunc[sample(seq_len(.N), sample_size, replace = FALSE)]
 ```
 
-### Model
+With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using `epidist`.
 
-Adjust for right truncation and date censoring using a latent variable approach.
+# Fit the model {#fit}
 
-```{r}
-latent_truncation_censoring_fit <- latent_truncation_censoring_adjusted_delay(
-  data = truncated_obs, cores = 4, refresh = 0
-)
+If we had access to the data `obs`, then it would be simple to estimate the delay distribution (Figure \@ref(fig:obs-est)).
+
+(ref:obs-est) The histogram of delays matches closely the underlying distribution. Edit: at the moment something is wrong here.
+
+```{r obs-est, fig.cap="(ref:obs-est)"}
+ggplot() +
+  geom_histogram(data = obs, aes(x = delay, y = stat(count / sum(count)))) +
+  geom_function(
+    data = data.frame(x = c(0, 30)),
+    aes(x = x),
+    fun = dlnorm,
+    meanlog = secondary_dist[["meanlog"]],
+    sdlog = secondary_dist[["sdlog"]],
+  ) +
+  labs(x = "Delay between primary and secondary event (days)", y = "Probability density") +
+  theme_minimal()
 ```
 
-### Summarise model posterior and compare to known truth
+However, with only access to `obs_cens_trunc_samp`, we must use a statistical model to avoid severely biased estimates!
 
-Combine models into a named list.
+The main `epidist` model function is `latent_truncation_censoring_adjusted_delay`.
+<!-- In a future vignette, we will explain in more detail the structure of the model. -->
+The parameters of the model are inferred using Bayesian inference.
+In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the [`brms`](https://paul-buerkner.github.io/brms/) R package [@brms].
 
 ```{r}
-models <- list(
-  "Latent variable truncation and censoring adjusted" = latent_truncation_censoring_fit
+fit <- latent_truncation_censoring_adjusted_delay(
+  data = obs_cens_trunc_samp, cores = 4, refresh = 0
 )
 ```
 
-Extract and summarise lognormal posterior estimates.
+The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters (Table \@ref(tab:pars)) in the model.
+Many of these are so-called "latent variables".
 
-```{r lognormal-draws}
-draws <- models |>
-  map(extract_lognormal_draws) |>
-  rbindlist(idcol = "model") |>
-  rbind(epinowcast_draws, use.names = TRUE)
+```{r pars}
+pars <- fit$fit@par_dims |>
+  map(.f = function(x) if(identical(x, integer(0))) return(1) else return(x))
 
-draws <- draws[,
-   model := factor(
-    model, levels = c("Joint incidence and forward delay", rev(names(models)))
-   )]
+data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |>
+  gt::gt(caption = "All of the parameters that are included in the model")
+```
 
-summarised_draws <- draws |>
-  draws_to_long() |>
-  summarise_draws(sf = 3)
+Although some users may wish to work with `fit` directly, we will see in the following section that `epidist` provides functions to make common post-processing tasks easy.
 
-knitr::kable(summarised_draws[parameter %in% c("meanlog", "sdlog")])
-```
+# Compare estimates {#compare}
 
-Plot summarised posterior estimates from each model compared to the ground truth.
+Compare modelled estimate to underlying truth here to convince user that it works.
+Perhaps as an exercise, we could show the "really simple" estimate being wrong here otherwise it won't be so impressive that the model is right.
+This section should also show the user how to get objects that they might want out of the fitted object.
 
-```{r, fig.width = 9, fig.height = 4}
-draws |>
-  draws_to_long() |>
-  make_relative_to_truth(draws_to_long(secondary_dist)) |>
-  plot_relative_recovery(y = model, fill = model) +
-  facet_wrap(vars(parameter), nrow = 1, scales = "free_x") +
-  scale_fill_brewer(palette = "Dark2") +
-  guides(fill = guide_none()) +
-  labs(
-    y = "Model", x = "Relative to ground truth"
-  )
+```{r}
+draws <- extract_lognormal_draws(fit)
+draws_long <- draws_to_long(draws)
+summarise_draws(draws_long, sf = 2)
 ```
 
-Finally, check the mean posterior predictions for each model against the observed daily cohort mean.
-```{r postcheck}
-truncated_draws <- draws |>
-  calculate_truncated_means(
-    obs_at = max(truncated_obs$stime_daily),
-    ptime = range(truncated_obs$ptime_daily)
-  ) |>
-  summarise_variable(variable = "trunc_mean", by = c("obs_horizon", "model")) 
-
-truncated_draws[, model := factor(
-      model, levels = c("Joint incidence and forward delay", rev(names(models)))
-    )]
-
-truncated_draws |>
-  plot_mean_posterior_pred(
-    truncated_obs |>
-      calculate_cohort_mean(
-        type = "cohort", obs_at = max(truncated_obs$stime_daily)
-      ),
-    col = model, fill = model, mean = TRUE, ribbon = TRUE
-  ) +
-  guides(
-    fill = guide_legend(title = "Model", nrow = 4),
-    col = guide_legend(title = "Model", nrow = 4)
-  ) +
-  scale_fill_brewer(palette = "Dark2", aesthetics = c("fill", "colour")) +
-  theme(legend.direction = "vertical")
-```
+## Bibliography {-}