Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 9: Getting started vignette #32

Merged
merged 19 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
303 changes: 176 additions & 127 deletions vignettes/epidist.Rmd
Original file line number Diff line number Diff line change
@@ -1,201 +1,250 @@
---
title: "Getting Started with Epidist"
description: "A quick start guide to using the Epidist package"
author: Epidist Team
title: "Getting started with epidist"
description: "A quick start guide to using the epidist R package"
output:
bookdown::html_vignette2:
bookdown::html_document2:
seabbs marked this conversation as resolved.
Show resolved Hide resolved
fig_caption: yes
code_folding: show
number_sections: true
pkgdown:
as_is: true
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl
# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl
seabbs marked this conversation as resolved.
Show resolved Hide resolved
link-citations: true
vignette: >
%\VignetteIndexEntry{Getting Started with Epidist}
%\VignetteIndexEntry{Getting started with epidist}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r setup, include=FALSE}
# exclude compile warnings from cmdstanr
knitr::opts_chunk$set(
fig.path = "figures/getting-started-nowcasting-",
cache = TRUE, dpi = 330,
collapse = TRUE, comment = "#>", out.width = "100%",
message = FALSE, warning = FALSE, error = FALSE,
eval = FALSE
fig.path = "figures/epidist-",
cache = TRUE,
collapse = TRUE,
comment = "#>",
message = FALSE,
warning = FALSE,
error = FALSE
)
```

# Quick start

In this quick start, we demonstrate using `epidist` to ...

# Package

In this quick start, we also use `data.table`, `purrr`, and `ggplot2` packages. These are both installed as dependencies when `epidist` is installed. Note that all output from `epidist` is readily useable with other tools, including `tidyverse` packages (see [here](https://mgimond.github.io/rug_2019_12/Index.html) for a comparison).
athowes marked this conversation as resolved.
Show resolved Hide resolved

```{r load-requirements}
library(epidist)
library(data.table)
library(purrr)
library(ggplot2)
```

### Simulate the data
Many quantities in epidemiology can be thought of as the time between two events, or "delays".
Important examples include:

* the incubation period (time between infection and symptom onset),
* serial interval (time between symptom onset of infectee and symptom onset of infected), and
* generation interval (time between infection of infectee and infection of infected).

Unfortunately, estimating delays accurately from observational data is usually difficult.
In our experience, the two main issues are:

1. interval censoring, and
2. right truncation.

Simulate data from an outbreak.
Don't worry if you've not come across these terms before.
In Section \@ref(data), we will explain what they mean by simulating data like that we might observe during an ongoing infectious disease outbreak.
Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues.
Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth.

seabbs marked this conversation as resolved.
Show resolved Hide resolved
```{r simulate-outbreak}
If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best.

# Example data {#data}

Data should be formatted as a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the following columns for use within the `epidist` package:

* `case`:
* `ptime`:
* `stime`:

Here we simulate data in this format, and in doing so explain the two main issues with observational delay data.
<!-- Please see as of yet unwritten vignette for information about how to transform your data of other formats to the right format. -->

First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model:

```{r}
outbreak <- simulate_gillespie(seed = 101)
```

Define the secondary distribution to use in the simulation
(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.)

```{r secondary-dist}
secondary_dist <- data.table(
meanlog = 1.8, sdlog = 0.5
) |>
```{r outbreak, fig.cap="(ref:outbreak)"}
outbreak[case %% 50 == 0, ] |>
ggplot(aes(x = ptime, y = case)) +
geom_point(col = "#56B4E9") +
labs(x = "Primary event time (day)", y = "Case number") +
theme_minimal()
```

`outbreak` is a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) with the columns `case` and `ptime`.

Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events:

```{r}
secondary_dist <- data.table(meanlog = 1.8, sdlog = 0.5) |>
add_natural_scale_mean_sd()
seabbs marked this conversation as resolved.
Show resolved Hide resolved
```

Simulate an observation process during the growth phase for a secondary event using a lognormal distribution, and finally simulate observing this event.
(ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability.

```{r simulate-data}
```{r lognormal, fig.cap="(ref:lognormal)"}
ggplot(data.frame(x = c(0, 8)), aes(x = x)) +
geom_function(
fun = dlnorm,
meanlog = secondary_dist[["meanlog"]],
sdlog = secondary_dist[["sdlog"]]
) +
theme_minimal() +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density"
)
```

```{r}
obs <- outbreak |>
simulate_secondary(
meanlog = secondary_dist$meanlog[[1]],
sdlog = secondary_dist$sdlog[[1]]
) |>
observe_process()
meanlog = secondary_dist[["meanlog"]],
sdlog = secondary_dist[["sdlog"]]
)
```

(ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure \@ref(fig:lognormal)). As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown.

```{r delay, fig.cap="(ref:delay)"}
obs[case %% 50 == 0, ] |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
geom_point(aes(x = ptime), col = "#56B4E9") +
seabbs marked this conversation as resolved.
Show resolved Hide resolved
geom_point(aes(x = stime), col = "#009E73") +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
```

Observe the outbreak after 25 days and take 100 samples.
`obs` is now a [`data.table`](https://cran.r-project.org/web/packages/data.table/index.html) object with further columns for `delay` and `stime`.
The secondary event time is simply the primary event time plus the delay:

```{r observe-data}
truncated_obs <- obs |>
filter_obs_by_obs_time(obs_time = 25)
truncated_obs <- truncated_obs[sample(1:.N, 200, replace = FALSE)]
```{r}
all(obs$ptime + obs$delay == obs$stime)
```

Plot primary cases (columns), and secondary cases (dots) by the observation time of their secondary events. This reflects would could be observed in real-time.
If we were to receive the data `obs` as above then estimating the delay distribution would be easy, and the `epidist` package wouldn't need to exist.
However, in reality, during an outbreak we almost never receive the data as above.

```{r observed-cases}
truncated_cases <- construct_cases_by_obs_window(
obs, windows = c(25), obs_type = "stime"
)
First, the times of primary and secondary events will usually be censored.
This means that rather than exact event times, we observe event times within an interval.
Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure \@ref(fig:cens)):

plot_cases_by_obs_window(truncated_cases)
```{r}
# observe_process() should be renamed and include a "daily" i.e. 1 argument
obs_cens <- obs |> observe_process()
```

We can alternatively plot the observed data based on primary events. This corresponds to a retrospective cohort based view of the data.

```{r cohort-observed-cases}
truncated_cases <- construct_cases_by_obs_window(
obs, windows = c(25), obs_type = "ptime"
)
(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times.
seabbs marked this conversation as resolved.
Show resolved Hide resolved

plot_cases_by_obs_window(truncated_cases)
```{r cens, fig.cap="(ref:cens)"}
ggplot(obs_cens, aes(x = delay, y = delay_daily)) +
geom_point(col = "#E69F00") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", col = "grey") +
theme_minimal() +
coord_fixed() +
labs(x = "Exact delay time (days)", y = "Censored delay time (days)")
```

Plot the true continuous delay distribution and the empirical observed distribution for each observation window.
Next, during an outbreak we will usually be estimating delays in real time.
athowes marked this conversation as resolved.
Show resolved Hide resolved
The result is that only those cases with a secondary event occurring before some time will be observed.
This is called (right) truncation, and biases the observation process towards shorter delays:
<!-- I think possible to do a better job with showing what right truncation is with a figure here! -->
athowes marked this conversation as resolved.
Show resolved Hide resolved

```{r empirical-dist}
combined_obs <- combine_obs(truncated_obs, obs)
```{r}
obs_time <- 25
# filter_obs_by_obs_time() should be renamed to refer to stime
obs_cens_trunc <- filter_obs_by_obs_time(obs_cens, obs_time = obs_time)
```

plot_empirical_delay(
combined_obs, meanlog = secondary_dist$meanlog[[1]],
sdlog = secondary_dist$sdlog[[1]]
)
Finally, in reality, it's not possible to observe every case.
We suppose that a sample of individuals of size `sample_size` are observed:
athowes marked this conversation as resolved.
Show resolved Hide resolved

```{r}
sample_size <- 200
```

Plot the mean difference between continuous and discrete event time:
This sample size corresponds to `r 100 * round(sample_size / nrow(obs_cens_trunc), 3)`% of the data.

```{r censor_delay}
censor_delay <- calculate_censor_delay(truncated_obs)
plot_censor_delay(censor_delay)
```{r}
obs_cens_trunc_samp <-
obs_cens_trunc[sample(seq_len(.N), sample_size, replace = FALSE)]
```

### Model
With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using `epidist`.

Adjust for right truncation and date censoring using a latent variable approach.
# Fit the model {#fit}

```{r}
latent_truncation_censoring_fit <- latent_truncation_censoring_adjusted_delay(
data = truncated_obs, cores = 4, refresh = 0
)
If we had access to the data `obs`, then it would be simple to estimate the delay distribution (Figure \@ref(fig:obs-est)).

(ref:obs-est) The histogram of delays matches closely the underlying distribution. Edit: at the moment something is wrong here.
athowes marked this conversation as resolved.
Show resolved Hide resolved

```{r obs-est, fig.cap="(ref:obs-est)"}
ggplot() +
geom_histogram(data = obs, aes(x = delay, y = stat(count / sum(count)))) +
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
meanlog = secondary_dist[["meanlog"]],
sdlog = secondary_dist[["sdlog"]],
) +
labs(x = "Delay between primary and secondary event (days)", y = "Probability density") +

Check warning on line 208 in vignettes/epidist.Rmd

View workflow job for this annotation

GitHub Actions / lint-changed-files

file=vignettes/epidist.Rmd,line=208,col=81,[line_length_linter] Lines should not be more than 80 characters. This line is 91 characters.
theme_minimal()
```

### Summarise model posterior and compare to known truth
However, with only access to `obs_cens_trunc_samp`, we must use a statistical model to avoid severely biased estimates!
athowes marked this conversation as resolved.
Show resolved Hide resolved

Combine models into a named list.
The main `epidist` model function is `latent_truncation_censoring_adjusted_delay`.
<!-- In a future vignette, we will explain in more detail the structure of the model. -->
The parameters of the model are inferred using Bayesian inference.
In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the [`brms`](https://paul-buerkner.github.io/brms/) R package [@brms].

```{r}
models <- list(
"Latent variable truncation and censoring adjusted" = latent_truncation_censoring_fit
fit <- latent_truncation_censoring_adjusted_delay(
data = obs_cens_trunc_samp, cores = 4, refresh = 0
)
```

Extract and summarise lognormal posterior estimates.
The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters (Table \@ref(tab:pars)) in the model.
Many of these are so-called "latent variables".

```{r lognormal-draws}
draws <- models |>
map(extract_lognormal_draws) |>
rbindlist(idcol = "model") |>
rbind(epinowcast_draws, use.names = TRUE)
```{r pars}
pars <- fit$fit@par_dims |>
map(.f = function(x) if(identical(x, integer(0))) return(1) else return(x))

Check warning on line 230 in vignettes/epidist.Rmd

View workflow job for this annotation

GitHub Actions / lint-changed-files

file=vignettes/epidist.Rmd,line=230,col=26,[spaces_left_parentheses_linter] Place a space before left parenthesis, except in a function call.
athowes marked this conversation as resolved.
Show resolved Hide resolved

draws <- draws[,
model := factor(
model, levels = c("Joint incidence and forward delay", rev(names(models)))
)]
data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |>
gt::gt(caption = "All of the parameters that are included in the model")
athowes marked this conversation as resolved.
Show resolved Hide resolved
```

summarised_draws <- draws |>
draws_to_long() |>
summarise_draws(sf = 3)
Although some users may wish to work with `fit` directly, we will see in the following section that `epidist` provides functions to make common post-processing tasks easy.
seabbs marked this conversation as resolved.
Show resolved Hide resolved

knitr::kable(summarised_draws[parameter %in% c("meanlog", "sdlog")])
```
# Compare estimates {#compare}

Plot summarised posterior estimates from each model compared to the ground truth.
Compare modelled estimate to underlying truth here to convince user that it works.
Perhaps as an exercise, we could show the "really simple" estimate being wrong here otherwise it won't be so impressive that the model is right.
seabbs marked this conversation as resolved.
Show resolved Hide resolved
This section should also show the user how to get objects that they might want out of the fitted object.
athowes marked this conversation as resolved.
Show resolved Hide resolved

```{r, fig.width = 9, fig.height = 4}
draws |>
draws_to_long() |>
make_relative_to_truth(draws_to_long(secondary_dist)) |>
plot_relative_recovery(y = model, fill = model) +
facet_wrap(vars(parameter), nrow = 1, scales = "free_x") +
scale_fill_brewer(palette = "Dark2") +
guides(fill = guide_none()) +
labs(
y = "Model", x = "Relative to ground truth"
)
```{r}
draws <- extract_lognormal_draws(fit)
draws_long <- draws_to_long(draws)
summarise_draws(draws_long, sf = 2)
```

Finally, check the mean posterior predictions for each model against the observed daily cohort mean.
```{r postcheck}
truncated_draws <- draws |>
calculate_truncated_means(
obs_at = max(truncated_obs$stime_daily),
ptime = range(truncated_obs$ptime_daily)
) |>
summarise_variable(variable = "trunc_mean", by = c("obs_horizon", "model"))

truncated_draws[, model := factor(
model, levels = c("Joint incidence and forward delay", rev(names(models)))
)]

truncated_draws |>
plot_mean_posterior_pred(
truncated_obs |>
calculate_cohort_mean(
type = "cohort", obs_at = max(truncated_obs$stime_daily)
),
col = model, fill = model, mean = TRUE, ribbon = TRUE
) +
guides(
fill = guide_legend(title = "Model", nrow = 4),
col = guide_legend(title = "Model", nrow = 4)
) +
scale_fill_brewer(palette = "Dark2", aesthetics = c("fill", "colour")) +
theme(legend.direction = "vertical")
```
## Bibliography {-}
Loading
Loading