-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathepidist.Rmd
370 lines (308 loc) · 13.9 KB
/
epidist.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---
title: "Getting started with epidist"
description: "A quick start guide to using the epidist R package"
output:
bookdown::html_document2:
fig_caption: yes
code_folding: show
number_sections: true
pkgdown:
as_is: true
# csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-numeric-superscript-brackets.csl
link-citations: true
vignette: >
%\VignetteIndexEntry{Getting started with epidist}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: references.bib
---
```{r setup, include=FALSE}
# exclude compile warnings from cmdstanr
knitr::opts_chunk$set(
fig.path = file.path("figures", "epidist-"),
cache = TRUE,
collapse = TRUE,
comment = "#>",
message = FALSE,
warning = FALSE,
error = FALSE
)
```
Many quantities in epidemiology can be thought of as the time between two events or "delays".
Important examples include:
* the incubation period (time between infection and symptom onset),
* serial interval (time between symptom onset of infectee and symptom onset of infected), and
* generation interval (time between infection of infectee and infection of infected).
We encompass all of these delays as the time between a "primary event" and "secondary event".
Unfortunately, estimating delays accurately from observational data is usually difficult.
In our experience, the two main issues are:
1. interval censoring, and
2. right truncation.
Don't worry if you've not come across these terms before.
First, in Section \@ref(data), we explain interval censoring and right truncation by simulating data like that we might observe during an ongoing infectious disease outbreak.
Then, in Section \@ref(fit), we show how `epidist` can be used to accurately estimate delay distributions by using a statistical model which properly accounts for these two issues.
If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best.
We also recommend the introduction provided by the [nowcasting and forecasting infectious disease dynamics](https://nfidd.github.io/nfidd/) course.
To run this vignette yourself, as well as the `epidist` package, you will need the following packages:
```{r load-requirements}
library(epidist)
library(brms)
library(ggplot2)
library(gt)
library(dplyr)
```
# Example data {#data}
First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model.
```{r}
outbreak <- simulate_gillespie(seed = 101)
```
(ref:outbreak) Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.)
```{r outbreak, fig.cap="(ref:outbreak)"}
outbreak |>
filter(case %% 50 == 0) |>
ggplot(aes(x = ptime, y = case)) +
geom_point(col = "#56B4E9") +
labs(x = "Primary event time (day)", y = "Case number") +
theme_minimal()
```
`outbreak` is a `data.frame` with the two columns: `case` and `ptime`.
Here `ptime` is a numeric column giving the time of infection.
In reality, it is more common to recive primary event times as a date rather than a numeric.
```{r}
head(outbreak)
```
To generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events:
```{r}
secondary_dist <- data.frame(mu = 1.8, sigma = 0.5)
class(secondary_dist) <- c("lognormal_samples", class(secondary_dist))
secondary_dist <- add_mean_sd(secondary_dist)
obs <- outbreak |>
simulate_secondary(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
```
(ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability.
```{r lognormal, fig.cap="(ref:lognormal)"}
ggplot(data.frame(x = c(0, 30)), aes(x = x)) +
geom_function(
fun = dlnorm,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
theme_minimal() +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density"
)
```
(ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure \@ref(fig:lognormal)). As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown.
```{r delay, fig.cap="(ref:delay)"}
obs |>
filter(case %% 50 == 0) |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
geom_point(aes(x = ptime), col = "#56B4E9") +
geom_point(aes(x = stime), col = "#009E73") +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
```
`obs` is now a `data.frame` with further columns for `delay` and `stime`.
The secondary event time is simply the primary event time plus the delay:
```{r}
all(obs$ptime + obs$delay == obs$stime)
```
If we were to receive the complete data `obs` as above then it would be simple to accurately estimate the delay distribution.
However, in reality, during an outbreak we almost never receive the data as above.
First, the times of primary and secondary events will usually be censored.
This means that rather than exact event times, we observe event times within an interval.
Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure \@ref(fig:cens)):
```{r}
obs_cens <- obs |>
mutate(
ptime_lwr = floor(.data$ptime),
ptime_upr = .data$ptime_lwr + 1,
stime_lwr = floor(.data$stime),
stime_upr = .data$stime_lwr + 1,
delay_daily = stime_lwr - ptime_lwr
)
```
(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. A common example of this is when events are reported as daily aggregates. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals.
```{r cens, fig.cap="(ref:cens)"}
obs_cens |>
filter(case %% 50 == 0, case <= 250) |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
# The primary event censoring intervals
geom_errorbarh(
aes(xmin = ptime_lwr, xmax = ptime_upr, y = case),
col = "#56B4E9", height = 5
) +
# The secondary event censoring intervals
geom_errorbarh(
aes(xmin = stime_lwr, xmax = stime_upr, y = case),
col = "#009E73", height = 5
) +
geom_point(aes(x = ptime), fill = "white", col = "#56B4E9", shape = 21) +
geom_point(aes(x = stime), fill = "white", col = "#009E73", shape = 21) +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
```
During an outbreak we will usually be estimating delays in real time.
The result is that only those cases with a secondary event occurring before some time will be observed.
This is called (right) truncation, and biases the observation process towards shorter delays:
```{r}
obs_time <- 25
obs_cens_trunc <- obs_cens |>
mutate(obs_time = obs_time) |>
filter(.data$stime_upr <= .data$obs_time)
```
Finally, in reality, it's not possible to observe every case.
We suppose that a sample of individuals of size `sample_size` are observed:
```{r}
sample_size <- 200
```
This sample size corresponds to `r 100 * round(sample_size / nrow(obs_cens_trunc), 3)`% of the data.
```{r}
obs_cens_trunc_samp <- obs_cens_trunc |>
slice_sample(n = sample_size, replace = FALSE)
```
Another issue, which `epidist` currently does not account for, is that sometimes only the secondary event might be observed, and not the primary event.
For example, symptom onset may be reported, but start of infection unknown.
Discarding events of this type leads to what are called ascertainment biases.
Whereas each case is equally likely to appear in the sample above, under ascertainment bias some cases are more likely to appear in the data than others.
With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using `epidist`.
# Fit the model and compare estimates {#fit}
If we had access to the complete and unaltered `obs`, it would be simple to estimate the delay distribution.
However, with only access to `obs_cens_trunc_samp`, the delay distribution we observe is biased (Figure \@ref(fig:obs-est)) and we must use a statistical model.
(ref:obs-est) The histogram of delays from the complete, retrospective data `obs_cens` match quite closely with the underlying distribution, whereas those from `obs_cens_trunc_samp` show more significant systematic bias. In this instance the extent of the bias caused by censoring is less than that caused by right truncation. Nonetheless, we always recommend [@charniga2024best; Table 2] adjusting for censoring when it is present.
```{r obs-est, fig.cap="(ref:obs-est)"}
bind_rows(
obs_cens |>
mutate(type = "Complete, retrospective data") |>
select(delay = delay_daily, type),
obs_cens_trunc_samp |>
mutate(type = "Censored, truncated,\nsampled data") |>
select(delay = delay_daily, type)
) |>
group_by(type, delay, .drop = FALSE) |>
summarise(n = n()) |>
mutate(p = n / sum(n)) |>
ggplot() +
geom_col(
aes(x = delay, y = p, fill = type, group = type),
position = position_dodge2(preserve = "single")
) +
scale_fill_manual(values = c("#CC79A7", "#0072B2")) +
geom_function(
data = data.frame(x = c(0, 30)), aes(x = x),
fun = dlnorm,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density",
fill = ""
) +
theme_minimal() +
theme(legend.position = "bottom")
```
The main function you will use for modelling is called `epidist()`^[Technically, `epidist()` is an [S3 generic](http://adv-r.had.co.nz/S3.html) which allows it to work differently for inputs of different classes. This is in part why inputs must be prepared first via `as_epidist_latent_model()` so that they are of the appropriate class!].
We will fit the model `"epidist_latent_model"` which uses latent variables for the time of primary and secondary event of each individual^[In a future vignette, we will explain in more detail the structure of the model!].
To do so, we first prepare the `data` using `as_epidist_latent_model()`:
```{r}
linelist_data <- as_epidist_linelist_data(
obs_cens_trunc_samp$ptime_lwr,
obs_cens_trunc_samp$ptime_upr,
obs_cens_trunc_samp$stime_lwr,
obs_cens_trunc_samp$stime_upr,
obs_time = obs_cens_trunc_samp$obs_time
)
data <- as_epidist_marginal_model(linelist_data)
class(data)
```
The `data` object now has the class `epidist_latent_model`.
Using this `data`, we now call `epidist()` to fit the model.
The parameters of the model are inferred using Bayesian inference.
In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the [`brms`](https://paul-buerkner.github.io/brms/) R package [@brms].
```{r}
fit <- epidist(
data = data, chains = 2, cores = 2, refresh = ifelse(interactive(), 250, 0)
)
```
**Note that here we use the default `rstan` backend but we generally recommend using the `cmdstanr` backend for faster sampling and additional features.**
**This can be set using `backend = "cmdstanr"` after following the installing CmdStan instructions in the README.**
The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters in the model, shown in the table below.
Many of these parameters (e.g. `swindow` and `pwindow`) are the so called latent variables, and have lengths corresponding to the `sample_size`.
```{r pars}
pars <- fit |>
variables(reserved = FALSE) |>
gsub(pattern = "\\[\\d+\\]", replacement = "")
data.frame(
Parameter = unique(pars), Length = table(pars)
) |>
gt()
```
Users familiar with Stan and `brms`, can work with `fit` directly.
Any tool that supports `brms` fitted model objects will be compatible with `fit`.
For example, we can use the built in `summary()` function
```{r}
summary(fit)
```
to see that the model has converged and recovered the delay distribution distribution paramters reasonably well.
The `epidist` package also provides functions to make common post-processing tasks easy.
For example, individual predictions of the lognormal delay parameters can be extracted using:
```{r}
pred <- predict_delay_parameters(fit)
```
Figure \@ref(fig:fitted-lognormal) shows the lognormal delay distribution obtained for 100 draws from the posterior distribution of the `mu` and `sigma` parameters.
Whereas in Figure \@ref(fig:obs-est) the histogram of censored, truncated, sampled data was substantially different to the underlying delay distribution, using `epidist()` we have obtained a much closer match to the truth.
(ref:fitted-lognormal) The fitted delay distribution (pink lines, 100 draws from the posterior) compared to the true underlying delay distribution (black line). Each pink line represents one possible delay distribution based on the fitted model parameters.
```{r fitted-lognormal, fig.cap="(ref:fitted-lognormal)"}
# Sample 100 random draws from the posterior
set.seed(123)
sample_draws <- sample(nrow(pred), 100)
ggplot() +
# Plot the true distribution
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
linewidth = 1.5,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
# Plot 100 sampled posterior distributions
lapply(sample_draws, \(i) {
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
args = list(
meanlog = pred$mu[i],
sdlog = pred$sigma[i]
),
alpha = 0.1,
color = "#CC79A7"
)
}) +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density"
) +
theme_minimal()
```
## Bibliography {-}