Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #430: Proofread and update vignettes for release #455

Merged
merged 6 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions vignettes/approx-inference.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,9 @@ library(tidybayes)
library(cmdstanr) # nolint
```

To access the approximate inference methods used in this vignette we will need to use the `cmdstanr` backend for `brms` (we generally recommend using this backend for fitting models). To do this, we first need to install CmdStan (see the README for more details). We can check we have everything we need as follows:
To access the approximate inference methods used in this vignette we will need to use the `cmdstanr` backend for `brms` (we generally recommend using this backend for fitting models).
To do this, we first need to install CmdStan (see the README for more details).
We can check we have everything we need as follows:

```{r}
cmdstanr::cmdstan_version()
Expand Down Expand Up @@ -262,7 +264,7 @@ pmap_df(

## Comparison of time taken

In this example, HMC took much longer than the other methods, with Pathfinder being the fastest method.
In this example, HMC took a longer time to run than the other methods and Pathfinder was the fastest running method.
That said, even for HMC the computation time in this case is unlikely to be prohibitive.

```{r}
Expand Down
8 changes: 5 additions & 3 deletions vignettes/ebola.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ That is, $\mu$ and $\sigma$ such that when $x \sim \mathcal{N}(\mu, \sigma)$ the

## Data preparation

To prepare the data, we begin by filtering for the relevant columns and converting the date columns to `Date` objects:
To prepare the data, we begin by converting the date columns to `Date` objects and selecting the relevant columns:

```{r}
obs_cens <- sierra_leone_ebola_data |>
Expand Down Expand Up @@ -193,7 +193,7 @@ subsample <- 0.2
```

Additionally, to speed up computation, we take a random `r 100 * subsample`% subsample of the complete data.
(In a real analysis, we'd recommend using all the available data).
(In a real analysis, we'd recommend using all the available data.)

```{r}
obs_cens <- obs_cens |>
Expand All @@ -210,7 +210,9 @@ linelist_data <- obs_cens |>
)
```

Note that this has made some assumptions about the data in that it has assumed that as we did not supply upper bounds for the primary and secondary events, that the upper bounds are one day after the lower bounds. It has also assumed that the observation time is the maximum of the secondary event upper bound as we also did not supply an observation time column.
In this call to [as_epidist_linelist_data()] it has made some assumptions about the data.
First, because we did not supply upper bounds for the primary and secondary events (`pdate_upr` and `sdate_upr`), it has assumed that the upper bounds are one day after the lower bounds.
Second, because we also did not supply an observation time column (`obs_date`), it has assumed that the observation time is the maximum of the secondary event upper bounds.

## Model fitting

Expand Down
87 changes: 49 additions & 38 deletions vignettes/epidist.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,9 @@ First, in Section \@ref(data), we explain interval censoring and right truncatio
Then, in Section \@ref(fit), we show how `epidist` can be used to accurately estimate delay distributions by using a statistical model which properly accounts for these two issues.

If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best.
We also recommend the introduction provided by the [nowcasting and forecasting infectious disease dynamics](https://nfidd.github.io/nfidd/) course.

To run this vignette yourself, along with the `epidist` package, you will need the following packages:
To run this vignette yourself, as well as the `epidist` package, you will need the following packages:

```{r load-requirements}
library(epidist)
Expand All @@ -62,16 +63,7 @@ library(dplyr)

# Example data {#data}

Data should be formatted as a `data.frame` with the following columns for use within `epidist`:

* `case`: The unique case ID.
* `ptime`: The time of the primary event.
* `stime`: The time of the secondary event.

Here we simulate data in this format, and in doing so explain the two main issues with observational delay data.
<!-- Please see as of yet unwritten vignette for information about how to transform your data of other formats to the right format. -->

First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model:
First, we use the [Gillepsie algorithm](https://en.wikipedia.org/wiki/Gillespie_algorithm) to generate infectious disease outbreak data (Figure \@ref(fig:outbreak)) from a stochastic compartmental model.

```{r}
outbreak <- simulate_gillespie(seed = 101)
Expand All @@ -88,14 +80,26 @@ outbreak |>
theme_minimal()
```

`outbreak` is a `data.frame` with the columns `case` and `ptime`.
`outbreak` is a `data.frame` with the two columns: `case` and `ptime`.
Here `ptime` is a numeric column giving the time of infection.
In reality, it is more common to recive primary event times as a date rather than a numeric.

```{r}
head(outbreak)
```

Now, to generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events:
To generate secondary events, we will use a lognormal distribution (Figure \@ref(fig:lognormal)) for the delay between primary and secondary events:

```{r}
secondary_dist <- data.frame(mu = 1.8, sigma = 0.5)
class(secondary_dist) <- c("lognormal_samples", class(secondary_dist))
secondary_dist <- add_mean_sd(secondary_dist)

obs <- outbreak |>
simulate_secondary(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
```

(ref:lognormal) The lognormal distribution is skewed to the right. Long delay times still have some probability.
Expand All @@ -116,14 +120,6 @@ ggplot(data.frame(x = c(0, 30)), aes(x = x)) +
)
```

```{r}
obs <- outbreak |>
simulate_secondary(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
```

(ref:delay) Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure \@ref(fig:lognormal)). As with Figure \@ref(fig:outbreak), to make this figure easier to read, only every 50th case is shown.

```{r delay, fig.cap="(ref:delay)"}
Expand All @@ -147,7 +143,7 @@ The secondary event time is simply the primary event time plus the delay:
all(obs$ptime + obs$delay == obs$stime)
```

If we were to receive the data `obs` as above then estimating the delay distribution would be easy, and the `epidist` package wouldn't need to exist.
If we were to receive the complete data `obs` as above then it would be simple to accurately estimate the delay distribution.
However, in reality, during an outbreak we almost never receive the data as above.

First, the times of primary and secondary events will usually be censored.
Expand All @@ -168,15 +164,30 @@ obs_cens <- obs |>
(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. A common example of this is when events are reported as daily aggregates. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals.

```{r cens, fig.cap="(ref:cens)"}
ggplot(obs_cens, aes(x = delay, y = delay_daily)) +
geom_point(col = "#E69F00") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", col = "grey") +
theme_minimal() +
coord_fixed() +
labs(x = "Exact delay time (days)", y = "Censored delay time (days)")
obs_cens |>
filter(case %% 50 == 0, case <= 250) |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
# The primary event censoring intervals
geom_errorbarh(
aes(xmin = ptime_lwr, xmax = ptime_upr, y = case, yend = case),
col = "#56B4E9", height = 5
) +
# The secondary event censoring intervals
geom_errorbarh(
aes(xmin = stime_lwr, xmax = stime_upr, y = case, yend = case),
col = "#009E73", height = 5
) +
geom_point(aes(x = ptime), fill = "white", col = "#56B4E9", shape = 21) +
geom_point(aes(x = stime), fill = "white", col = "#009E73", shape = 21) +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
```

Next, during an outbreak we will usually be estimating delays in real time.
During an outbreak we will usually be estimating delays in real time.
The result is that only those cases with a secondary event occurring before some time will be observed.
This is called (right) truncation, and biases the observation process towards shorter delays:
<!-- I think possible to do a better job with showing what right truncation is with a figure here! -->
Expand Down Expand Up @@ -277,13 +288,8 @@ In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo
fit <- epidist(data = data, chains = 2, cores = 2, refresh = 0)
```

The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters (Table \@ref(tab:pars)) in the model.
Users familiar with Stan and `brms`, can work with `fit` directly.
Any tool that supports `brms` fitted model objects will be compatible with `fit`.

**Note that here we use the default `rstan` backend but we generally recommend using the `cmdstanr` backend for faster sampling and additional features. This can be set using `backend = "cmdstanr"` after following the installing `cmdstan` instructions in the README.**

(ref:pars) All of the parameters that are included in the model. Many of these parameters (e.g. `swindow` and `pwindow`) are the so called latent variables in the model, and have lengths corresponding to the `sample_size`. We extracted the model parameters using `brms::variables()` and removed the indices.
The `fit` object is a `brmsfit` object containing MCMC samples from each of the parameters in the model, shown in the table below.
Many of these parameters (e.g. `swindow` and `pwindow`) are the so called latent variables, and have lengths corresponding to the `sample_size`.

```{r pars}
pars <- fit |>
Expand All @@ -293,10 +299,15 @@ pars <- fit |>
data.frame(
Parameter = unique(pars), Length = table(pars)
) |>
gt() |>
tab_caption("(ref:pars)")
gt()
```

Users familiar with Stan and `brms`, can work with `fit` directly.
Any tool that supports `brms` fitted model objects will be compatible with `fit`.

**Note that here we use the default `rstan` backend but we generally recommend using the `cmdstanr` backend for faster sampling and additional features.**
**This can be set using `backend = "cmdstanr"` after following the installing CmdStan instructions in the README.**

The `epidist` package also provides functions to make common post-processing tasks easy.
For example, individual predictions of the lognormal delay parameters can be extracted using:

Expand Down
11 changes: 6 additions & 5 deletions vignettes/faq.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,11 @@ vignette: >
bibliography: references.bib
---

Here we provide tips for working with the `epidist` package, and answers to "frequently" asked questions.
If you have a question about using the package, please create an [issue](https://github.com/epinowcast/epidist/issues) and we will endeavour to get back to you soon!
Here we provide tips for working with the `epidist` package, and answers to frequently asked questions.
If you have a question about using the package, please create an [issue](https://github.com/epinowcast/epidist/issues).
We will endeavour to respond promptly!

The code block below is provided to facilitate reproduction of this script, if required!
The code block below facilitates reproduction of this script.

```{r message=FALSE, results='hide', class.source='fold-hide'}
library(epidist)
Expand Down Expand Up @@ -81,7 +82,7 @@ draws <- as_draws_df(fit, variable = c("Intercept", "Intercept_sigma"))
head(draws)
```

## How can I assess if sampling has converged?
## How can I assess whether sampling has converged?

The output of a call to `epidist` is compatible with typical Stan workflows.
We recommend use of the [`bayesplot`](http://mc-stan.org/bayesplot/) package for sampling diagnostic plots.
Expand Down Expand Up @@ -109,7 +110,7 @@ We particularly highlight two functions which might be useful:
For an example use of these functions, have a look at the [`epidist-paper`](https://github.com/parksw3/epidist-paper) repository containing the code for @park2024estimating.
(Note that in that codebase, we use `map` as a part of a [`targets`](https://books.ropensci.org/targets/) pipeline.)

## How did you choose the default priors for `epidist`?
## How are the default priors for `epidist` chosen?

[`brms`](http://paulbuerkner.com/brms/) provides default priors for all parameters.
However, some of those priors do not make sense in the context of our application.
Expand Down
Loading