Skip to content

Commit

Permalink
Issue 14: Fitting delay estimates using primarycensoreddist (#33)
Browse files Browse the repository at this point in the history
* start by improving the stan vignette

* add data simulatio

* first working version

* complete first draft

* update plot theme

* solve lint issue

* resolve duplicate chunk label

* resolve duplicate chunk label

* resolve duplicate chunk label

* add in missing dplyr

* tidy up new vignette
  • Loading branch information
seabbs authored Sep 4, 2024
1 parent a088846 commit ded8bb9
Show file tree
Hide file tree
Showing 6 changed files with 276 additions and 3 deletions.
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Depends:
Suggests:
bookdown,
cmdstanr,
dplyr,
knitr,
ggplot2,
rmarkdown,
Expand Down
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

* Added a getting started vignette.
* Added a vignette showcasing how to use the package Stan code with `cmdstanr`.
* Added a vignette showcasing how to fit distributions using the `cmdstanr` package.

# primarycensoreddist 0.1.0

Expand Down
3 changes: 3 additions & 0 deletions inst/stan/primary_censored_dist.stan
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,9 @@ real primary_censored_integrand(real x, real xc, array[] real theta,

// Compute adjusted delay
real d_adj = d - x;
if (d_adj <= 0) {
return 0;
}

// Compute log probabilities
real log_cdf = dist_lcdf(d_adj | params, dist_id);
Expand Down
2 changes: 2 additions & 0 deletions vignettes/fitting-dists-with-fitdistr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

*Under construction*
256 changes: 256 additions & 0 deletions vignettes/fitting-dists-with-stan.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,259 @@ vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

# Introduction

## What are we going to do in this Vignette

In this vignette, we'll demonstrate how to use `primarycensoreddist` in conjunction with Stan for Bayesian inference of epidemiological delay distributions. We'll cover the following key points:

1. Simulating censored delay distribution data
2. Fitting a naive model using cmdstan
3. Evaluating the naive model's performance
4. Fitting an improved model using `primarycensoreddist` functionality
5. Comparing the `primarycensoreddist` model's performance to the naive model

## What might I need to know before starting

This vignette builds on the concepts introduced in the [Getting Started with primarycensoreddist](primarycensoreddist.html) vignette and assumes familiarity with using Stan tools as covered in the [How to use primarycensoreddist with Stan](using-stan-tools.html) vignette.

## Packages used in this vignette

Alongside the `primarycensoreddist` package we will use the `cmdstanr` package for interfacing with cmdstananalysis. We will also use the `ggplot2` package for plotting and `dplyr` for data manipulation.

```{r setup, message = FALSE}
library(primarycensoreddist)
library(cmdstanr)
library(ggplot2)
library(dplyr)
```

# Simulating censored and truncated delay distribution data

We'll start by simulating some censored and truncated delay distribution data. We'll use the `rprimarycensoreddist` function (actually we will use the `rpcens ` alias for brevity).

```{r sample-lognormal}
set.seed(123) # For reproducibility
# Define the true distribution parameters
n <- 1000
meanlog <- 1.5
sdlog <- 0.75
# Generate varying pwindow, swindow, and obs_time lengths
pwindows <- sample(1, n, replace = TRUE)
swindows <- sample(1, n, replace = TRUE)
obs_times <- 10
# Function to generate a single sample
generate_sample <- function(pwindow, swindow, obs_time) {
rpcens(
1, rlnorm,
meanlog = meanlog, sdlog = sdlog,
pwindow = pwindow, swindow = swindow, D = obs_time
)
}
# Generate samples
samples <- mapply(generate_sample, pwindows, swindows, obs_times)
# Create initial data frame
delay_data <- data.frame(
pwindow = pwindows,
swindow = swindows,
obs_time = obs_times,
observed_delay = samples
)
head(delay_data)
# Aggregate to unique combinations and count occurrences
# Aggregate to unique combinations and count occurrences
delay_counts <- delay_data |>
summarise(n = n(), .by = c(pwindow, swindow, obs_time, observed_delay))
head(delay_counts)
# Compare the samples with and without secondary censoring to the true
# distribution
# Calculate empirical CDF
empirical_cdf <- ecdf(samples)
# Create a sequence of x values for the theoretical CDF
x_seq <- seq(min(samples), max(samples), length.out = 100)
# Calculate theoretical CDF
theoretical_cdf <- plnorm(x_seq, meanlog = meanlog, sdlog = sdlog)
# Create a long format data frame for plotting
cdf_data <- data.frame(
x = rep(x_seq, 2),
probability = c(empirical_cdf(x_seq), theoretical_cdf),
type = rep(c("Observed", "Theoretical"), each = length(x_seq)),
stringsAsFactors = FALSE
)
# Plot
ggplot(cdf_data, aes(x = x, y = probability, color = type)) +
geom_step(linewidth = 1) +
scale_color_manual(
values = c(Observed = "#4292C6", Theoretical = "#252525")
) +
geom_vline(
aes(xintercept = mean(samples), color = "Observed"),
linetype = "dashed", linewidth = 1
) +
geom_vline(
aes(xintercept = exp(meanlog + sdlog^2 / 2), color = "Theoretical"),
linetype = "dashed", linewidth = 1
) +
labs(
title = "Comparison of Observed vs Theoretical CDF",
x = "Delay",
y = "Cumulative Probability",
color = "CDF Type"
) +
theme_minimal() +
theme(
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.position = "bottom"
)
```

We've aggregated the data to unique combinations of `pwindow`, `swindow`, and `obs_time` and counted the number of occurrences of each `observed_delay` for each combination. This is the data we will use to fit our model.

# Fitting a naive model using cmdstan

We'll start by fitting a naive model using cmdstan. We'll use the `cmdstanr` package to interface with cmdstan. We define the model in a string and then write it to a file as in the [How to use primarycensoreddist with Stan](using-stan-tools.html) vignette.

```{r naive-model}
writeLines(
"data {
int<lower=0> N; // number of observations
vector[N] y; // observed delays
vector[N] n; // number of occurrences for each delay
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
// Priors
mu ~ normal(1, 1);
sigma ~ normal(0.5, 1);
// Likelihood
target += n .* lognormal_lpdf(y | mu, sigma);
}",
con = file.path(tempdir(), "naive_lognormal.stan")
)
```

Now let's compile the model

```{r compile-naive-model}
naive_model <- cmdstan_model(file.path(tempdir(), "naive_lognormal.stan"))
```

aand now let's fit the compiled model.

```{r fit-naive-model, message = FALSE}
naive_fit <- naive_model$sample(
data = list(
# Add a small constant to avoid log(0)
y = delay_counts$observed_delay + 1e-6,
n = delay_counts$n,
N = nrow(delay_counts)
),
chains = 4,
parallel_chains = 4,
show_messages = FALSE,
refresh = 0
)
naive_fit
```

We see that the model has converged and the diagnostics look good. However, just from the model posterior summary we see that we might not be very happy with the fit. `mu` is smaller than the target `r meanlog` and `sigma` is larger than the target `r sdlog`.

# Fitting an improved model using primarycensoreddist

We'll now fit an improved model using the `primarycensoreddist` package. The main improvement is that we will use the `primary_censored_dist_lpdf` function to fit the model. This is the Stan version of the `pcens()` function and accounts for the primary and secondary censoring windows as well as the truncation. We encode that our primary distribution is a lognormal distribution by passing 1 as the `dist_id` parameter and that our primary event distribution is uniform by passing 1 as the `primary_dist_id` parameter. See the [Stan documentation](https://primarycensoreddist.epinowcast.org/stan/primary__censored__dist_8stan.html#acc97240dee1bc19e4f02013118f3857d) for more details on the `primary_censored_dist_lpdf` function.

```{r pimarycensoreddist-model}
writeLines(
"
functions {
#include primary_censored_dist.stan
#include expgrowth.stan
}
data {
int<lower=0> N; // number of observations
array[N] int<lower=0> y; // observed delays
array[N] int<lower=0> n; // number of occurrences for each delay
array[N] int<lower=0> pwindow; // primary censoring window
array[N] int<lower=0> swindow; // secondary censoring window
array[N] int<lower=0> D; // maximum delay
}
transformed data {
array[0] real primary_params;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
// Priors
mu ~ normal(1, 1);
sigma ~ normal(0.5, 0.5);
// Likelihood
for (i in 1:N) {
target += n[i] * primary_censored_dist_lpmf(
y[i] | 1, {mu, sigma},
pwindow[i], swindow[i], D[i],
1, primary_params
);
}
}",
con = file.path(tempdir(), "primarycensoreddist_lognormal.stan")
)
```

Now let's compile the model

```{r compile-primarycensoreddist-model, message = FALSE}
primarycensoreddist_model <- cmdstan_model(
file.path(tempdir(), "primarycensoreddist_lognormal.stan"),
include_paths = pcd_stan_path()
)
```

Now let's fit the compiled model.

```{r fit-primarycensoreddist-model, message = FALSE}
primarycensoreddist_fit <- primarycensoreddist_model$sample(
data = list(
y = delay_counts$observed_delay,
n = delay_counts$n,
pwindow = delay_counts$pwindow,
swindow = delay_counts$swindow,
D = delay_counts$obs_time,
N = nrow(delay_counts)
),
chains = 4,
parallel_chains = 4,
init = list( # we use this to resolve initialisation issues
list(mu = 1.5, sigma = 0.6),
list(mu = 1.5, sigma = 0.4),
list(mu = 1.5, sigma = 0.3),
list(mu = 1.5, sigma = 0.55)
),
refresh = 0,
show_messages = FALSE
)
primarycensoreddist_fit
```

We see that the model has converged and the diagnostics look good. We also see that the posterior means are very near the true parameters and the 90% credible intervals include the true parameters.
16 changes: 13 additions & 3 deletions vignettes/using-stan-tools.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ In this vignette, we'll explore how to use the `primarycensoreddist` package in
Stan is a probabilistic programming language for statistical inference. It provides a flexible and efficient platform for Bayesian modeling and is widely used in various fields of data science and statistics. In this vignette, we'll use Stan in conjunction with `primarycensoreddist` to perform Bayesian inference on censored data.

For more information on Stan:

- [Stan's official website](https://mc-stan.org/)
- [Stan documentation](https://mc-stan.org/users/documentation/)
- [Stan forums](https://discourse.mc-stan.org/) for community support and discussions
Expand Down Expand Up @@ -108,25 +109,34 @@ writeLines(
text = c(
"functions {",
"#include expgrowth.stan",
"}",
"generated quantities {",
" real y = expgrowth_rng(0, 1, 0.4);",
"}"
),
con = expgrowth_stan_file
)
```
}
```

We can now use this file to compile a model. **Note** that we need to include the path to the `primarycensoreddist` Stan functions using the `include_paths` argument to `cmdstan_model()`.

```{r}
model <- cmdstan_model(expgrowth_stan_file, include_paths = pcd_stan_path())
model
```

We can then sample from the model (we set `fixed_param = TRUE` here as our toy example doesn't require MCMC sampling).

```{r}
samples <- model$sample(chains = 1, fixed_param = TRUE)
samples
```

## Using Stan functions directly in R

Whilst it is possible to use Stan functions directly in R it is not recommended for most use cases (use the R functions in `primarycensoreddist` instead). However, it can be useful to understand what is going on under the hood or for exploration (indeed we use this internally in `primarycensoreddist` to check our functions against the R implementations). To do this we use the `expose_functions()` method on our already compiled model.

```{r}
```{r, message = FALSE}
model$expose_functions(global = TRUE)
```

Expand Down

0 comments on commit ded8bb9

Please sign in to comment.