Issue 14: Fitting delay estimates using primarycensoreddist (#33)

* start by improving the stan vignette * add data simulatio * first working version * complete first draft * update plot theme * solve lint issue * resolve duplicate chunk label * resolve duplicate chunk label * resolve duplicate chunk label * add in missing dplyr * tidy up new vignette
epinowcast · Sep 4, 2024 · ded8bb9 · ded8bb9
1 parent a088846
commit ded8bb9
Show file tree

Hide file tree

Showing 6 changed files with 276 additions and 3 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -20,6 +20,7 @@ Depends:
 Suggests:
     bookdown,
     cmdstanr,
+    dplyr,
     knitr,
     ggplot2,
     rmarkdown,

diff --git a/NEWS.md b/NEWS.md
@@ -10,6 +10,7 @@
 
 * Added a getting started vignette.
 * Added a vignette showcasing how to use the package Stan code with `cmdstanr`.
+* Added a vignette showcasing how to fit distributions using the `cmdstanr` package.
 
 # primarycensoreddist 0.1.0
 

diff --git a/inst/stan/primary_censored_dist.stan b/inst/stan/primary_censored_dist.stan
@@ -126,6 +126,9 @@ real primary_censored_integrand(real x, real xc, array[] real theta,
 
   // Compute adjusted delay
   real d_adj = d - x;
+  if (d_adj <= 0) {
+    return 0;
+  }
 
   // Compute log probabilities
   real log_cdf = dist_lcdf(d_adj | params, dist_id);

diff --git a/vignettes/fitting-dists-with-fitdistr.Rmd b/vignettes/fitting-dists-with-fitdistr.Rmd
@@ -16,3 +16,5 @@ vignette: >
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
+
+*Under construction*
diff --git a/vignettes/fitting-dists-with-stan.Rmd b/vignettes/fitting-dists-with-stan.Rmd
@@ -16,3 +16,259 @@ vignette: >
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
+
+# Introduction
+
+## What are we going to do in this Vignette
+
+In this vignette, we'll demonstrate how to use `primarycensoreddist` in conjunction with Stan for Bayesian inference of epidemiological delay distributions. We'll cover the following key points:
+
+1. Simulating censored delay distribution data
+2. Fitting a naive model using cmdstan
+3. Evaluating the naive model's performance
+4. Fitting an improved model using `primarycensoreddist` functionality
+5. Comparing the `primarycensoreddist` model's performance to the naive model
+
+## What might I need to know before starting
+
+This vignette builds on the concepts introduced in the [Getting Started with primarycensoreddist](primarycensoreddist.html) vignette and assumes familiarity with using Stan tools as covered in the [How to use primarycensoreddist with Stan](using-stan-tools.html) vignette.
+
+## Packages used in this vignette
+
+Alongside the `primarycensoreddist` package we will use the `cmdstanr` package for interfacing with cmdstananalysis. We will also use the `ggplot2` package for plotting and `dplyr` for data manipulation.
+
+```{r setup, message = FALSE}
+library(primarycensoreddist)
+library(cmdstanr)
+library(ggplot2)
+library(dplyr)
+```
+
+# Simulating censored and truncated delay distribution data
+
+We'll start by simulating some censored and truncated delay distribution data. We'll use the `rprimarycensoreddist` function (actually we will use the `rpcens ` alias for brevity).
+
+```{r sample-lognormal}
+set.seed(123) # For reproducibility
+
+# Define the true distribution parameters
+n <- 1000
+meanlog <- 1.5
+sdlog <- 0.75
+
+# Generate varying pwindow, swindow, and obs_time lengths
+pwindows <- sample(1, n, replace = TRUE)
+swindows <- sample(1, n, replace = TRUE)
+obs_times <- 10
+
+# Function to generate a single sample
+generate_sample <- function(pwindow, swindow, obs_time) {
+  rpcens(
+    1, rlnorm,
+    meanlog = meanlog, sdlog = sdlog,
+    pwindow = pwindow, swindow = swindow, D = obs_time
+  )
+}
+
+# Generate samples
+samples <- mapply(generate_sample, pwindows, swindows, obs_times)
+
+# Create initial data frame
+delay_data <- data.frame(
+  pwindow = pwindows,
+  swindow = swindows,
+  obs_time = obs_times,
+  observed_delay = samples
+)
+
+head(delay_data)
+
+# Aggregate to unique combinations and count occurrences
+# Aggregate to unique combinations and count occurrences
+delay_counts <- delay_data |>
+  summarise(n = n(), .by = c(pwindow, swindow, obs_time, observed_delay))
+
+head(delay_counts)
+
+# Compare the samples with and without secondary censoring to the true
+# distribution
+# Calculate empirical CDF
+empirical_cdf <- ecdf(samples)
+
+# Create a sequence of x values for the theoretical CDF
+x_seq <- seq(min(samples), max(samples), length.out = 100)
+
+# Calculate theoretical CDF
+theoretical_cdf <- plnorm(x_seq, meanlog = meanlog, sdlog = sdlog)
+
+# Create a long format data frame for plotting
+cdf_data <- data.frame(
+  x = rep(x_seq, 2),
+  probability = c(empirical_cdf(x_seq), theoretical_cdf),
+  type = rep(c("Observed", "Theoretical"), each = length(x_seq)),
+  stringsAsFactors = FALSE
+)
+
+# Plot
+ggplot(cdf_data, aes(x = x, y = probability, color = type)) +
+  geom_step(linewidth = 1) +
+  scale_color_manual(
+    values = c(Observed = "#4292C6", Theoretical = "#252525")
+  ) +
+  geom_vline(
+    aes(xintercept = mean(samples), color = "Observed"),
+    linetype = "dashed", linewidth = 1
+  ) +
+  geom_vline(
+    aes(xintercept = exp(meanlog + sdlog^2 / 2), color = "Theoretical"),
+    linetype = "dashed", linewidth = 1
+  ) +
+  labs(
+    title = "Comparison of Observed vs Theoretical CDF",
+    x = "Delay",
+    y = "Cumulative Probability",
+    color = "CDF Type"
+  ) +
+  theme_minimal() +
+  theme(
+    panel.grid.minor = element_blank(),
+    plot.title = element_text(hjust = 0.5),
+    legend.position = "bottom"
+  )
+```
+
+We've aggregated the data to unique combinations of `pwindow`, `swindow`, and `obs_time` and counted the number of occurrences of each `observed_delay` for each combination. This is the data we will use to fit our model.
+
+# Fitting a naive model using cmdstan
+
+We'll start by fitting a naive model using cmdstan. We'll use the `cmdstanr` package to interface with cmdstan. We define the model in a string and then write it to a file as in the [How to use primarycensoreddist with Stan](using-stan-tools.html) vignette.
+
+```{r naive-model}
+writeLines(
+  "data {
+    int<lower=0> N;  // number of observations
+    vector[N] y;     // observed delays
+    vector[N] n;     // number of occurrences for each delay
+  }
+  parameters {
+    real mu;
+    real<lower=0> sigma;
+  }
+  model {
+    // Priors
+    mu ~ normal(1, 1);
+    sigma ~ normal(0.5, 1);
+
+    // Likelihood
+    target += n .* lognormal_lpdf(y | mu, sigma);
+  }",
+  con = file.path(tempdir(), "naive_lognormal.stan")
+)
+```
+
+Now let's compile the model
+
+```{r compile-naive-model}
+naive_model <- cmdstan_model(file.path(tempdir(), "naive_lognormal.stan"))
+```
+
+aand now let's fit the compiled model.
+
+```{r fit-naive-model, message = FALSE}
+naive_fit <- naive_model$sample(
+  data = list(
+    # Add a small constant to avoid log(0)
+    y = delay_counts$observed_delay + 1e-6,
+    n = delay_counts$n,
+    N = nrow(delay_counts)
+  ),
+  chains = 4,
+  parallel_chains = 4,
+  show_messages = FALSE,
+  refresh = 0
+)
+naive_fit
+```
+
+We see that the model has converged and the diagnostics look good. However, just from the model posterior summary we see that we might not be very happy with the fit. `mu` is smaller than the target `r meanlog` and `sigma` is larger than the target `r sdlog`.
+
+# Fitting an improved model using primarycensoreddist
+
+We'll now fit an improved model using the `primarycensoreddist` package. The main improvement is that we will use the `primary_censored_dist_lpdf` function to fit the model. This is the Stan version of the `pcens()` function and accounts for the primary and secondary censoring windows as well as the truncation. We encode that our primary distribution is a lognormal distribution by passing 1 as the `dist_id` parameter and that our primary event distribution is uniform by passing 1 as the `primary_dist_id` parameter. See the [Stan documentation](https://primarycensoreddist.epinowcast.org/stan/primary__censored__dist_8stan.html#acc97240dee1bc19e4f02013118f3857d) for more details on the `primary_censored_dist_lpdf` function.
+
+```{r pimarycensoreddist-model}
+writeLines(
+  "
+  functions {
+    #include primary_censored_dist.stan
+    #include expgrowth.stan
+  }
+  data {
+    int<lower=0> N;  // number of observations
+    array[N] int<lower=0> y;     // observed delays
+    array[N] int<lower=0> n;     // number of occurrences for each delay
+    array[N] int<lower=0> pwindow; // primary censoring window
+    array[N] int<lower=0> swindow; // secondary censoring window
+    array[N] int<lower=0> D; // maximum delay
+  }
+  transformed data {
+    array[0] real primary_params;
+  }
+  parameters {
+    real mu;
+    real<lower=0> sigma;
+  }
+  model {
+    // Priors
+    mu ~ normal(1, 1);
+    sigma ~ normal(0.5, 0.5);
+
+    // Likelihood
+    for (i in 1:N) {
+      target += n[i] * primary_censored_dist_lpmf(
+        y[i] | 1, {mu, sigma},
+        pwindow[i], swindow[i], D[i],
+        1, primary_params
+      );
+    }
+  }",
+  con = file.path(tempdir(), "primarycensoreddist_lognormal.stan")
+)
+```
+
+Now let's compile the model
+
+```{r compile-primarycensoreddist-model, message = FALSE}
+primarycensoreddist_model <- cmdstan_model(
+  file.path(tempdir(), "primarycensoreddist_lognormal.stan"),
+  include_paths = pcd_stan_path()
+)
+```
+
+Now let's fit the compiled model.
+
+```{r fit-primarycensoreddist-model, message = FALSE}
+primarycensoreddist_fit <- primarycensoreddist_model$sample(
+  data = list(
+    y = delay_counts$observed_delay,
+    n = delay_counts$n,
+    pwindow = delay_counts$pwindow,
+    swindow = delay_counts$swindow,
+    D = delay_counts$obs_time,
+    N = nrow(delay_counts)
+  ),
+  chains = 4,
+  parallel_chains = 4,
+  init = list( # we use this to resolve initialisation issues
+    list(mu = 1.5, sigma = 0.6),
+    list(mu = 1.5, sigma = 0.4),
+    list(mu = 1.5, sigma = 0.3),
+    list(mu = 1.5, sigma = 0.55)
+  ),
+  refresh = 0,
+  show_messages = FALSE
+)
+primarycensoreddist_fit
+```
+
+We see that the model has converged and the diagnostics look good. We also see that the posterior means are very near the true parameters and the 90% credible intervals include the true parameters.
diff --git a/vignettes/using-stan-tools.Rmd b/vignettes/using-stan-tools.Rmd
@@ -34,6 +34,7 @@ In this vignette, we'll explore how to use the `primarycensoreddist` package in
 Stan is a probabilistic programming language for statistical inference. It provides a flexible and efficient platform for Bayesian modeling and is widely used in various fields of data science and statistics. In this vignette, we'll use Stan in conjunction with `primarycensoreddist` to perform Bayesian inference on censored data.
 
 For more information on Stan:
+
 - [Stan's official website](https://mc-stan.org/)
 - [Stan documentation](https://mc-stan.org/users/documentation/)
 - [Stan forums](https://discourse.mc-stan.org/) for community support and discussions
@@ -108,25 +109,34 @@ writeLines(
   text = c(
     "functions {",
     "#include expgrowth.stan",
+    "}",
+    "generated quantities {",
+    "  real y = expgrowth_rng(0, 1, 0.4);",
     "}"
   ),
   con = expgrowth_stan_file
 )
 ```
-}
-```
+
 We can now use this file to compile a model. **Note** that we need to include the path to the `primarycensoreddist` Stan functions using the `include_paths` argument to `cmdstan_model()`.
 
 ```{r}
 model <- cmdstan_model(expgrowth_stan_file, include_paths = pcd_stan_path())
 model
 ```
 
+We can then sample from the model (we set `fixed_param = TRUE` here as our toy example doesn't require MCMC sampling).
+
+```{r}
+samples <- model$sample(chains = 1, fixed_param = TRUE)
+samples
+```
+
 ## Using Stan functions directly in R
 
 Whilst it is possible to use Stan functions directly in R it is not recommended for most use cases (use the R functions in `primarycensoreddist` instead). However, it can be useful to understand what is going on under the hood or for exploration (indeed we use this internally in `primarycensoreddist` to check our functions against the R implementations). To do this we use the `expose_functions()` method on our already compiled model.
 
-```{r}
+```{r, message = FALSE}
 model$expose_functions(global = TRUE)
 ```