Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 14: Fitting delay estimates using primarycensoreddist #33

Merged
merged 11 commits into from
Sep 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Depends:
Suggests:
bookdown,
cmdstanr,
dplyr,
knitr,
ggplot2,
rmarkdown,
Expand Down
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

* Added a getting started vignette.
* Added a vignette showcasing how to use the package Stan code with `cmdstanr`.
* Added a vignette showcasing how to fit distributions using the `cmdstanr` package.

# primarycensoreddist 0.1.0

Expand Down
3 changes: 3 additions & 0 deletions inst/stan/primary_censored_dist.stan
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,9 @@ real primary_censored_integrand(real x, real xc, array[] real theta,

// Compute adjusted delay
real d_adj = d - x;
if (d_adj <= 0) {
return 0;
}

// Compute log probabilities
real log_cdf = dist_lcdf(d_adj | params, dist_id);
Expand Down
2 changes: 2 additions & 0 deletions vignettes/fitting-dists-with-fitdistr.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

*Under construction*
256 changes: 256 additions & 0 deletions vignettes/fitting-dists-with-stan.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,259 @@ vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

# Introduction

## What are we going to do in this Vignette

In this vignette, we'll demonstrate how to use `primarycensoreddist` in conjunction with Stan for Bayesian inference of epidemiological delay distributions. We'll cover the following key points:

1. Simulating censored delay distribution data
2. Fitting a naive model using cmdstan
3. Evaluating the naive model's performance
4. Fitting an improved model using `primarycensoreddist` functionality
5. Comparing the `primarycensoreddist` model's performance to the naive model

## What might I need to know before starting

This vignette builds on the concepts introduced in the [Getting Started with primarycensoreddist](primarycensoreddist.html) vignette and assumes familiarity with using Stan tools as covered in the [How to use primarycensoreddist with Stan](using-stan-tools.html) vignette.

## Packages used in this vignette

Alongside the `primarycensoreddist` package we will use the `cmdstanr` package for interfacing with cmdstananalysis. We will also use the `ggplot2` package for plotting and `dplyr` for data manipulation.

```{r setup, message = FALSE}
library(primarycensoreddist)
library(cmdstanr)
library(ggplot2)
library(dplyr)
```

# Simulating censored and truncated delay distribution data

We'll start by simulating some censored and truncated delay distribution data. We'll use the `rprimarycensoreddist` function (actually we will use the `rpcens ` alias for brevity).

```{r sample-lognormal}
set.seed(123) # For reproducibility

# Define the true distribution parameters
n <- 1000
meanlog <- 1.5
sdlog <- 0.75

# Generate varying pwindow, swindow, and obs_time lengths
pwindows <- sample(1, n, replace = TRUE)
swindows <- sample(1, n, replace = TRUE)
obs_times <- 10

# Function to generate a single sample
generate_sample <- function(pwindow, swindow, obs_time) {
rpcens(
1, rlnorm,
meanlog = meanlog, sdlog = sdlog,
pwindow = pwindow, swindow = swindow, D = obs_time
)
}

# Generate samples
samples <- mapply(generate_sample, pwindows, swindows, obs_times)

# Create initial data frame
delay_data <- data.frame(
pwindow = pwindows,
swindow = swindows,
obs_time = obs_times,
observed_delay = samples
)

head(delay_data)

# Aggregate to unique combinations and count occurrences
# Aggregate to unique combinations and count occurrences
delay_counts <- delay_data |>
summarise(n = n(), .by = c(pwindow, swindow, obs_time, observed_delay))

head(delay_counts)

# Compare the samples with and without secondary censoring to the true
# distribution
# Calculate empirical CDF
empirical_cdf <- ecdf(samples)

# Create a sequence of x values for the theoretical CDF
x_seq <- seq(min(samples), max(samples), length.out = 100)

# Calculate theoretical CDF
theoretical_cdf <- plnorm(x_seq, meanlog = meanlog, sdlog = sdlog)

# Create a long format data frame for plotting
cdf_data <- data.frame(
x = rep(x_seq, 2),
probability = c(empirical_cdf(x_seq), theoretical_cdf),
type = rep(c("Observed", "Theoretical"), each = length(x_seq)),
stringsAsFactors = FALSE
)

# Plot
ggplot(cdf_data, aes(x = x, y = probability, color = type)) +
geom_step(linewidth = 1) +
scale_color_manual(
values = c(Observed = "#4292C6", Theoretical = "#252525")
) +
geom_vline(
aes(xintercept = mean(samples), color = "Observed"),
linetype = "dashed", linewidth = 1
) +
geom_vline(
aes(xintercept = exp(meanlog + sdlog^2 / 2), color = "Theoretical"),
linetype = "dashed", linewidth = 1
) +
labs(
title = "Comparison of Observed vs Theoretical CDF",
x = "Delay",
y = "Cumulative Probability",
color = "CDF Type"
) +
theme_minimal() +
theme(
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.position = "bottom"
)
```

We've aggregated the data to unique combinations of `pwindow`, `swindow`, and `obs_time` and counted the number of occurrences of each `observed_delay` for each combination. This is the data we will use to fit our model.

# Fitting a naive model using cmdstan

We'll start by fitting a naive model using cmdstan. We'll use the `cmdstanr` package to interface with cmdstan. We define the model in a string and then write it to a file as in the [How to use primarycensoreddist with Stan](using-stan-tools.html) vignette.

```{r naive-model}
writeLines(
"data {
int<lower=0> N; // number of observations
vector[N] y; // observed delays
vector[N] n; // number of occurrences for each delay
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
// Priors
mu ~ normal(1, 1);
sigma ~ normal(0.5, 1);

// Likelihood
target += n .* lognormal_lpdf(y | mu, sigma);
}",
con = file.path(tempdir(), "naive_lognormal.stan")
)
```

Now let's compile the model

```{r compile-naive-model}
naive_model <- cmdstan_model(file.path(tempdir(), "naive_lognormal.stan"))
```

aand now let's fit the compiled model.

```{r fit-naive-model, message = FALSE}
naive_fit <- naive_model$sample(
data = list(
# Add a small constant to avoid log(0)
y = delay_counts$observed_delay + 1e-6,
n = delay_counts$n,
N = nrow(delay_counts)
),
chains = 4,
parallel_chains = 4,
show_messages = FALSE,
refresh = 0
)
naive_fit
```

We see that the model has converged and the diagnostics look good. However, just from the model posterior summary we see that we might not be very happy with the fit. `mu` is smaller than the target `r meanlog` and `sigma` is larger than the target `r sdlog`.

# Fitting an improved model using primarycensoreddist

We'll now fit an improved model using the `primarycensoreddist` package. The main improvement is that we will use the `primary_censored_dist_lpdf` function to fit the model. This is the Stan version of the `pcens()` function and accounts for the primary and secondary censoring windows as well as the truncation. We encode that our primary distribution is a lognormal distribution by passing 1 as the `dist_id` parameter and that our primary event distribution is uniform by passing 1 as the `primary_dist_id` parameter. See the [Stan documentation](https://primarycensoreddist.epinowcast.org/stan/primary__censored__dist_8stan.html#acc97240dee1bc19e4f02013118f3857d) for more details on the `primary_censored_dist_lpdf` function.

```{r pimarycensoreddist-model}
writeLines(
"
functions {
#include primary_censored_dist.stan
#include expgrowth.stan
}
data {
int<lower=0> N; // number of observations
array[N] int<lower=0> y; // observed delays
array[N] int<lower=0> n; // number of occurrences for each delay
array[N] int<lower=0> pwindow; // primary censoring window
array[N] int<lower=0> swindow; // secondary censoring window
array[N] int<lower=0> D; // maximum delay
}
transformed data {
array[0] real primary_params;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
// Priors
mu ~ normal(1, 1);
sigma ~ normal(0.5, 0.5);

// Likelihood
for (i in 1:N) {
target += n[i] * primary_censored_dist_lpmf(
y[i] | 1, {mu, sigma},
pwindow[i], swindow[i], D[i],
1, primary_params
);
}
}",
con = file.path(tempdir(), "primarycensoreddist_lognormal.stan")
)
```

Now let's compile the model

```{r compile-primarycensoreddist-model, message = FALSE}
primarycensoreddist_model <- cmdstan_model(
file.path(tempdir(), "primarycensoreddist_lognormal.stan"),
include_paths = pcd_stan_path()
)
```

Now let's fit the compiled model.

```{r fit-primarycensoreddist-model, message = FALSE}
primarycensoreddist_fit <- primarycensoreddist_model$sample(
data = list(
y = delay_counts$observed_delay,
n = delay_counts$n,
pwindow = delay_counts$pwindow,
swindow = delay_counts$swindow,
D = delay_counts$obs_time,
N = nrow(delay_counts)
),
chains = 4,
parallel_chains = 4,
init = list( # we use this to resolve initialisation issues
list(mu = 1.5, sigma = 0.6),
list(mu = 1.5, sigma = 0.4),
list(mu = 1.5, sigma = 0.3),
list(mu = 1.5, sigma = 0.55)
),
refresh = 0,
show_messages = FALSE
)
primarycensoreddist_fit
```

We see that the model has converged and the diagnostics look good. We also see that the posterior means are very near the true parameters and the 90% credible intervals include the true parameters.
16 changes: 13 additions & 3 deletions vignettes/using-stan-tools.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ In this vignette, we'll explore how to use the `primarycensoreddist` package in
Stan is a probabilistic programming language for statistical inference. It provides a flexible and efficient platform for Bayesian modeling and is widely used in various fields of data science and statistics. In this vignette, we'll use Stan in conjunction with `primarycensoreddist` to perform Bayesian inference on censored data.

For more information on Stan:

- [Stan's official website](https://mc-stan.org/)
- [Stan documentation](https://mc-stan.org/users/documentation/)
- [Stan forums](https://discourse.mc-stan.org/) for community support and discussions
Expand Down Expand Up @@ -108,25 +109,34 @@ writeLines(
text = c(
"functions {",
"#include expgrowth.stan",
"}",
"generated quantities {",
" real y = expgrowth_rng(0, 1, 0.4);",
"}"
),
con = expgrowth_stan_file
)
```
}
```

We can now use this file to compile a model. **Note** that we need to include the path to the `primarycensoreddist` Stan functions using the `include_paths` argument to `cmdstan_model()`.

```{r}
model <- cmdstan_model(expgrowth_stan_file, include_paths = pcd_stan_path())
model
```

We can then sample from the model (we set `fixed_param = TRUE` here as our toy example doesn't require MCMC sampling).

```{r}
samples <- model$sample(chains = 1, fixed_param = TRUE)
samples
```

## Using Stan functions directly in R

Whilst it is possible to use Stan functions directly in R it is not recommended for most use cases (use the R functions in `primarycensoreddist` instead). However, it can be useful to understand what is going on under the hood or for exploration (indeed we use this internally in `primarycensoreddist` to check our functions against the R implementations). To do this we use the `expose_functions()` method on our already compiled model.

```{r}
```{r, message = FALSE}
model$expose_functions(global = TRUE)
```

Expand Down