Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 37 draft: Why it works vignette with analytic solutions #68

Merged
merged 35 commits into from
Sep 18, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
4b7a02c
Add Leung ref
SamuelBrand1 Sep 13, 2024
dc94f53
Create why-it-works.Rmd
SamuelBrand1 Sep 13, 2024
b060945
add cori et al ref
SamuelBrand1 Sep 16, 2024
0f591af
Add analytic solutions: Gamma
SamuelBrand1 Sep 16, 2024
5733e25
slight eq tweak
SamuelBrand1 Sep 16, 2024
37e7f18
linting
SamuelBrand1 Sep 16, 2024
ca16908
Add Leung ref
SamuelBrand1 Sep 13, 2024
c47c2d6
Create why-it-works.Rmd
SamuelBrand1 Sep 13, 2024
7bd8cf4
add cori et al ref
SamuelBrand1 Sep 16, 2024
fd2ac18
Add analytic solutions: Gamma
SamuelBrand1 Sep 16, 2024
9d039aa
slight eq tweak
SamuelBrand1 Sep 16, 2024
b05b238
linting
SamuelBrand1 Sep 16, 2024
06e1a16
Update vignettes/why-it-works.Rmd
seabbs Sep 16, 2024
e8a1d78
Update vignettes/why-it-works.Rmd
seabbs Sep 16, 2024
85c8667
Merge branch 'why-it-works-vignette' of https://github.com/SamuelBran…
SamuelBrand1 Sep 16, 2024
ea269fa
add self as author
SamuelBrand1 Sep 16, 2024
308141b
Update vignettes/why-it-works.Rmd
SamuelBrand1 Sep 17, 2024
ddd5c64
Update vignettes/why-it-works.Rmd
SamuelBrand1 Sep 17, 2024
dd79913
Update vignettes/why-it-works.Rmd
SamuelBrand1 Sep 17, 2024
8917703
add self as author in DESCRIPTION
SamuelBrand1 Sep 17, 2024
cba034b
change title
SamuelBrand1 Sep 17, 2024
e66a0ac
changed notation so that window width is lowercase
SamuelBrand1 Sep 17, 2024
dcfd976
ongoing rewrite
SamuelBrand1 Sep 17, 2024
694e16b
extra ref
SamuelBrand1 Sep 17, 2024
dd29b60
seqn updates
SamuelBrand1 Sep 18, 2024
991407a
reorganise equations
SamuelBrand1 Sep 18, 2024
3db8aeb
fix lognormal discrete
SamuelBrand1 Sep 18, 2024
27c98ed
add note to pkgdown.yml
SamuelBrand1 Sep 18, 2024
5fa6b9b
extra refs
SamuelBrand1 Sep 18, 2024
917ab7e
Update WORDLIST
SamuelBrand1 Sep 18, 2024
6120812
Merge branch 'main' into why-it-works-vignette
seabbs Sep 18, 2024
62ec9ed
add a "pre-intro" section
SamuelBrand1 Sep 18, 2024
05daf84
Merge branch 'why-it-works-vignette' of https://github.com/SamuelBran…
SamuelBrand1 Sep 18, 2024
9df87c0
Update NEWS.md
SamuelBrand1 Sep 18, 2024
8059534
Apply suggestions from code review
seabbs Sep 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions vignettes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/.quarto/
22 changes: 22 additions & 0 deletions vignettes/library.bib
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,25 @@ @ARTICLE{Park2024
year = 2024,
doi = "10.1101/2024.01.12.24301247"
}

@article{leung1997censoring,
title={Censoring issues in survival analysis},
author={Leung, Kwan-Moon and Elashoff, Robert M and Afifi, Abdelmonem A},
journal={Annual review of public health},
volume={18},
number={1},
pages={83--104},
year={1997},
publisher={Annual Reviews 4139 El Camino Way, PO Box 10139, Palo Alto, CA 94303-0139, USA}
}

@article{cori2013new,
title={A new framework and software to estimate time-varying reproduction numbers during epidemics},
author={Cori, Anne and Ferguson, Neil M and Fraser, Christophe and Cauchemez, Simon},
journal={American journal of epidemiology},
volume={178},
number={9},
pages={1505--1512},
year={2013},
publisher={Oxford University Press}
}
191 changes: 191 additions & 0 deletions vignettes/why-it-works.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
title: "Interval censoring and truncation in epidemiological delay modelling"
author: "Samuel Brand"
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved
format: html
bibliography: library.bib
---

# Introduction {#introduction}

Time to event analysis, also known as survival analysis, concerns estimating the distribution of delay times between events. A distinctive feature of the field are the methodological techniques used to deal with the missing data problems common in data sets of delay times [@leung1997censoring]. In `primarycensoreddist` we focus on two particular missing data problems:
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved

- **Interval censoring**. The primary (start) time and/or the secondary (end) time of the delay is unobserved but known to be within an interval.
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved
- **Truncation**. Delay data items are only contained in the data set conditional on aspect of the data point, for example if the secondary event is the observation time which add the data item to the data set.
seabbs marked this conversation as resolved.
Show resolved Hide resolved

In statistical epidemiology, these missing data problems occur frequently in both data analysis and theoretical modelling.

In data analysis, events in epidemiology are commonly reported as occurring in a particular day or week (*interval censoring*). Also, in an emergency situation data analysis can be happening on datasets that are streaming in real time, which introduces biases around which data items have been observed (*truncation*).
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved

In theoretical epidemiological modelling, it is often appropriate to model the evolution of an infectious disease as occurring in discrete time, for example in the [`EpiNow2`](https://epiforecasts.io/EpiNow2/) modelling package. This means appropriately discretising continuous distributions, such as the generation interval distribution. In `primarycensoreddist` we treat the discretisation of intrinsically continuous distributions as an *interval censoring* problem.
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved

# Statistical model used in `primarycensoreddist`

As described in [Introduction](#introduction), `primarycensoreddist` focuses on a subset of methods from time to event analysis that address data missingness problems commonly found in epidemiological datasets. We present the statistical problem as a double interval censoring problem, where both the primary event time and the secondary event times are interval censored. We can recover single interval censoring problems by reducing one of the intervals to a point.

## Survival function of secondary event time after primary event window

A key assumption we make is that the censoring window for events is known and independent of the event time. This is known as non-informative censoring. It is useful to consider the time from the end (right) point of the primary censoring interval to the secondary time as a random variable,
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved

$$
S_+ = T - C_P.
$$

Where $T$ is the delay distribution of interest and $C_P$ is interval between the end (right) point of the primary censoring window and the primary event time; note that by definition $C_P$ is not observed.

With non-informative censoring, it is possible to derive the upper distribution function of $S_+$, or _survival function_ of $S_+$, from the distribution of $T$ and the distribution of $C_P$.
seabbs marked this conversation as resolved.
Show resolved Hide resolved

$$
Q_{S_+}(t) = \mathbb{E}_{C_P} \Big[Q_T(t + C_P)\Big] = \int_0^{W_P} Q_T(t + p) c(p) dp.
$$

Using integration by parts gives:
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved
$$
Q_{S_+}(t) = Q_T(t + W_P) + \int_0^{W_P} f_T(t+p) C(p) dp.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing f_T def

$$

Where $c$ is the density function of delay from the primary event to the end (right) point of the primary censoring window with distribution function $C$, $Q_T$ is the survival function of the actual delay distribution of interest and $W_P$ is the length of the primary censoring window.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To link across with the docs I think here we should connect it to the formulation give in the getting started clearly and the same goes for the pmf section below


## Probability mass of secondary event time within a secondary censoring window

The probability of the secondary event time falling within a secondary censoring window of length $W_S$ that begins at time $n$ _after_ the primary censoring window is then given by:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say something like with this in hand, all very standard blah blah (and link to something - maybe stan docs on censoring)


$$
Pr(S_+ \in [n, n + W_S)) = Q_{S_+}(n) - Q_{S_+}(n + W_S).
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved
$$

Which can in general be calculated by numerical quadrature.

Note that the secondary event time can also occur within the primary censoring window. This happens with probability,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to getting started

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this currently feels like somewhat of a random aside. If it is can it go after the main thrust of your point and some text added to explain why we care.

$$
Q_{S_+}(-W_P) - Q_{S_+}(0) = 1 - Q_T(W_P) - \int_0^{W_P} f_T(p) C(p) dp = Pr(T< C).
$$

Using a difference operator defined as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this gain you apart from 74? line 74 doesn't seem that useful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It pops up a lot in the analytic solutions, I wasn't going to add new notation but then I realised that for the uniform primary censor window to basic delay pmf comes in a standard form.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe introduce it in the analytical section then (perhaps even as a helpful ideas section or similar)


$$
\Delta_{W}f(t) = f(t + W) - f(t).
$$

The common case where every primary and secondary event window is a single time unit, $W_P = W_S = 1$, simplifies the above expressions to:

$$
\begin{aligned}
f_n = Pr(S_+ \in [n, n + 1)) &= Q_{S_+}(n) - Q_{S_+}(n + 1), \\
& = - \Delta_1 Q_{S_+}(n),\qquad n = -1, 0,\dots.
\end{aligned}
$$
Which is a discrete probability mass function of the secondary event time relative to the primary event time in units of the event window.
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved

seabbs marked this conversation as resolved.
Show resolved Hide resolved
## Analytic solutions for exponentially tilted primary event times
SamuelBrand1 marked this conversation as resolved.
Show resolved Hide resolved

In epidemiological analysis, it is common for primary events to be arriving at exponentially increasing or decreasing rates, for example, incidence of new infections in a growing epidemic. In this case, the distribution of the primary event time within its censoring window is biased by the exponential growth or decay. If we assume a reference uniform distribution within a primary censoring window $[k, k + W_P)$ then the distribution of the primary event time within the censoring window is the exponential tilted uniform distribution:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to our implementation? (i.e code docs [rexpgrowth()].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add ref to exponetial titled uniform

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit more on what this means (i.e no circadian pattern etc)


$$
f_P(t) \propto \exp(r t) \mathbb{1}_{[k, k + W_P]}(t).
$$

In this case, the distribution function for $C_P$, that is the length of time left in the primary censor window _after_ the primary event time, is given by:

$$
C(p) = { 1 - \exp(-r p) \over 1 - \exp(-r W_P)}, \qquad p \in [0, W_P].
$$

We consider two situations: 1) uniform primary event time ($r=0$) and 2) non-uniform primary event time ($r\neq0$).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is still a WIP I would assume (and fair enough) For now shall we drop just so we can merge this sooner rather than later and add in 2. via a new issue and PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. That makes sense. Mostly, its not hard to extend solutions to situation 2) except that for log-normal dist where the maths is really helped by cancelling the $\frac{1}{z}$ factor in the pdf.


### Uniform primary event time ($r=0$)

In this case, the primary event time is uniformly distributed within the primary censoring window. The survival function of $S_+$ is then:

$$
Q_{S_+}(t) = Q_T(t + W_P) + { 1 \over W_P} \int_0^{W_P} f_T(t+p) p~ dp.
$$
This is analytically solvable whenever the upper distribution function of $T$ is known and the mean of $T$ is analytically solvable from its integral definition.

In each case it is easier to change the integration variable:

$$
\begin{aligned}
Q_{S_+}(t) &= Q_T(t + W_P) + { 1 \over W_P} \int_t^{t+W_P} f_T(z) (z-t)~ dz \\
&= Q_T(t + W_P) + { 1 \over W_P} \Big[ \int_t^{t+W_P} f_T(z) z~ dz - t \Delta_{W_P}F_T(t) \Big].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by parts blah blah

\end{aligned}
$$


seabbs marked this conversation as resolved.
Show resolved Hide resolved
#### Gamma distribution

The Gamma distribution has the density function:

$$
f_T(z;k, \theta) = {1 \over \Gamma(k) \theta^k} z^{k-1} \exp(-z/\theta).
seabbs marked this conversation as resolved.
Show resolved Hide resolved
$$

This gives a solution to the survival function of $S_+$ in terms of analytically available functions:

$$
\begin{aligned}
Q_{S_+}(t; k, \theta) &= Q_T(t + W_P; k, \theta) + { 1 \over W_P} \left( {1 \over \Gamma(k) \theta^k} \int_t^{t+W_P} z^{k-1} \exp(-z/\theta) (z-t)~ dz \right), \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not start from the last derived section here?

&= Q_T(t + W_P; k, \theta) + { 1 \over W_P} \big[ k \theta \Delta_{W_P}F_T(t; k+1, \theta) - t \Delta_{W_P}F_T(t; k, \theta) \big].
\end{aligned}
$$

Where we have used the standard Gamma integration trick of rewriting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref (wikipedia)

$$
{z z^{k-1} \over \Gamma(k) \theta^k} = {k \theta z^k \over\Gamma(k+1) \theta^{k+1} }.
$$

In the special case of the equal event window then the discrete delay distribution is:

$$
\begin{aligned}
f_n &= (n+2) F_T(n+2; k, \theta) + n F_T(n; k, \theta) - 2 (n+1) F_T(n+1; k, \theta) \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a few more steps here for people would be great.

&+ k \theta \Big( 2 F_T(n+1; k+1, \theta) - F_T(n; k+1, \theta) - F_T(n+2; k+1,\theta) \Big).
\end{aligned}
$$

Which was also found by Cori _et al_ [@cori2013new].

#### Log-Normal distribution

The Log-Normal distribution has the density function:

$$
f_T(z;\mu, \sigma) = {1 \over z \sigma \sqrt{2\pi}} \exp\left( - {(\log(z) - \mu)^2 \over 2 \sigma^2} \right).
$$
And distribution function:

$$
F_T(z;\mu, \sigma) = \Phi\left( {\log(z) - \mu \over \sigma} \right).
seabbs marked this conversation as resolved.
Show resolved Hide resolved
$$
Where $\Phi$ is the standard normal distribution function.

This gives a solution to the survival function of $S_+$ in terms of analytically available functions:

$$
\begin{aligned}
Q_{S+}(t ;\mu, \sigma) &= Q_T(t + W_P;\mu, \sigma) + { 1 \over W_P} \int_t^{t+W_P} {1 \over z \sigma \sqrt{2\pi}} \exp\left( - {(\log(z) - \mu)^2 \over 2 \sigma^2} \right) (z-t)~ dz, \\
&= Q_T(t + W_P;\mu, \sigma) + { 1 \over W_P} \Big[ e^{\mu + \frac{1}{2} \sigma^2} \Delta_{W_P}F_T(t; \mu + \sigma^2, \sigma) - t\Delta_{W_P}F_T(t; \mu, \sigma) \Big]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just go directly to this line and say subbing into blah blah previous equation

\end{aligned}
$$

Where we have used the Log-Normal integration trick of making a further substitution $y = (\ln z - \mu) / \sigma$ and using the shift property of the standard normal distribution function. This gives:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. cite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also subbing into to equation blah blah blah

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't actually know a citation for this. I put "The log-normal integration trick" as a bit of a placeholder. I'll try and find a ref, its pretty obvs once you spot it.


$$
\begin{aligned}
\int_t^{t+W_P} z~ f_T(z; \mu, \sigma) dz &= {1 \over \sigma \sqrt{2\pi}} \int_{(\ln t - \mu)/\sigma}^{(\ln(t+W_P) - \mu)/\sigma} e^{\sigma y + \mu} e^{-y^2/2} dy\\
&= e^{\mu + \frac{1}{2} \sigma^2} \int_{(\ln t - \mu)/\sigma}^{(\ln(t+W_P) - \mu)/\sigma} e^{-(y- \sigma)^2/2} dy \\
&= e^{\mu + \frac{1}{2} \sigma^2} \Big[\Phi\Big({\ln(t+W_P) - \mu \over \sigma} - \sigma\Big) - \Phi\Big({\ln(t) - \mu \over \sigma} - \sigma\Big) \Big]\\
&= e^{\mu + \frac{1}{2} \sigma^2} \Delta_{W_P}F_T(t; \mu + \sigma^2, \sigma).
\end{aligned}
$$

In the special case of the equal event window then the discrete delay distribution is:

$$
\begin{aligned}
f_n &= (n+2) F_T(n+2; \mu, \sigma) + n F_T(n; \mu, \sigma) - 2 (n+1) F_T(n+1; \mu, \sigma) \\
&+ e^{\mu + \frac{1}{2} \sigma^2} \Big( 2 F_T(n+1; \mu + \sigma^2, \sigma) - F_T(n; \mu + \sigma^2, \sigma) - F_T(n+2; \mu + \sigma^2, \sigma) \Big).
\end{aligned}
$$


## References
Loading