CaVEMan Algorithm

At each site of the genome CaVEMan calculates the probability of there being a somatic mutation given a pre-specified normal contamination, $c$ , of the tumour sample, a site specific copy number for both germline, $N_g$ , and tumour, $N_t$ , and also prior probabilities that the site has a somatic variant, $P(Som)$ , or germline variant, $P(SNP)$ .

At each of these sites the model considers the set of possible germline, $G_g$ , and tumour genotypes, $G_t$ , constructed from the reference allele $A$ and a single variant allele $B$ . We have three disjoint sets of joint genotypes:

${\Omega_s = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}_{\text{$k$}}:N_t\geq k > 0}} {Somatic Genotypes}}$

${\Omega_g = { (G_g, G_t) : G_g\in{\underbrace{A \ldots A}{\text{$N_g-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_g\geq k > 0},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_t\geq k \geq 0}} {SNP Genotypes}}$

${\Omega_r = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_g=\underbrace{A \ldots A}{\text{$N_t$ copies.}}} {Reference Genotype}}$

The union of the above 3 sets, $\Omega$ , gives the set of possible genotypes considered by the model. We can then express prior probability of set membership probabilities in terms of our priors (assuming independence):

$P((G_g,G_t)\in\Omega_s)=P(Som)(1-P(SNP))$

$P((G_g,G_t)\in \Omega_g)=P(SNP)$

$P((G_g,G_t)\in\Omega_s)=(1-P(Som))(1-P(SNP))$

$P(G_g,G_t)&=&P(\Omega(G_g,G_t))P((G_g,G_t)|(G_g,G_t)\in \Omega(G_g,G_t))$

$[$ = \frac{P(\Omega(G_g,G_t))}{\left | \Omega(G_g,G_t) \right |}$]$

The central equation in the Caveman algorithm represents the probability of each joint genotype given the data $(D)$ and is calculated using the Bayes’rule:

$[$P(G_g,G_t|D)=\frac{P(D|G_g,G_t)P(G_g,G_t)}{\sum_{(G_g,G_t)\in\Omega}P(D|G_g,G_t)P(G_g,G_t)}$]$ (1)

We have already discussed how the prior is calculated, it now remains to calculate $(P(D|G_g,G_t))$ . We assume that for a given mapped read at position $(p)$ the true base $[$\hat{R}_p=x$]$ is called as base $(C)$ with probability $[$P(R_p=C|\hat{R}_p=x,\theta)$]$ given covariates $(\theta)$ . Caveman considers the following covariates:

lane/read group
read order
strand
mapping quality
base quality
read position (position within read)

It is worth noting that the first 4 covariates are defined at the level of the read and the last 2 covariates at the base call level.

The relevant data, $(D)$ , is assumed to be the pileup of reads at the specified genomic position for both the normal sample $[$D_{p}^{(n)}$]$ and the tumour sample $[$D_{p}^{(t)}$]$ . If we assume that errors in the called base are conditionally independent given the covariates then we can write:

$[P(D|G_g,G_t)=\prod_{R\in D_{p}^{(n)}}P^{(n)}(R_p=C|\hat{R}p,\theta)\prod{R\in D_{p}^{(t)}}P^{(t)}(R_p=C|\hat{R}_p,\theta)]$ (2)

For genotype, $(G)$ , let $[$\Psi(G)=\frac{B(G)}{A(G)+B(G)}=P(\hat{R}_p=B|G)$]$ then:

$[{P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g)}]$

$[{\nonumber\ &=&P(R_p=C|\hat{R}_p=A,\theta)(1-\Psi(G_g))+P(R_p=C|\hat{R}_p=B,\theta)\Psi(G_g)\nonumber }]$

We can calculate the $[$P^{(t)}$]$ similarly, but need to take into account contamination of the tumour with normal cells.

$[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g,G_t)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g,G_t)}]$ (3)

Now the adjusted contamination, $[$\hat{c}=\frac{c N_g}{c N_t +(1-c) N_g}$]$ , represents the probability that a read from the tumour sample comes from a normal cell. So, for the reads (R) coming from the tumour sample we have:

$[{P(\hat{R}_p=A|G_g,G_t)&=&P(\hat{R}_p=A| R\in Normal,G_g)P(R \in Normal)+P(\hat{R}_p=A| R\in Tumour)P(R \in Tumour,G_t)}]$

$[&=&\hat{c}P(\hat{R}_p=A| R\in Normal,G_g)+(1-\hat{c})P(\hat{R}_p=A| R\in Tumour,G_t)]$

$[&=&\hat{c}(1-\Psi(G_g))+(1-\hat{c})(1-\Psi(G_t))]$

and similarly:

$[P(\hat{R}_p=B|G_g,G_t)&=&\hat{c}\Psi(G_g)+(1-\hat{c})\Psi(G_t)\nonumber]$

Substituting this into (3) and rearranging we have:

$[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&\hat{c}P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)+ \nonumber \ & &(1-\hat{c})((1-\Psi(G_t))P(R_p=C|\hat{R}_p=A,\theta)+\Psi(G_t)P(R_p=C|\hat{R}_p=B,\theta))}]$

We now have all the information required to calculate (2) and therefore (1) except we still need an expression for the so called profile probabilities $[$P(R_p=C|\hat{R}_p=x,\theta)$]$ .

Recall that at each position $(p)$ in the genome we have a set of overlapping reads $[$D_{p}^{(n)}$]$ and $[$D_{p}^{(t)}$]$ for normal and tumour respectively. Now each read $(R)$ at position $(p)$ has covariates $[$\theta(R,p)$]$ and we form an overall count $\chi$ of the number of times that each covariate appears across the genome:

$[\chi(R_p=C,\hat{R}p=R,\theta)=\sum{p\in Genome}{(\sum_{R\in D_{p}^{(n)}}|{R:\theta(R,p)=\theta}|)+(\sum_{R\in D_{p}^{(t)}}|{R:\theta(R,p)=\theta}|)}]$

This counts matrix is then converted into an empirical estimate of the profile probabilities:

$[P(R_p=C|\hat{R}_p=x,\theta)=\frac{\chi(R_p=C,\hat{R}p=x,\theta)}{\sum{C\in {a,c,g,t}}\chi(R_p=C,\hat{R}_p=x,\theta)}]$

Credit for this statistical writeup goes to Nick Williams

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CaVEMan Algorithm

Clone this wiki locally