CaVEMan Algorithm

At each site of the genome CaVEMan calculates the probability of there being a somatic mutation given a pre-specified normal contamination, , of the tumour sample, a site specific copy number for both germline, N_g , and tumour, N_t , and also prior probabilities that the site has a somatic variant, P(Som) , or germline variant, P(SNP) .

At each of these sites the model considers the set of possible germline, G_g , and tumour genotypes, G_t , constructed from the reference allele and a single variant allele . We have three disjoint sets of joint genotypes:

${\Omega_s = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}_{\text{$k$}}:N_t\geq k > 0}} {Somatic Genotypes}}$

${\Omega_g = { (G_g, G_t) : G_g\in{\underbrace{A \ldots A}{\text{$N_g-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_g\geq k > 0},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_t\geq k \geq 0}} {SNP Genotypes}}$

${\Omega_r = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_g=\underbrace{A \ldots A}{\text{$N_t$ copies.}}} {Reference Genotype}}$

The union of the above 3 sets, Omega , gives the set of possible genotypes considered by the model. We can then express prior probability of set membership probabilities in terms of our priors:

$P((G_g,G_t)\in\Omega_s)=P(Som)(1-P(SNP))$

$P((G_g,G_t)\in \Omega_g)=P(SNP)$

$P((G_g,G_t)\in\Omega_s)=(1-P(Som))(1-P(SNP))$

$P(G_g,G_t)&=&P(\Omega(G_g,G_t))P((G_g,G_t)|(G_g,G_t)\in \Omega(G_g,G_t))$

$[ = \frac{P(\Omega(G_g,G_t))}{\left | \Omega(G_g,G_t) \right |}]$

The central equation in the Caveman algorithm represents the probability of each joint genotype given the data (D) and is calculated using the Bayes’ rule:

$[$P(G_g,G_t|D)=\frac{P(D|G_g,G_t)P(G_g,G_t)}{\sum_{(G_g,G_t)\in\Omega}P(D|G_g,G_t)P(G_g,G_t)}$]$ (1)

We have already discussed how the prior is calculated, it now remains to calculate [P(D|G_g,G_t)] . We assume that for a given mapped read at position (p) the true base $[\hat{R}_p=x]$ is called as base (C) with probability $[$P(R_p=C|\hat{R}_p=x,\theta)$]$ given covariates $(\theta)$ . Caveman considers the following covariates:

lane/read group
read order
strand
mapping quality
base quality
read position (position within read)

It is worth noting that the first 4 covariates are defined at the level of the read and the last 2 covariates at the base call level.

The relevant data, (D) , is assumed to be the pileup of reads at the specified genomic position for both the normal sample $[$D_{p}^{(n)}$]$ and the tumour sample $[$D_{p}^{(t)}$]$ . If we assume that errors in the called base are conditionally independent given the covariates then we can write:

$[P(D|G_g,G_t)=\prod_{R\in D_{p}^{(n)}}P^{(n)}(R_p=C|\hat{R}p,\theta)\prod{R\in D_{p}^{(t)}}P^{(t)}(R_p=C|\hat{R}_p,\theta)]$ (2)

For genotype, (G) , let $[$\Psi(G)=\frac{B(G)}{A(G)+B(G)}=P(\hat{R}_p=B|G)$]$ then:

$[{P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g)}]$

$[{\nonumber\ &=&P(R_p=C|\hat{R}_p=A,\theta)(1-\Psi(G_g))+P(R_p=C|\hat{R}_p=B,\theta)\Psi(G_g)\nonumber }]$

We can calculate the $[$P^{(t)}$]$ similarly, but need to take into account contamination of the tumour with normal cells.

$[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g,G_t)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g,G_t)}]$ (3)

Now the adjusted contamination, $[$\hat{c}=\frac{c N_g}{c N_g +(1-c) N_t}$]$ , represents the probability that a read from the tumour sample comes from a normal cell. So, for the reads (R) coming from the tumour sample we have:

$[{P(\hat{R}_p=A|G_g,G_t)&=&P(\hat{R}_p=A| R\in Normal,G_g)P(R \in Normal)+P(\hat{R}_p=A| R\in Tumour)P(R \in Tumour,G_t)}]$

$[&=&\hat{c}P(\hat{R}_p=A| R\in Normal,G_g)+(1-\hat{c})P(\hat{R}_p=A| R\in Tumour,G_t)]$

$[&=&\hat{c}(1-\Psi(G_g))+(1-\hat{c})(1-\Psi(G_t))]$

and similarly:

$[P(\hat{R}_p=B|G_g,G_t)&=&\hat{c}\Psi(G_g)+(1-\hat{c})\Psi(G_t)\nonumber]$

Substituting this into (3) and rearranging we have:

$[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&\hat{c}P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)+ \nonumber \ & &(1-\hat{c})((1-\Psi(G_t))P(R_p=C|\hat{R}_p=A,\theta)+\Psi(G_t)P(R_p=C|\hat{R}_p=B,\theta))}]$

We now have all the information required to calculate (2) and therefore (1) except we still need an expression for the so called profile probabilities $[$P(R_p=C|\hat{R}_p=x,\theta)$]$ .

Recall that at each position (p) in the genome we have a set of overlapping reads $[D_{p}^{(n)}]$ and $[D_{p}^{(t)}]$ for normal and tumour respectively. Now each read (R) at position (p) has covariates $[\theta(R,p)]$ and we form an overall count $\chi$ of the number of times that each covariate appears across the genome:

$[\chi(R_p=C,\hat{R}p=R,\theta)=\sum{p\in Genome}{(\sum_{R\in D_{p}^{(n)}}|{R:\theta(R,p)=\theta}|)+(\sum_{R\in D_{p}^{(t)}}|{R:\theta(R,p)=\theta}|)}]$

This counts matrix is then converted into an empirical estimate of the profile probabilities:

$[P(R_p=C|\hat{R}_p=x,\theta)=\frac{\chi(R_p=C,\hat{R}p=x,\theta)}{\sum{C\in {a,c,g,t}}\chi(R_p=C,\hat{R}_p=x,\theta)}]$

Credit for this statistical writeup goes to Nick Williams

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CaVEMan Algorithm

Clone this wiki locally