Skip to content
David Jones edited this page Jul 31, 2018 · 7 revisions

At each site of the genome CaVEMan calculates the probability of there being a somatic mutation given a pre-specified normal contamination, $c$, of the tumour sample, a site specific copy number for both germline, $N_g$, and tumour, $N_t$, and also prior probabilities that the site has a somatic variant, $P(Som)$, or germline variant, $P(SNP)$.

At each of these sites the model considers the set of possible germline, $G_g$, and tumour genotypes, $G_t$, constructed from the reference allele $A$ and a single variant allele $B$. We have three disjoint sets of joint genotypes:

${\Omega_s = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}_{\text{$k$}}:N_t\geq k > 0}} {Somatic Genotypes}}$

${\Omega_g = { (G_g, G_t) : G_g\in{\underbrace{A \ldots A}{\text{$N_g-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_g\geq k > 0},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_t\geq k \geq 0}} {SNP Genotypes}}$

${\Omega_r = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_g=\underbrace{A \ldots A}{\text{$N_t$ copies.}}} {Reference Genotype}}$

The union of the above 3 sets, $\Omega$, gives the set of possible genotypes considered by the model. We can then express prior probability of set membership probabilities in terms of our priors (assuming independence):

$P((G_g,G_t)\in\Omega_s)=P(Som)(1-P(SNP))$

$P((G_g,G_t)\in \Omega_g)=P(SNP)$

$P((G_g,G_t)\in\Omega_s)=(1-P(Som))(1-P(SNP))$

$P(G_g,G_t)&=&P(\Omega(G_g,G_t))P((G_g,G_t)|(G_g,G_t)\in \Omega(G_g,G_t))$

[$ = \frac{P(\Omega(G_g,G_t))}{\left | \Omega(G_g,G_t) \right |}$]

The central equation in the Caveman algorithm represents the probability of each joint genotype given the data $(D)$ and is calculated using the Bayes’rule:

[$P(G_g,G_t|D)=\frac{P(D|G_g,G_t)P(G_g,G_t)}{\sum_{(G_g,G_t)\in\Omega}P(D|G_g,G_t)P(G_g,G_t)}$] (1)

We have already discussed how the prior is calculated, it now remains to calculate $(P(D|G_g,G_t))$. We assume that for a given mapped read at position $(p)$ the true base [$\hat{R}_p=x$] is called as base $(C)$ with probability [$P(R_p=C|\hat{R}_p=x,\theta)$] given covariates $(\theta)$. Caveman considers the following covariates:

  1. lane/read group
  2. read order
  3. strand
  4. mapping quality
  5. base quality
  6. read position (position within read)

It is worth noting that the first 4 covariates are defined at the level of the read and the last 2 covariates at the base call level.

The relevant data, $(D)$, is assumed to be the pileup of reads at the specified genomic position for both the normal sample [$D_{p}^{(n)}$] and the tumour sample [$D_{p}^{(t)}$]. If we assume that errors in the called base are conditionally independent given the covariates then we can write:

[P(D|G_g,G_t)=\prod_{R\in D_{p}^{(n)}}P^{(n)}(R_p=C|\hat{R}p,\theta)\prod{R\in D_{p}^{(t)}}P^{(t)}(R_p=C|\hat{R}_p,\theta)] (2)

For genotype, $(G)$, let [$\Psi(G)=\frac{B(G)}{A(G)+B(G)}=P(\hat{R}_p=B|G)$] then:

[{P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g)}]

[{\nonumber\ &=&P(R_p=C|\hat{R}_p=A,\theta)(1-\Psi(G_g))+P(R_p=C|\hat{R}_p=B,\theta)\Psi(G_g)\nonumber }]

We can calculate the [$P^{(t)}$] similarly, but need to take into account contamination of the tumour with normal cells.

[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g,G_t)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g,G_t)}] (3)

Now the adjusted contamination, [$\hat{c}=\frac{c N_g}{c N_t +(1-c) N_g}$] , represents the probability that a read from the tumour sample comes from a normal cell. So, for the reads (R) coming from the tumour sample we have:

[{P(\hat{R}_p=A|G_g,G_t)&=&P(\hat{R}_p=A| R\in Normal,G_g)P(R \in Normal)+P(\hat{R}_p=A| R\in Tumour)P(R \in Tumour,G_t)}]

[&=&\hat{c}P(\hat{R}_p=A| R\in Normal,G_g)+(1-\hat{c})P(\hat{R}_p=A| R\in Tumour,G_t)]

[&=&\hat{c}(1-\Psi(G_g))+(1-\hat{c})(1-\Psi(G_t))]

and similarly:

[P(\hat{R}_p=B|G_g,G_t)&=&\hat{c}\Psi(G_g)+(1-\hat{c})\Psi(G_t)\nonumber]

Substituting this into (3) and rearranging we have:

[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&\hat{c}P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)+ \nonumber \ & &(1-\hat{c})((1-\Psi(G_t))P(R_p=C|\hat{R}_p=A,\theta)+\Psi(G_t)P(R_p=C|\hat{R}_p=B,\theta))}]

We now have all the information required to calculate (2) and therefore (1) except we still need an expression for the so called profile probabilities [$P(R_p=C|\hat{R}_p=x,\theta)$].

Recall that at each position $(p)$ in the genome we have a set of overlapping reads [$D_{p}^{(n)}$] and [$D_{p}^{(t)}$] for normal and tumour respectively. Now each read $(R)$ at position $(p)$ has covariates [$\theta(R,p)$] and we form an overall count $\chi$ of the number of times that each covariate appears across the genome:

[\chi(R_p=C,\hat{R}p=R,\theta)=\sum{p\in Genome}{(\sum_{R\in D_{p}^{(n)}}|{R:\theta(R,p)=\theta}|)+(\sum_{R\in D_{p}^{(t)}}|{R:\theta(R,p)=\theta}|)}]

This counts matrix is then converted into an empirical estimate of the profile probabilities:

[P(R_p=C|\hat{R}_p=x,\theta)=\frac{\chi(R_p=C,\hat{R}p=x,\theta)}{\sum{C\in {a,c,g,t}}\chi(R_p=C,\hat{R}_p=x,\theta)}]

Credit for this statistical writeup goes to Nick Williams

Clone this wiki locally