Skip to content
David Jones edited this page Mar 24, 2022 · 7 revisions

At each site of the genome CaVEMan calculates the probability of there being a somatic mutation given a pre-specified normal contamination, c, of the tumour sample, a site specific copy number for both germline, N_g, and tumour, N_t, and also prior probabilities that the site has a somatic variant, P(Som), or germline variant, P(SNP).

At each of these sites the model considers the set of possible germline, G_g, and tumour genotypes, G_t, constructed from the reference allele A and a single variant allele B. We have three disjoint sets of joint genotypes:

${\Omega_s = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}_{\text{$k$}}:N_t\geq k > 0}} {Somatic Genotypes}}$

${\Omega_g = { (G_g, G_t) : G_g\in{\underbrace{A \ldots A}{\text{$N_g-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_g\geq k > 0},G_t\in{\underbrace{A \ldots A}{\text{$N_t-k$}}\underbrace{B \ldots B}{\text{$k$}}:N_t\geq k \geq 0}} {SNP Genotypes}}$

{\Omega_r = { (G_g, G_t) : G_g=\underbrace{A \ldots A}{\text{$N_g$ copies.}},G_g=\underbrace{A \ldots A}{\text{$N_t$ copies.}}} {Reference Genotype}}

The union of the above 3 sets, Omega, gives the set of possible genotypes considered by the model. We can then express prior probability of set membership probabilities in terms of our priors:

P((G_g,G_t)\in\Omega_s)=P(Som)(1-P(SNP))

P((G_g,G_t)\in \Omega_g)=P(SNP)

P((G_g,G_t)\in\Omega_s)=(1-P(Som))(1-P(SNP))

P(G_g,G_t)&=&P(\Omega(G_g,G_t))P((G_g,G_t)|(G_g,G_t)\in \Omega(G_g,G_t))

[ = \frac{P(\Omega(G_g,G_t))}{\left | \Omega(G_g,G_t) \right |}]

The central equation in the Caveman algorithm represents the probability of each joint genotype given the data (D) and is calculated using the Bayes’ rule:

[$P(G_g,G_t|D)=\frac{P(D|G_g,G_t)P(G_g,G_t)}{\sum_{(G_g,G_t)\in\Omega}P(D|G_g,G_t)P(G_g,G_t)}$] (1)

We have already discussed how the prior is calculated, it now remains to calculate [P(D|G_g,G_t)]. We assume that for a given mapped read at position (p) the true base [\hat{R}_p=x] is called as base (C) with probability [$P(R_p=C|\hat{R}_p=x,\theta)$] given covariates (\theta). Caveman considers the following covariates:

  1. lane/read group
  2. read order
  3. strand
  4. mapping quality
  5. base quality
  6. read position (position within read)

It is worth noting that the first 4 covariates are defined at the level of the read and the last 2 covariates at the base call level.

The relevant data, (D), is assumed to be the pileup of reads at the specified genomic position for both the normal sample [$D_{p}^{(n)}$] and the tumour sample [$D_{p}^{(t)}$]. If we assume that errors in the called base are conditionally independent given the covariates then we can write:

[P(D|G_g,G_t)=\prod_{R\in D_{p}^{(n)}}P^{(n)}(R_p=C|\hat{R}p,\theta)\prod{R\in D_{p}^{(t)}}P^{(t)}(R_p=C|\hat{R}_p,\theta)] (2)

For genotype, (G), let [$\Psi(G)=\frac{B(G)}{A(G)+B(G)}=P(\hat{R}_p=B|G)$] then:

[{P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g)}]

[{\nonumber\ &=&P(R_p=C|\hat{R}_p=A,\theta)(1-\Psi(G_g))+P(R_p=C|\hat{R}_p=B,\theta)\Psi(G_g)\nonumber }]

We can calculate the [$P^{(t)}$] similarly, but need to take into account contamination of the tumour with normal cells.

[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&P(R_p=C|\hat{R}_p=A,\theta)P(\hat{R}_p=A|G_g,G_t)+P(R_p=C|\hat{R}_p=B,\theta)P(\hat{R}_p=B|G_g,G_t)}] (3)

Now the adjusted contamination, [$\hat{c}=\frac{c N_g}{c N_g +(1-c) N_t}$] , represents the probability that a read from the tumour sample comes from a normal cell. So, for the reads (R) coming from the tumour sample we have:

[{P(\hat{R}_p=A|G_g,G_t)&=&P(\hat{R}_p=A| R\in Normal,G_g)P(R \in Normal)+P(\hat{R}_p=A| R\in Tumour)P(R \in Tumour,G_t)}]

[&=&\hat{c}P(\hat{R}_p=A| R\in Normal,G_g)+(1-\hat{c})P(\hat{R}_p=A| R\in Tumour,G_t)]

[&=&\hat{c}(1-\Psi(G_g))+(1-\hat{c})(1-\Psi(G_t))]

and similarly:

[P(\hat{R}_p=B|G_g,G_t)&=&\hat{c}\Psi(G_g)+(1-\hat{c})\Psi(G_t)\nonumber]

Substituting this into (3) and rearranging we have:

[{P^{(t)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)&=&\hat{c}P^{(n)}(R_p=C|\hat{R}_p,\theta,G_g,G_t)+ \nonumber \ & &(1-\hat{c})((1-\Psi(G_t))P(R_p=C|\hat{R}_p=A,\theta)+\Psi(G_t)P(R_p=C|\hat{R}_p=B,\theta))}]

We now have all the information required to calculate (2) and therefore (1) except we still need an expression for the so called profile probabilities [$P(R_p=C|\hat{R}_p=x,\theta)$].

Recall that at each position (p) in the genome we have a set of overlapping reads [D_{p}^{(n)}] and [D_{p}^{(t)}] for normal and tumour respectively. Now each read (R) at position (p) has covariates [\theta(R,p)] and we form an overall count \chi of the number of times that each covariate appears across the genome:

[\chi(R_p=C,\hat{R}p=R,\theta)=\sum{p\in Genome}{(\sum_{R\in D_{p}^{(n)}}|{R:\theta(R,p)=\theta}|)+(\sum_{R\in D_{p}^{(t)}}|{R:\theta(R,p)=\theta}|)}]

This counts matrix is then converted into an empirical estimate of the profile probabilities:

[P(R_p=C|\hat{R}_p=x,\theta)=\frac{\chi(R_p=C,\hat{R}p=x,\theta)}{\sum{C\in {a,c,g,t}}\chi(R_p=C,\hat{R}_p=x,\theta)}]

Credit for this statistical writeup goes to Nick Williams

Clone this wiki locally