-
Notifications
You must be signed in to change notification settings - Fork 3
CaVEMan Algorithm
At each site of the genome CaVEMan calculates the probability of there being a somatic mutation given a pre-specified normal contamination, , of the tumour sample, a site specific copy number for both germline, , and tumour, , and also prior probabilities that the site has a somatic variant, , or germline variant, .
At each of these sites the model considers the set of possible germline, , and tumour genotypes, , constructed from the reference allele and a single variant allele . We have three disjoint sets of joint genotypes:
The union of the above 3 sets, , gives the set of possible genotypes considered by the model. We can then express prior probability of set membership probabilities in terms of our priors:
The central equation in the Caveman algorithm represents the probability of each joint genotype given the data and is calculated using the Bayes’ rule:
(1)
We have already discussed how the prior is calculated, it now remains to calculate . We assume that for a given mapped read at position the true base is called as base with probability given covariates . Caveman considers the following covariates:
- lane/read group
- read order
- strand
- mapping quality
- base quality
- read position (position within read)
It is worth noting that the first 4 covariates are defined at the level of the read and the last 2 covariates at the base call level.
The relevant data, , is assumed to be the pileup of reads at the specified genomic position for both the normal sample and the tumour sample . If we assume that errors in the called base are conditionally independent given the covariates then we can write:
(2)
For genotype, , let then:
We can calculate the similarly, but need to take into account contamination of the tumour with normal cells.
(3)
Now the adjusted contamination, , represents the probability that a read from the tumour sample comes from a normal cell. So, for the reads coming from the tumour sample we have:
and similarly:
Substituting this into (3) and rearranging we have:
We now have all the information required to calculate (2) and therefore (1) except we still need an expression for the so called profile probabilities .
Recall that at each position in the genome we have a set of overlapping reads and for normal and tumour respectively. Now each read at position has covariates and we form an overall count of the number of times that each covariate appears across the genome:
This counts matrix is then converted into an empirical estimate of the profile probabilities:
Credit for this statistical writeup goes to Nick Williams