Overhaul of CalculateContamination's model for determining hom alt sites #5413

davidbenjamin · 2018-11-14T20:22:03Z

@takutosato This dramatically improves CalculateContamination by giving more care to distinguishing hom alts from hets. It makes an especially big difference in our tumor-only HCC1143 validations, where the accuracy is now very good (and BTW, ContEst gets these all completely wrong even with a matched normal).

It also makes the tool work better in targeted panels where there might not be enough hom alt sites by adding a backup hom ref mode that gets triggered automatically. This is based on a Broadie request.

I will file a separate issue to update the docs.

davidbenjamin · 2018-11-14T20:24:18Z

BTW, I wouldn't bother looking at the diff. It's a complete rewrite.

codecov-io · 2018-11-14T21:26:49Z

Codecov Report

Merging #5413 into master will increase coverage by 0.06%.
The diff coverage is 93.57%.

@@             Coverage Diff              @@
##             master    #5413      +/-   ##
============================================
+ Coverage     87.02%   87.08%   +0.06%     
- Complexity    30454    30511      +57     
============================================
  Files          1853     1856       +3     
  Lines        140995   141484     +489     
  Branches      15518    15536      +18     
============================================
+ Hits         122706   123217     +511     
+ Misses        12648    12617      -31     
- Partials       5641     5650       +9

Impacted Files	Coverage Δ	Complexity Δ
...der/tools/walkers/contamination/PileupSummary.java	`90% <ø> (+1.42%)`	`14 <0> (+1)`	⬆️
...dinstitute/hellbender/utils/OptimizationUtils.java	`42.1% <100%> (+3.21%)`	`4 <1> (+1)`	⬆️
...ination/CalculateContaminationIntegrationTest.java	`100% <100%> (ø)`	`14 <1> (+2)`	⬆️
...ols/walkers/contamination/ContaminationRecord.java	`88.23% <100%> (-2.88%)`	`5 <1> (-1)`
.../walkers/contamination/CalculateContamination.java	`96.29% <88.88%> (+0.77%)`	`10 <4> (-31)`	⬇️
...ools/walkers/contamination/ContaminationModel.java	`92.39% <92.39%> (ø)`	`39 <39> (?)`
.../walkers/contamination/ContaminationSegmenter.java	`96.42% <96.42%> (ø)`	`9 <9> (?)`
...lbender/utils/read/SAMRecordToGATKReadAdapter.java	`91.6% <0%> (-2.1%)`	`144% <0%> (+6%)`
...nder/tools/funcotator/TranscriptSelectionMode.java	`89.71% <0%> (-1.87%)`	`1% <0%> (ø)`
...tools/funcotator/DataSourceFuncotationFactory.java	`86.95% <0%> (-1.68%)`	`17% <0%> (ø)`
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 864b180...1183b3d. Read the comment docs.

takutosato · 2018-11-14T21:36:52Z

I'm on it

takutosato

Very minor comments. Adding a test or two on the hom-ref mode would be nice, but I'm fine either way.

I couldn't derive the error bar on my own, so my review on the related code is not complete. Please self-review it before you merge.

takutosato · 2018-11-19T17:42:29Z

docs/mutect/mutect.tex

 \begin{equation}
-\prod_{{\rm hets~}s} \frac{1}{2} \left[ {\rm Binom}(a_s | d_s, f) + {\rm Binom}(a_s | d_s, 1 - f)    \right] \label{het-maf}
+P(\{ a\}| \{f\}, \chi, \{d\}    \prod_{{\rm segments~}n}\prod_{{\rm sites~}s} \sum_{{\rm genotype~}g} P(g|f_s) {\rm Binom}(a_s | d_s, (1 - \chi) \phi(g, \mu_n, \epsilon) + \chi f_s)


closing paren and = missing after P(\{ a\}| \{f\}, \chi, \{d\}

Added a test for hom ref mode and caught/fixed a bug in the process. Thanks!

takutosato · 2018-11-19T17:45:43Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/contamination/ContaminationModel.java

+     * could be derived from the tumor itself or a matched normal.
+     * @return
+     */
+    public Pair<Double, Double> calculateContaminationFromHoms(List<PileupSummary> tumorSites) {


I think tumorSites can be final. Happens a few times in this file.

davidbenjamin · 2018-11-20T04:42:56Z

Put in docs for the error bars. Now this PR has no sloppy loose ends.

davidbenjamin · 2018-11-20T04:43:11Z

@takutosato back to you for another round of review.

takutosato

Just a few typos but looks great!

takutosato · 2018-11-20T21:09:51Z

docs/mutect/mutect.tex

+=& (1 - f_s) \left( {\rm var}(X) + E[X]^2 \right) - (1 - f_s)^2E[X]^2 \\
+=& (1 - f_s) d_s \chi(1 - \chi) + f_s(1 - f_s) d_s^2 \chi^2
+\end{align}
+And therefore the standard error on $\hat{\chi}$ comes out to the square root of the sum of these per-site variances, divided by the denominator of Eq. Eq. \ref{contamination_estimate}, that is,


takutosato · 2018-11-20T21:10:56Z

docs/mutect/mutect.tex


+Finally, we note that the same calculation can be reverse by using alt reads in hom ref sites as the signal and replacing $f$ by $1-f$ everywhere above.  We use the estimate from hom refs as a backup when the hom alt estimate has too great an error, as can occur in the case of targeted panels with few sites.  We do not use this as our primary estimate because it is much more affected by uncertainty in the population allele frequencies and is thus susceptible to systematic bias.


can be *reversed

takutosato · 2018-11-20T21:12:56Z

docs/mutect/mutect.tex

-where $a_s$ and $d_s$ are the alt and total counts at site $s$ and $f$ is the minor allele fraction.  This approach might seem to break down in the case of loss of heterozygosity when many hets are outside of the heuristic range, but if this happens the likelihood will be maximized at the end of the range and we will still detect loss of heterozygosity.  Also, due to contamination the true distributions are overdispersed relative to a pure binomial.  However, the important thing is inferring $f$ and not obtaining a perfect model fit.  Overdispersion worsens the fit but does not significantly affect the estimate of $f$.
-
-Once the minor allele fraction is determined we compare the posterior probabilities of hom alt and het.  For hets we use the above likelihood, Equation \ref{het-maf}, with a prior of $2 q (1 - q)$, where $q$ is the population allele frequency.  For hom alts we use a binomial likelihood ${\rm Binom}(a_s | d_s, 1 - \alpha)$, where $\alpha$ is the contamination, and a prior of $q^2$.  These priors are not normalized since at this point we have thrown away obvious hom ref sites, but only their ratio matters.  The fact that this likelihood requires the contamination means that after initializing with some guess for the contamination we iterate the entire procedure described above\footnote{This excludes segmentation, which is independent.} to obtain successive estimates of the contamination.  These likelihoods are crude, as they must be because we do not know how many contaminant samples there are, or what their copy numbers are at each site.  This generally doesn't matter because for most segments hom alts are easily distinguished from hets.  Since low minor allele fractions can lead to confusion, we only use hom alt sites from segments with minor allele fraction as close to 1/2 as possible.  Specifically, we use the largest cutoff of minor allele fraction that gives us at least 100 hom alt sites.
+where $a_s$ and $d_s$ are the alt and total read counts at site $s$, allelic CNV genotypes $g$ run over hom ref, alt minor, alt major, and hom alt with priors $P({\rm hom~ref}) = (1-f_s)^2$, $P({\rm alt~minor}) = P({\rm alt~major}) = f_s(1-f_s)$, and $P({\rm hom alt} = f^2$.  $\phi(g, \mu, \epsilon)$ is the alt allele fraction of the uncontaminated sample: $\phi({\rm hom~ref}) = \epsilon$, $\phi({\rm alt~minor}) = \mu$, $\phi({\rm alt~major}) = 1 - \mu$, $\phi({\rm hom~alt}) = 1 - \epsilon$.  The binomial is the weighted average of the uncontaminated sample and sample reads drawn independently from allele frequency $f$.  This is inconsistent with a single diploid contaminant sample, or indeed with any finite number of contaminants, which is why we do not the the MLE estimate in the final output of the tool.  The model also assumes that the uncontaminated and contaminating samples have the same overall depth distribution at each site, which is inconsistent with any differences in copy-number.  We perform the MLE by brute force, alternately maximizing with respect to $\chi$ with $\mu$ fixed and vice versa.  In order to make the solution more robust, we exclude segments with low $\mu$ from the maximization over $\chi$ by taking the highest possible threshold (up to 0.5, of course) for $\mu$ that retains at least $1/4$ of all sites.


which is why we do not *the the

davidbenjamin added the Mutect label Nov 14, 2018

davidbenjamin assigned takutosato Nov 14, 2018

davidbenjamin requested a review from takutosato November 14, 2018 20:22

davidbenjamin force-pushed the db_contam branch from 9ef0b3e to c0bae32 Compare November 14, 2018 20:22

Overhaul of CalculateContamination's model for determining hom alt sites

ce6b147

davidbenjamin force-pushed the db_contam branch from c0bae32 to ce6b147 Compare November 14, 2018 20:23

updated docs, but not yet stderror docs

bc16e14

takutosato approved these changes Nov 19, 2018

View reviewed changes

davidbenjamin added 2 commits November 19, 2018 23:04

review

fa514ad

docs for error bars

7780e18

takutosato approved these changes Nov 20, 2018

View reviewed changes

typos

1183b3d

davidbenjamin merged commit 986551c into master Nov 21, 2018

davidbenjamin deleted the db_contam branch November 21, 2018 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul of CalculateContamination's model for determining hom alt sites #5413

Overhaul of CalculateContamination's model for determining hom alt sites #5413

davidbenjamin commented Nov 14, 2018

davidbenjamin commented Nov 14, 2018

codecov-io commented Nov 14, 2018 •

edited

Loading

takutosato commented Nov 14, 2018

takutosato left a comment

takutosato Nov 19, 2018

davidbenjamin Nov 20, 2018

davidbenjamin Nov 20, 2018

takutosato Nov 19, 2018

davidbenjamin Nov 20, 2018

davidbenjamin commented Nov 20, 2018

davidbenjamin commented Nov 20, 2018

takutosato left a comment

takutosato Nov 20, 2018

davidbenjamin Nov 20, 2018

takutosato Nov 20, 2018

davidbenjamin Nov 20, 2018

takutosato Nov 20, 2018

davidbenjamin Nov 20, 2018


		Finally, we note that the same calculation can be reverse by using alt reads in hom ref sites as the signal and replacing $f$ by $1-f$ everywhere above. We use the estimate from hom refs as a backup when the hom alt estimate has too great an error, as can occur in the case of targeted panels with few sites. We do not use this as our primary estimate because it is much more affected by uncertainty in the population allele frequencies and is thus susceptible to systematic bias.

Overhaul of CalculateContamination's model for determining hom alt sites #5413

Overhaul of CalculateContamination's model for determining hom alt sites #5413

Conversation

davidbenjamin commented Nov 14, 2018

davidbenjamin commented Nov 14, 2018

codecov-io commented Nov 14, 2018 • edited Loading

Codecov Report

takutosato commented Nov 14, 2018

takutosato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbenjamin commented Nov 20, 2018

davidbenjamin commented Nov 20, 2018

takutosato left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Nov 14, 2018 •

edited

Loading