Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of CalculateContamination's model for determining hom alt sites #5413

Merged
merged 5 commits into from
Nov 21, 2018

Conversation

davidbenjamin
Copy link
Contributor

@takutosato This dramatically improves CalculateContamination by giving more care to distinguishing hom alts from hets. It makes an especially big difference in our tumor-only HCC1143 validations, where the accuracy is now very good (and BTW, ContEst gets these all completely wrong even with a matched normal).

It also makes the tool work better in targeted panels where there might not be enough hom alt sites by adding a backup hom ref mode that gets triggered automatically. This is based on a Broadie request.

I will file a separate issue to update the docs.

@davidbenjamin
Copy link
Contributor Author

BTW, I wouldn't bother looking at the diff. It's a complete rewrite.

@codecov-io
Copy link

codecov-io commented Nov 14, 2018

Codecov Report

Merging #5413 into master will increase coverage by 0.06%.
The diff coverage is 93.57%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #5413      +/-   ##
============================================
+ Coverage     87.02%   87.08%   +0.06%     
- Complexity    30454    30511      +57     
============================================
  Files          1853     1856       +3     
  Lines        140995   141484     +489     
  Branches      15518    15536      +18     
============================================
+ Hits         122706   123217     +511     
+ Misses        12648    12617      -31     
- Partials       5641     5650       +9
Impacted Files Coverage Δ Complexity Δ
...der/tools/walkers/contamination/PileupSummary.java 90% <ø> (+1.42%) 14 <0> (+1) ⬆️
...dinstitute/hellbender/utils/OptimizationUtils.java 42.1% <100%> (+3.21%) 4 <1> (+1) ⬆️
...ination/CalculateContaminationIntegrationTest.java 100% <100%> (ø) 14 <1> (+2) ⬆️
...ols/walkers/contamination/ContaminationRecord.java 88.23% <100%> (-2.88%) 5 <1> (-1)
.../walkers/contamination/CalculateContamination.java 96.29% <88.88%> (+0.77%) 10 <4> (-31) ⬇️
...ools/walkers/contamination/ContaminationModel.java 92.39% <92.39%> (ø) 39 <39> (?)
.../walkers/contamination/ContaminationSegmenter.java 96.42% <96.42%> (ø) 9 <9> (?)
...lbender/utils/read/SAMRecordToGATKReadAdapter.java 91.6% <0%> (-2.1%) 144% <0%> (+6%)
...nder/tools/funcotator/TranscriptSelectionMode.java 89.71% <0%> (-1.87%) 1% <0%> (ø)
...tools/funcotator/DataSourceFuncotationFactory.java 86.95% <0%> (-1.68%) 17% <0%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 864b180...1183b3d. Read the comment docs.

@takutosato
Copy link
Contributor

I'm on it

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comments. Adding a test or two on the hom-ref mode would be nice, but I'm fine either way.

I couldn't derive the error bar on my own, so my review on the related code is not complete. Please self-review it before you merge.

\begin{equation}
\prod_{{\rm hets~}s} \frac{1}{2} \left[ {\rm Binom}(a_s | d_s, f) + {\rm Binom}(a_s | d_s, 1 - f) \right] \label{het-maf}
P(\{ a\}| \{f\}, \chi, \{d\} \prod_{{\rm segments~}n}\prod_{{\rm sites~}s} \sum_{{\rm genotype~}g} P(g|f_s) {\rm Binom}(a_s | d_s, (1 - \chi) \phi(g, \mu_n, \epsilon) + \chi f_s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closing paren and = missing after P(\{ a\}| \{f\}, \chi, \{d\}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for hom ref mode and caught/fixed a bug in the process. Thanks!

* could be derived from the tumor itself or a matched normal.
* @return
*/
public Pair<Double, Double> calculateContaminationFromHoms(List<PileupSummary> tumorSites) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tumorSites can be final. Happens a few times in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@davidbenjamin
Copy link
Contributor Author

Put in docs for the error bars. Now this PR has no sloppy loose ends.

@davidbenjamin
Copy link
Contributor Author

@takutosato back to you for another round of review.

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few typos but looks great!

=& (1 - f_s) \left( {\rm var}(X) + E[X]^2 \right) - (1 - f_s)^2E[X]^2 \\
=& (1 - f_s) d_s \chi(1 - \chi) + f_s(1 - f_s) d_s^2 \chi^2
\end{align}
And therefore the standard error on $\hat{\chi}$ comes out to the square root of the sum of these per-site variances, divided by the denominator of Eq. Eq. \ref{contamination_estimate}, that is,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eq. Eq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Finally, we note that the same calculation can be reverse by using alt reads in hom ref sites as the signal and replacing $f$ by $1-f$ everywhere above. We use the estimate from hom refs as a backup when the hom alt estimate has too great an error, as can occur in the case of targeted panels with few sites. We do not use this as our primary estimate because it is much more affected by uncertainty in the population allele frequencies and is thus susceptible to systematic bias.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be *reversed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

where $a_s$ and $d_s$ are the alt and total counts at site $s$ and $f$ is the minor allele fraction. This approach might seem to break down in the case of loss of heterozygosity when many hets are outside of the heuristic range, but if this happens the likelihood will be maximized at the end of the range and we will still detect loss of heterozygosity. Also, due to contamination the true distributions are overdispersed relative to a pure binomial. However, the important thing is inferring $f$ and not obtaining a perfect model fit. Overdispersion worsens the fit but does not significantly affect the estimate of $f$.

Once the minor allele fraction is determined we compare the posterior probabilities of hom alt and het. For hets we use the above likelihood, Equation \ref{het-maf}, with a prior of $2 q (1 - q)$, where $q$ is the population allele frequency. For hom alts we use a binomial likelihood ${\rm Binom}(a_s | d_s, 1 - \alpha)$, where $\alpha$ is the contamination, and a prior of $q^2$. These priors are not normalized since at this point we have thrown away obvious hom ref sites, but only their ratio matters. The fact that this likelihood requires the contamination means that after initializing with some guess for the contamination we iterate the entire procedure described above\footnote{This excludes segmentation, which is independent.} to obtain successive estimates of the contamination. These likelihoods are crude, as they must be because we do not know how many contaminant samples there are, or what their copy numbers are at each site. This generally doesn't matter because for most segments hom alts are easily distinguished from hets. Since low minor allele fractions can lead to confusion, we only use hom alt sites from segments with minor allele fraction as close to 1/2 as possible. Specifically, we use the largest cutoff of minor allele fraction that gives us at least 100 hom alt sites.
where $a_s$ and $d_s$ are the alt and total read counts at site $s$, allelic CNV genotypes $g$ run over hom ref, alt minor, alt major, and hom alt with priors $P({\rm hom~ref}) = (1-f_s)^2$, $P({\rm alt~minor}) = P({\rm alt~major}) = f_s(1-f_s)$, and $P({\rm hom alt} = f^2$. $\phi(g, \mu, \epsilon)$ is the alt allele fraction of the uncontaminated sample: $\phi({\rm hom~ref}) = \epsilon$, $\phi({\rm alt~minor}) = \mu$, $\phi({\rm alt~major}) = 1 - \mu$, $\phi({\rm hom~alt}) = 1 - \epsilon$. The binomial is the weighted average of the uncontaminated sample and sample reads drawn independently from allele frequency $f$. This is inconsistent with a single diploid contaminant sample, or indeed with any finite number of contaminants, which is why we do not the the MLE estimate in the final output of the tool. The model also assumes that the uncontaminated and contaminating samples have the same overall depth distribution at each site, which is inconsistent with any differences in copy-number. We perform the MLE by brute force, alternately maximizing with respect to $\chi$ with $\mu$ fixed and vice versa. In order to make the solution more robust, we exclude segments with low $\mu$ from the maximization over $\chi$ by taking the highest possible threshold (up to 0.5, of course) for $\mu$ that retains at least $1/4$ of all sites.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is why we do not *the the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@davidbenjamin davidbenjamin merged commit 986551c into master Nov 21, 2018
@davidbenjamin davidbenjamin deleted the db_contam branch November 21, 2018 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants