Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of Mutect2 filtering #5688

Merged
merged 15 commits into from
Mar 11, 2019
Merged

Overhaul of Mutect2 filtering #5688

merged 15 commits into from
Mar 11, 2019

Conversation

davidbenjamin
Copy link
Contributor

@davidbenjamin davidbenjamin commented Feb 19, 2019

Closes #4893. Closes #5086. Closes #5684. Closes #4500. Makes #4933, #4958, and #5085 possible.

@takutosato Failing tests are superficial. You can begin reviewing.

This is a big PR:

  • Refactor of all M2 filtering. Each filter has its own class, and the filtering engine ties it all together.
  • Learn allele fraction clustering and somatic SNV and indel priors.
  • More probabilistic filters.
  • All filters have a common probabilistic threshold.
  • M2 determines threshold automatically.
  • Rewrite of all M2 documentation.
  • Several filters, including strand bias and normal artifact, learn their own parameters.

@LeeTL1220 M2 validations look really, really good.

@meganshand Once this goes in mitochondria best practices will need to be tweaked again. We can merge the dangling tails homoplasmic fix before merging this.

@codecov-io
Copy link

codecov-io commented Feb 19, 2019

Codecov Report

Merging #5688 into master will decrease coverage by 51.14%.
The diff coverage is 86.938%.

@@               Coverage Diff               @@
##              master     #5688       +/-   ##
===============================================
- Coverage     86.999%   35.859%   -51.14%     
+ Complexity     31873     17519    -14354     
===============================================
  Files           1942      1975       +33     
  Lines         146789    147184      +395     
  Branches       16216     16228       +12     
===============================================
- Hits          127705     52779    -74926     
- Misses         13174     89595    +76421     
+ Partials        5910      4810     -1100
Impacted Files Coverage Δ Complexity Δ
...titute/hellbender/tools/walkers/GenotypeGVCFs.java 89.352% <ø> (-0.971%) 93 <0> (-1)
...lkers/ReferenceConfidenceVariantContextMerger.java 90.544% <ø> (-4.298%) 108 <0> (-2)
...stitute/hellbender/tools/walkers/CombineGVCFs.java 94.479% <ø> (ø) 70 <0> (ø) ⬇️
...ellbender/tools/exome/FilterByOrientationBias.java 83.019% <ø> (ø) 14 <0> (ø) ⬇️
.../walkers/contamination/CalculateContamination.java 96.552% <ø> (ø) 10 <0> (ø) ⬇️
...ender/tools/walkers/annotator/InbreedingCoeff.java 72.414% <ø> (-13.793%) 7 <0> (-3)
...der/tools/walkers/CombineGVCFsIntegrationTest.java 1.262% <ø> (-88.675%) 2 <0> (-36)
...s/walkers/validation/CalculateMixingFractions.java 78.431% <ø> (ø) 15 <0> (ø) ⬇️
...s/walkers/validation/MergeMutect2CallsWithMC3.java 81.25% <ø> (ø) 13 <0> (ø) ⬇️
...tools/walkers/mutect/SomaticLikelihoodsEngine.java 86.667% <ø> (-2.222%) 17 <0> (-1)
... and 1382 more

@davidbenjamin davidbenjamin added this to the Mutect 3 milestone Feb 26, 2019
@davidbenjamin davidbenjamin force-pushed the db_m2_filtering_refactor branch 2 times, most recently from 5a74c30 to 654ac5f Compare March 3, 2019 05:06
Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed everything but the ThresholdCalculator (and wdls and tests). Just wanted to give you review comments I have thus far so you don't have to wait until I'm done done.

\usepackage{listings}
\lstset{basicstyle=\ttfamily,
showstringspaces=false,
commentstyle=\color{red},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a matter of taste, but I think red is a little distracting. gray looked good when I tried. I'm sure other colors would work too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think captions look better at the bottom. Adding captionpos=b here will do it, if you choose to move them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done x 2 -- looks much better

[-I tumor2.bam -I tumor3.bam . . .] \
# Mutect2 may input matched normals from the same individual
[-I normal1.bam -I normal2.bam . . .] \
# For most purposes Mutect2 should be supplied with gnomad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gnomad -> gnomAD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

where $\psi$ is the digamma function and $N$ is the number of reads. To obtain $q(\vz)$ and $q(\vf)$ we iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. A very reasonable initialization is to set $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise. Having obtained the mean field of $\vz$, we would like to plug it into Eq \ref{evidence}. We can't do this directly, of course, because Eq \ref{evidence} says nothing about our mean field factorization. Rather, we need the variational approximation (Bishop's Eq 10.3) to the model evidence, which is
We want to marginalize the latent variables to obtain the evidence $P(\mathbb{R} | \mathbb{A})$, which we make tractable via a mean-field approximation $P(\mathbb{R}, \vz, \vf | \mathbb{A}) \approx q(\vz) q(\vf)$, which is exact in two limits. First, if there are many reads, each allele is associated with many reads and therefore the Law of Large Numbers causes $\vf$ and $\vz$ to become uncorrelated. Second, if the allele assignments of reads are obvious $\vz_r$ is effectively determinate, hence uncorrelated with $\vf$. In the variational Bayesian mean-field formalism we have
\begin{align}
q(\vf) \propto& E_{q(\vz)} \left[ P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] \propto {\rm Dir}(\vf | \valpha + \sum_r \bar{\vz}_r) \equiv {\rm Dir}(\vf | \vbeta), \quad \vbeta = \valpha + \sum_r \bar{\vz}_r \label{z_mean_field} \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because expectation and log don't commute, I think the first of these proportional-to relationships is not true i.e. q(f) ~ E_z [P(X,Z,f)] isn't the same as ln q(f) ~ E_z [ln P(X,Z,f)] (aka Bishop 10.9)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed -- fortunately the equations to the right were correct

\end{align}
Before we proceed, let's introduce some notation. First, from Eq \ref{qf} the posterior $q(\vf)$ is
are easily obtained from the categorical distribution $q(\vz)$ and the Dirichlet distribution $q(\vf)$\footnote{Note that we didn't \textit{impose} this in any way. It simply falls out of the mean field equations.} We initialize $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise and iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. Having obtained the mean fields of $q(\vz)$ and $q(\vf)$, we use the variational approximation (Bishop's Eq 10.3) to the model evidence:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$q(\vz)$ -> $q(\vz_r)$

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And missing a period before footnote

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is correct as long as you overload \vz in that without a subscript it's the set of all \vz_r. How do you feel about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the missing period

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup \vz as the set of all \vz_r sounds good to me

\begin{equation}
E_{\rm Dir(\vf | \vomega)} \left[ \ln f_a \right] = \psi(\omega_a) - \psi(\sum_{a^\prime} \omega_{a^\prime}) \equiv h_a(\vomega).
P({\rm error}) = 1 - (1 - {\rm max~artifact~error~prob})(1 - {\rm max~non-somatic~prob})(1 - {\rm sequencing~error~prob}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-somatic" looks like "non minus somatic"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, and now I know that \textrm is better than \rm and doesn't need those tildes.

.filter(eStep -> eStep.getArtifactProbability() > 0.1).collect(Collectors.toList());
final double totalArtifacts = potentialArtifacts.stream().mapToDouble(EStep::getArtifactProbability).sum();
final double totalNonArtifacts = eSteps.stream().mapToDouble(e -> 1 - e.getArtifactProbability()).sum();
strandArtifactPrior = (totalArtifacts + ARTIFACT_PSEUDOCOUNT) / (totalNonArtifacts + NON_ARTIFACT_PSEUDOCOUNT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my calculation the denominator should be the total count N = totalArtifacts + totalNonArtifacts + alpha, instead of the effectiveCount of non artifacts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, and this fixes a bug with this branch that showed up in mitochondria


@Override
protected void learnParameters() {
final List<EStep> potentialArtifacts = eSteps.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why filter the list by prob > 0.1? I'm sure it wouldn't hurt but also seems unnecessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly speed in the case of large vcfs, partly to be robust to a bad initialization where we don't want a lot of questionable strand artifacts to cause us to learn bad parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see thanks

\end{align}
where ${\rm D}_j$ denotes the $j$th existing Dirichlet Process cluster, ${\rm Dirichlet}_{\rm new}$ denotes a newly-created Dirichlet cluster, $N_j$ is the number of variants assigned to cluster $j$, and $N_D$ is the total number of variants assigned to Dirichlet clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${\rm D}_j$ -> ${\rm Dirichlet}_j$ to be consistent with the equations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

P({\rm Sequencing~error}) \propto& 1 - \pi_{\rm real} \\
P({\rm High~AF}) \propto& \pi_{\rm real} \pi_H \\
P({\rm Background}) \propto& \pi_{\rm real} \pi_B \\
P({\rm Dirichlet}_j) \propto& \pi_{\rm real} \pi_D \frac{ N_j}{N_D + \alpha} \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the superscript -i or equivalent on N_j and N_D to signify the fact that you call popDatum() before each Gibbs update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think we can replace \propto with =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done x 2

pruneEmptyClusters();

final List<List<Datum>> dataByCluster = clusters.stream().map(c -> new ArrayList<Datum>()).collect(Collectors.toList());
for (final MutableInt datumIndex = new MutableInt(0); datumIndex.getValue() < clusterAssignments.size(); datumIndex.increment()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use an int instead of MutableInt here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lambda but I rewrote it with an IndexRange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's inconvenient but sounds good

@davidbenjamin davidbenjamin force-pushed the db_m2_filtering_refactor branch from 49d4895 to 2ef54c6 Compare March 7, 2019 19:53
Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished review. A couple more comments.

IntStream.range(OFFSET, clusters.size()).boxed()
.sorted(Comparator.comparingDouble(c -> -log10CRPWeight(c)))
.forEach(c -> result.add(ImmutablePair.of("Binomial cluster " + clusterIndex.toString(),
String.format("weight = %.4f, %s", Math.pow(10, log10CRPWeight(c)), clusters.get(c).toString()))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log10CRPWeight(c) + log10SparseClustersWeight might be more informative than just log10CRPWeight(c), but I can see the benefit of the conditional probability too. I'm just bringing this up in case it was unintended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended the conditional but it was a dilemma between the two options.

});

// filter if there is no alt evidence in the forward or reverse strand
return Math.min(altForwardCount.getValue(), altReverseCount.getValue()) >= minReadsOnEachStrand;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to replace >= with <.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. That would have been quite a present for Maddy and Mark.

@davidbenjamin
Copy link
Contributor Author

@takutosato Back to you!

@davidbenjamin davidbenjamin force-pushed the db_m2_filtering_refactor branch from 2921ce5 to 48e464d Compare March 8, 2019 01:40
@takutosato
Copy link
Contributor

Looks good! @davidbenjamin

@davidbenjamin davidbenjamin force-pushed the db_m2_filtering_refactor branch from f2aa08e to a131d56 Compare March 10, 2019 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants