-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul of Mutect2 filtering #5688
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5688 +/- ##
===============================================
- Coverage 86.999% 35.859% -51.14%
+ Complexity 31873 17519 -14354
===============================================
Files 1942 1975 +33
Lines 146789 147184 +395
Branches 16216 16228 +12
===============================================
- Hits 127705 52779 -74926
- Misses 13174 89595 +76421
+ Partials 5910 4810 -1100
|
5a74c30
to
654ac5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed everything but the ThresholdCalculator (and wdls and tests). Just wanted to give you review comments I have thus far so you don't have to wait until I'm done done.
docs/mutect/mutect.tex
Outdated
\usepackage{listings} | ||
\lstset{basicstyle=\ttfamily, | ||
showstringspaces=false, | ||
commentstyle=\color{red}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a matter of taste, but I think red is a little distracting. gray
looked good when I tried. I'm sure other colors would work too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think captions look better at the bottom. Adding captionpos=b
here will do it, if you choose to move them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done x 2 -- looks much better
docs/mutect/mutect.tex
Outdated
[-I tumor2.bam -I tumor3.bam . . .] \ | ||
# Mutect2 may input matched normals from the same individual | ||
[-I normal1.bam -I normal2.bam . . .] \ | ||
# For most purposes Mutect2 should be supplied with gnomad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gnomad
-> gnomAD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/mutect/mutect.tex
Outdated
where $\psi$ is the digamma function and $N$ is the number of reads. To obtain $q(\vz)$ and $q(\vf)$ we iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. A very reasonable initialization is to set $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise. Having obtained the mean field of $\vz$, we would like to plug it into Eq \ref{evidence}. We can't do this directly, of course, because Eq \ref{evidence} says nothing about our mean field factorization. Rather, we need the variational approximation (Bishop's Eq 10.3) to the model evidence, which is | ||
We want to marginalize the latent variables to obtain the evidence $P(\mathbb{R} | \mathbb{A})$, which we make tractable via a mean-field approximation $P(\mathbb{R}, \vz, \vf | \mathbb{A}) \approx q(\vz) q(\vf)$, which is exact in two limits. First, if there are many reads, each allele is associated with many reads and therefore the Law of Large Numbers causes $\vf$ and $\vz$ to become uncorrelated. Second, if the allele assignments of reads are obvious $\vz_r$ is effectively determinate, hence uncorrelated with $\vf$. In the variational Bayesian mean-field formalism we have | ||
\begin{align} | ||
q(\vf) \propto& E_{q(\vz)} \left[ P(\mathbb{R}, \vz, \vf | \mathbb{A}) \right] \propto {\rm Dir}(\vf | \valpha + \sum_r \bar{\vz}_r) \equiv {\rm Dir}(\vf | \vbeta), \quad \vbeta = \valpha + \sum_r \bar{\vz}_r \label{z_mean_field} \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because expectation and log don't commute, I think the first of these proportional-to relationships is not true i.e. q(f) ~ E_z [P(X,Z,f)]
isn't the same as ln q(f) ~ E_z [ln P(X,Z,f)]
(aka Bishop 10.9)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed -- fortunately the equations to the right were correct
docs/mutect/mutect.tex
Outdated
\end{align} | ||
Before we proceed, let's introduce some notation. First, from Eq \ref{qf} the posterior $q(\vf)$ is | ||
are easily obtained from the categorical distribution $q(\vz)$ and the Dirichlet distribution $q(\vf)$\footnote{Note that we didn't \textit{impose} this in any way. It simply falls out of the mean field equations.} We initialize $\bar{z}_{ra} = 1$ if $a$ is the most likely allele for read $r$, 0 otherwise and iterate Equations \ref{z_mean_field} and \ref{f_mean_field} until convergence. Having obtained the mean fields of $q(\vz)$ and $q(\vf)$, we use the variational approximation (Bishop's Eq 10.3) to the model evidence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$q(\vz)$
-> $q(\vz_r)$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And missing a period before footnote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is correct as long as you overload \vz
in that without a subscript it's the set of all \vz_r
. How do you feel about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the missing period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup \vz
as the set of all \vz_r
sounds good to me
docs/mutect/mutect.tex
Outdated
\begin{equation} | ||
E_{\rm Dir(\vf | \vomega)} \left[ \ln f_a \right] = \psi(\omega_a) - \psi(\sum_{a^\prime} \omega_{a^\prime}) \equiv h_a(\vomega). | ||
P({\rm error}) = 1 - (1 - {\rm max~artifact~error~prob})(1 - {\rm max~non-somatic~prob})(1 - {\rm sequencing~error~prob}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"non-somatic" looks like "non minus somatic"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, and now I know that \textrm
is better than \rm
and doesn't need those tildes.
.filter(eStep -> eStep.getArtifactProbability() > 0.1).collect(Collectors.toList()); | ||
final double totalArtifacts = potentialArtifacts.stream().mapToDouble(EStep::getArtifactProbability).sum(); | ||
final double totalNonArtifacts = eSteps.stream().mapToDouble(e -> 1 - e.getArtifactProbability()).sum(); | ||
strandArtifactPrior = (totalArtifacts + ARTIFACT_PSEUDOCOUNT) / (totalNonArtifacts + NON_ARTIFACT_PSEUDOCOUNT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my calculation the denominator should be the total count N = totalArtifacts + totalNonArtifacts + alpha
, instead of the effectiveCount of non artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, and this fixes a bug with this branch that showed up in mitochondria
|
||
@Override | ||
protected void learnParameters() { | ||
final List<EStep> potentialArtifacts = eSteps.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why filter the list by prob > 0.1
? I'm sure it wouldn't hurt but also seems unnecessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partly speed in the case of large vcfs, partly to be robust to a bad initialization where we don't want a lot of questionable strand artifacts to cause us to learn bad parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see thanks
docs/mutect/mutect.tex
Outdated
\end{align} | ||
where ${\rm D}_j$ denotes the $j$th existing Dirichlet Process cluster, ${\rm Dirichlet}_{\rm new}$ denotes a newly-created Dirichlet cluster, $N_j$ is the number of variants assigned to cluster $j$, and $N_D$ is the total number of variants assigned to Dirichlet clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${\rm D}_j$
-> ${\rm Dirichlet}_j$
to be consistent with the equations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
docs/mutect/mutect.tex
Outdated
P({\rm Sequencing~error}) \propto& 1 - \pi_{\rm real} \\ | ||
P({\rm High~AF}) \propto& \pi_{\rm real} \pi_H \\ | ||
P({\rm Background}) \propto& \pi_{\rm real} \pi_B \\ | ||
P({\rm Dirichlet}_j) \propto& \pi_{\rm real} \pi_D \frac{ N_j}{N_D + \alpha} \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the superscript -i
or equivalent on N_j
and N_D
to signify the fact that you call popDatum()
before each Gibbs update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think we can replace \propto
with =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done x 2
pruneEmptyClusters(); | ||
|
||
final List<List<Datum>> dataByCluster = clusters.stream().map(c -> new ArrayList<Datum>()).collect(Collectors.toList()); | ||
for (final MutableInt datumIndex = new MutableInt(0); datumIndex.getValue() < clusterAssignments.size(); datumIndex.increment()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use an int
instead of MutableInt
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lambda but I rewrote it with an IndexRange
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that's inconvenient but sounds good
49d4895
to
2ef54c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished review. A couple more comments.
IntStream.range(OFFSET, clusters.size()).boxed() | ||
.sorted(Comparator.comparingDouble(c -> -log10CRPWeight(c))) | ||
.forEach(c -> result.add(ImmutablePair.of("Binomial cluster " + clusterIndex.toString(), | ||
String.format("weight = %.4f, %s", Math.pow(10, log10CRPWeight(c)), clusters.get(c).toString())))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log10CRPWeight(c) + log10SparseClustersWeight
might be more informative than just log10CRPWeight(c)
, but I can see the benefit of the conditional probability too. I'm just bringing this up in case it was unintended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended the conditional but it was a dilemma between the two options.
}); | ||
|
||
// filter if there is no alt evidence in the forward or reverse strand | ||
return Math.min(altForwardCount.getValue(), altReverseCount.getValue()) >= minReadsOnEachStrand; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to replace >=
with <
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. That would have been quite a present for Maddy and Mark.
@takutosato Back to you! |
2921ce5
to
48e464d
Compare
Looks good! @davidbenjamin |
f2aa08e
to
a131d56
Compare
Closes #4893. Closes #5086. Closes #5684. Closes #4500. Makes #4933, #4958, and #5085 possible.
@takutosato Failing tests are superficial. You can begin reviewing.
This is a big PR:
@LeeTL1220 M2 validations look really, really good.
@meganshand Once this goes in mitochondria best practices will need to be tweaked again. We can merge the dangling tails homoplasmic fix before merging this.