-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make adaptive pruner smarter in complex graphs #6520
Conversation
There are some post-rebase failures, but nothing serious. You can go ahead with review and the rug won't be pulled out from under you. |
} | ||
|
||
private Collection<Path<V,E>> likelyErrorChains(final List<Path<V, E>> chains, final BaseGraph<V,E> graph, final double errorRate) { | ||
final Map<Path<V,E>, Double> chainLogOdds = chains.stream() | ||
|
||
// start from set of really obvious good chains, defined as those with an edge with multiplicity above some fraction of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my understanding, now we are saving all chains with a MAX edge weight of (graph max / 10) and then growing outwards from there based on the same log odds ratio we had before correct? Except now we are demanding that at least one edge stems from both an already blessed "good" edge instead of both edges needing to satisfy the log odds test. This seems reasonable I think but I want to think a little more on it. Have some comments I noticed. I will give this a try on haplotype caller.
// maximum multiplicity edge. Then grow the set of good chains by looking at the log odds of junctions between | ||
// good chains and unknown chains. | ||
final int maxEdgeWeight = graph.edgeSet().stream().mapToInt(BaseEdge::getMultiplicity).max().orElse(0); | ||
final int thresholdEdgeWeight = maxEdgeWeight / 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameterize the "10". Also I would make the comment that the maxEdgeWeight / 10 will probably falter somewhat around homopolymers in the graph, as a 40 base homopolomyer at 25mer weight without any issues in coverage should have (40 - 25) or 15x the coverage of any of the sequence around it. This doesn't matter for the old graph code but it might for the new graph code and I suspect this will call everything "bad" and either be very slow or possibly throw almost everything away.
final int maxEdgeWeight = graph.edgeSet().stream().mapToInt(BaseEdge::getMultiplicity).max().orElse(0); | ||
final int thresholdEdgeWeight = maxEdgeWeight / 10; | ||
final Set<Path<V, E>> definiteGoodChains = chains.stream() | ||
.filter(chain -> chain.getEdges().stream().mapToInt(BaseEdge::getMultiplicity).max().orElse(0) > thresholdEdgeWeight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not recover reference chains here?
.collect(Collectors.toMap(c -> c, c-> chainLogOdds(c, graph, errorRate))); | ||
|
||
final Set<Path<V,E>> result = new HashSet<>(chains.size()); | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a cap on the number of iterations here. I'm worried this might loop functionally forever on very pathological/crazy high coverage sites with lots of chains and a high disparity of depths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, i think there might be a more efficient way to do this. If you preprocess every chain into a map of end vertex-> list of adjacent chains and then just poll them in a queue looking at each vertex exactly once (or putting it into the queue if we discover a new vertex from our processing) i think should avoid the problem of iterating over the same gazillion error chains forever and only ever making progress on one or two vertices at a time. Indeed it seems to me that once a vertex is "good" we need only check its incoming and outgoing neighbors exactly once and then we can skip ever looking at its neighboring vertexes ever again.
.sorted(Comparator.comparingDouble((ToDoubleFunction<Path<V, E>>) chainLogOdds::get) | ||
.reversed().thenComparingInt(Path::length)) | ||
// add non-error chains to error chains if maximum number of variants has been exceeded | ||
chains.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about this sort, it seems to me that the LogOdds would have discrete values (since they are based on comparing counts that are descrete) so it seems likely there would be a tie here. Without a tiebreaker this seems prone to non-deterministic behavior. I would recommend either putting a relatively bulletproof set of tiebreakers in here or making all of the underlying sets Linked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done using your vertex lexicographical ordering suggestion as below
// such that when bad edges meet the multiplicities do not indicate an error - does not harm pruning. | ||
// we test with and without a true variant path A -> E -> C | ||
@Test | ||
public void testAdaptivePruningWithAdjacentBadEdges() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend including a test where the reference has "badMultiplicity" and making sure that we aren't accidentally pruning that and causing potential problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done by modifying testAdaptivePruning
and its data provider.
5806cef
to
5f39967
Compare
@jamesemery Back to you, at long last. I adopted your suggestion of a proper search that doesn't revisit already-seen vertices and came up with a better way of seeding the "good" subgraph that is safe from your STR concern. As far as code is concerned it's a total rewrite — you can pretend the first PR commit doesn't exist. The new criterion for seeding the search is chains with good log odds on both ends and which are incident on a vertex with multiple good out-edges or multiple good in-edges. The rationale is that the adjacency of two bad edges may have good log odds (Suppose a bad edge comes in and two bad edges come out. One is a new error on top of the original error and one is the continuation of the original error) but two have two outgoing edges with good log odds requires an actual real variant. On our M2 validations this essentially no effect on sensitivity and a mild reduction in false positives. I will leave it to you (or to me when I don't have to work like a vampire) to investigate how well it interacts with junction trees. As a first step I wrote a basic unit test for the basic pathology of the old method. |
Collection<Path<V,E>> probableErrorChains = likelyErrorChains(chains, graph, initialErrorProbability); | ||
final int errorCount = probableErrorChains.stream().mapToInt(c -> c.getLastEdge().getMultiplicity()).sum(); | ||
final int totalBases = chains.stream().mapToInt(c -> c.getEdges().stream().mapToInt(E::getMultiplicity).sum()).sum(); | ||
final double errorRate = (double) errorCount / totalBases; | ||
|
||
return likelyErrorChains(chains, graph, errorRate); | ||
return likelyErrorChains(chains, graph, errorRate).stream().filter(c -> !c.getEdges().stream().anyMatch(BaseEdge::isRef)).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, I think this resolves my previous concern about potentially pruning ref edges. I still think it would be prudent to adapt the test you have below to include a reference path that could/should be pruned otherwise just to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done by checking that ref path is assembled in testAdaptivePruning
and by giving it's data provider a case where altFraction
is 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes were positive and think that the new approach to choosing seed edges is more robust and I would say that my objections are relatively minor (though comments about determinism I think should be addressed). I am curious as to how the specific LOD score you chose interacts with pathological edge cases in the graph as it still seems possible to construct circumstances that this might under prune.
vertexToGoodOutgoingChains.put(chain.getFirstVertex(), chain); | ||
} | ||
|
||
// seed-worthy chains must pass the more stringent seeding log odds threshold on both sides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I do think you are right that this solves the problem of drastically differing edge weights being commonplace. I am a little concerned though, it seems that having no ties whatsoever to the depth of good/reference will run the risk of graphs that look something like this rescuing edges:
........................../ (1x coverage) B ---
........................./ (1xcoverage) A -----\
......................../ (1x coverage)...............
REF SOURCE / ---- (100 coverage) C ------REFSink
Under your last proposal we would only have selected C or ref sinks as seeds and then we would have thrown away every other path, whereas now its possible (certain?) that both A and B would initially be marked as seed vertexes and consequently we would still end up recovering those paths. That seems problematic, as i'm sure there are other strange edge cases where low weight error kmers "rescue" each-other. Still though I think this fixes a bunch of the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess part of the question is about the LOD score and how its computed, perhaps we could harden ourselves to some of these edge cases if it weren't the case that a bubble where both ends are anchored by 1 base verses an alternative path that is also anchored by 1 base. Perhaps by weighting against edges with below some minimum of coverage anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This danger is averted by the seedingLogOddsThreshold
. If your incoming and outgoing weights are both very low the log odds will be close to zero. That is, not enough evidence to decisively reject the edge with the regular chain threshold, but also not enough evidence to count it as a seed.
return chains.stream() | ||
.max(Comparator.comparingInt((Path<V, E> chain) -> chain.getEdges().stream().mapToInt(BaseEdge::getMultiplicity).max().orElse(0)) | ||
.thenComparingInt(Path::length) | ||
.thenComparingInt(Path::hashCode)).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't use Path::hashCode
to break ties here as the underlying Path objects are LinkedLists with the java object hash function which is based on the memory address for the object and consequently could still vary from run to run. If you want to do something quick and dirty you could use the lexicographic ordering for the first kmer in the chain (which should never match (well maybe they could in the edge case where we duplicate repeated kmers...)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or some other such unique feature of the chains (perhaps compare more edge weights?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (the lexicographic version)
graph.addEdges(() -> new BaseEdge(false, variantMultiplicity), A, E, C); | ||
} | ||
|
||
final ChainPruner<SeqVertex, BaseEdge> pruner = new AdaptiveChainPruner<>(0.01, 2.0, 2.0, 50); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we are using a flat chain/seed pruning log-odds here? Why not test with the defaults laid out in the mutect2 argument collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope. Changed it.
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a test along the lines of the example I laid out above (where there is a high confidence reference path and a bubble with bad connections to the root of the tree and demonstrate that a strict seed LOD ratio will prevent that case from being recovered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// look for vertices with two incoming or two outgoing chains (plus one outgoing or incoming for a total of 3 or more) with good log odds to seed the subgraph of good vertices | ||
// the logic here is that a high-multiplicity error chain A that branches into a second error chain B and a continuation-of-the-original-error chain A' | ||
// may have a high log odds for A'. However, only in the case of true variation will multiple branches leaving the same vertex have good log odds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I still think this can be fooled if there is a bubble of low weight in an error branch for which two low weight edges of equal weight split and merge and ultimately are connected to the reference only by bogus chains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also averted by the stricter seeding log odds threshold. I wrote a unit test for it.
@@ -18,6 +18,7 @@ | |||
private static final long serialVersionUID = 1L; | |||
|
|||
public static final double DEFAULT_PRUNING_LOG_ODDS_THRESHOLD = MathUtils.log10ToLog(1.0); | |||
public static final double DEFAULT_PRUNING_SEEDING_LOG_ODDS_THRESHOLD = MathUtils.log10ToLog(2.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advocate this threshold be made very strict and that we do some reasoning about how probable it is for this to come up. It should be fairly unlikely for the seeding vertexes to ever fall off of good variation by accident due to low coverage I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, and I'll reason it out right now. I don't know what conclusions this exercise will lead to. . .
The seeding method is fooled when an error chain A splits into two chains B and C, where B is the continuation of the original error and C is an additional error on top of the first, such that the likelihood of outgoing coverage N_C given incoming coverage N_A exceeds the threshold.
So let's first tabulate some values of N_A and N_C along with their log odds (from Mutect2Engine.logLikelihoodRatio(N_B, N_C, errorRate)
as in the pruning code). We'll use an error rate of 0.001. Note that the current threshold is 4.605.
N_A N_C LOD
30 2 4.3
30 3 8.9
15 2 6.4
15 3 11.8
10 1 2.2
10 2 7.6
10 3 13.5
5 1 3.5
5 2 9.7
3 1 4.4
2 1 5.5
Without further calculation this already looks like the current threshold is too lenient and that raising the threshold significantly would not cause many real variants to fail. Furthermore, even if a real variant with say 15 reads coming in and 2 coming out doesn't get counted as a seeding chain, it is overwhelmingly likely that the chain from which that 15-read chain is outgoing will be a seeding chain. Thus I think this threshold can be raised to at least 10 with no harm to sensitivity whatsoever. [To be continued. . .]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now let's think about false positive rates — that is, the probability of an error chain being used for seeding. Note also that this isn't the end of the world. It is not true that an error chain used for seeding starts a chain reaction of every error chain remaining unpruned, since chains still must pass the non-seeding log odds threshold. Nonetheless, it's undesirable.
Suppose we have a coverage of N at some correct chain representing a real haplotype. To get incorrect seeding we must have the same two errors sharing a kmer. Otherwise the first error's chain will have returned to the non-error chain before the second error's chain begins.
If we have a seeding threshold of 10, the table above shows that we need a coincidence of errors somewhere in the ballpark of 5 for the first and two for the second. At a coverage of 100 and an error rate of 0.001, the chance of this is 1 in ten million per locus. The chance of two or more errors at any given locus is about 5 in 1000, so the chance of the second error site occurring within k = 25 bases is about one in ten. If we have a 500-base window we would then expect a false seeding event one out of every 200 thousand assemblies. Add to that the fairly small chance that the false seeding actually does any harm and I think we can safely say a seeding log odds threshold of 10 (this is all in log space, by the way) incurs no risk of under-pruning.
.filter(v -> vertexToSeedableChains.get(v).size() > 2) | ||
.forEach(verticesToProcess::add); | ||
|
||
final Set<V> processedVertices = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do not think this matters based on how you use these but for extra safety and defense against future alterations to this code I would make these linkedHashSets, especially since you are still iterating over the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.skip(maxUnprunedVariants) | ||
.forEach(result::add); | ||
.forEach(errorChains::add); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know, it occurs to me that the maxUnprunedVariants
limit here probably doesn't make sense. If you think about it there is not a 1:1 mapping of chains to variants after pruning. Its very possible for a "variant" to be a single chain in the simple case or possible correspond to a few distinct "chains" because there could be forked paths with low LOD that get pruned that end up splitting that one good "chain" into a few smaller ones. Furthermore it seems possible that a perfectly good variant has part a part of its chain removed at this stage as a result resulting in 1 or potentially 2 dangling ends after pruning and deleting the connection. I don't suspect this happens often enough to worry about here though. Worth keeping in mind for future changes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a good point, and it's worth fixing now. The number of variants associated with a vertex is the number of good outgoing chains minus one. This works even where there is one good incoming chain, one good outgoing chain, and one bad outgoing chain. Thus we have
number of variants in graph = sum_vertices (max(number of outgoing chains - 1, 0))
After computing this number of variants we could prune good chains, starting from those with the worst log odds, until the number of variants is low enough. Every time we prune a good chain whose first vertex has at least one other good outgoing chain we have eliminated one variant from the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
also it looks like you missed a few test failures in your last branch |
cfa1b7b
to
2f0079d
Compare
e10dada
to
e0113bb
Compare
Back to @jamesemery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some very small comments and one question that might have a sufficient answer as it stands. I don't want to hold up this branch for much longer so once you have fixed the few small code cleanup issues and answered (or fixed if you think its real) my one question feel free to merge.
.../org/broadinstitute/hellbender/tools/walkers/haplotypecaller/graphs/ChainPrunerUnitTest.java
Outdated
Show resolved
Hide resolved
.../org/broadinstitute/hellbender/tools/walkers/haplotypecaller/graphs/AdaptiveChainPruner.java
Outdated
Show resolved
Hide resolved
.../org/broadinstitute/hellbender/tools/walkers/haplotypecaller/graphs/AdaptiveChainPruner.java
Outdated
Show resolved
Hide resolved
.../org/broadinstitute/hellbender/tools/walkers/haplotypecaller/graphs/ChainPrunerUnitTest.java
Outdated
Show resolved
Hide resolved
.../org/broadinstitute/hellbender/tools/walkers/haplotypecaller/graphs/AdaptiveChainPruner.java
Outdated
Show resolved
Hide resolved
final Path<V,E> worstGoodChain = excessGoodChainsToPrune.poll(); | ||
errorChains.add(worstGoodChain); | ||
//if removing this chain pops a bubble, we have pruned a variant | ||
if (vertexToGoodOutgoingChains.get(worstGoodChain.getFirstVertex()).size() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I still think this code carries a risk, it seems possible to me that at this stage you could end up leaving a damaged graph by pruning good chains as part of a bubble that were not zipped together but appear in different orders.
Imagine you have two variants A and B and they are both "good" that means 2 bubbles in the graph after puning but both variants had a second bubble that was pruned and thus they are sepearted by chaings A1, A2, B1, and B2. Now say that the order in which the lod scores shakes out the best LOD scores are A1, B1, B2, A2 and we only allow 1 maxUnprunedVariant. In this circumstance it seems possible to me that we might over prune the graph in a destructive manner. The graph will look at A2 first, remove it (wihtout decrementing numberOfVariantsInGraph), then B2 (again without decrementing numberOfVariantsInGraph) then hit B1 and remove it and arrive on the "corect" number of variants leftover in the graph (1) with the only chain left being that A1 chain. In this situation the pruning code will delete the chains A2, B1, and B2 which could have left the A1 bubble as a dangling end potentially without enough information to accurately recover it. It seems the absolute safest thing to do in this circumstance would be to "zip" the chains before calling this good chain bubble pruning in some capacity. This could either be done by reprocessing everything before pruning good chains so it will never be split across multiple chains that could be zipped or by fixing the LOD scores when you prune down to a single path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right; this is possible. I changed the logic so that we remove all chains within a broken bubble at once. That is, when we delete A2 we also delete A1. Essentially we are zipping on the fly within the subgraph of chains that have not already been pruned.
03b1efe
to
507e23d
Compare
@jamesemery This fixes the case you found, hopefully bringing us closer to turning on linked de Bruijn graphs. I will start testing M2. If you test in HC, continue to keep in mind that adaptive pruning is not default.
This change will be most important for rare complex graphs and in combination with junction trees but I did see modest improvements to indel sensitivity even with the old assembly.