Skip to content

Commit

Permalink
Adaptive pruning option for local assembly (#5473)
Browse files Browse the repository at this point in the history
  • Loading branch information
davidbenjamin authored Dec 5, 2018
1 parent 6096004 commit 079d34a
Show file tree
Hide file tree
Showing 24 changed files with 676 additions and 317 deletions.
Binary file modified docs/local_assembly.pdf
Binary file not shown.
7 changes: 6 additions & 1 deletion docs/local_assembly.tex
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,12 @@ \section{Building the graph} \label{graph-assembly}
\section{Cleaning the Graph} \label{graph-cleaning}
Before deciding on candidate haplotypes, the assembler simplifies the graph with the following heuristics to remove spurious paths and to merge variant paths that diverge from the reference.
\begin{itemize}
\item pruning: The assembler finds all maximal non-branching subgraphs and removes those that 1) do not share an edge with the reference path and 2) contain no edges with sufficient multiplicity\footnote{By default 2. This is controlled by the \code{minPruning} argument.} While the default multiplicity threshold of 2 is quite permissive, it \textit{does} cause \Mutect~ to lose sensitivity for deletions occurring in a single read\footnote{While a SNV occurring on a single read would not yield a confident somatic variant call, a long deletion in a non-STR context could easily be supported by a single read be due to the tiny probability of its arising from sequencing error.}.
\item pruning: The assembler finds all maximal non-branching subgraphs (``chains") and removes those that 1) do not share an edge with the reference path and 2) contain no edges with sufficient multiplicity\footnote{By default 2. This is controlled by the \code{minPruning} argument.} While the default multiplicity threshold of 2 is quite permissive, it \textit{does} cause \Mutect~ to lose sensitivity for deletions occurring in a single read\footnote{While a SNV occurring on a single read would not yield a confident somatic variant call, a long deletion in a non-STR context could easily be supported by a single read be due to the tiny probability of its arising from sequencing error.}.

There is a command line flag \code{--adaptive-pruning} to turn on an adaptive pruning algorithm that adjusts itself to both the local depth of coverage and the observed sequencing error rate and removes chains based on a likelihood score. The score of a chain is the maximum of a left score and a right score, where the score on left (right) end of the chain is the active region determination log likelihood from the \code{Mutect2Engine}, treating the first (last) edge of the chain as a potential variant reads and all other outgoing (incoming) edges of the first (last) vertex in the chain as ref reads. The adaptive algorithm does this in two passes, where the first pass is used to determine likely errors from which to determine an empirical guess of the error rate.

The adaptive pruning option is extremely useful for samples with high coverage, such as mitochondria and targeted panels, and for samples with variable coverage, such as exomes and RNA.

\item dangling tails: The assembler only outputs haplotypes that start and end with a reference kmer, so it attempts to rescue paths in the graph that do not. To rescue a ``dangling tails" -- a path that ends in a non-reference kmer vertex -- the assembler first traverses the graph backwards from this vertex to a reference vertex. If during traversal it encounters a vertex with more than one incoming edge it gives up\footnote{as opposed to doing eg depth-first search of all possible paths back to the reference.} It also gives up if it encounters a vertex with more than one outgoing edge, that is, if the path branches again after diverging from the reference\footnote{It seems like this could be changed to increase sensitivity.}. Then it generates the Smith-Waterman alignment of the branching path versus the reference path after the vertex at which they diverge. If the alignment's CIGAR contains three or fewer elements, that is, if the alignment has at most one indel, the assembly engine attempts to merge the dangling tail back into the reference.

To merge the dangling tail back into the reference path, the assembler finds the beginning of the maximal common suffix of the dangling path and the reference path, that is, the point at which the sequences coverges\footnote{this is \textit{not} where the \textit{paths in the graph} converge (they don't) because kmers in the suffix disagree with the ref at upstream bases.} and adds an edge between the dangling path's vertex and the reference path's vertex at this position. This means that the graph is no longer a valid de Bruijn graph because the dangling vertex kmer and its succeeding reference vertex kmer do not overlap by $k - 1$ bases. Nonetheless, this graph yields valid haplotypes when we later ``zip'' the graph's chains (see below) by accumulating the last base of each kmer.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,11 @@ public abstract class AssemblyBasedCallerArgumentCollection extends StandardCall
@ArgumentCollection
public AssemblyRegionTrimmerArgumentCollection assemblyRegionTrimmerArgs = new AssemblyRegionTrimmerArgumentCollection();

protected boolean useMutectAssemblerArgumentCollection() { return false; }

@ArgumentCollection
public ReadThreadingAssemblerArgumentCollection assemblerArgs = new ReadThreadingAssemblerArgumentCollection();
public ReadThreadingAssemblerArgumentCollection assemblerArgs = useMutectAssemblerArgumentCollection() ?
new MutectReadThreadingAssemblerArgumentCollection() : new HaplotypeCallerReadThreadingAssemblerArgumentCollection();

@ArgumentCollection
public LikelihoodEngineArgumentCollection likelihoodArgs = new LikelihoodEngineArgumentCollection();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -189,19 +189,10 @@ public static ReadLikelihoodCalculationEngine createLikelihoodCalculationEngine(

public static ReadThreadingAssembler createReadThreadingAssembler(final AssemblyBasedCallerArgumentCollection args) {
final ReadThreadingAssemblerArgumentCollection rtaac = args.assemblerArgs;
final ReadThreadingAssembler assemblyEngine = new ReadThreadingAssembler(rtaac.maxNumHaplotypesInPopulation, rtaac.kmerSizes, rtaac.dontIncreaseKmerSizesForCycles, rtaac.allowNonUniqueKmersInRef, rtaac.numPruningSamples);
assemblyEngine.setErrorCorrectKmers(rtaac.errorCorrectKmers);
assemblyEngine.setPruneFactor(rtaac.minPruneFactor);
final ReadThreadingAssembler assemblyEngine = rtaac.makeReadThreadingAssembler();
assemblyEngine.setDebug(args.debug);
assemblyEngine.setDebugGraphTransformations(rtaac.debugGraphTransformations);
assemblyEngine.setRecoverDanglingBranches(!rtaac.doNotRecoverDanglingBranches);
assemblyEngine.setMinDanglingBranchLength(rtaac.minDanglingBranchLength);
assemblyEngine.setMinBaseQualityToUseInAssembly(args.minBaseQualityScore);

if ( rtaac.graphOutput != null ) {
assemblyEngine.setGraphWriter(new File(rtaac.graphOutput));
}

return assemblyEngine;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ private void validateAndInitializeArgs() {
hcArgs.setSampleContamination(AlleleBiasedDownsamplingUtils.loadContaminationFile(hcArgs.CONTAMINATION_FRACTION_FILE, hcArgs.CONTAMINATION_FRACTION, sampleSet, logger));
}

if ( hcArgs.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES && hcArgs.assemblerArgs.consensusMode ) {
if ( hcArgs.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES && hcArgs.assemblerArgs.consensusMode() ) {
throw new UserException("HaplotypeCaller cannot be run in both GENOTYPE_GIVEN_ALLELES mode and in consensus mode at the same time. Please choose one or the other.");
}

Expand Down Expand Up @@ -604,7 +604,7 @@ public List<VariantContext> callRegion(final AssemblyRegion region, final Featur
assemblyResult.getPaddedReferenceLoc(),
regionForGenotyping.getSpan(),
features,
(hcArgs.assemblerArgs.consensusMode ? Collections.<VariantContext>emptyList() : givenAlleles),
(hcArgs.assemblerArgs.consensusMode() ? Collections.<VariantContext>emptyList() : givenAlleles),
emitReferenceConfidence(),
hcArgs.maxMnpDistance,
readsHeader,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
package org.broadinstitute.hellbender.tools.walkers.haplotypecaller;

import org.broadinstitute.barclay.argparser.Advanced;
import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.Hidden;
import org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler;

import java.io.File;

public class HaplotypeCallerReadThreadingAssemblerArgumentCollection extends ReadThreadingAssemblerArgumentCollection {
private static final long serialVersionUID = 6520834L;
/**
* A single edge multiplicity cutoff for pruning doesn't work in samples with variable depths, for example exomes
* and RNA. This parameter enables the probabilistic algorithm for pruning the assembly graph that considers the
* likelihood that each chain in the graph comes from real variation.
*/
@Advanced
@Argument(fullName="adaptive-pruning", doc = "Use Mutect2's adaptive graph pruning algorithm", optional = true)
public boolean useAdaptivePruning = false;

/**
* By default, the read threading assembler will attempt to recover dangling heads and tails. See the `minDanglingBranchLength` argument documentation for more details.
*/
@Hidden
@Argument(fullName="do-not-recover-dangling-branches", doc="Disable dangling head and tail recovery", optional = true)
public boolean doNotRecoverDanglingBranches = false;

/**
* As of version 3.3, this argument is no longer needed because dangling end recovery is now the default behavior. See GATK 3.3 release notes for more details.
*/
@Deprecated
@Argument(fullName="recover-dangling-heads", doc="This argument is deprecated since version 3.3", optional = true)
public boolean DEPRECATED_RecoverDanglingHeads = false;

/**
* This argument is specifically intended for 1000G consensus analysis mode. Setting this flag will inject all
* provided alleles to the assembly graph but will not forcibly genotype all of them.
*/
@Advanced
@Argument(fullName="consensus", doc="1000G consensus mode", optional = true)
public boolean consensusMode = false;

@Override
public ReadThreadingAssembler makeReadThreadingAssembler() {
final ReadThreadingAssembler assemblyEngine = new ReadThreadingAssembler(maxNumHaplotypesInPopulation, kmerSizes,
dontIncreaseKmerSizesForCycles, allowNonUniqueKmersInRef, numPruningSamples, useAdaptivePruning ? 0 : minPruneFactor,
useAdaptivePruning, initialErrorRateForPruning, pruningLog10OddsThreshold, maxUnprunedVariants);
assemblyEngine.setDebugGraphTransformations(debugGraphTransformations);
assemblyEngine.setRecoverDanglingBranches(!doNotRecoverDanglingBranches);
assemblyEngine.setMinDanglingBranchLength(minDanglingBranchLength);

if ( graphOutput != null ) {
assemblyEngine.setGraphWriter(new File(graphOutput));
}

return assemblyEngine;
}

@Override
public boolean consensusMode() { return consensusMode; }
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
package org.broadinstitute.hellbender.tools.walkers.haplotypecaller;

import org.broadinstitute.barclay.argparser.Advanced;
import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler;

import java.io.File;

public class MutectReadThreadingAssemblerArgumentCollection extends ReadThreadingAssemblerArgumentCollection {
private static final long serialVersionUID = 5304L;

/**
* A single edge multiplicity cutoff for pruning doesn't work in samples with variable depths, for example exomes
* and RNA. This parameter disables the probabilistic algorithm for pruning the assembly graph that considers the
* likelihood that each chain in the graph comes from real variation, and instead uses a simple multiplicity cutoff.
*/
@Advanced
@Argument(fullName="disable-adaptive-pruning", doc = "Disable the adaptive algorithm for pruning paths in the graph", optional = true)
public boolean disableAdaptivePruning = false;

@Override
public ReadThreadingAssembler makeReadThreadingAssembler() {
final ReadThreadingAssembler assemblyEngine = new ReadThreadingAssembler(maxNumHaplotypesInPopulation, kmerSizes,
dontIncreaseKmerSizesForCycles, allowNonUniqueKmersInRef, numPruningSamples, disableAdaptivePruning ? minPruneFactor : 0,
!disableAdaptivePruning, initialErrorRateForPruning, pruningLog10OddsThreshold, maxUnprunedVariants);
assemblyEngine.setDebugGraphTransformations(debugGraphTransformations);
assemblyEngine.setRecoverDanglingBranches(true);
assemblyEngine.setMinDanglingBranchLength(minDanglingBranchLength);

if ( graphOutput != null ) {
assemblyEngine.setGraphWriter(new File(graphOutput));
}

return assemblyEngine;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@
import org.broadinstitute.barclay.argparser.Advanced;
import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.Hidden;
import org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler;

import java.io.Serializable;
import java.util.List;

/**
* Set of arguments related to the {@link org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler}
*/
public final class ReadThreadingAssemblerArgumentCollection implements Serializable {
public abstract class ReadThreadingAssemblerArgumentCollection implements Serializable {
private static final long serialVersionUID = 1L;

// -----------------------------------------------------------------------------------------------
Expand Down Expand Up @@ -48,20 +49,6 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
@Argument(fullName="num-pruning-samples", doc="Number of samples that must pass the minPruning threshold", optional = true)
public int numPruningSamples = 1;

/**
* As of version 3.3, this argument is no longer needed because dangling end recovery is now the default behavior. See GATK 3.3 release notes for more details.
*/
@Deprecated
@Argument(fullName="recover-dangling-heads", doc="This argument is deprecated since version 3.3", optional = true)
public boolean DEPRECATED_RecoverDanglingHeads = false;

/**
* By default, the read threading assembler will attempt to recover dangling heads and tails. See the `minDanglingBranchLength` argument documentation for more details.
*/
@Hidden
@Argument(fullName="do-not-recover-dangling-branches", doc="Disable dangling head and tail recovery", optional = true)
public boolean doNotRecoverDanglingBranches = false;

/**
* When constructing the assembly graph we are often left with "dangling" branches. The assembly engine attempts to rescue these branches
* by merging them back into the main graph. This argument describes the minimum length of a dangling branch needed for the engine to
Expand All @@ -71,13 +58,7 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
@Argument(fullName="min-dangling-branch-length", doc="Minimum length of a dangling branch to attempt recovery", optional = true)
public int minDanglingBranchLength = 4;

/**
* This argument is specifically intended for 1000G consensus analysis mode. Setting this flag will inject all
* provided alleles to the assembly graph but will not forcibly genotype all of them.
*/
@Advanced
@Argument(fullName="consensus", doc="1000G consensus mode", optional = true)
public boolean consensusMode = false;


/**
* The assembly graph can be quite complex, and could imply a very large number of possible haplotypes. Each haplotype
Expand All @@ -91,13 +72,6 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
@Argument(fullName="max-num-haplotypes-in-population", doc="Maximum number of haplotypes to consider for your population", optional = true)
public int maxNumHaplotypesInPopulation = 128;

/**
* Enabling this argument may cause fundamental problems with the assembly graph itself.
*/
@Hidden
@Argument(fullName="error-correct-kmers", doc = "Use an exploratory algorithm to error correct the kmers used during assembly", optional = true)
public boolean errorCorrectKmers = false;

/**
* Paths with fewer supporting kmers than the specified threshold will be pruned from the graph.
*
Expand All @@ -111,6 +85,28 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
@Argument(fullName="min-pruning", doc = "Minimum support to not prune paths in the graph", optional = true)
public int minPruneFactor = 2;

/**
* Initial base error rate guess for the probabilistic adaptive pruning model. Results are not very sensitive to this
* parameter because it is only a starting point from which the algorithm discovers the true error rate.
*/
@Advanced
@Argument(fullName="adaptive-pruning-initial-error-rate", doc = "Initial base error rate estimate for adaptive pruning", optional = true)
public double initialErrorRateForPruning = 0.001;

/**
* Log-10 likelihood ratio threshold for adaptive pruning algorithm.
*/
@Advanced
@Argument(fullName="pruning-lod-threshold", doc = "Log-10 likelihood ratio threshold for adaptive pruning algorithm", optional = true)
public double pruningLog10OddsThreshold = 1.0;

/**
* The maximum number of variants in graph the adaptive pruner will allow
*/
@Advanced
@Argument(fullName="max-unpruned-variants", doc = "Maximum number of variants in graph the adaptive pruner will allow", optional = true)
public int maxUnprunedVariants = 100;

@Hidden
@Argument(fullName="debug-graph-transformations", doc="Write DOT formatted graph files out of the assembler for only this graph size", optional = true)
public boolean debugGraphTransformations = false;
Expand All @@ -137,4 +133,8 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
@Hidden
@Argument(fullName="min-observations-for-kmer-to-be-solid", doc = "A k-mer must be seen at least these times for it considered to be solid", optional = true)
public int minObservationsForKmerToBeSolid = 20;

public abstract ReadThreadingAssembler makeReadThreadingAssembler();

public boolean consensusMode() { return false; }
}
Loading

0 comments on commit 079d34a

Please sign in to comment.