4.1.1.0
Highlights of the 4.1.1.0 release:
- A substantial (~33%) speedup to the
HaplotypeCaller
in GVCF mode (-ERC GVCF
) - Major updates to
Mutect2
, including completely overhauled filtering and smarter handling of overlapping read pairs. - A tensorflow update for
CNNScoreVariants
that speeds up the tool by roughly ~2X when using the 2D model. - Important updates to the mitochondrial calling pipeline, and improved memory usage in the CNV pipeline.
- Important bug fixes to
Funcotator
,VariantEval
,GenomicsDBImport
, and other tools, as well as to the--pedigree
argument for annotations.
Docker image: https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes:
-
HaplotypeCaller
- Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
- This speeds up whole-genome GVCF mode calling (
-ERC GVCF
) by ~33% in our tests!
- This speeds up whole-genome GVCF mode calling (
- Optimized some additional performance hotspots in the ReferenceConfidenceModel (#5616) (#5469) (#5652)
- Can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
- Don't output variants with no ALT allele if the * (spanning deletion) allele gets dropped (#5844)
- Added a
--force-active
argument that marks all regions as active. Useful for debugging/diagnostics. (#5635) HaplotypeCallerSpark
: made performance improvements to allow the tool to run on WGS in strict mode (#5721)- Fixed rare infinite recursion bug in
KBestHaplotypeFinder
(also affectsMutect2
)(#5786)
- Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
-
Mutect2
- Overhaul of
FilterMutectCalls
, which now applies a single threshold to an overall error probability (#5688)FilterMutectCalls
automatically determines the optimal threshold.- The new somatic clustering model learns tumors' allele fraction spectra and overall SNV and indel mutation rates in order to improve filtering.
- Includes a rewrite of
Mutect2
documentation -- better organization and now includes command line examples in addition to math.
Mutect2
now modifies base and indel qualities of overlapping paired reads to account for PCR error rather than discarding reads (#5794)- This especially improves indel sensitivity.
- Optimized
Mutect2
read orientation filtering by collecting F1R2 counts from within Mutect2 itself, greatly reducing wall-clock and CPU time (#5840) - New
Mutect2
panel of normals workflow usingGenomicsDB
for scalability (#5675)- Panel of normals removes germline variants in order to contain only technical artifacts, and contains information about artifact prevalence
- Rewrote
Mutect2
active region likelihood as special case of full somatic likelihoods model, which reduces runtime by 5% (#5814) Funcotator
updates inMutect2
WDL (#5742) (#5735)- Prune assemby graph before checking for cycles (#5562)
- Refactor
Mutect2
inheritance so that it doesn't have inactive arguments (#5758) - Added CRAM support to the
Mutect2
WDL (#5668) - Split MNPs in
Mutect2
PON WDL, fixing a potential bug (#5706) - Handle negative infinity log likelihoods from PairHMM in
Mutect2
(#5736) - Fixed overfiltering in
Mutect2
in GGA alleles mode with no reads (#5743) - Correct some
Mutect2
VCF header lines (#5792) - Handle unmarked duplicates with mate MQ = 0 in
Mutect2
(#5734) - Output sample names in
Mutect2
PON header (#5733) - Avoid error due to finite precision error in
Mutect2
PON creation (#5797) - Update
Mutect2
javadoc to reflect v4.1 changes. (#5769) - Renamed the
OxoGReadCounts
annotation toOrientationBiasReadCounts
(#5840)
- Overhaul of
-
CNNScoreVariants
- We now use the latest Intel-optimized tensorflow (#5725)
- This speeds up the 2D CNN by roughly 2X in our tests!
FilterVariantTranches
is out of beta (#5628)- Fixed
CNNScoreVariants
hanging when the conda environment is not set up (#5819)- We now make sure that the GATK tool Python package is present before executing streaming Python commands.
- Extensive updates to the CNN WDLs (#5251)
- We now use the latest Intel-optimized tensorflow (#5725)
-
Mitochondrial Calling Pipeline
- Added an option to recover all dangling branches, on by default for MT calling (#5693)
- Fixes a large number of missed calls
- Use adaptive pruning in the mitochondria pipeline (#5669)
- Changed defaults in mitochondria mode in response to
Mutect2
filtering overhaul (#5827) - Allowed the MT pipeline to work on bams with a mix of single and paired-end reads (#5818)
- Added a hard filter to M2 for polymorphic NuMTs and low VAF sites (#5842)
- Updated the
haplochecker
version to0.1.2
to fix a bug with flipping the major and minor hg headers in its output (#5760) - Added the rest of the mitochondria joint-calling pipeline (#5673)
- Merging and genotyping "somatic" GVCFs from
Mutect2
- Merging and genotyping "somatic" GVCFs from
- Added a read filter for unmapped reads and their mates (#5826)
- Refactored the MT WDL to make validations easier (#5708)
- Updated a variable name in MT WDL to match gatk-workflows version (#5694)
- Added an option to recover all dangling branches, on by default for MT calling (#5693)
-
GenotypeGVCFs
- Added an option to merge intervals for better
GenotypeGVCFs
performance onGenomicsDB
exome input (#5741) - Trim per-allele FORMAT annotations and optionally retain raw AS annotations (#5833)
GenotypeGVCFs
now uses the header info to determine if FORMAT lists need to be subset when alleles are dropped- Fixes "F1R2 and F2R2 annotations not updated by GenotypeGvcfs" (#5704)
- Added an option to merge intervals for better
-
Funcotator
- Non-locatable data sources can create funcotations again (#5774)
- Fixes a bug where
Funcotator
was not adding funcotations from non-locatable data sources
- Fixes a bug where
- Fixed handling of symbollic alleles when determining best transcript for
GencodeFuncotation
creation. (#5834) FilterFuncotations
: support for multi-allelic variants (#5588)FilterFuncotations
: support for gnomAD for allele frequency inClinVarFilter
andLofFilter
, with a new argument telling it which dataset of gnomAD or ExAC to use (#5691)- Added
#
as a character to be sanitized byVCFOutputRenderer
(#5817) - Added in Markdown files for Funcotator forum posts (#5630)
- Updated
Funcotator
documentation with a FAQ section to respond to user comments (#5755)
- Non-locatable data sources can create funcotations again (#5774)
-
CNV Tools
- Improved memory usage in gCNV (#5781)
- Improved memory requirements of
CollectReadCounts
(#5715) - Added some fixes for minor CNV issues (#5699)
- Added io_commons.read_csv to address issues with formatting of sample names in gCNV (#5811)
- Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation (#5732)
-
Miscellaneous Changes
SelectVariants
can now write VCF outputs to Google Cloud Storage (GCS) (#5378)VariantEval
bug fix: don't require the output file to already exist (#5681)- Fixed the
--pedigree
argument in thePossibleDeNovo
annotation (#5663) GenomicsDBImport
: fixed a core dump when querying overlapping deletions (#5799)GatherPileupSummaries
: a new tool that combines the output ofGetPileupSummaries
from disjoint scatter jobs (#5599)VariantsToTable
: add splitting for allele-specific annotations and ADs (#5697)CalculateGenotypePosteriors
: fix reported bug where no-call genotypes with no reads get genotype posterior probabilities and calls (#5667)- Added a new argument to Spark tools enabling the user to control whether to sort the reads on output (#4874)
ReadsPipelineSpark
: fixed an "Interval not within the bounds of a contig" error (#5645)Concordance
: fixed the tool to allow for no variation alleles in the truth data. (#5718)ReblockGVCF
: fix sites with zero AD to actually use SITE-level DP value as intended in (#5835)- Change
UpdateVCFSequenceDictionary
to use the specified dictionary uniformly (#5093) - Fixed gatk-nightly Docker builds (https://hub.docker.com/r/broadinstitute/gatk-nightly/) (#5759)
- Print the Picard/HTSJDK versions in addition to the GATK version when running with
--version
(#5757) IndexFeatureFile
: fixed a crash on VCFs with 0 records (#5795)PrintBGZFBlockInformation
: removed the file extension check so that we can accept bams (#5801)- Added a new read filter:
IntervalOverlapReadFilter
(#5656) - Add NIO Path support to
TableReader
andTableWriter
(#5785) - Replaced
IntervalsSkipList
withOverlapDetector
(#4154) - Removed some unused arguments in VCF merging code (#5745)
- Kebab-case some arguments in
LocusWalker
andLocusWalkerSpark
(#5770) - Removed an unnecessary IllegalArgumentException in
PairHMM
(#5705) - Removed accidental uses of log4j v1 (#5682)
- Improvements to Spark evaluation scripts (#5815)
- Extract tests from
PrintReadsIntegrationTest
to share with the Spark version. (#5689)
-
Documentation
- Improved the documentation for the
StrandOddsRatio
annotation (#5703) - Fixed the descriptions of some
HaplotypeCaller
arguments (#5658) - Update
VariantRecalibrator
example code to reflect new tagged argument syntax (#5710) - Corrected javadoc for the
InbreedingCoeff
annotation (#5768) CalculateGenotypePosteriors
: minor updates to javadoc and logger type (#5601)- Added and Updated javadoc for
SortSamSpark
andMarkDuplicatesSpark
(#5672) - Added a link to a "GitHub basics for researchers" article at top of the GATK README (#5643)
- Updated the main GATK README to remove outdated references to the Intel conda environment (#5753)
- Trimmed overly-long tool one-line summaries to shorten --list display width. (#5551)
- Improved the documentation for the
-
Dependencies