Skip to content

4.1.1.0

Compare
Choose a tag to compare
@droazen droazen released this 28 Mar 23:06
· 896 commits to master since this release
ea3032d

Highlights of the 4.1.1.0 release:

  • A substantial (~33%) speedup to the HaplotypeCaller in GVCF mode (-ERC GVCF)
  • Major updates to Mutect2, including completely overhauled filtering and smarter handling of overlapping read pairs.
  • A tensorflow update for CNNScoreVariants that speeds up the tool by roughly ~2X when using the 2D model.
  • Important updates to the mitochondrial calling pipeline, and improved memory usage in the CNV pipeline.
  • Important bug fixes to Funcotator, VariantEval, GenomicsDBImport, and other tools, as well as to the --pedigree argument for annotations.

Docker image: https://hub.docker.com/r/broadinstitute/gatk/

Full list of changes:

  • HaplotypeCaller

    • Greatly improved the performance of the ReferenceConfidenceModel using dynamic programming and caching (#5607)
      • This speeds up whole-genome GVCF mode calling (-ERC GVCF) by ~33% in our tests!
    • Optimized some additional performance hotspots in the ReferenceConfidenceModel (#5616) (#5469) (#5652)
    • Can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
    • Don't output variants with no ALT allele if the * (spanning deletion) allele gets dropped (#5844)
    • Added a --force-active argument that marks all regions as active. Useful for debugging/diagnostics. (#5635)
    • HaplotypeCallerSpark: made performance improvements to allow the tool to run on WGS in strict mode (#5721)
    • Fixed rare infinite recursion bug in KBestHaplotypeFinder (also affects Mutect2)(#5786)
  • Mutect2

    • Overhaul of FilterMutectCalls, which now applies a single threshold to an overall error probability (#5688)
      • FilterMutectCalls automatically determines the optimal threshold.
      • The new somatic clustering model learns tumors' allele fraction spectra and overall SNV and indel mutation rates in order to improve filtering.
      • Includes a rewrite of Mutect2 documentation -- better organization and now includes command line examples in addition to math.
    • Mutect2 now modifies base and indel qualities of overlapping paired reads to account for PCR error rather than discarding reads (#5794)
      • This especially improves indel sensitivity.
    • Optimized Mutect2 read orientation filtering by collecting F1R2 counts from within Mutect2 itself, greatly reducing wall-clock and CPU time (#5840)
    • New Mutect2 panel of normals workflow using GenomicsDB for scalability (#5675)
      • Panel of normals removes germline variants in order to contain only technical artifacts, and contains information about artifact prevalence
    • Rewrote Mutect2 active region likelihood as special case of full somatic likelihoods model, which reduces runtime by 5% (#5814)
    • Funcotator updates in Mutect2 WDL (#5742) (#5735)
    • Prune assemby graph before checking for cycles (#5562)
    • Refactor Mutect2 inheritance so that it doesn't have inactive arguments (#5758)
    • Added CRAM support to the Mutect2 WDL (#5668)
    • Split MNPs in Mutect2 PON WDL, fixing a potential bug (#5706)
    • Handle negative infinity log likelihoods from PairHMM in Mutect2 (#5736)
    • Fixed overfiltering in Mutect2 in GGA alleles mode with no reads (#5743)
    • Correct some Mutect2 VCF header lines (#5792)
    • Handle unmarked duplicates with mate MQ = 0 in Mutect2 (#5734)
    • Output sample names in Mutect2 PON header (#5733)
    • Avoid error due to finite precision error in Mutect2 PON creation (#5797)
    • Update Mutect2 javadoc to reflect v4.1 changes. (#5769)
    • Renamed the OxoGReadCounts annotation to OrientationBiasReadCounts (#5840)
  • CNNScoreVariants

    • We now use the latest Intel-optimized tensorflow (#5725)
      • This speeds up the 2D CNN by roughly 2X in our tests!
    • FilterVariantTranches is out of beta (#5628)
    • Fixed CNNScoreVariants hanging when the conda environment is not set up (#5819)
      • We now make sure that the GATK tool Python package is present before executing streaming Python commands.
    • Extensive updates to the CNN WDLs (#5251)
  • Mitochondrial Calling Pipeline

    • Added an option to recover all dangling branches, on by default for MT calling (#5693)
      • Fixes a large number of missed calls
    • Use adaptive pruning in the mitochondria pipeline (#5669)
    • Changed defaults in mitochondria mode in response to Mutect2 filtering overhaul (#5827)
    • Allowed the MT pipeline to work on bams with a mix of single and paired-end reads (#5818)
    • Added a hard filter to M2 for polymorphic NuMTs and low VAF sites (#5842)
    • Updated the haplochecker version to 0.1.2 to fix a bug with flipping the major and minor hg headers in its output (#5760)
    • Added the rest of the mitochondria joint-calling pipeline (#5673)
      • Merging and genotyping "somatic" GVCFs from Mutect2
    • Added a read filter for unmapped reads and their mates (#5826)
    • Refactored the MT WDL to make validations easier (#5708)
    • Updated a variable name in MT WDL to match gatk-workflows version (#5694)
  • GenotypeGVCFs

    • Added an option to merge intervals for better GenotypeGVCFs performance on GenomicsDB exome input (#5741)
    • Trim per-allele FORMAT annotations and optionally retain raw AS annotations (#5833)
      • GenotypeGVCFs now uses the header info to determine if FORMAT lists need to be subset when alleles are dropped
      • Fixes "F1R2 and F2R2 annotations not updated by GenotypeGvcfs" (#5704)
  • Funcotator

    • Non-locatable data sources can create funcotations again (#5774)
      • Fixes a bug where Funcotator was not adding funcotations from non-locatable data sources
    • Fixed handling of symbollic alleles when determining best transcript for GencodeFuncotation creation. (#5834)
    • FilterFuncotations: support for multi-allelic variants (#5588)
    • FilterFuncotations: support for gnomAD for allele frequency in ClinVarFilter and LofFilter, with a new argument telling it which dataset of gnomAD or ExAC to use (#5691)
    • Added # as a character to be sanitized by VCFOutputRenderer (#5817)
    • Added in Markdown files for Funcotator forum posts (#5630)
    • Updated Funcotator documentation with a FAQ section to respond to user comments (#5755)
  • CNV Tools

    • Improved memory usage in gCNV (#5781)
    • Improved memory requirements of CollectReadCounts (#5715)
    • Added some fixes for minor CNV issues (#5699)
    • Added io_commons.read_csv to address issues with formatting of sample names in gCNV (#5811)
    • Added gCNV PROBPROG 2018 extended abstract, archived notes on CNV methods, and deleted some legacy documentation (#5732)
  • Miscellaneous Changes

    • SelectVariants can now write VCF outputs to Google Cloud Storage (GCS) (#5378)
    • VariantEval bug fix: don't require the output file to already exist (#5681)
    • Fixed the --pedigree argument in the PossibleDeNovo annotation (#5663)
    • GenomicsDBImport: fixed a core dump when querying overlapping deletions (#5799)
    • GatherPileupSummaries: a new tool that combines the output of GetPileupSummaries from disjoint scatter jobs (#5599)
    • VariantsToTable: add splitting for allele-specific annotations and ADs (#5697)
    • CalculateGenotypePosteriors: fix reported bug where no-call genotypes with no reads get genotype posterior probabilities and calls (#5667)
    • Added a new argument to Spark tools enabling the user to control whether to sort the reads on output (#4874)
    • ReadsPipelineSpark: fixed an "Interval not within the bounds of a contig" error (#5645)
    • Concordance: fixed the tool to allow for no variation alleles in the truth data. (#5718)
    • ReblockGVCF: fix sites with zero AD to actually use SITE-level DP value as intended in (#5835)
    • Change UpdateVCFSequenceDictionary to use the specified dictionary uniformly (#5093)
    • Fixed gatk-nightly Docker builds (https://hub.docker.com/r/broadinstitute/gatk-nightly/) (#5759)
    • Print the Picard/HTSJDK versions in addition to the GATK version when running with --version (#5757)
    • IndexFeatureFile: fixed a crash on VCFs with 0 records (#5795)
    • PrintBGZFBlockInformation: removed the file extension check so that we can accept bams (#5801)
    • Added a new read filter: IntervalOverlapReadFilter (#5656)
    • Add NIO Path support to TableReader and TableWriter (#5785)
    • Replaced IntervalsSkipList with OverlapDetector (#4154)
    • Removed some unused arguments in VCF merging code (#5745)
    • Kebab-case some arguments in LocusWalker and LocusWalkerSpark (#5770)
    • Removed an unnecessary IllegalArgumentException in PairHMM (#5705)
    • Removed accidental uses of log4j v1 (#5682)
    • Improvements to Spark evaluation scripts (#5815)
    • Extract tests from PrintReadsIntegrationTest to share with the Spark version. (#5689)
  • Documentation

    • Improved the documentation for the StrandOddsRatio annotation (#5703)
    • Fixed the descriptions of some HaplotypeCaller arguments (#5658)
    • Update VariantRecalibrator example code to reflect new tagged argument syntax (#5710)
    • Corrected javadoc for the InbreedingCoeff annotation (#5768)
    • CalculateGenotypePosteriors: minor updates to javadoc and logger type (#5601)
    • Added and Updated javadoc for SortSamSpark and MarkDuplicatesSpark (#5672)
    • Added a link to a "GitHub basics for researchers" article at top of the GATK README (#5643)
    • Updated the main GATK README to remove outdated references to the Intel conda environment (#5753)
    • Trimmed overly-long tool one-line summaries to shorten --list display width. (#5551)
  • Dependencies

    • Updated HTSJDK to 2.19.0 (#5812)
    • Updated Picard to 2.19.0 (#5812)
    • Updated Disq to 0.3.0 (#5812)
    • Updated google-cloud-nio to 0.81.0 (#5752)