Updating SimpleGermlineTagger and somatic CNV experimental post-proce…

…ssing workflow (#5252) Several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV: - `combine_tracks.wdl` for post-processing somatic CNV calls. This wdl will perform two operations: - Increase precision by removing: - germline segments. As a result, the WDL requires the matched normal segments. - Areas of common germline activity or error from other cancer studies. - Convert the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE. This is currently done inline in the WDL. - This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5). The current algorithm relies on hard filtering and may need updating pending evaluation. - For more information about AllelicCapSeg and ABSOLUTE, see: - Carter et al. *Absolute quantification of somatic DNA alterations in human cancer*, Nat Biotechnol. 2012 May; 30(5): 413–421 - https://software.broadinstitute.org/cancer/cga/absolute - Brastianos, P.K., Carter S.L., et al. *Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets* (2015) Cancer Discovery PMID:26410082 - Changes to GATK tools to support the above: - `SimpleGermlineTagger` now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event. This greatly improved results in areas near centromeres. - Added tool `MergeAnnotatedRegionsByAnnotation`. This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance. - `multi_combine_tracks.wdl` and `aggregate_combine_tracks.wdl` which run `combine_tracks.wdl` on multiple pairs and combine the results into one seg file for easy consumption by IGV.
broadinstitute · Oct 5, 2018 · 82d1d82 · 82d1d82
1 parent 930a0dd
commit 82d1d82
Show file tree

Hide file tree

Showing 22 changed files with 1,500 additions and 185 deletions.
diff --git a/scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl b/scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl
@@ -5,8 +5,8 @@
 #
 # - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
 #   GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
-#   These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
-#   and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
+#   These intervals will be padded on both sides by the amount specified by padding (default 250)
+#   and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
 #   e.g., for WES).  For WGS, the intervals should simply cover the chromosomes of interest.
 #
 # - Intervals can be blacklisted from coverage collection and all downstream steps by using the blacklist_intervals

diff --git a/scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl b/scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl
@@ -4,8 +4,8 @@
 #
 # - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
 #   GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
-#   These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
-#   and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
+#   These intervals will be padded on both sides by the amount specified by padding (default 250)
+#   and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
 #   e.g., for WES).  For WGS, the intervals should simply cover the chromosomes of interest.
 #
 # - Intervals can be blacklisted from coverage collection and all downstream steps by using the blacklist_intervals

diff --git a/scripts/cnv_wdl/somatic/cnv_somatic_pair_workflow.wdl b/scripts/cnv_wdl/somatic/cnv_somatic_pair_workflow.wdl
@@ -4,8 +4,8 @@
 #
 # - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
 #   GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
-#   These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
-#   and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
+#   These intervals will be padded on both sides by the amount specified by padding (default 250)
+#   and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
 #   e.g., for WES).  For WGS, the intervals should simply cover the autosomal chromosomes (sex chromosomes may be
 #   included, but care should be taken to 1) avoid creating panels of mixed sex, and 2) denoise case samples only
 #   with panels containing only individuals of the same sex as the case samples).

diff --git a/scripts/cnv_wdl/somatic/cnv_somatic_panel_workflow.wdl b/scripts/cnv_wdl/somatic/cnv_somatic_panel_workflow.wdl
@@ -4,8 +4,8 @@
 #
 # - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
 #   GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
-#   These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
-#   and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
+#   These intervals will be padded on both sides by the amount specified by padding (default 250)
+#   and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
 #   e.g., for WES).  For WGS, the intervals should simply cover the autosomal chromosomes (sex chromosomes may be
 #   included, but care should be taken to 1) avoid creating panels of mixed sex, and 2) denoise case samples only
 #   with panels containing only individuals of the same sex as the case samples).

diff --git a/scripts/unsupported/combine_tracks_postprocessing_cnv/aggregate_combined_tracks.wdl b/scripts/unsupported/combine_tracks_postprocessing_cnv/aggregate_combined_tracks.wdl
@@ -0,0 +1,60 @@
+# Unsupported workflow that concatenates the IGV compatible files generated by multiple runs of combine_tracks.wdl
+workflow AggregateCombinedTracksWorkflow {
+    String group_id
+    Array[File] tumor_with_germline_filtered_segs
+    Array[File] normals_igv_compat
+    Array[File] tumors_igv_compat
+
+    call TsvCat as TsvCatTumorGermlinePruned {
+        input:
+            input_files = tumor_with_germline_filtered_segs,
+            id = group_id + "_TumorGermlinePruned"
+    }
+
+    call TsvCat as TsvCatTumor {
+            input:
+                input_files = tumors_igv_compat,
+                id = group_id + "_Tumor"
+    }
+
+    call TsvCat as TsvCatNormal {
+            input:
+                input_files = normals_igv_compat,
+                id = group_id + "_Normal"
+    }
+
+    output {
+        File cnv_postprocessing_aggregated_tumors_pre = TsvCatTumor.aggregated_tsv
+        File cnv_postprocessing_aggregated_tumors_post = TsvCatTumorGermlinePruned.aggregated_tsv
+        File cnv_postprocessing_aggregated_normals = TsvCatNormal.aggregated_tsv
+    }
+}
+
+
+task TsvCat {
+
+	String id
+	Array[File] input_files
+
+	command <<<
+    set -e
+
+    head -1 ${input_files[0]} > ${id}.aggregated.seg
+
+    for FILE in ${sep=" " input_files}
+    do
+        egrep -v "CONTIG|Chromosome" $FILE >> ${id}.aggregated.seg
+    done
+	>>>
+
+	output {
+		File aggregated_tsv="${id}.aggregated.seg"
+	}
+
+	runtime {
+		docker: "ubuntu:16.04"
+		memory: "2 GB"
+        cpu: "1"
+		disks: "local-disk 100 HDD"
+	}
+}