Skip to content

Commit

Permalink
Updating SimpleGermlineTagger and somatic CNV experimental post-proce…
Browse files Browse the repository at this point in the history
…ssing workflow (#5252)

Several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV:

- `combine_tracks.wdl` for post-processing somatic CNV calls.  This wdl will perform two operations:
  - Increase precision by removing:
    - germline segments.  As a result, the WDL requires the matched normal segments.
    - Areas of common germline activity or error from other cancer studies.
  - Convert the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE.  This is currently done inline in the WDL.  
    - This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5).  The current algorithm relies on hard filtering and may need updating pending evaluation.
    - For more information about AllelicCapSeg and ABSOLUTE, see: 
      - Carter et al. *Absolute quantification of somatic DNA alterations in human cancer*, Nat Biotechnol. 2012 May; 30(5): 413–421 
      - https://software.broadinstitute.org/cancer/cga/absolute 
      - Brastianos, P.K., Carter S.L., et al. *Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets* (2015) Cancer Discovery PMID:26410082

- Changes to GATK tools to support the above:
  - `SimpleGermlineTagger` now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event.  This greatly improved results in areas near centromeres.
  - Added tool `MergeAnnotatedRegionsByAnnotation`.  This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance.  

- `multi_combine_tracks.wdl` and `aggregate_combine_tracks.wdl` which run `combine_tracks.wdl` on multiple pairs and combine the results into one seg file for easy consumption by IGV.
  • Loading branch information
LeeTL1220 authored Oct 5, 2018
1 parent 930a0dd commit 82d1d82
Show file tree
Hide file tree
Showing 22 changed files with 1,500 additions and 185 deletions.
4 changes: 2 additions & 2 deletions scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
#
# - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
# GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
# These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
# and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
# These intervals will be padded on both sides by the amount specified by padding (default 250)
# and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
# e.g., for WES). For WGS, the intervals should simply cover the chromosomes of interest.
#
# - Intervals can be blacklisted from coverage collection and all downstream steps by using the blacklist_intervals
Expand Down
4 changes: 2 additions & 2 deletions scripts/cnv_wdl/germline/cnv_germline_cohort_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
#
# - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
# GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
# These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
# and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
# These intervals will be padded on both sides by the amount specified by padding (default 250)
# and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
# e.g., for WES). For WGS, the intervals should simply cover the chromosomes of interest.
#
# - Intervals can be blacklisted from coverage collection and all downstream steps by using the blacklist_intervals
Expand Down
4 changes: 2 additions & 2 deletions scripts/cnv_wdl/somatic/cnv_somatic_pair_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
#
# - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
# GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
# These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
# and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
# These intervals will be padded on both sides by the amount specified by padding (default 250)
# and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
# e.g., for WES). For WGS, the intervals should simply cover the autosomal chromosomes (sex chromosomes may be
# included, but care should be taken to 1) avoid creating panels of mixed sex, and 2) denoise case samples only
# with panels containing only individuals of the same sex as the case samples).
Expand Down
4 changes: 2 additions & 2 deletions scripts/cnv_wdl/somatic/cnv_somatic_panel_workflow.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
#
# - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
# GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
# These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
# and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
# These intervals will be padded on both sides by the amount specified by padding (default 250)
# and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
# e.g., for WES). For WGS, the intervals should simply cover the autosomal chromosomes (sex chromosomes may be
# included, but care should be taken to 1) avoid creating panels of mixed sex, and 2) denoise case samples only
# with panels containing only individuals of the same sex as the case samples).
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Unsupported workflow that concatenates the IGV compatible files generated by multiple runs of combine_tracks.wdl
workflow AggregateCombinedTracksWorkflow {
String group_id
Array[File] tumor_with_germline_filtered_segs
Array[File] normals_igv_compat
Array[File] tumors_igv_compat

call TsvCat as TsvCatTumorGermlinePruned {
input:
input_files = tumor_with_germline_filtered_segs,
id = group_id + "_TumorGermlinePruned"
}

call TsvCat as TsvCatTumor {
input:
input_files = tumors_igv_compat,
id = group_id + "_Tumor"
}

call TsvCat as TsvCatNormal {
input:
input_files = normals_igv_compat,
id = group_id + "_Normal"
}

output {
File cnv_postprocessing_aggregated_tumors_pre = TsvCatTumor.aggregated_tsv
File cnv_postprocessing_aggregated_tumors_post = TsvCatTumorGermlinePruned.aggregated_tsv
File cnv_postprocessing_aggregated_normals = TsvCatNormal.aggregated_tsv
}
}


task TsvCat {

String id
Array[File] input_files

command <<<
set -e

head -1 ${input_files[0]} > ${id}.aggregated.seg

for FILE in ${sep=" " input_files}
do
egrep -v "CONTIG|Chromosome" $FILE >> ${id}.aggregated.seg
done
>>>

output {
File aggregated_tsv="${id}.aggregated.seg"
}

runtime {
docker: "ubuntu:16.04"
memory: "2 GB"
cpu: "1"
disks: "local-disk 100 HDD"
}
}
Loading

0 comments on commit 82d1d82

Please sign in to comment.