Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating SimpleGermlineTagger and somatic CNV experimental post-processing workflow #5252

Merged
merged 8 commits into from
Oct 5, 2018

Conversation

LeeTL1220
Copy link
Contributor

@LeeTL1220 LeeTL1220 commented Oct 3, 2018

Several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV:

  • combine_tracks.wdl for post-processing somatic CNV calls. This wdl will perform two operations:

    • Increase precision by removing:
      • germline segments. As a result, the WDL requires the matched normal segments.
      • Areas of common germline activity or error from other cancer studies.
    • Convert the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE. This is currently done inline in the WDL.
      • This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5). The current algorithm relies on hard filtering and may need updating pending evaluation.
      • For more information about AllelicCapSeg and ABSOLUTE, see:
        • Carter et al. Absolute quantification of somatic DNA alterations in human cancer, Nat Biotechnol. 2012 May; 30(5): 413–421
        • https://software.broadinstitute.org/cancer/cga/absolute
        • Brastianos, P.K., Carter S.L., et al. Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets (2015) Cancer Discovery PMID:26410082
  • Changes to GATK tools to support the above:

    • SimpleGermlineTagger now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event. This greatly improved results in areas near centromeres.
    • Added tool MergeAnnotatedRegionsByAnnotation. This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance.
  • multi_combine_tracks.wdl and aggregate_combine_tracks.wdl which run combine_tracks.wdl on multiple pairs and combine the results into one seg file for easy consumption by IGV.

Prototype WDL and first cut at a workflow.

Making the MergeAnnotatedRegionsByAnnotation tool spit out a GATK seg file.

More automated testing

A little more automated testing.

Adding a simple WDL to convert IGV and concatenate.

Lots of TODO addressed.

More WDL changes.

More WDL changes.

Adding more docs

Adding more docs

More documentation updates.

More documentation updates.

More documentation updates.

More documentation updates.

Some more last minute doc updates.
@LeeTL1220 LeeTL1220 force-pushed the ll_germline_pruning_and_acs_conversion branch from 760ea8c to 3513e38 Compare October 3, 2018 01:16
@LeeTL1220 LeeTL1220 changed the title Updating SimpleGermlineTagger and somatic CNV experimental post-processing workflow (DO NOT MERGE YET) Updating SimpleGermlineTagger and somatic CNV experimental post-processing workflow Oct 3, 2018
@LeeTL1220 LeeTL1220 requested a review from samuelklee October 3, 2018 01:58
@LeeTL1220
Copy link
Contributor Author

@samuelklee As a reminder, everything here is unsupported and/or experimental.

@codecov-io
Copy link

codecov-io commented Oct 3, 2018

Codecov Report

Merging #5252 into master will decrease coverage by 6.127%.
The diff coverage is 89.756%.

@@               Coverage Diff               @@
##              master     #5252       +/-   ##
===============================================
- Coverage     86.757%   80.631%   -6.127%     
+ Complexity     29763     28463     -1300     
===============================================
  Files           1825      1830        +5     
  Lines         137699    138368      +669     
  Branches       15176     15237       +61     
===============================================
- Hits          119464    111567     -7897     
- Misses         12721     21438     +8717     
+ Partials        5514      5363      -151
Impacted Files Coverage Δ Complexity Δ
...bender/tools/copynumber/CallCopyRatioSegments.java 94.118% <ø> (ø) 6 <0> (ø) ⬇️
...nterval/SimpleAnnotatedIntervalWriterUnitTest.java 98.148% <ø> (ø) 17 <0> (ø) ⬇️
.../tools/copynumber/utils/MergeAnnotatedRegions.java 100% <ø> (ø) 3 <0> (ø) ⬇️
...notatedinterval/SimpleAnnotatedIntervalWriter.java 77.778% <ø> (+1.852%) 7 <0> (+1) ⬆️
...number/utils/TagGermlineEventsIntegrationTest.java 7.759% <0%> (-92.241%) 2 <0> (-7)
...titute/hellbender/utils/IntervalUtilsUnitTest.java 91.916% <100%> (+0.204%) 144 <2> (+2) ⬆️
...broadinstitute/hellbender/utils/IntervalUtils.java 91.474% <100%> (+0.123%) 185 <5> (+5) ⬆️
...nder/tools/copynumber/utils/TagGermlineEvents.java 100% <100%> (ø) 3 <0> (ø) ⬇️
.../germlinetagging/SimpleGermlineTaggerUnitTest.java 100% <100%> (ø) 4 <0> (ø) ⬇️
...geAnnotatedRegionsByAnnotationIntegrationTest.java 16.667% <16.667%> (ø) 2 <2> (?)
... and 183 more

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. A few minor questions/comments on the WDL and Docker images. Will leave it up to you if you want to keep the IGV stuff, but I think it's worth cleaning up now. Didn't look at the Java too closely, but I'll trust that everything is OK there. Back to you, @LeeTL1220!

@@ -5,7 +5,7 @@
# - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the
# GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists).
# These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250)
# and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning,
# and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Can you do a find-and-replace on other instances? Should be a few more in the other somatic/germline WDLs as well as the docs. Also may be some instances of PreprocessIntervals.padding.

BTW, where are we with Cromwell exposing task-level parameters automatically for subworkflows? Do we still need to expose all parameters in this way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done for padding as well.

Your comment here is related to broadinstitute/cromwell#2912 ... I have not heard about any update on this.

# Unsupported workflow that concatenates the IGV compatible files generated by multiple runs of combine_tracks.wdl
workflow AggregateCombinedTracksWorkflow {
String group_id
Array[File] tumor_with_germline_pruned_segs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, this is the seg file with segments that have been tagged as germline removed? If so, then I think "filtered" is better than "pruned", since the latter suggests a tree structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tumor_with_germline_filtered_segs is the new name.

Done


task TsvCat {

String id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some white space funkiness here and throughout the other WDLs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it here. Will be on lookout in other files.


for FILE in ${sep=" " input_files}
do
egrep -v "CONTIG|Chromo" $FILE >> ${id}.aggregated.seg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just go ahead and put Chromosome to avoid any confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#
# UNSUPPORTED and we may modify this WDL before we fully support it.
#
# This will also generate absolute (https://software.broadinstitute.org/cancer/cga/absolute) compatible files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolute -> ABSOLUTE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

>>>

runtime {
docker: "continuumio/anaconda"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repo seems a little more legit, but perhaps we should still use our own GCR image if appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I do not want to put our GCR image into anything that is seen outside the Broad.

No action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth just exposing the Docker images as parameters, then?

Copy link
Contributor Author

@LeeTL1220 LeeTL1220 Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but only on the tasks. These will run fine with the GATK docker image (I have tested), so the workflow will feed that docker image.

More versatile if broadinstitute/cromwell#2912 gets addressed.

sep='\t', comment='@', na_values='NA')
model_segments_af_param_pd = pd.read_csv(model_segments_af_param_input_file, sep='\t', comment='@')

def simple_determine_allelic_fraction(model_segments_seg_pd):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened to the beta fitting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that it did very little on the test data that I had on hand. I do not yet have a good evaluation set up for the ABSOLUTE conversion, which I would like to do before making more changes. We can discuss offline -- I think I have an approach.

No action.

final Map<AnnotatedInterval,CalledCopyRatioSegment.Call> result = new HashMap<>();
for (final AnnotatedInterval normalSeg : nonZeroMergedNormalSegmentsToTumorSegments.keySet()) {
final List<AnnotatedInterval> overlappingTumorSegments = nonZeroMergedNormalSegmentsToTumorSegments.get(normalSeg);

final List<AnnotatedInterval> mergedTumorSegments = mergedRegionsByAnnotation(callAnnotation, overlappingTumorSegments);
final boolean isReciprocolOverlapSeen = mergedTumorSegments.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reciprocol -> Reciprocal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import java.util.stream.Collectors;

public class MergeAnnotatedRegionsByAnnotationIntegrationTest extends CommandLineProgramTest {
private static final String TEST_SUB_DIR = publicTestDir + "org/broadinstitute/hellbender/tools/copynumber/utils/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use toolsTestDir to clean this up a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


final List<AnnotatedInterval> mergedTumorSegments = mergedRegionsByAnnotation(callAnnotation, overlappingTumorSegments);
final boolean isReciprocolOverlapSeen = mergedTumorSegments.stream()
.anyMatch(s -> (normalSeg.getInterval().intersect(s).size() > (s.getInterval().size() * reciprocalThreshold)) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably cleaner to extract a test for reciprocal overlap, but up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done... Added utility to IntervalUtils

@samuelklee
Copy link
Contributor

Thanks for adding that method. But shouldn’t the threshold be in [0, 1]?

@LeeTL1220
Copy link
Contributor Author

@samuelklee I added the range check in reciprocal overlap method. Please take a look at the rest, as I did not directly address all comments (I filed issues).

@LeeTL1220
Copy link
Contributor Author

@samuelklee Back to you... tell me if I am okay to merge.

@LeeTL1220
Copy link
Contributor Author

@samuelklee The repo requires an explicit approval of the review.

@LeeTL1220 LeeTL1220 merged commit 82d1d82 into master Oct 5, 2018
@LeeTL1220 LeeTL1220 deleted the ll_germline_pruning_and_acs_conversion branch October 5, 2018 17:48
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
…ssing workflow (broadinstitute#5252)

Several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV:

- `combine_tracks.wdl` for post-processing somatic CNV calls.  This wdl will perform two operations:
  - Increase precision by removing:
    - germline segments.  As a result, the WDL requires the matched normal segments.
    - Areas of common germline activity or error from other cancer studies.
  - Convert the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE.  This is currently done inline in the WDL.  
    - This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5).  The current algorithm relies on hard filtering and may need updating pending evaluation.
    - For more information about AllelicCapSeg and ABSOLUTE, see: 
      - Carter et al. *Absolute quantification of somatic DNA alterations in human cancer*, Nat Biotechnol. 2012 May; 30(5): 413–421 
      - https://software.broadinstitute.org/cancer/cga/absolute 
      - Brastianos, P.K., Carter S.L., et al. *Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets* (2015) Cancer Discovery PMID:26410082

- Changes to GATK tools to support the above:
  - `SimpleGermlineTagger` now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event.  This greatly improved results in areas near centromeres.
  - Added tool `MergeAnnotatedRegionsByAnnotation`.  This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance.  

- `multi_combine_tracks.wdl` and `aggregate_combine_tracks.wdl` which run `combine_tracks.wdl` on multiple pairs and combine the results into one seg file for easy consumption by IGV.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants