Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. #5307

samuelklee · 2018-10-15T21:06:25Z

Closes #2992.
Closes #4558.

…tering in the gCNV pipeline.

samuelklee · 2018-10-15T22:48:18Z

@lucidtronix looks like I got an intermittent 10-minute timeout failure in the CNN WDL test.

samuelklee · 2018-10-15T22:57:00Z

Going to go ahead and restart that test.

samuelklee · 2018-10-17T13:31:12Z

Note that we could probably extract some code for reading and subsetting read counts for both DetermineGermlineContigPloidy and GermlineCNVCaller, see related issue #4004. There is also some duplication in the integration-test code, which is probably not worth cleaning up.

@ldgauthier would you mind reviewing?

ldgauthier

Filtering code and tests look good. I have some questions about how the different interval lists are used.

ldgauthier · 2018-10-19T14:07:36Z

scripts/cnv_cromwell_tests/germline/cnv_germline_cohort_workflow.json

+  "CNVGermlineCohortWorkflow.minimum_mappability": 0.0,
+  "CNVGermlineCohortWorkflow.maximum_mappability": 1.0,
+  "CNVGermlineCohortWorkflow.minimum_segmental_duplication_content": 0.0,
+  "CNVGermlineCohortWorkflow.maximum_segmental_duplication_content": 1.0,


Why are these more general than the defaults in cnv_common_tasks.wdl? Is the goal to have this json run the WDL with the filtering effectively turned off?

Yup, the WDL test data is so small (to allow Travis tests to complete in a reasonable time) that it's hard to test realistic filters.

ldgauthier · 2018-10-19T14:28:31Z

scripts/cnv_wdl/germline/cnv_germline_case_workflow.wdl

@@ -161,7 +162,7 @@ workflow CNVGermlineCaseWorkflow {

    call CNVTasks.ScatterIntervals {
        input:
-            interval_list = PreprocessIntervals.preprocessed_intervals,
+            interval_list = filtered_intervals,


Just to make sure I have this right -- in the case workflow you're collecting read counts over the full set of intervals (intervals), but then you're running the caller over the filtered intervals? Is that because you want the full data for contig ploidy determination?

It looks like the task GermlineCNVCallerCaseMode gets intervals that it never uses. Ditto annotated_intervals. Is the scatter accomplished by only passing one shard of the model?

Just to make sure I have this right -- in the case workflow you're collecting read counts over the full set of intervals (intervals), but then you're running the caller over the filtered intervals? Is that because you want the full data for contig ploidy determination?

Coverage collection is performed over all intervals, but ploidy determination and gCNV are run over the filtered intervals. So changing the filtering parameters effectively masks the data for model fitting.

It looks like the task GermlineCNVCallerCaseMode gets intervals that it never uses. Ditto annotated_intervals. Is the scatter accomplished by only passing one shard of the model?

Oops, good catch. At some point, I think we switched things so that the intervals for each shard are contained in the corresponding model tar, rather than passing them through -L. I think I removed some vestigial documentation elsewhere in this PR, but forgot to clean up the WDL.

Technically, I guess we don't have to run ScatterIntervals in case mode, since we could just scatter over the model tars. However, this step is cheap, and I think that having the scatter parameters explicit in the case mode JSON is not too much hassle. The user just has to make sure that they use the same parameters for cohort and case mode. Same logic goes for the PreprocessIntervals step. However, we can always change this later so that all necessary results produced in cohort mode are simply taken as input to case mode, if desired. There are pros and cons to both.

ldgauthier · 2018-10-19T14:34:29Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/DetermineGermlineContigPloidy.java

@@ -148,6 +148,18 @@
 *   --output-prefix normal_cohort
 * </pre>
 *
+ * <pre>


Can you put a little "with optional intervals" or something before this? I'd hate for users to think they have to run it twice.

codecov-io · 2018-10-19T20:21:54Z

Codecov Report

Merging #5307 into master will increase coverage by 0.04%.
The diff coverage is 94.709%.

@@              Coverage Diff              @@
##             master     #5307      +/-   ##
=============================================
+ Coverage     86.78%   86.821%   +0.04%     
- Complexity    30001     30197     +196     
=============================================
  Files          1838      1845       +7     
  Lines        139051    139825     +774     
  Branches      15329     15412      +83     
=============================================
+ Hits         120669    121397     +728     
- Misses        12809     12819      +10     
- Partials       5573      5609      +36

Impacted Files	Coverage Δ	Complexity Δ
...s/copynumber/DenoiseReadCountsIntegrationTest.java	`97.674% <ø> (-0.053%)`	`15 <0> (ø)`
...ols/copynumber/FilterIntervalsIntegrationTest.java	`100% <100%> (ø)`	`23 <23> (?)`
...hellbender/tools/copynumber/GermlineCNVCaller.java	`81.373% <77.778%> (-3.031%)`	`10 <1> (ø)`
...ools/copynumber/DetermineGermlineContigPloidy.java	`91.209% <85%> (-2.839%)`	`15 <1> (+1)`
...e/hellbender/tools/copynumber/FilterIntervals.java	`89.855% <89.855%> (ø)`	`44 <44> (?)`
...r/arguments/CopyNumberArgumentValidationUtils.java	`77.778% <95.455%> (+18.357%)`	`20 <3> (+3)`	⬆️
...der/tools/walkers/mutect/M2ArgumentCollection.java	`87.5% <0%> (-7.237%)`	`8% <0%> (+5%)`
...itute/hellbender/tools/walkers/mutect/Mutect2.java	`81.395% <0%> (-6.483%)`	`23% <0%> (+4%)`
...bender/tools/walkers/mutect/FilterMutectCalls.java	`92.5% <0%> (-4.722%)`	`14% <0%> (+2%)`
...nstitute/hellbender/utils/gcs/BucketUtilsTest.java	`56.303% <0%> (-3.103%)`	`13% <0%> (+1%)`
... and 36 more

ldgauthier

Two last (non-blocking) questions:
In case mode you already have the filtered intervals, so why collect coverage over the full set?
Also, in cnv_germline_case_workflow.wdl it really doesn't look like ploidy determination uses the filtered intervals. Are the intervals wrapped into the model tar there too? If that's true, then that should be documented somewhere.

samuelklee · 2018-10-19T20:46:17Z

In case mode you already have the filtered intervals, so why collect coverage over the full set?

Since coverage collection is relatively expensive, it's better to collect coverage over the same set of intervals for all samples just once and call it a day. Then we can run these samples in whatever mode we please, adjust filtering parameters, etc. in subsequent analyses without having to go back and recollect coverage at any point.

Also, in cnv_germline_case_workflow.wdl it really doesn't look like ploidy determination uses the filtered intervals. Are the intervals wrapped into the model tar there too? If that's true, then that should be documented somewhere.

Yup, that's correct. Added some more docs.

…tering in the gCNV pipeline. (broadinstitute#5307) * Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. * Addressed PR comments. * Added some documentation.

Added FilterIntervals to perform annotation-based and count-based fil…

131ec72

…tering in the gCNV pipeline.

samuelklee requested a review from ldgauthier October 17, 2018 13:31

ldgauthier reviewed Oct 19, 2018

View reviewed changes

Addressed PR comments.

5c36f9d

ldgauthier approved these changes Oct 19, 2018

View reviewed changes

Added some documentation.

274459b

samuelklee merged commit 4618323 into master Oct 22, 2018

samuelklee deleted the sl_filter branch October 22, 2018 13:29

samuelklee mentioned this pull request Feb 15, 2019

Ploidy model for DetermineGermlineContigPloidy needs to check that intervals used in case mode are identical. #4446

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. #5307

Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. #5307

samuelklee commented Oct 15, 2018

samuelklee commented Oct 15, 2018

samuelklee commented Oct 15, 2018

samuelklee commented Oct 17, 2018 •

edited

Loading

ldgauthier left a comment

ldgauthier Oct 19, 2018

samuelklee Oct 19, 2018

ldgauthier Oct 19, 2018

ldgauthier Oct 19, 2018

samuelklee Oct 19, 2018

ldgauthier Oct 19, 2018

samuelklee Oct 19, 2018

codecov-io commented Oct 19, 2018

ldgauthier left a comment

samuelklee commented Oct 19, 2018

Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. #5307

Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. #5307

Conversation

samuelklee commented Oct 15, 2018

samuelklee commented Oct 15, 2018

samuelklee commented Oct 15, 2018

samuelklee commented Oct 17, 2018 • edited Loading

ldgauthier left a comment

Choose a reason for hiding this comment

ldgauthier Oct 19, 2018

Choose a reason for hiding this comment

samuelklee Oct 19, 2018

Choose a reason for hiding this comment

ldgauthier Oct 19, 2018

Choose a reason for hiding this comment

ldgauthier Oct 19, 2018

Choose a reason for hiding this comment

samuelklee Oct 19, 2018

Choose a reason for hiding this comment

ldgauthier Oct 19, 2018

Choose a reason for hiding this comment

samuelklee Oct 19, 2018

Choose a reason for hiding this comment

codecov-io commented Oct 19, 2018

Codecov Report

ldgauthier left a comment

Choose a reason for hiding this comment

samuelklee commented Oct 19, 2018

samuelklee commented Oct 17, 2018 •

edited

Loading