Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. #5307

Merged
merged 3 commits into from
Oct 22, 2018

Conversation

samuelklee
Copy link
Contributor

Closes #2992.
Closes #4558.

@samuelklee
Copy link
Contributor Author

@lucidtronix looks like I got an intermittent 10-minute timeout failure in the CNN WDL test.

@samuelklee
Copy link
Contributor Author

Going to go ahead and restart that test.

@samuelklee
Copy link
Contributor Author

samuelklee commented Oct 17, 2018

Note that we could probably extract some code for reading and subsetting read counts for both DetermineGermlineContigPloidy and GermlineCNVCaller, see related issue #4004. There is also some duplication in the integration-test code, which is probably not worth cleaning up.

@ldgauthier would you mind reviewing?

@samuelklee samuelklee requested a review from ldgauthier October 17, 2018 13:31
Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering code and tests look good. I have some questions about how the different interval lists are used.

"CNVGermlineCohortWorkflow.minimum_mappability": 0.0,
"CNVGermlineCohortWorkflow.maximum_mappability": 1.0,
"CNVGermlineCohortWorkflow.minimum_segmental_duplication_content": 0.0,
"CNVGermlineCohortWorkflow.maximum_segmental_duplication_content": 1.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these more general than the defaults in cnv_common_tasks.wdl? Is the goal to have this json run the WDL with the filtering effectively turned off?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the WDL test data is so small (to allow Travis tests to complete in a reasonable time) that it's hard to test realistic filters.

@@ -161,7 +162,7 @@ workflow CNVGermlineCaseWorkflow {

call CNVTasks.ScatterIntervals {
input:
interval_list = PreprocessIntervals.preprocessed_intervals,
interval_list = filtered_intervals,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I have this right -- in the case workflow you're collecting read counts over the full set of intervals (intervals), but then you're running the caller over the filtered intervals? Is that because you want the full data for contig ploidy determination?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the task GermlineCNVCallerCaseMode gets intervals that it never uses. Ditto annotated_intervals. Is the scatter accomplished by only passing one shard of the model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I have this right -- in the case workflow you're collecting read counts over the full set of intervals (intervals), but then you're running the caller over the filtered intervals? Is that because you want the full data for contig ploidy determination?

Coverage collection is performed over all intervals, but ploidy determination and gCNV are run over the filtered intervals. So changing the filtering parameters effectively masks the data for model fitting.

It looks like the task GermlineCNVCallerCaseMode gets intervals that it never uses. Ditto annotated_intervals. Is the scatter accomplished by only passing one shard of the model?

Oops, good catch. At some point, I think we switched things so that the intervals for each shard are contained in the corresponding model tar, rather than passing them through -L. I think I removed some vestigial documentation elsewhere in this PR, but forgot to clean up the WDL.

Technically, I guess we don't have to run ScatterIntervals in case mode, since we could just scatter over the model tars. However, this step is cheap, and I think that having the scatter parameters explicit in the case mode JSON is not too much hassle. The user just has to make sure that they use the same parameters for cohort and case mode. Same logic goes for the PreprocessIntervals step. However, we can always change this later so that all necessary results produced in cohort mode are simply taken as input to case mode, if desired. There are pros and cons to both.

@@ -148,6 +148,18 @@
* --output-prefix normal_cohort
* </pre>
*
* <pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a little "with optional intervals" or something before this? I'd hate for users to think they have to run it twice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@codecov-io
Copy link

Codecov Report

Merging #5307 into master will increase coverage by 0.04%.
The diff coverage is 94.709%.

@@              Coverage Diff              @@
##             master     #5307      +/-   ##
=============================================
+ Coverage     86.78%   86.821%   +0.04%     
- Complexity    30001     30197     +196     
=============================================
  Files          1838      1845       +7     
  Lines        139051    139825     +774     
  Branches      15329     15412      +83     
=============================================
+ Hits         120669    121397     +728     
- Misses        12809     12819      +10     
- Partials       5573      5609      +36
Impacted Files Coverage Δ Complexity Δ
...s/copynumber/DenoiseReadCountsIntegrationTest.java 97.674% <ø> (-0.053%) 15 <0> (ø)
...ols/copynumber/FilterIntervalsIntegrationTest.java 100% <100%> (ø) 23 <23> (?)
...hellbender/tools/copynumber/GermlineCNVCaller.java 81.373% <77.778%> (-3.031%) 10 <1> (ø)
...ools/copynumber/DetermineGermlineContigPloidy.java 91.209% <85%> (-2.839%) 15 <1> (+1)
...e/hellbender/tools/copynumber/FilterIntervals.java 89.855% <89.855%> (ø) 44 <44> (?)
...r/arguments/CopyNumberArgumentValidationUtils.java 77.778% <95.455%> (+18.357%) 20 <3> (+3) ⬆️
...der/tools/walkers/mutect/M2ArgumentCollection.java 87.5% <0%> (-7.237%) 8% <0%> (+5%)
...itute/hellbender/tools/walkers/mutect/Mutect2.java 81.395% <0%> (-6.483%) 23% <0%> (+4%)
...bender/tools/walkers/mutect/FilterMutectCalls.java 92.5% <0%> (-4.722%) 14% <0%> (+2%)
...nstitute/hellbender/utils/gcs/BucketUtilsTest.java 56.303% <0%> (-3.103%) 13% <0%> (+1%)
... and 36 more

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two last (non-blocking) questions:
In case mode you already have the filtered intervals, so why collect coverage over the full set?
Also, in cnv_germline_case_workflow.wdl it really doesn't look like ploidy determination uses the filtered intervals. Are the intervals wrapped into the model tar there too? If that's true, then that should be documented somewhere.

@samuelklee
Copy link
Contributor Author

In case mode you already have the filtered intervals, so why collect coverage over the full set?

Since coverage collection is relatively expensive, it's better to collect coverage over the same set of intervals for all samples just once and call it a day. Then we can run these samples in whatever mode we please, adjust filtering parameters, etc. in subsequent analyses without having to go back and recollect coverage at any point.

Also, in cnv_germline_case_workflow.wdl it really doesn't look like ploidy determination uses the filtered intervals. Are the intervals wrapped into the model tar there too? If that's true, then that should be documented somewhere.

Yup, that's correct. Added some more docs.

@samuelklee samuelklee merged commit 4618323 into master Oct 22, 2018
@samuelklee samuelklee deleted the sl_filter branch October 22, 2018 13:29
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
…tering in the gCNV pipeline. (broadinstitute#5307)

* Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline.

* Addressed PR comments.

* Added some documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants