-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. #5307
Conversation
…tering in the gCNV pipeline.
@lucidtronix looks like I got an intermittent 10-minute timeout failure in the CNN WDL test. |
Going to go ahead and restart that test. |
Note that we could probably extract some code for reading and subsetting read counts for both DetermineGermlineContigPloidy and GermlineCNVCaller, see related issue #4004. There is also some duplication in the integration-test code, which is probably not worth cleaning up. @ldgauthier would you mind reviewing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering code and tests look good. I have some questions about how the different interval lists are used.
"CNVGermlineCohortWorkflow.minimum_mappability": 0.0, | ||
"CNVGermlineCohortWorkflow.maximum_mappability": 1.0, | ||
"CNVGermlineCohortWorkflow.minimum_segmental_duplication_content": 0.0, | ||
"CNVGermlineCohortWorkflow.maximum_segmental_duplication_content": 1.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these more general than the defaults in cnv_common_tasks.wdl
? Is the goal to have this json run the WDL with the filtering effectively turned off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the WDL test data is so small (to allow Travis tests to complete in a reasonable time) that it's hard to test realistic filters.
@@ -161,7 +162,7 @@ workflow CNVGermlineCaseWorkflow { | |||
|
|||
call CNVTasks.ScatterIntervals { | |||
input: | |||
interval_list = PreprocessIntervals.preprocessed_intervals, | |||
interval_list = filtered_intervals, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I have this right -- in the case workflow you're collecting read counts over the full set of intervals (intervals
), but then you're running the caller over the filtered intervals? Is that because you want the full data for contig ploidy determination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the task GermlineCNVCallerCaseMode gets intervals that it never uses. Ditto annotated_intervals. Is the scatter accomplished by only passing one shard of the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I have this right -- in the case workflow you're collecting read counts over the full set of intervals (intervals), but then you're running the caller over the filtered intervals? Is that because you want the full data for contig ploidy determination?
Coverage collection is performed over all intervals, but ploidy determination and gCNV are run over the filtered intervals. So changing the filtering parameters effectively masks the data for model fitting.
It looks like the task GermlineCNVCallerCaseMode gets intervals that it never uses. Ditto annotated_intervals. Is the scatter accomplished by only passing one shard of the model?
Oops, good catch. At some point, I think we switched things so that the intervals for each shard are contained in the corresponding model tar, rather than passing them through -L
. I think I removed some vestigial documentation elsewhere in this PR, but forgot to clean up the WDL.
Technically, I guess we don't have to run ScatterIntervals in case mode, since we could just scatter over the model tars. However, this step is cheap, and I think that having the scatter parameters explicit in the case mode JSON is not too much hassle. The user just has to make sure that they use the same parameters for cohort and case mode. Same logic goes for the PreprocessIntervals step. However, we can always change this later so that all necessary results produced in cohort mode are simply taken as input to case mode, if desired. There are pros and cons to both.
@@ -148,6 +148,18 @@ | |||
* --output-prefix normal_cohort | |||
* </pre> | |||
* | |||
* <pre> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put a little "with optional intervals" or something before this? I'd hate for users to think they have to run it twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Codecov Report
@@ Coverage Diff @@
## master #5307 +/- ##
=============================================
+ Coverage 86.78% 86.821% +0.04%
- Complexity 30001 30197 +196
=============================================
Files 1838 1845 +7
Lines 139051 139825 +774
Branches 15329 15412 +83
=============================================
+ Hits 120669 121397 +728
- Misses 12809 12819 +10
- Partials 5573 5609 +36
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two last (non-blocking) questions:
In case mode you already have the filtered intervals, so why collect coverage over the full set?
Also, in cnv_germline_case_workflow.wdl
it really doesn't look like ploidy determination uses the filtered intervals. Are the intervals wrapped into the model tar there too? If that's true, then that should be documented somewhere.
Since coverage collection is relatively expensive, it's better to collect coverage over the same set of intervals for all samples just once and call it a day. Then we can run these samples in whatever mode we please, adjust filtering parameters, etc. in subsequent analyses without having to go back and recollect coverage at any point.
Yup, that's correct. Added some more docs. |
…tering in the gCNV pipeline. (broadinstitute#5307) * Added FilterIntervals to perform annotation-based and count-based filtering in the gCNV pipeline. * Addressed PR comments. * Added some documentation.
Closes #2992.
Closes #4558.