-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored JointVcfFiltering WDL and expanded tests. #8074
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #8074 +/- ##
===============================================
+ Coverage 86.593% 86.644% +0.051%
- Complexity 38899 38963 +64
===============================================
Files 2336 2336
Lines 182709 182730 +21
Branches 20060 20066 +6
===============================================
+ Hits 158213 158325 +112
+ Misses 17441 17365 -76
+ Partials 7055 7040 -15
|
1bd0d65
to
96b2d67
Compare
305781d
to
929ab9a
Compare
This comment was marked as outdated.
This comment was marked as outdated.
51dd3d8
to
4023009
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
cff6900
to
fefded5
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
21a93d4
to
9f5ab4f
Compare
This comment was marked as outdated.
This comment was marked as outdated.
9f5ab4f
to
9bfc359
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I've added a few commits that clean up some of the code inherited from VQSR regarding the use of labeled resources when using allele-specific annotations. This should be ready for review and/or experimentation with importing into the WARP repo, @meganshand. There are a few unrelated failing tests, which I think others are seeing in their branches as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My test case WDL ran! I just have a few comments about usability in the WDL.
# will output the same number of shards that are input. | ||
# This portion of the filtering pipeline will assign a SCORE INFO field annotation to each site, but does not yet apply | ||
# the filtering threshold to the final VCF. | ||
# Workflow for scoring and optionally filtering a VCF based on site-level annotations using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not super clear from the WDL how to get a filtered VCF rather than just a scored VCF (I think you need to go to the tool doc to see the optional parameter that you'd add to extra_args
). This feels important enough to elevate to its own argument rather than just being folded into the extra_args
so it's clear how to get this feature straight from the WDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think I feel the same way about the allele specific argument. These feel important enough to the core of the tool to call out explicitly in the WDL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we could make allele-specific mode automatic---we just need to detect whether any annotations have Number=A
. So this would get rid of the corresponding parameter. Added a TODO to the meta issue for this.
As for hard filtering, the fact that we need to specify this via extra_args
is alluded to in the parameter meta documentation. I agree that it would be nice to elevate certain parameters, but then this raises the question of why those parameters were singled out. I think it's better from a development perspective to treat everything as equitably as possible----i.e., elevate everything or elevate nothing.
Currently, we do "elevate" resource_args
, since this is shared by the extract and score tools and we want to ensure the same values are passed. We also elevate any file arguments that might need to be localized (e.g., python_ script
, which in turn suggests that we elevate the corresponding model_backend
argument). But then everything else is relegated to extra_args
. This at least has the benefits of giving a more generic interface (in that using extra_args
provides maximal freedom to change arguments in the future, etc.).
As for whether we should replicate documentation for e.g. hard filtering here----in a perfect world, yes, but in a world where replicated documentation quickly falls out of sync, I'd rather refer users to the original documentation. Hopefully reading those docs and looking at example JSONs where hard filtering is turned on should make it clear how this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we shouldn't be replicating documentation here, but I disagree that if we elevate some inputs we must elevate them all. But perhaps for this use case of an example WDL nothing needs to be elevated. We can leave that for the WDLs that will run a specific use case.
docker: gatk_docker | ||
cpu: select_first([runtime_attributes.cpu, 1]) | ||
memory: select_first([runtime_attributes.machine_mem_gb, 7]) + " GB" | ||
disks: "local-disk " + select_first([runtime_attributes.disk_size_gb, 100]) + if select_first([runtime_attributes.use_ssd, false]) then " SSD" else " HDD" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the default disk sizes be generated dynamically from the size of the inputs instead of hardcoded? This could be done for all of the tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started an attempt at this, but a lot of the big files (e.g., the input VCFs) are only optionally localized, and I'm not sure how best to use size
in these cases. I think it's also likely that e.g. large annotation files will cause memory issues before they cause disk issues, and for these and other output files, there's not a good way to estimate their resulting disk size. Will file something in the meta issue and punt for now.
runtime { | ||
docker: gatk_docker | ||
cpu: select_first([runtime_attributes.cpu, 1]) | ||
memory: select_first([runtime_attributes.machine_mem_gb, 7]) + " GB" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth the default memory requirements were too small for both the Extract and the Train tasks for a callset of ~200 samples. Not sure if that means it's worth it to increase the defaults or not for this example WDL.
However, as we discussed in slack, I would personally vote for making the command_mem_gb
related to machine_mem_gb
(machine_mem_gb = command_mem_gb +1
). It seems much simpler for the user to only have to worry about one of those arguments. If you're concerned that they should be untied due to needing extra memory on the machine outside of the java options, then maybe add a additional_machine_mem_gb
or something similar (machine_mem_gb = command_mem_gb + additional_machine_mem_gb
)? That way you'd still only have two memory arguments (command_mem_gb
and additional_machine_mem_gb
), but in most use cases the user only needs to update the command_mem_gb
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with command_mem_gb
and additional_mem_gb
, thanks for the suggestion! I just left the default values as is, though---they're just meant to be an example.
…e-specific mode and cleaned up unlabeled alleles in labeled output.
…l for edge cases with insufficient negative training data.
9bfc359
to
2d67bfb
Compare
Thanks for the review @meganshand, and sorry for the delayed response. I addressed one of your comments and punted/semi-punted on the others. Tests should also be passing after a rebase. Unless you have strong objections, I'll go ahead and merge---want to get this in before the break! |
6663c33
to
41abfd6
Compare
@samuelklee Sounds good I think you can merge! |
Adds a WDL that replaces the "serial" SnpThenIndel joint filtering workflow added in #7932. This simplified replacement only runs one iteration of the extract-train-score toolchain, rather than running one iteration for SNPs followed by another for INDELs. The original SnpThenIndel workflow (used for Ultima) will be updated and moved to the WARP repo. (EDIT: I was originally confused here, the WDL that was replaced in this PR simply ran SNPs and indels separately, rather than serially. Curious that things still tied out, but I’m not sure it’s worth looking into at this point.)
Test files have also been subset to chr21-22 and slimmed down. A test for the positive-negative was also added, as well as tests of an empty shard.
The first commit contains the original workflow (JointVcfFilteringOriginal.wdl), as well as a reimplementation (JointVcfFilteringSnpThenIndel.wdl) that calls the simplified workflow (JointVcfFiltering.wdl). I've verified that both the original and reimplemented SnpThenIndel workflows tie out on the original test data.
The second commit then removes the original and the reimplementation, leaving only the simplified workflow. It may thus be easier to review the first commit, second commit, or the overall changes, depending on what you are looking at.
@meganshand can you take a look and let me know if there's any missing functionality, or if this otherwise won't work for Ultima and/or importing in WARP? Apologies for the delay!