-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VS-693] Add support for VQSR Lite to GvsCreateFilterSet #8157
Conversation
…annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <[email protected]>
…eous minor fixes to new annotation-based filtering tools. (#8049)
…input in JointVcfFiltering WDL (#8027) * Small changes to JointVCFFiltering WDL * making default for use_allele_specific_annotations * addressing comments
… including Java changes necessary to produce the different tsv files
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## ah_var_store #8157 +/- ##
================================================
Coverage ? 86.184%
Complexity ? 35511
================================================
Files ? 2191
Lines ? 166339
Branches ? 17900
================================================
Hits ? 143358
Misses ? 16593
Partials ? 6388 |
Github actions tests reported job failures from actions build 3903475240
|
cleaning up verbiage
Github actions tests reported job failures from actions build 3942175794
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot in this PR. Should we be reviewing these changes in their entirety or just the most recent changes our team has made to integrate VQSR-Lite? If the former (which might be appropriate if we are going to be co-owners of this code), we might want to get a walkthrough by the authors.
…ut by gCNV and updated sklearn. (#7261)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Only thing that gives me pause is keeping everything to create the filter set in one WDL. GvsCreateFilterSet.wdl already had one fork (are the number of samples over a certain threshold?) and this adds another so there's a lot of code that isn't going to get run in any given submission. I would prefer (not a dealbreaker) two additional WDLs that can be run as sub-workflows of GvsCreateFilterSet.wdl to generate the filter set depending on what use_classic_VQSR
is set to (this will also make it easier if we decide to ditch one).
scripts/vcf_site_level_filtering_cromwell_tests/run_vcf_site_level_filtering_wdl.sh
Show resolved
Hide resolved
sites_only_vcf = MergeVCFs.output_vcf, | ||
sites_only_vcf_index = MergeVCFs.output_vcf_index, | ||
basename = filter_set_name, | ||
gatk_docker = "us.gcr.io/broad-gatk/gatk:4.3.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why the GATK image is hardcoded here? Alternatively it could be an input with this as the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It looks as though that was necessary when Bec yanked the code from master (which is at GATK 4.3.0) into our branch. It may have been dependent on the latest version of GATK? Maybe we leave it as an input with that as the default value and see if we can remove it if/when our branch catches up to master.
} | ||
|
||
call Utils.MergeVCFs as MergeRecalibrationFiles { | ||
call PopulateFilterSetInfo as PopulateFilterSetInfoCLassic { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
references will need to be updated as well obvs
call PopulateFilterSetInfo as PopulateFilterSetInfoCLassic { | |
call PopulateFilterSetInfo as PopulateFilterSetInfoClassic { |
src/main/java/org/broadinstitute/hellbender/tools/gvs/filtering/CreateFilteringFiles.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I'm confused as to why vqslod is entirely null in vqsr_lite_filter_set_info.
Would like to understand prior to merging.
Hi all, thanks again for working to integrate this code! Saw some confusion in the comments above and just wanted to clarify: if you take a look at the VQSR-lite PR https://github.com/broadinstitute/gatk/pull/7954/commits that the current branch is rebased upon, you'll see that it contains a version of the Joint Genotyping WDL (which was put together by Megan for Ultima) along with Java code for the tools (which was written by me). Both the WDL and the code have been updated in subsequent PRs. The WDL was rewritten by me in #8074; the main difference is that we no longer run SNPs and indels filtering in "series", but instead run them in a single step. However, this requires that you use the same annotations for both SNPs and indels; GVS might not be ready for that just yet, since the default WARP implementation uses different annotations. (But see also the comment here: #8074 (comment). The gist is we can easily reimplement Megan's/WARP's "serial" SNP-then-indel workflow using the simpler single-step workflow.) (EDIT: I was originally confused here, Megan’s WDL simply runs SNPs and indels separately—thanks to George for correcting me here!) Note also that test infrastructure was moved from Travis to Github Actions between these PRs, so the Travis references above have already been cleaned up. There have also been a few additional minor PRs merged in the interim, with a couple more incoming. These PRs do not fundamentally change the interfaces of the tools/WDL, however, so I think you can update to them when you're ready. Punchline: this branch should suffice for a first cut of a VQSR/VQSR-lite bakeoff, and although it is already slightly out of date, it shouldn't be too much work to get things updated after the first cut is done. |
Added a new argument
use_classic_VQSR
to the WDL to determine whether we are using the old VQSR or the new VQSR-Lite. It defaults to true, so it should not change the default behavior of other code.Ran the code with it set to true for the old behavior (https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/adbbc5dc-6168-47a6-8f76-a232b8b4b9db)
and false for the new behavior (https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/e5f46969-f80d-411a-90b0-309ea46753b3)
and it appears to behave as expected for both cases. Screenshot of the table created in BQ
This branch is based upon Bec's original PoC branch, so will be pulling in that code to ah_var_store as well
Edit: This is a large PR, so it may be helpful to think of it as 3 separate pieces.