Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VS-693] Add support for VQSR Lite to GvsCreateFilterSet #8157

Merged
merged 24 commits into from
Feb 2, 2023

Conversation

koncheto-broad
Copy link

@koncheto-broad koncheto-broad commented Jan 12, 2023

Added a new argument use_classic_VQSR to the WDL to determine whether we are using the old VQSR or the new VQSR-Lite. It defaults to true, so it should not change the default behavior of other code.

Ran the code with it set to true for the old behavior (https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/adbbc5dc-6168-47a6-8f76-a232b8b4b9db)

and false for the new behavior (https://app.terra.bio/#workspaces/gvs-dev/GVS%20Tiny%20Quickstart%20hatcher/job_history/e5f46969-f80d-411a-90b0-309ea46753b3)

and it appears to behave as expected for both cases. Screenshot of the table created in BQ
Screen Shot 2023-01-12 at 10 03 49 AM

This branch is based upon Bec's original PoC branch, so will be pulling in that code to ah_var_store as well

Edit: This is a large PR, so it may be helpful to think of it as 3 separate pieces.

  1. There was Sam's code, which was verified and code reviewed and merged into master. Afterwards, it was cherrypicked into an offshoot of ah_var_store. I do not believe we have to review that. If you want to see those though, you can see them here (with the caveat that this is a comparison against the current ah_var_store, which may be slightly misleading): ah_var_store...c95945c
  2. There's Bec's initial poc code. This was part of our previous spike, and what I built upon. A link to just those changes is here: c95945c...3c19b50
  3. There's my code, which was specifically to add in the pathway to choose between VQSR and VQSR-Lite. A link to just those changes is here: 3c19b50...3320837

samuelklee and others added 20 commits October 17, 2022 13:50
…annotations. (#7954)

* Adds wdl that tests joint VCF filtering tools (#7932)

* adding filtering wdl

* renaming pipeline

* addressing comments

* added bash

* renaming json

* adding glob to extract for extra files

* changing dollar signs

* small comments

* Added changes for specifying model backend and other tweaks to WDLs and environment.

* Added classes for representing a collection of labeled variant annotations.

* Added interfaces for modeling and scoring backends.

* Added a new suite of tools for variant filtering based on site-level annotations.

* Added integration tests.

* Added test resources and expected results.

* Miscellaneous changes.

* Removed non-ASCII characters.

* Added documentation for TrainVariantAnnotationsModel and addressed review comments.

Co-authored-by: meganshand <[email protected]>
…eous minor fixes to new annotation-based filtering tools. (#8049)
…input in JointVcfFiltering WDL (#8027)

* Small changes to JointVCFFiltering WDL

* making default for use_allele_specific_annotations

* addressing comments
… including Java changes necessary to produce the different tsv files
@codecov
Copy link

codecov bot commented Jan 12, 2023

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@9cf8021). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files
@@               Coverage Diff                @@
##             ah_var_store     #8157   +/-   ##
================================================
  Coverage                ?   86.184%           
  Complexity              ?     35511           
================================================
  Files                   ?      2191           
  Lines                   ?    166339           
  Branches                ?     17900           
================================================
  Hits                    ?    143358           
  Misses                  ?     16593           
  Partials                ?      6388           

@gatk-bot
Copy link

Github actions tests reported job failures from actions build 3903475240
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 8 3903475240.3 logs

cleaning up verbiage
@gatk-bot
Copy link

Github actions tests reported job failures from actions build 3942175794
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 8 3942175794.3 logs

Copy link
Collaborator

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot in this PR. Should we be reviewing these changes in their entirety or just the most recent changes our team has made to integrate VQSR-Lite? If the former (which might be appropriate if we are going to be co-owners of this code), we might want to get a walkthrough by the authors.

Copy link

@rsasch rsasch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only thing that gives me pause is keeping everything to create the filter set in one WDL. GvsCreateFilterSet.wdl already had one fork (are the number of samples over a certain threshold?) and this adds another so there's a lot of code that isn't going to get run in any given submission. I would prefer (not a dealbreaker) two additional WDLs that can be run as sub-workflows of GvsCreateFilterSet.wdl to generate the filter set depending on what use_classic_VQSR is set to (this will also make it easier if we decide to ditch one).

sites_only_vcf = MergeVCFs.output_vcf,
sites_only_vcf_index = MergeVCFs.output_vcf_index,
basename = filter_set_name,
gatk_docker = "us.gcr.io/broad-gatk/gatk:4.3.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why the GATK image is hardcoded here? Alternatively it could be an input with this as the default value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It looks as though that was necessary when Bec yanked the code from master (which is at GATK 4.3.0) into our branch. It may have been dependent on the latest version of GATK? Maybe we leave it as an input with that as the default value and see if we can remove it if/when our branch catches up to master.

}

call Utils.MergeVCFs as MergeRecalibrationFiles {
call PopulateFilterSetInfo as PopulateFilterSetInfoCLassic {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

references will need to be updated as well obvs

Suggested change
call PopulateFilterSetInfo as PopulateFilterSetInfoCLassic {
call PopulateFilterSetInfo as PopulateFilterSetInfoClassic {

scripts/variantstore/wdl/GvsCreateFilterSet.wdl Outdated Show resolved Hide resolved
Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I'm confused as to why vqslod is entirely null in vqsr_lite_filter_set_info.
Would like to understand prior to merging.

@samuelklee
Copy link
Contributor

samuelklee commented Feb 1, 2023

Hi all, thanks again for working to integrate this code!

Saw some confusion in the comments above and just wanted to clarify: if you take a look at the VQSR-lite PR https://github.com/broadinstitute/gatk/pull/7954/commits that the current branch is rebased upon, you'll see that it contains a version of the Joint Genotyping WDL (which was put together by Megan for Ultima) along with Java code for the tools (which was written by me).

Both the WDL and the code have been updated in subsequent PRs. The WDL was rewritten by me in #8074; the main difference is that we no longer run SNPs and indels filtering in "series", but instead run them in a single step. However, this requires that you use the same annotations for both SNPs and indels; GVS might not be ready for that just yet, since the default WARP implementation uses different annotations. (But see also the comment here: #8074 (comment). The gist is we can easily reimplement Megan's/WARP's "serial" SNP-then-indel workflow using the simpler single-step workflow.) (EDIT: I was originally confused here, Megan’s WDL simply runs SNPs and indels separately—thanks to George for correcting me here!)

Note also that test infrastructure was moved from Travis to Github Actions between these PRs, so the Travis references above have already been cleaned up. There have also been a few additional minor PRs merged in the interim, with a couple more incoming. These PRs do not fundamentally change the interfaces of the tools/WDL, however, so I think you can update to them when you're ready.

Punchline: this branch should suffice for a first cut of a VQSR/VQSR-lite bakeoff, and although it is already slightly out of date, it shouldn't be too much work to get things updated after the first cut is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants