-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. #8049
Conversation
…eous minor fixes to new annotation-based filtering tools.
@@ -46,7 +48,7 @@ | |||
* walker, performing the operations: | |||
* | |||
* - nthPassApply(n = 0) | |||
* - if variant/alleles pass filters and variant-type/overlapping-resource checks, then: | |||
* - if variant/alleles pass filters and variant-type/resource-match checks, then: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I just changed "overlapping" to "resource-matching" everywhere to distinguish what we're doing here from straightforward genomic overlapping.
@@ -283,7 +304,9 @@ VCFHeader constructVCFHeader(final List<String> sortedLabels) { | |||
.collect(Collectors.toCollection(TreeSet::new)); | |||
hInfo.add(GATKVCFHeaderLines.getFilterLine(VCFConstants.PASSES_FILTERS_v4)); | |||
final SAMSequenceDictionary sequenceDictionary = getBestAvailableSequenceDictionary(); | |||
hInfo = VcfUtils.updateHeaderContigLines(hInfo, null, sequenceDictionary, true); | |||
if (sequenceDictionary != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids a complaint raised if the VCF is missing contig lines in the header.
switch (resourceMatchingStrategy) { | ||
case START_POSITION: | ||
return true; | ||
case START_POSITION_AND_GIVEN_REPRESENTATION: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if these strategies add any noticeable overhead if there are a lot of multiallelics, but I haven't noticed anything out of the ordinary on my runs so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if you could add some tests for this (even a unit test of isMatchingVariant
), but you can make that in a separate PR if you're in a hurry to get this in now.
Even START_POSITION_AND_MINIMAL_RERPRESENTATION
is not as thorough as RTG VCFEval, right? That does a more complete adjustment to resolve different representations? It would be nice to use their code somehow if it makes a big enough difference. Maybe that's easy enough to achieve through pipelining. Sorry, these are musings not requests. 👍
Thanks for the lightning-quick review! And good thought to add tests---probably just adding them to the current suite of exact-match tests would hopefully suffice, since we rely on existing/library methods (which presumably already have correctness tests) to do the matching. I'll add it to the straggler issues 😆 And yes, I don't think the matching here is as sophisticated as that done by VCFEval, but it's probably good enough for the purposes of identifying training variants. I am actually curious how often the start-position strategy gets us into trouble (e.g., we hit an artifact at a multiallelic site, or annotations at multiallelic sites are somehow distributed differently even if all alleles are real, etc.) |
…eous minor fixes to new annotation-based filtering tools. (#8049)
* Added a new suite of tools for variant filtering based on site-level annotations. (#7954) * Adds wdl that tests joint VCF filtering tools (#7932) * adding filtering wdl * renaming pipeline * addressing comments * added bash * renaming json * adding glob to extract for extra files * changing dollar signs * small comments * Added changes for specifying model backend and other tweaks to WDLs and environment. * Added classes for representing a collection of labeled variant annotations. * Added interfaces for modeling and scoring backends. * Added a new suite of tools for variant filtering based on site-level annotations. * Added integration tests. * Added test resources and expected results. * Miscellaneous changes. * Removed non-ASCII characters. * Added documentation for TrainVariantAnnotationsModel and addressed review comments. Co-authored-by: meganshand <[email protected]> * Added toggle for selecting resource-matching strategies and miscellaneous minor fixes to new annotation-based filtering tools. (#8049) * Adding use_allele_specific_annotation arg and fixing task with empty input in JointVcfFiltering WDL (#8027) * Small changes to JointVCFFiltering WDL * making default for use_allele_specific_annotations * addressing comments * first stab * wire through WDL changes * fixed typo * set model_backend input value * add gatk_override to JointVcfFiltering call * typo in indel_annotations * make model_backend optional * tabs and spaces * make all model_backends optional * use gatk 4.3.0 * no point in changing the table names as this is a POC * adding new branch to dockstore * adding in branching logic for classic VQSR vs VQSR-Lite * implementing the separate schemas for the VQSR vs VQSR-Lite branches, including Java changes necessary to produce the different tsv files * passing classic flag to indel run of CreateFilteringFiles * Update GvsCreateFilterSet.wdl cleaning up verbiage * Removed mapping error rate from estimate of denoised copy ratios output by gCNV and updated sklearn. (#7261) * cleanup up sloppy comment --------- Co-authored-by: samuelklee <[email protected]> Co-authored-by: meganshand <[email protected]> Co-authored-by: Rebecca Asch <[email protected]>
Ticks off a few straggler issues noted in #7724.
@meganshand mind reviewing? Hopefully should be quick and we can get it in before @droazen cuts the next release. Note that this shouldn't change behavior in the Ultima pipeline, as the default toggle is still the same start-position resource-matching strategy inherited from VQSR, but we might want to explore the effect of choosing another strategy there.