Refactored JointVcfFiltering WDL and expanded tests. #8074

samuelklee · 2022-10-26T13:07:00Z

Adds a WDL that replaces the "serial" SnpThenIndel joint filtering workflow added in #7932. This simplified replacement only runs one iteration of the extract-train-score toolchain, rather than running one iteration for SNPs followed by another for INDELs. The original SnpThenIndel workflow (used for Ultima) will be updated and moved to the WARP repo. (EDIT: I was originally confused here, the WDL that was replaced in this PR simply ran SNPs and indels separately, rather than serially. Curious that things still tied out, but I’m not sure it’s worth looking into at this point.)

Test files have also been subset to chr21-22 and slimmed down. A test for the positive-negative was also added, as well as tests of an empty shard.

The first commit contains the original workflow (JointVcfFilteringOriginal.wdl), as well as a reimplementation (JointVcfFilteringSnpThenIndel.wdl) that calls the simplified workflow (JointVcfFiltering.wdl). I've verified that both the original and reimplemented SnpThenIndel workflows tie out on the original test data.

The second commit then removes the original and the reimplementation, leaving only the simplified workflow. It may thus be easier to review the first commit, second commit, or the overall changes, depending on what you are looking at.

@meganshand can you take a look and let me know if there's any missing functionality, or if this otherwise won't work for Ultima and/or importing in WARP? Apologies for the delay!

codecov · 2022-10-27T20:58:34Z

Codecov Report

Merging #8074 (41abfd6) into master (c5bccee) will increase coverage by 0.051%.
The diff coverage is 69.697%.

Additional details and impacted files

@@               Coverage Diff               @@
##              master     #8074       +/-   ##
===============================================
+ Coverage     86.593%   86.644%   +0.051%     
- Complexity     38899     38963       +64     
===============================================
  Files           2336      2336               
  Lines         182709    182730       +21     
  Branches       20060     20066        +6     
===============================================
+ Hits          158213    158325      +112     
+ Misses         17441     17365       -76     
+ Partials        7055      7040       -15

Impacted Files	Coverage Δ
.../scalable/data/LabeledVariantAnnotationsDatum.java	`68.421% <45.455%> (-3.801%)`	⬇️
...vqsr/scalable/LabeledVariantAnnotationsWalker.java	`86.822% <46.154%> (+0.208%)`	⬆️
...rs/vqsr/scalable/TrainVariantAnnotationsModel.java	`77.778% <66.667%> (-2.991%)`	⬇️
...lkers/vqsr/scalable/ExtractVariantAnnotations.java	`92.188% <100.000%> (+1.116%)`	⬆️
...walkers/vqsr/scalable/ScoreVariantAnnotations.java	`76.250% <100.000%> (+0.149%)`	⬆️
...r/scalable/data/LabeledVariantAnnotationsData.java	`75.510% <100.000%> (+1.283%)`	⬆️
.../hellbender/utils/genotyper/AlleleLikelihoods.java	`84.201% <0.000%> (+0.186%)`	⬆️
...lbender/utils/variant/GATKVariantContextUtils.java	`88.242% <0.000%> (+0.224%)`	⬆️
...llbender/tools/spark/sv/evidence/ReadMetadata.java	`88.480% <0.000%> (+0.245%)`	⬆️
... and 16 more

samuelklee · 2022-11-22T16:37:35Z

I've added a few commits that clean up some of the code inherited from VQSR regarding the use of labeled resources when using allele-specific annotations. This should be ready for review and/or experimentation with importing into the WARP repo, @meganshand. There are a few unrelated failing tests, which I think others are seeing in their branches as well.

meganshand

My test case WDL ran! I just have a few comments about usability in the WDL.

meganshand · 2022-11-29T14:46:24Z

scripts/vcf_site_level_filtering_wdl/JointVcfFiltering.wdl

-# will output the same number of shards that are input.
-# This portion of the filtering pipeline will assign a SCORE INFO field annotation to each site, but does not yet apply
-# the filtering threshold to the final VCF.
+# Workflow for scoring and optionally filtering a VCF based on site-level annotations using the


It's not super clear from the WDL how to get a filtered VCF rather than just a scored VCF (I think you need to go to the tool doc to see the optional parameter that you'd add to extra_args). This feels important enough to elevate to its own argument rather than just being folded into the extra_args so it's clear how to get this feature straight from the WDL.

Actually, I think I feel the same way about the allele specific argument. These feel important enough to the core of the tool to call out explicitly in the WDL.

I was thinking we could make allele-specific mode automatic---we just need to detect whether any annotations have Number=A. So this would get rid of the corresponding parameter. Added a TODO to the meta issue for this.

As for hard filtering, the fact that we need to specify this via extra_args is alluded to in the parameter meta documentation. I agree that it would be nice to elevate certain parameters, but then this raises the question of why those parameters were singled out. I think it's better from a development perspective to treat everything as equitably as possible----i.e., elevate everything or elevate nothing.

Currently, we do "elevate" resource_args, since this is shared by the extract and score tools and we want to ensure the same values are passed. We also elevate any file arguments that might need to be localized (e.g., python_ script, which in turn suggests that we elevate the corresponding model_backend argument). But then everything else is relegated to extra_args. This at least has the benefits of giving a more generic interface (in that using extra_args provides maximal freedom to change arguments in the future, etc.).

As for whether we should replicate documentation for e.g. hard filtering here----in a perfect world, yes, but in a world where replicated documentation quickly falls out of sync, I'd rather refer users to the original documentation. Hopefully reading those docs and looking at example JSONs where hard filtering is turned on should make it clear how this works.

I agree we shouldn't be replicating documentation here, but I disagree that if we elevate some inputs we must elevate them all. But perhaps for this use case of an example WDL nothing needs to be elevated. We can leave that for the WDLs that will run a specific use case.

meganshand · 2022-11-30T14:15:58Z

scripts/vcf_site_level_filtering_wdl/JointVcfFiltering.wdl

+        docker: gatk_docker
+        cpu: select_first([runtime_attributes.cpu, 1])
+        memory: select_first([runtime_attributes.machine_mem_gb, 7]) + " GB"
+        disks: "local-disk " + select_first([runtime_attributes.disk_size_gb, 100]) + if select_first([runtime_attributes.use_ssd, false]) then " SSD" else " HDD"


Can the default disk sizes be generated dynamically from the size of the inputs instead of hardcoded? This could be done for all of the tasks.

I started an attempt at this, but a lot of the big files (e.g., the input VCFs) are only optionally localized, and I'm not sure how best to use size in these cases. I think it's also likely that e.g. large annotation files will cause memory issues before they cause disk issues, and for these and other output files, there's not a good way to estimate their resulting disk size. Will file something in the meta issue and punt for now.

meganshand · 2022-12-06T19:24:58Z

scripts/vcf_site_level_filtering_wdl/JointVcfFiltering.wdl

+    runtime {
+        docker: gatk_docker
+        cpu: select_first([runtime_attributes.cpu, 1])
+        memory: select_first([runtime_attributes.machine_mem_gb, 7]) + " GB"


For what it's worth the default memory requirements were too small for both the Extract and the Train tasks for a callset of ~200 samples. Not sure if that means it's worth it to increase the defaults or not for this example WDL.

However, as we discussed in slack, I would personally vote for making the command_mem_gb related to machine_mem_gb (machine_mem_gb = command_mem_gb +1). It seems much simpler for the user to only have to worry about one of those arguments. If you're concerned that they should be untied due to needing extra memory on the machine outside of the java options, then maybe add a additional_machine_mem_gb or something similar (machine_mem_gb = command_mem_gb + additional_machine_mem_gb)? That way you'd still only have two memory arguments (command_mem_gb and additional_machine_mem_gb), but in most use cases the user only needs to update the command_mem_gb.

Went with command_mem_gb and additional_mem_gb, thanks for the suggestion! I just left the default values as is, though---they're just meant to be an example.

… workflow.

… annotations.

…e-specific mode and cleaned up unlabeled alleles in labeled output.

…l for edge cases with insufficient negative training data.

samuelklee · 2022-12-14T14:15:17Z

Thanks for the review @meganshand, and sorry for the delayed response. I addressed one of your comments and punted/semi-punted on the others. Tests should also be passing after a rebase.

Unless you have strong objections, I'll go ahead and merge---want to get this in before the break!

meganshand · 2022-12-14T15:59:15Z

@samuelklee Sounds good I think you can merge!

samuelklee marked this pull request as draft October 26, 2022 13:07

samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch from 1bd0d65 to 96b2d67 Compare October 31, 2022 19:53

samuelklee mentioned this pull request Oct 31, 2022

New tools for annotation-based filtering. #7724

Open

samuelklee force-pushed the sl_simple_joint_vcf_filtering_wdl branch 4 times, most recently from 305781d to 929ab9a Compare November 1, 2022 11:12