Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618

kirtanav98 · 2023-11-20T15:02:23Z

This pull request was created to improve the efficiency of the SplitVariants task command in the TasksGenotypeBatch.wdl. The SplitVariants task took an input bed file and a number of splits, called svtk to produce multiple bed files and then used awk to filter and split each bed file based on the number of required splits, proving inefficient with both memory and time and cost.
Instead, this request is to convert the actions of the task into a python script which is called on the input vcf file with the input number of lines per file as well as the variable indicating if bca's should be examined. Svtk is called only once on the vcf to produce a bed file. Each line of the bed file is read only once; the appropriate line is then added to a corresponding text file with the appropriate prefix based on which of the conditions it matches: gt5kb (greater than 5kb in length), lt5kb (less than 5kb in length), and if the bca's are indicated to be examined, if it is a bca or insertion. Once each file has reached the maximum capacity, it is closed an a new file with the same prefix and a new suffix is opened.
This implementation is an improvement in memory and time since svtk is called on the vcf only once, and each line in the bed file is only parsed once as well.
The docker images were updated and implemented.
The appropriate changes were made to the TasksGenotypeBatch. wdl where the SplitVariants task is defined, as well as in GenotypeBatch.wdl. All changes passed validation with womtool and cromwell. Testing involved changing the parameters of bca to True and False with a combination of different input values for n_per_split.
Using a standard bed file with the default parameter of n_per_split=5000 and checking for BCAs, the runtime was about 1.5 times faster than the previous task using the awk commands. The time increases as the n_per_split increases as well but not drastically.

…ions

inputs/values/dockers.json

epiercehoffman

Thanks for all your work on this, it has the potential to really speed up this task! And the code has matured a lot over the time you've been working on it.

I've left specific comments on functionality and style throughout the code, and I also have some more general comments and questions here:

Can you add more comments to your script to summarize the functionality of each section and clarify any details that might be hard for a new reader to understand at first glance?
Vahid's comment is correct - you do need to build your own docker image for testing (best stored under broad-dsde-methods) but we don't commit those changes to the main branch because dockers are handled automatically, so you'll need to revert the changes to dockers.json before this PR can be merged. You can make a local copy of the dockers.json outside of your local clone of the repo to save your changes, then revert the dockers.json changes on your branch before pushing.
Can you also give more details on the testing you performed? SplitVariants is called multiple times with multiple combinations of parameters so it's necessary to test all of those cases.
What was the difference in runtime/memory between the version of SplitVariants in the main branch and your branch? This would be good to benchmark (it may only be more efficient for larger cohorts so it may not show improvements in the reference panel, but we at least don't want to see a performance reduction).

wdl/TasksGenotypeBatch.wdl

src/sv-pipeline/scripts/SplitVariants.py

…ing comments to code.

src/sv-pipeline/04_variant_resolution/scripts/split_variants.py

src/sv-pipeline/scripts/SplitVariants.py

src/sv-pipeline/04_variant_resolution/scripts/split_variants.py

wdl/TasksGenotypeBatch.wdl

epiercehoffman

Looks good to merge! Thanks for your patience finalizing all those details! I'm looking forward to seeing the performance improvement for this task :)

kirtanav98 added 3 commits November 13, 2023 14:36

add python file for SplitVariants task

68015ed

edited TasksGenotype.wdl command to call SplitVariants.py

8d7ca52

changed command in TasksGenotypeBatch.wdl

b4f8414

kirtanav98 requested a review from mwalker174 November 20, 2023 15:02

kirtanav98 added 3 commits November 20, 2023 14:53

changed docker to include correct tag for sv-pipeline

d7ee19e

reformatted python script to match github lint8 formatting specificat…

2b0f5da

…ions

reformatted python script to match github lint8 formatting specificat…

2368800

…ions

kirtanav98 requested a review from shadizaheri November 20, 2023 22:00

VJalili reviewed Nov 21, 2023

View reviewed changes

inputs/values/dockers.json Outdated Show resolved Hide resolved

kirtanav98 requested a review from epiercehoffman December 4, 2023 18:24

epiercehoffman requested changes Dec 5, 2023

View reviewed changes

kirtanav98 added 4 commits December 18, 2023 15:07

made changes based on first review

c5be22d

made edit to python script to lint correctly

70e6a80

made edit to python script to lint correctly, and added extra clarify…

0b68f62

…ing comments to code.

made edit to python script to lint correctly, and added extra clarify…

e61c128

…ing comments to code.

epiercehoffman reviewed Dec 19, 2023

View reviewed changes

src/sv-pipeline/04_variant_resolution/scripts/split_variants.py Outdated Show resolved Hide resolved

src/sv-pipeline/scripts/SplitVariants.py Outdated Show resolved Hide resolved

epiercehoffman reviewed Dec 19, 2023

View reviewed changes

src/sv-pipeline/04_variant_resolution/scripts/split_variants.py Outdated Show resolved Hide resolved

wdl/TasksGenotypeBatch.wdl Outdated Show resolved Hide resolved

kirtanav98 added 5 commits December 20, 2023 08:08

made edits based on second review.

f57b37c

made edits based on second review.

32550f1

made edits based on second review.

c590e1e

made edits based on second review.

a1bb4fe

addressed changes in the last review

47c3da3

epiercehoffman approved these changes Jan 5, 2024

View reviewed changes

kirtanav98 merged commit 310eb7b into main Jan 5, 2024
5 checks passed

kirtanav98 deleted the kv_split_variants branch January 5, 2024 19:43

mwalker174 mentioned this pull request Aug 27, 2024

Update GATK to 4.6.0.0 in sv_pipeline_docker #719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618

Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618

kirtanav98 commented Nov 20, 2023 •

edited

Loading

epiercehoffman left a comment

epiercehoffman left a comment

Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618

Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618

Conversation

kirtanav98 commented Nov 20, 2023 • edited Loading

epiercehoffman left a comment

Choose a reason for hiding this comment

epiercehoffman left a comment

Choose a reason for hiding this comment

kirtanav98 commented Nov 20, 2023 •

edited

Loading