-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite the SplitVariants task command in TasksGenotypeBatch.wdl to call svtk only once #618
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this, it has the potential to really speed up this task! And the code has matured a lot over the time you've been working on it.
I've left specific comments on functionality and style throughout the code, and I also have some more general comments and questions here:
-
Can you add more comments to your script to summarize the functionality of each section and clarify any details that might be hard for a new reader to understand at first glance?
-
Vahid's comment is correct - you do need to build your own docker image for testing (best stored under broad-dsde-methods) but we don't commit those changes to the main branch because dockers are handled automatically, so you'll need to revert the changes to dockers.json before this PR can be merged. You can make a local copy of the dockers.json outside of your local clone of the repo to save your changes, then revert the dockers.json changes on your branch before pushing.
-
Can you also give more details on the testing you performed? SplitVariants is called multiple times with multiple combinations of parameters so it's necessary to test all of those cases.
-
What was the difference in runtime/memory between the version of SplitVariants in the main branch and your branch? This would be good to benchmark (it may only be more efficient for larger cohorts so it may not show improvements in the reference panel, but we at least don't want to see a performance reduction).
src/sv-pipeline/04_variant_resolution/scripts/split_variants.py
Outdated
Show resolved
Hide resolved
src/sv-pipeline/04_variant_resolution/scripts/split_variants.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to merge! Thanks for your patience finalizing all those details! I'm looking forward to seeing the performance improvement for this task :)
This pull request was created to improve the efficiency of the SplitVariants task command in the TasksGenotypeBatch.wdl. The SplitVariants task took an input bed file and a number of splits, called svtk to produce multiple bed files and then used awk to filter and split each bed file based on the number of required splits, proving inefficient with both memory and time and cost.
Instead, this request is to convert the actions of the task into a python script which is called on the input vcf file with the input number of lines per file as well as the variable indicating if bca's should be examined. Svtk is called only once on the vcf to produce a bed file. Each line of the bed file is read only once; the appropriate line is then added to a corresponding text file with the appropriate prefix based on which of the conditions it matches: gt5kb (greater than 5kb in length), lt5kb (less than 5kb in length), and if the bca's are indicated to be examined, if it is a bca or insertion. Once each file has reached the maximum capacity, it is closed an a new file with the same prefix and a new suffix is opened.
This implementation is an improvement in memory and time since svtk is called on the vcf only once, and each line in the bed file is only parsed once as well.
The docker images were updated and implemented.
The appropriate changes were made to the TasksGenotypeBatch. wdl where the SplitVariants task is defined, as well as in GenotypeBatch.wdl. All changes passed validation with womtool and cromwell. Testing involved changing the parameters of bca to True and False with a combination of different input values for n_per_split.
Using a standard bed file with the default parameter of n_per_split=5000 and checking for BCAs, the runtime was about 1.5 times faster than the previous task using the awk commands. The time increases as the n_per_split increases as well but not drastically.