-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document how multi-threading support works in GATK4 #2345
Comments
In GATK4, the way to make a tool multithreaded is to implement it as a Spark tool. All Spark tools can be trivially parallelized across multiple threads using the local runner, and across a cluster using spark-submit or gcloud. We wanted to avoid the complexities of implementing our own map/reduce framework, as was done in previous versions of the GATK, and instead rely on a standard, third-party framework to keep the GATK4 engine as simple as possible. |
I don't know too much about Spark, so maybe I have a stupid question: how can be run a Spark tool be run in multiple threads in a single computer? I mean, that requires some setup of the local computer, doesn't it? |
And thank you very much for the quick answer, @droazen! |
@magicDGS You can run a spark tool on a single computer with N threads by specifying |
Thanks @lbergelson. But that requires the gatk-launch script or just the shadow jar? |
Ah, yes, that's kind of confusing actually... You don't need to use gatk-launch also applies some default arguments when running on spark, you may have to supply them yourself if you're not using it. |
Thanks a lot for this explanation. For the moment I'm more interested in the toolkit that I'm implementing, which does not have a master script for running. It is good to know that if I implement a Spark tool it could run locally with a shadow jar. I will study a bit more about spark in the near future. So just the last question, and I will close this: why there are some tools that have a Spark version and a normal version in GATK4? If the Spark version could run locally, is there any performance issue related to run it without Spark? Thanks a lot for all your answers, it is very informative :) |
Well, you're welcome to use gatk-launch as a launch script if you'd like (and feel free to rename to whatever you like...) A There are a few reasons we have spark and non-spark versions of the tools.
We don't yet fully understand spark performance and it's caveats, we're looking into that actively now. We hope that we'll be able to optimize our tools so that a spark pipeline of several tools in series is faster than running the individual non-spark versions, since it lets us avoid doing things like loading the bam file multiple times from disk. Whether or not we can achieve this is still and open question though. |
Thank you very much for the detailed answer, @lbergelson. I understand the point 1 and 3, but regarding 2: there is also a cost of running a Spark tool with 1 thread? I would love to use the framework for "sparkify" my tools, but I would like to be sure that there is no cost for running it locally without multi-thread... |
We've found that, generally speaking, you do pay a penalty on single-core performance when becoming a Spark tool, but gain the ability to easily scale to multiple cores and get the job done quickly. This is why we've been maintaining both Spark and non-Spark versions of important tools. Whether this will be the case for your tools as well can only be determined by profiling. If you extract the logic of your tool into a separate class, it's usually possible to call that shared code from both the Spark and walker frameworks without much or any code duplication. See |
Thanks a lot for all your feedback about this @lbergelson and @droazen. From my side this could be close now, although it may be useful to have some of this information in the Wiki to avoid confusion. Thank you very much again! |
👍 to archiving this content on the wiki -- plenty of great information in here (I've been lurking) |
Changing this into a documentation ticket. |
hi @droazen magicDGS I hope you doing well |
Hi @irgp2019 |
thank you @gokalpcelik |
Exactly. |
thanks a lot |
In the classic GATK, walkers had the option to be multi-thread in two different ways:
NanoSchedulable
for thread-safemap()
calls.TreeReducible
for thread-safemap()
andreduce()
calls.Because now the new framework's walkers have only one
apply()
function, maybe the previous design is not applicable. Nevertheless, it will be useful to implement a way to allows a tool to apply the function in a multi-thread way. Is there any plan to implement something similar in GATK4?The text was updated successfully, but these errors were encountered: