Document how multi-threading support works in GATK4 #2345

magicDGS · 2017-01-17T14:29:14Z

In the classic GATK, walkers had the option to be multi-thread in two different ways:

NanoSchedulable for thread-safe map() calls.
TreeReducible for thread-safe map() and reduce() calls.

Because now the new framework's walkers have only one apply() function, maybe the previous design is not applicable. Nevertheless, it will be useful to implement a way to allows a tool to apply the function in a multi-thread way. Is there any plan to implement something similar in GATK4?

The text was updated successfully, but these errors were encountered:

droazen · 2017-01-17T15:44:52Z

In GATK4, the way to make a tool multithreaded is to implement it as a Spark tool. All Spark tools can be trivially parallelized across multiple threads using the local runner, and across a cluster using spark-submit or gcloud.

We wanted to avoid the complexities of implementing our own map/reduce framework, as was done in previous versions of the GATK, and instead rely on a standard, third-party framework to keep the GATK4 engine as simple as possible.

magicDGS · 2017-01-17T17:36:17Z

I don't know too much about Spark, so maybe I have a stupid question: how can be run a Spark tool be run in multiple threads in a single computer? I mean, that requires some setup of the local computer, doesn't it?

magicDGS · 2017-01-17T17:36:39Z

And thank you very much for the quick answer, @droazen!

lbergelson · 2017-01-17T21:05:41Z

@magicDGS You can run a spark tool on a single computer with N threads by specifying --sparkMaster 'local[N]' where N is the number of threads you want to use. If that number is big (>8ish) you might want to consider setting up a spark master and using yarn, which is a bit more complicated but not very difficult. If that's the case let me know and I can point you to some resources. For use on the average laptop it makes sense to just run with sparkMaster local.

magicDGS · 2017-01-17T21:13:21Z

Thanks @lbergelson. But that requires the gatk-launch script or just the shadow jar?

lbergelson · 2017-01-17T21:34:22Z

Ah, yes, that's kind of confusing actually...
The shadow jar includes a copy of spark and all of it's dependencies, so if you want to run spark tools locally you can use the shadow jar. If you want to use an existing spark cluster, which may have slightly different versions of spark/spark's dependencies then you need to use the spark jar. The spark jar doesn't include it's own copy of spark and expects that the spark cluster will provide the necessary dependencies. This avoids conflicts between different dependency versions.

You don't need to use gatk-launch ever, but it can make it easier if you want to potentially run your code in different environments. It knows about 3 different potential ways to invoke spark, 1) running in local mode with --sparkMaster local, 2) running on a cluster using spark-submit and 3) running on a google dataproc cluster using gcloud. Gatk-launch knows which environment needs which jar and will prompt you to create one if you don't have it.

gatk-launch also applies some default arguments when running on spark, you may have to supply them yourself if you're not using it.

magicDGS · 2017-01-17T21:44:26Z

Thanks a lot for this explanation. For the moment I'm more interested in the toolkit that I'm implementing, which does not have a master script for running. It is good to know that if I implement a Spark tool it could run locally with a shadow jar. I will study a bit more about spark in the near future.

So just the last question, and I will close this: why there are some tools that have a Spark version and a normal version in GATK4? If the Spark version could run locally, is there any performance issue related to run it without Spark?

Thanks a lot for all your answers, it is very informative :)

lbergelson · 2017-01-17T22:16:33Z

Well, you're welcome to use gatk-launch as a launch script if you'd like (and feel free to rename to whatever you like...) A

There are a few reasons we have spark and non-spark versions of the tools.

We wanted to port and validate certain tools as quickly as possible and doing a direct port from gatk3 -> gatk4 was easier than making them sparkified at the same time.
There's a tradeoff in using spark where you end up spending more total cpu hours in order to finish a job faster. Ideally this would be 1:1, double the number of cores and you halve the time to finish a job. It never scales perfectly though, there's always some overhead for being parallel. Our production pipelines are extremely sensitive to cost and not very sensitive to runtime, so they prefer we have a version that's optimized to use the least cpu hours even if that means a longer runtime. Other users prefer to be able to finish a job quickly and are willing to pay slightly more to do so, so we also have a spark version.
Some tool are complicated to make work well spark. Spark works best when you can divide the input data into independent shards and then process them separately. This is complicated for things like the AssemblyRegion walker where you need context around each location of interest. We had to do things like add extra overlapping padding and things like that to avoid boundary issues where there are shard divisions.

We don't yet fully understand spark performance and it's caveats, we're looking into that actively now. We hope that we'll be able to optimize our tools so that a spark pipeline of several tools in series is faster than running the individual non-spark versions, since it lets us avoid doing things like loading the bam file multiple times from disk. Whether or not we can achieve this is still and open question though.

magicDGS · 2017-01-18T09:10:30Z

Thank you very much for the detailed answer, @lbergelson. I understand the point 1 and 3, but regarding 2: there is also a cost of running a Spark tool with 1 thread? I would love to use the framework for "sparkify" my tools, but I would like to be sure that there is no cost for running it locally without multi-thread...

droazen · 2017-01-18T16:27:03Z

We've found that, generally speaking, you do pay a penalty on single-core performance when becoming a Spark tool, but gain the ability to easily scale to multiple cores and get the job done quickly. This is why we've been maintaining both Spark and non-Spark versions of important tools. Whether this will be the case for your tools as well can only be determined by profiling.

If you extract the logic of your tool into a separate class, it's usually possible to call that shared code from both the Spark and walker frameworks without much or any code duplication. See BaseRecalibrator and BaseRecalibratorSpark for an example of this.

magicDGS · 2017-01-19T10:07:58Z

Thanks a lot for all your feedback about this @lbergelson and @droazen. From my side this could be close now, although it may be useful to have some of this information in the Wiki to avoid confusion.

Thank you very much again!

vdauwera · 2017-01-19T11:43:42Z

👍 to archiving this content on the wiki -- plenty of great information in here (I've been lurking)

droazen · 2017-03-22T21:38:47Z

Changing this into a documentation ticket.

irgp2019 · 2024-11-11T10:11:02Z

Thanks a lot for all your feedback about this @lbergelson and @droazen. From my side this could be close now, although it may be useful to have some of this information in the Wiki to avoid confusion.

Thank you very much again!

hi @droazen magicDGS I hope you doing well
could you please guide me on how should I install gatk spark on locall machine
I want to run this code multithreading but it runs single core:
gatk CombineGVCFs hg19.fa --variant combined.g.vcf.gz --variant comb2.g.vcf.gz -O output.g.vcf.gz

gokalpcelik · 2024-11-11T10:33:45Z

Hi @irgp2019
Only Spark named tools can be run over a local or cluster spark instance. Other tools are still single threaded. CombineGVCFs is still a single threaded tool.
Regards.

irgp2019 · 2024-11-11T10:36:18Z

thank you @gokalpcelik
so you mean there is no way to run CombineGVCFs with multithread?

gokalpcelik · 2024-11-11T10:37:11Z

Exactly.

irgp2019 · 2024-11-11T11:08:57Z

Exactly.

thanks a lot

droazen changed the title ~~Question: multi-thread support in GATK4~~ Document how multi-thread support works in GATK4 Mar 22, 2017

droazen added the UserDoc label Mar 22, 2017

droazen added this to the 4.0 release milestone Mar 22, 2017

droazen changed the title ~~Document how multi-thread support works in GATK4~~ Document how multi-threading support works in GATK4 Mar 22, 2017

droazen assigned lbergelson Aug 1, 2017

droazen assigned vdauwera and unassigned lbergelson Oct 17, 2017

droazen modified the milestones: Engine-4.0, Engine-4.1 Jan 16, 2018

magicDGS mentioned this issue Feb 23, 2018

Multithread / multicore #4448

Closed

droazen removed this from the Engine-2Q2018 milestone Oct 4, 2018

alneberg mentioned this issue Feb 21, 2019

Look into using spark for gatk SciLifeLab/Sarek#730

Closed

davidbenjamin mentioned this issue Oct 3, 2019

How to call SNPs in GATK with multithreads in GATK? #6192

Closed

davidbenjamin added Documentation and removed UserDoc labels Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document how multi-threading support works in GATK4 #2345

Document how multi-threading support works in GATK4 #2345

magicDGS commented Jan 17, 2017

droazen commented Jan 17, 2017

magicDGS commented Jan 17, 2017

magicDGS commented Jan 17, 2017

lbergelson commented Jan 17, 2017

magicDGS commented Jan 17, 2017

lbergelson commented Jan 17, 2017

magicDGS commented Jan 17, 2017

lbergelson commented Jan 17, 2017

magicDGS commented Jan 18, 2017

droazen commented Jan 18, 2017

magicDGS commented Jan 19, 2017

vdauwera commented Jan 19, 2017

droazen commented Mar 22, 2017

irgp2019 commented Nov 11, 2024 •

edited

Loading

gokalpcelik commented Nov 11, 2024

irgp2019 commented Nov 11, 2024

gokalpcelik commented Nov 11, 2024

irgp2019 commented Nov 11, 2024

Document how multi-threading support works in GATK4 #2345

Document how multi-threading support works in GATK4 #2345

Comments

magicDGS commented Jan 17, 2017

droazen commented Jan 17, 2017

magicDGS commented Jan 17, 2017

magicDGS commented Jan 17, 2017

lbergelson commented Jan 17, 2017

magicDGS commented Jan 17, 2017

lbergelson commented Jan 17, 2017

magicDGS commented Jan 17, 2017

lbergelson commented Jan 17, 2017

magicDGS commented Jan 18, 2017

droazen commented Jan 18, 2017

magicDGS commented Jan 19, 2017

vdauwera commented Jan 19, 2017

droazen commented Mar 22, 2017

irgp2019 commented Nov 11, 2024 • edited Loading

gokalpcelik commented Nov 11, 2024

irgp2019 commented Nov 11, 2024

gokalpcelik commented Nov 11, 2024

irgp2019 commented Nov 11, 2024

irgp2019 commented Nov 11, 2024 •

edited

Loading