Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how multi-threading support works in GATK4 #2345

Open
magicDGS opened this issue Jan 17, 2017 · 18 comments
Open

Document how multi-threading support works in GATK4 #2345

magicDGS opened this issue Jan 17, 2017 · 18 comments
Assignees

Comments

@magicDGS
Copy link
Contributor

In the classic GATK, walkers had the option to be multi-thread in two different ways:

  • NanoSchedulable for thread-safe map() calls.
  • TreeReducible for thread-safe map() and reduce() calls.

Because now the new framework's walkers have only one apply() function, maybe the previous design is not applicable. Nevertheless, it will be useful to implement a way to allows a tool to apply the function in a multi-thread way. Is there any plan to implement something similar in GATK4?

@droazen
Copy link
Contributor

droazen commented Jan 17, 2017

In GATK4, the way to make a tool multithreaded is to implement it as a Spark tool. All Spark tools can be trivially parallelized across multiple threads using the local runner, and across a cluster using spark-submit or gcloud.

We wanted to avoid the complexities of implementing our own map/reduce framework, as was done in previous versions of the GATK, and instead rely on a standard, third-party framework to keep the GATK4 engine as simple as possible.

@magicDGS
Copy link
Contributor Author

I don't know too much about Spark, so maybe I have a stupid question: how can be run a Spark tool be run in multiple threads in a single computer? I mean, that requires some setup of the local computer, doesn't it?

@magicDGS
Copy link
Contributor Author

And thank you very much for the quick answer, @droazen!

@lbergelson
Copy link
Member

@magicDGS You can run a spark tool on a single computer with N threads by specifying --sparkMaster 'local[N]' where N is the number of threads you want to use. If that number is big (>8ish) you might want to consider setting up a spark master and using yarn, which is a bit more complicated but not very difficult. If that's the case let me know and I can point you to some resources. For use on the average laptop it makes sense to just run with sparkMaster local.

@magicDGS
Copy link
Contributor Author

Thanks @lbergelson. But that requires the gatk-launch script or just the shadow jar?

@lbergelson
Copy link
Member

Ah, yes, that's kind of confusing actually...
The shadow jar includes a copy of spark and all of it's dependencies, so if you want to run spark tools locally you can use the shadow jar. If you want to use an existing spark cluster, which may have slightly different versions of spark/spark's dependencies then you need to use the spark jar. The spark jar doesn't include it's own copy of spark and expects that the spark cluster will provide the necessary dependencies. This avoids conflicts between different dependency versions.

You don't need to use gatk-launch ever, but it can make it easier if you want to potentially run your code in different environments. It knows about 3 different potential ways to invoke spark, 1) running in local mode with --sparkMaster local, 2) running on a cluster using spark-submit and 3) running on a google dataproc cluster using gcloud. Gatk-launch knows which environment needs which jar and will prompt you to create one if you don't have it.

gatk-launch also applies some default arguments when running on spark, you may have to supply them yourself if you're not using it.

@magicDGS
Copy link
Contributor Author

Thanks a lot for this explanation. For the moment I'm more interested in the toolkit that I'm implementing, which does not have a master script for running. It is good to know that if I implement a Spark tool it could run locally with a shadow jar. I will study a bit more about spark in the near future.

So just the last question, and I will close this: why there are some tools that have a Spark version and a normal version in GATK4? If the Spark version could run locally, is there any performance issue related to run it without Spark?

Thanks a lot for all your answers, it is very informative :)

@lbergelson
Copy link
Member

Well, you're welcome to use gatk-launch as a launch script if you'd like (and feel free to rename to whatever you like...) A

There are a few reasons we have spark and non-spark versions of the tools.

  1. We wanted to port and validate certain tools as quickly as possible and doing a direct port from gatk3 -> gatk4 was easier than making them sparkified at the same time.

  2. There's a tradeoff in using spark where you end up spending more total cpu hours in order to finish a job faster. Ideally this would be 1:1, double the number of cores and you halve the time to finish a job. It never scales perfectly though, there's always some overhead for being parallel. Our production pipelines are extremely sensitive to cost and not very sensitive to runtime, so they prefer we have a version that's optimized to use the least cpu hours even if that means a longer runtime. Other users prefer to be able to finish a job quickly and are willing to pay slightly more to do so, so we also have a spark version.

  3. Some tool are complicated to make work well spark. Spark works best when you can divide the input data into independent shards and then process them separately. This is complicated for things like the AssemblyRegion walker where you need context around each location of interest. We had to do things like add extra overlapping padding and things like that to avoid boundary issues where there are shard divisions.

We don't yet fully understand spark performance and it's caveats, we're looking into that actively now. We hope that we'll be able to optimize our tools so that a spark pipeline of several tools in series is faster than running the individual non-spark versions, since it lets us avoid doing things like loading the bam file multiple times from disk. Whether or not we can achieve this is still and open question though.

@magicDGS
Copy link
Contributor Author

Thank you very much for the detailed answer, @lbergelson. I understand the point 1 and 3, but regarding 2: there is also a cost of running a Spark tool with 1 thread? I would love to use the framework for "sparkify" my tools, but I would like to be sure that there is no cost for running it locally without multi-thread...

@droazen
Copy link
Contributor

droazen commented Jan 18, 2017

We've found that, generally speaking, you do pay a penalty on single-core performance when becoming a Spark tool, but gain the ability to easily scale to multiple cores and get the job done quickly. This is why we've been maintaining both Spark and non-Spark versions of important tools. Whether this will be the case for your tools as well can only be determined by profiling.

If you extract the logic of your tool into a separate class, it's usually possible to call that shared code from both the Spark and walker frameworks without much or any code duplication. See BaseRecalibrator and BaseRecalibratorSpark for an example of this.

@magicDGS
Copy link
Contributor Author

Thanks a lot for all your feedback about this @lbergelson and @droazen. From my side this could be close now, although it may be useful to have some of this information in the Wiki to avoid confusion.

Thank you very much again!

@vdauwera
Copy link
Contributor

👍 to archiving this content on the wiki -- plenty of great information in here (I've been lurking)

@droazen droazen changed the title Question: multi-thread support in GATK4 Document how multi-thread support works in GATK4 Mar 22, 2017
@droazen droazen added this to the 4.0 release milestone Mar 22, 2017
@droazen
Copy link
Contributor

droazen commented Mar 22, 2017

Changing this into a documentation ticket.

@droazen droazen changed the title Document how multi-thread support works in GATK4 Document how multi-threading support works in GATK4 Mar 22, 2017
@droazen droazen assigned vdauwera and unassigned lbergelson Oct 17, 2017
@droazen droazen modified the milestones: Engine-4.0, Engine-4.1 Jan 16, 2018
@droazen droazen removed this from the Engine-2Q2018 milestone Oct 4, 2018
@irgp2019
Copy link

irgp2019 commented Nov 11, 2024

Thanks a lot for all your feedback about this @lbergelson and @droazen. From my side this could be close now, although it may be useful to have some of this information in the Wiki to avoid confusion.

Thank you very much again!

hi @droazen magicDGS I hope you doing well
could you please guide me on how should I install gatk spark on locall machine
I want to run this code multithreading but it runs single core:
gatk CombineGVCFs hg19.fa --variant combined.g.vcf.gz --variant comb2.g.vcf.gz -O output.g.vcf.gz

@gokalpcelik
Copy link
Contributor

Hi @irgp2019
Only Spark named tools can be run over a local or cluster spark instance. Other tools are still single threaded. CombineGVCFs is still a single threaded tool.
Regards.

@irgp2019
Copy link

thank you @gokalpcelik
so you mean there is no way to run CombineGVCFs with multithread?

@gokalpcelik
Copy link
Contributor

Exactly.

@irgp2019
Copy link

Exactly.

thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants