-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL multi-GPU reductions #449
Conversation
…used to destroy NCCL communicator
Codecov Report
@@ Coverage Diff @@
## master #449 +/- ##
==========================================
- Coverage 88.08% 87.21% -0.88%
==========================================
Files 78 78
Lines 16824 17038 +214
==========================================
+ Hits 14820 14860 +40
- Misses 2004 2178 +174
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach seems sound and as far as I can tell it should be alright (not checked each implementation detail).
As hoped, the step from #447 to multi-GPU training was pretty small. Basically, we re-use the code from #316 which allows the generation of merged host code (so big models don't end up with hundreds of lines of NCCL calls) to generate NCCL reductions after the batch reduction kernels. These operations use standard CUDA stream semantics so won't cause any synchronisation unless they're followed by a synchronisation or memcpy operation. The only slightly gnarly bit is that we don't want platform-specific CUDA (or in this case NCCL) types to bleed out of the generated code so the unique id (which is used to identify the 'clique' of nodes that are communicating) is exposed as a pointer (which is then exposed to PyGeNN as a numpy view).
You need to combine the code this PR generates with something that can spawn multiple processes and communicate between them. I have used mpi4py on JUWELS booster where you can do something like:
But you could also use the Python multiprocessing module e.g. on a multi-GPU workstation where you might not want to use MPI (I envisage this abstraction being handled by mlGeNN as you're likely to also want to use MPI/multi-processing to share e.g. number of correct classifications across all nodes)
Results from JUWELS booster (each node has 4A100 GPU) and JADE2 (each node has 8V100 GPU):