Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL multi-GPU reductions #449

Merged
merged 18 commits into from
Sep 6, 2021
Merged

NCCL multi-GPU reductions #449

merged 18 commits into from
Sep 6, 2021

Conversation

neworderofjamie
Copy link
Contributor

@neworderofjamie neworderofjamie commented Aug 11, 2021

As hoped, the step from #447 to multi-GPU training was pretty small. Basically, we re-use the code from #316 which allows the generation of merged host code (so big models don't end up with hundreds of lines of NCCL calls) to generate NCCL reductions after the batch reduction kernels. These operations use standard CUDA stream semantics so won't cause any synchronisation unless they're followed by a synchronisation or memcpy operation. The only slightly gnarly bit is that we don't want platform-specific CUDA (or in this case NCCL) types to bleed out of the generated code so the unique id (which is used to identify the 'clique' of nodes that are communicating) is exposed as a pointer (which is then exposed to PyGeNN as a numpy view).

You need to combine the code this PR generates with something that can spawn multiple processes and communicate between them. I have used mpi4py on JUWELS booster where you can do something like:

from mpi4py import MPI

# Get communicator
comm = MPI.COMM_WORLD

# Get our rank and number of ranks
rank = comm.Get_rank()
num_ranks = comm.Get_size()

# Generate unique ID for our NCCL 'clique' on first rank
if rank == 0:
    model._slm.nccl_generate_unique_id()

# Broadcast our  NCCL clique ID across all ranks
nccl_unique_id_view = model._slm.nccl_assign_external_unique_id()
comm.Bcast(nccl_unique_id_view, root=0)

# Initialise NCCL communicator
model._slm.nccl_init_communicator(rank, num_ranks)

But you could also use the Python multiprocessing module e.g. on a multi-GPU workstation where you might not want to use MPI (I envisage this abstraction being handled by mlGeNN as you're likely to also want to use MPI/multi-processing to share e.g. number of correct classifications across all nodes)

Results from JUWELS booster (each node has 4A100 GPU) and JADE2 (each node has 8V100 GPU):
Figure_1

@neworderofjamie neworderofjamie added this to the GeNN 4.6.0 milestone Aug 11, 2021
@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

Merging #449 (e0d5e43) into master (36d3e83) will decrease coverage by 0.87%.
The diff coverage is 22.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #449      +/-   ##
==========================================
- Coverage   88.08%   87.21%   -0.88%     
==========================================
  Files          78       78              
  Lines       16824    17038     +214     
==========================================
+ Hits        14820    14860      +40     
- Misses       2004     2178     +174     
Impacted Files Coverage Δ
include/genn/genn/code_generator/backendBase.h 91.52% <ø> (ø)
include/genn/genn/code_generator/groupMerged.h 86.65% <0.00%> (-3.03%) ⬇️
src/genn/genn/code_generator/modelSpecMerged.cc 96.31% <10.00%> (-2.18%) ⬇️
src/genn/genn/code_generator/groupMerged.cc 88.80% <10.71%> (-1.27%) ⬇️
src/genn/backends/cuda/backend.cc 82.45% <17.51%> (-5.77%) ⬇️
include/genn/genn/code_generator/modelSpecMerged.h 97.40% <60.00%> (-1.27%) ⬇️
src/genn/genn/code_generator/generateRunner.cc 95.85% <71.42%> (-0.25%) ⬇️
src/genn/backends/single_threaded_cpu/backend.cc 56.97% <75.00%> (+0.02%) ⬆️
include/genn/backends/cuda/backend.h 96.29% <100.00%> (+0.14%) ⬆️
include/genn/backends/opencl/backend.h 98.86% <100.00%> (+0.01%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36d3e83...e0d5e43. Read the comment docs.

Copy link
Member

@tnowotny tnowotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach seems sound and as far as I can tell it should be alright (not checked each implementation detail).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants