NCCL multi-GPU reductions #449

neworderofjamie · 2021-08-11T15:06:39Z

As hoped, the step from #447 to multi-GPU training was pretty small. Basically, we re-use the code from #316 which allows the generation of merged host code (so big models don't end up with hundreds of lines of NCCL calls) to generate NCCL reductions after the batch reduction kernels. These operations use standard CUDA stream semantics so won't cause any synchronisation unless they're followed by a synchronisation or memcpy operation. The only slightly gnarly bit is that we don't want platform-specific CUDA (or in this case NCCL) types to bleed out of the generated code so the unique id (which is used to identify the 'clique' of nodes that are communicating) is exposed as a pointer (which is then exposed to PyGeNN as a numpy view).

You need to combine the code this PR generates with something that can spawn multiple processes and communicate between them. I have used mpi4py on JUWELS booster where you can do something like:

from mpi4py import MPI

# Get communicator
comm = MPI.COMM_WORLD

# Get our rank and number of ranks
rank = comm.Get_rank()
num_ranks = comm.Get_size()

# Generate unique ID for our NCCL 'clique' on first rank
if rank == 0:
    model._slm.nccl_generate_unique_id()

# Broadcast our  NCCL clique ID across all ranks
nccl_unique_id_view = model._slm.nccl_assign_external_unique_id()
comm.Bcast(nccl_unique_id_view, root=0)

# Initialise NCCL communicator
model._slm.nccl_init_communicator(rank, num_ranks)

But you could also use the Python multiprocessing module e.g. on a multi-GPU workstation where you might not want to use MPI (I envisage this abstraction being handled by mlGeNN as you're likely to also want to use MPI/multi-processing to share e.g. number of correct classifications across all nodes)

Results from JUWELS booster (each node has 4A100 GPU) and JADE2 (each node has 8V100 GPU):

…used to destroy NCCL communicator

* Adding allocateMem parameters is a bad idea as it breaks calling via sharedLibraryModel * Exposed all NCCL functionality via seperate generated functions

codecov · 2021-08-11T15:35:56Z

Codecov Report

Merging #449 (e0d5e43) into master (36d3e83) will decrease coverage by 0.87%.
The diff coverage is 22.66%.

@@            Coverage Diff             @@
##           master     #449      +/-   ##
==========================================
- Coverage   88.08%   87.21%   -0.88%     
==========================================
  Files          78       78              
  Lines       16824    17038     +214     
==========================================
+ Hits        14820    14860      +40     
- Misses       2004     2178     +174

Impacted Files	Coverage Δ
include/genn/genn/code_generator/backendBase.h	`91.52% <ø> (ø)`
include/genn/genn/code_generator/groupMerged.h	`86.65% <0.00%> (-3.03%)`	⬇️
src/genn/genn/code_generator/modelSpecMerged.cc	`96.31% <10.00%> (-2.18%)`	⬇️
src/genn/genn/code_generator/groupMerged.cc	`88.80% <10.71%> (-1.27%)`	⬇️
src/genn/backends/cuda/backend.cc	`82.45% <17.51%> (-5.77%)`	⬇️
include/genn/genn/code_generator/modelSpecMerged.h	`97.40% <60.00%> (-1.27%)`	⬇️
src/genn/genn/code_generator/generateRunner.cc	`95.85% <71.42%> (-0.25%)`	⬇️
src/genn/backends/single_threaded_cpu/backend.cc	`56.97% <75.00%> (+0.02%)`	⬆️
include/genn/backends/cuda/backend.h	`96.29% <100.00%> (+0.14%)`	⬆️
include/genn/backends/opencl/backend.h	`98.86% <100.00%> (+0.01%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36d3e83...e0d5e43. Read the comment docs.

tnowotny

The approach seems sound and as far as I can tell it should be alright (not checked each implementation detail).

neworderofjamie and others added 16 commits August 3, 2021 13:30

basic infrastructure for creating NCCL communicator

3b442f3

export wrapper function around ncclGetUniqueId

29525f3

linker flags for NCCL and error on Windows

245721d

code generation of merged host groups for host reductions

40798e6

implement NCCL reductions

e085a7a

interface to expose ncclUniqueID data

e2ffaee

missing makefile

23731c4

missing bracket

012347c

fixed windows linker issues

5585088

Added functionality to backends to generate freeMem preamble and …

257cd54

…used to destroy NCCL communicator

Tweaks

9289eab

* Adding allocateMem parameters is a bad idea as it breaks calling via sharedLibraryModel * Exposed all NCCL functionality via seperate generated functions

Merge branch 'master' into nccl

f1ef73b

reinstate Windows NCCL error

461f065

fixed warnings

6a2dbc1

implemented ncclGetUniqueIDBytes

dda134f

removed NCCL from test

050926f

neworderofjamie added enhancement CUDA backend labels Aug 11, 2021

neworderofjamie added this to the GeNN 4.6.0 milestone Aug 11, 2021

neworderofjamie requested a review from tnowotny August 11, 2021 15:44

neworderofjamie added 2 commits August 11, 2021 17:16

added this->

fad06a6

fixed warnings

e0d5e43

tnowotny approved these changes Sep 3, 2021

View reviewed changes

neworderofjamie merged commit 967ffba into master Sep 6, 2021

neworderofjamie deleted the nccl branch September 6, 2021 08:03

neworderofjamie mentioned this pull request Oct 12, 2023

Data parallel training support genn-team/ml_genn#79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL multi-GPU reductions #449

NCCL multi-GPU reductions #449

neworderofjamie commented Aug 11, 2021 •

edited

Loading

codecov bot commented Aug 11, 2021 •

edited

Loading

tnowotny left a comment

NCCL multi-GPU reductions #449

NCCL multi-GPU reductions #449

Conversation

neworderofjamie commented Aug 11, 2021 • edited Loading

codecov bot commented Aug 11, 2021 • edited Loading

Codecov Report

tnowotny left a comment

Choose a reason for hiding this comment

neworderofjamie commented Aug 11, 2021 •

edited

Loading

codecov bot commented Aug 11, 2021 •

edited

Loading