Data parallel training support #79

neworderofjamie · 2023-10-12T14:50:35Z

Pleasingly, this was actually very easy to do! Basically:

There's a new class of things called 'communicators' which let you get rank, number of ranks etc and perform basic communications - I've made an mpi4py implementation for now as that's what my old code used
CompiledNetwork does some basic stuff if a communicator is provided:
- Only building on the 1st rank
- Waiting on a barrier before loading
- Doing the NCCL initialisation
Compiler subdivides batches across ranks if a communicator is provided and turns on the magic NCCL flag so GeNN generates the additional bits of code (NCCL multi-GPU reductions genn#449)
Metrics like SparseCategoricalAccuracy get passed the communicator and use it to

Other than that, it's all just passing the communicator around and a few places where the 'full' batch size is used rather than the scaled down one e.g. in the EventProp compiler to scale stuff. I've also added a couple of additional examples (at some point I need to tidy the examples up a bit) which demonstrate how you need to change your code to run across multiple GPUs - mostly just splitting the dataset and turning off progress bars etc apart from on the first rank.

* NCCL init

…NCCL flag when constructing GeNNModel

…y calculate metrics

…is batched at all * divide batch size by number of ranks

tnowotny

wow - that looks surprisingly simple and elegant.
I was first confused where the checkpoints will go but I see now that only rank 0 is writing so that's fine.

neworderofjamie added the enhancement New feature or request label Oct 12, 2023

neworderofjamie added this to the mlGeNN 2.2 milestone Oct 12, 2023

neworderofjamie requested a review from tnowotny October 12, 2023 14:50

neworderofjamie added 13 commits December 7, 2023 10:38

abstract communicator and mpi4py implementation

a99cd84

* correct loading and building logic for parallel simulation

d6088dd

* NCCL init

pass communicator object through compilers to compiled model and set …

9b8eb71

…NCCL flag when constructing GeNNModel

pass communicator object through to Metric.update and use to correctl…

13a3e89

…y calculate metrics

whitespace

bbdbd3a

* add Compiler.full_batch_size for scaling and testing whether model …

23afe1d

…is batched at all * divide batch size by number of ranks

example of using MPI communicator

2429e5d

whitespace

3bad75c

fixed typos

ec0de58

whitespace

7458592

MPI communicator fixes

8588dfd

hacky around GeNN bug

e6747f9

fixed typos in few spike compiler and compiled network

9309a50

neworderofjamie force-pushed the nccl branch from 53e1564 to 9309a50 Compare December 7, 2023 13:25

neworderofjamie added 2 commits December 7, 2023 13:28

add SHD eprop MPI example and rename checkpoints

1f35746

updated MPI examples and added shell scripts

7e913ae

tnowotny approved these changes Dec 7, 2023

View reviewed changes

neworderofjamie merged commit 2d785ab into master Dec 7, 2023
1 check passed

neworderofjamie deleted the nccl branch December 7, 2023 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data parallel training support #79

Data parallel training support #79

neworderofjamie commented Oct 12, 2023

tnowotny left a comment

Data parallel training support #79

Data parallel training support #79

Conversation

neworderofjamie commented Oct 12, 2023

tnowotny left a comment

Choose a reason for hiding this comment