Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data parallel training support #79

Merged
merged 15 commits into from
Dec 7, 2023
Merged

Data parallel training support #79

merged 15 commits into from
Dec 7, 2023

Conversation

neworderofjamie
Copy link
Contributor

Pleasingly, this was actually very easy to do! Basically:

  • There's a new class of things called 'communicators' which let you get rank, number of ranks etc and perform basic communications - I've made an mpi4py implementation for now as that's what my old code used
  • CompiledNetwork does some basic stuff if a communicator is provided:
    • Only building on the 1st rank
    • Waiting on a barrier before loading
    • Doing the NCCL initialisation
  • Compiler subdivides batches across ranks if a communicator is provided and turns on the magic NCCL flag so GeNN generates the additional bits of code (NCCL multi-GPU reductions genn#449)
  • Metrics like SparseCategoricalAccuracy get passed the communicator and use it to

Other than that, it's all just passing the communicator around and a few places where the 'full' batch size is used rather than the scaled down one e.g. in the EventProp compiler to scale stuff. I've also added a couple of additional examples (at some point I need to tidy the examples up a bit) which demonstrate how you need to change your code to run across multiple GPUs - mostly just splitting the dataset and turning off progress bars etc apart from on the first rank.

@neworderofjamie neworderofjamie added the enhancement New feature or request label Oct 12, 2023
@neworderofjamie neworderofjamie added this to the mlGeNN 2.2 milestone Oct 12, 2023
Copy link
Member

@tnowotny tnowotny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow - that looks surprisingly simple and elegant.
I was first confused where the checkpoints will go but I see now that only rank 0 is writing so that's fine.

@neworderofjamie neworderofjamie merged commit 2d785ab into master Dec 7, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants