Fsdp pytorch draft PR #823

davidtweedle · 2024-12-19T02:10:42Z

A draft PR for potential changes for the pytorch workloads to upgrade to FSDP (fully sharded data parallel) from DDP (distributed data parallel).

Summary for changes to: cifar, mnist, criteo1tb, imagenet vit, imagenet resnet, librispeech deepspeech, librispeech conformer, ogbg, wmt, fastmri

Summary for changes to momentum (as simple test optimizer):

first compute weighted loss on each device
then loss.backward (the gradient of the losses will now be all reduced by a pytorch communication hook)
then display the correct loss

github-actions · 2024-12-19T02:10:56Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

davidtweedle added 13 commits November 24, 2024 15:31

Updated cifar workload, and momentum algorithm to use FSDP

37905b1

Removed import causing problems on kaggle

6eccc1d

OGBG updated to FSDP (to be tested still)

d31d79b

wmt workload testing for FSDP

6ccb1e8

added functools import to wmt

b786bfc

First update of FSDP for mnist

a2000bd

First update for FSDP criteo

49774dd

First FSDP update for fastmri

1a30851

First FSDP pytorch update for imagenet resnet

015734f

First FSDP pytorch update for imagenet vit

10a32d4

First update for FSDP pytorch librispeech conformer

11676d6

First update for FSDP pytorch librispeech deepspeech

dbfd233

Typo in FSDP definition for ogbg

40a932f

Provide feedback