[Not for merge] Madam optimizer with OOM handling #8

csukuangfj · 2021-08-14T15:20:38Z

Put it here for discussion. Not ready for merge.

It contains

madam optimizer from Dan
Handle OOM when using a larger max-duration (e.g., from 200 to 350)

~~tensorboard log: https://tensorboard.dev/experiment/WCQbgwK2T0OI9kCWjHPOSw/#scalars~~
tensorboard log: https://tensorboard.dev/experiment/BedA6yRKRyGpFB2wY709fQ/

csukuangfj · 2021-08-14T15:25:18Z

egs/librispeech/ASR/conformer_ctc_madam_no_warmup/train.py

+            graph_compiler=graph_compiler,
+            is_training=is_training,
+        )
+    except RuntimeError as ex:


This is where OOM is handled. I am testing it. Not sure whether it works or not.

Not sure if it's helpful -- there is a repo that has some utilties for finding the optimal batch size in PyTorch and it has some code to deal with OOM. Maybe there is something useful that can be borrowed https://github.com/BlackHC/toma

Hmm, after looking into the DDP implementation, it looks to me like the backward gradient reduction is not done in one pass after the backward is completed on individual machines, but is done during the .backward() of the model. So I think catching errors that happen during .backward() is not going to be possible, and likewise for errors that happen during the top-level model forward() function, because ddp seems to have sync points there. Errors in the forward for CTC or transformer decoder may be catchable though.

... but we can't do a new top-level model forward on new data. Might be possible to re-do a CTC or transformer forward with a subset of the minibatch, since its backward pass would be structurally similar to the one on the entire minibatch.

The current code is able to catch CUDA OOM exceptions during model.forward() when trained with a single GPU,
but it hangs when trained with two GPUs using DDP. I am looking into where it hangs.

pytorch/pytorch#18853 (comment)
says it is possible to catch CUDA OOM exceptions also for DDP training with a single GPU.

We are currently catching exceptions only for the forward pass, i..e, model.forward, get_tot_scores, attetion decoder.

So I think catching errors that happen during .backward() is not going to be possible,

Agreed.

pzelasko · 2021-08-14T17:23:29Z

egs/librispeech/ASR/conformer_ctc_madam_no_warmup/train.py

+from icefall.bpe_graph_compiler import BpeCtcTrainingGraphCompiler
+from icefall.checkpoint import load_checkpoint
+from icefall.checkpoint import save_checkpoint as save_checkpoint_impl
+from icefall.dataset.librispeech import LibriSpeechAsrDataModule


I strongly suggest making this class local too (e.g. in a local data.py file) -- in the current repo layout, it will make it much easier to experiment with different types of data setups and augmentations.

danpovey · 2021-08-15T05:37:48Z

egs/librispeech/ASR/conformer_ctc_madam_no_warmup/transformer.py

+        x = x.view(-1, self.size)
+        target = target.view(-1)
+        with torch.no_grad():
+            true_dist = x.clone()


instead of x.clone(), torch.empty_like(x) would be more approprate..

We have to disable batch norm layers. Otherwise, the process will hang indefinitely.

csukuangfj · 2021-08-15T11:00:51Z

egs/librispeech/ASR/conformer_ctc_madam_no_warmup/conformer.py

+        # NOTE(fangjun): The process hangs when using DDP
+        # if we try to recover from CUDA OOM, so we disable
+        # batchnorm layer here.
+        #  self.norm = nn.BatchNorm1d(channels)


After disabling batch norm, training with DDP can now recover from OOM in model.forward() as expected.

Mm. Hopefully with the madam optimizer, the training will still be stable without the batchnorm. We'll have to see. Obviously would have to compare the performance after this change.

csukuangfj · 2021-08-15T11:01:30Z

I will port OOM handling to LF-MMI training as well.

csukuangfj added 6 commits August 10, 2021 20:08

Add readme.

55be105

Add TOC.

dec6ecf

fix typos

b7133f3

Minor fixes after review.

f0ee6cf

Merge remote-tracking branch 'dan/master' into doc

c26eb67

Add madam optimizer from Dan.

36ac512

csukuangfj changed the title ~~[Not for merge] Madam oom~~ [Not for merge] Madam optimizer with OOM handling Aug 14, 2021

csukuangfj commented Aug 14, 2021

View reviewed changes

pzelasko reviewed Aug 14, 2021

View reviewed changes

csukuangfj added 2 commits August 15, 2021 09:52

Fix oom handling.

72c0220

Minor fixes.

14e0886

danpovey reviewed Aug 15, 2021

View reviewed changes

Fix OOM handling when using DDP.

2129206

We have to disable batch norm layers. Otherwise, the process will hang indefinitely.

csukuangfj commented Aug 15, 2021

View reviewed changes

csukuangfj added 3 commits August 15, 2021 22:59

Replace warmup with lr scheduler.

0be42be

Replace warmup with lr scheduler.

02e409b

Set the initial learning rate directly.

58eb498

Lzhang-hub mentioned this pull request Oct 11, 2021

CUDA out of memory in decoding #70

Open

danpovey mentioned this pull request Nov 27, 2021

Decoding error 'Fsa' object doesn't support assignment. #133

Open

wwxm0523 mentioned this pull request Jan 30, 2022

LF-MMI GPU OOM #196

Open

ahazned mentioned this pull request Apr 13, 2022

Illegal memory error when training with multi-GPU #247

Open

ngoel17 mentioned this pull request Sep 30, 2024

Illegal memory access during zipformer training #1764

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Not for merge] Madam optimizer with OOM handling #8

[Not for merge] Madam optimizer with OOM handling #8

csukuangfj commented Aug 14, 2021 •

edited

Loading

csukuangfj Aug 14, 2021

pzelasko Aug 14, 2021

danpovey Aug 15, 2021

danpovey Aug 15, 2021

csukuangfj Aug 15, 2021

pzelasko Aug 14, 2021

danpovey Aug 15, 2021

csukuangfj Aug 15, 2021

danpovey Aug 15, 2021

csukuangfj commented Aug 15, 2021

[Not for merge] Madam optimizer with OOM handling #8

Are you sure you want to change the base?

[Not for merge] Madam optimizer with OOM handling #8

Conversation

csukuangfj commented Aug 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csukuangfj commented Aug 15, 2021

csukuangfj commented Aug 14, 2021 •

edited

Loading