Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a workaround for the crash when using
apex.optimizers.FusedAdam
on A100s with 80GB.Currently can only load half the gpu memory, if I try to pack just a tad more over 40Gb this happens:
with
CUDA_LAUNCH_BLOCKING=1
getting:there is about 35GB free out of 80GB when this happens. (before
self.optimizer.step()
)We observed that if the model is just slightly smaller it all works, so somehow
multi_tensor_applier
tries to duplicate the first large param group and crashes. Splitting it in 2 halves seems to workaround this issue.This workaround is a courtesy of @samyam and @jeffra
Reading other related tickets I have tried all the proposals, including
set_device
and none helped or we already were doing it.See: NVIDIA/apex#319
There was also a coredump, bt attached. It appears to be failing to free some resource and getting illegal memory access there. some devices must be crossing there.
log-core.txt