-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo #5590
WA for Torch-compile-Z3-act-apt accuracy issue from the Pytorch repo #5590
Conversation
@NirSonnenschein Thank you for the PR. It seems old versions of PyTorch don't have |
We have been encountered an accuracy issue when running Torch compile + zero3 + activation checkpoiting. Sepcifically some grads gets is zeroed (running without torch compile, this issue is not encountered). This issue was also reproduced by Umesh Chand from the DS team. We found that in the Pytorch repo torch compile has been specifically disabled using the label: @torch._disable_dynamo() reference to the WA in the pytorch repo (https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324) this indicates that there is some issue with torch compile and checkpoiting (not necessarily DS related). given that the checkpointing function in Deepspeed is based on the Pytorch function, We propose to adopt this WA to ensure correct behavior (it can be removed later if the underlying issue is fixed) Note: this shouldn't impact non-troch compile cases.
e068adf
to
bc8f511
Compare
thanks for the comment, I've uploaded a new version which should fix this. |
Hi @tohtana , |
Hi @mrwyattii, |
thanks @tjruwase, |
…icrosoft#5590) We have been encountered an accuracy issue when running Torch compile + zero3 + activation checkpointing. Specifically some grads gets is zeroed (running without torch compile, this issue is not encountered). This issue was also reproduced by Umesh Chand from the DS team. We found that in the Pytorch repo torch compile has been specifically disabled using the label: @torch._disable_dynamo() reference to the WA in the Pytorch repo (https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324) this indicates that there is some issue with torch compile and checkpointing (not necessarily DS related). given that the checkpointing function in DeepSpeed is based on the Pytorch function, We propose to adopt this WA to ensure correct behavior (it can be removed later if the underlying issue is fixed) Note: this shouldn't impact non-troch compile cases. --------- Co-authored-by: Olatunji Ruwase <[email protected]>
We have been encountered an accuracy issue when running Torch compile + zero3 + activation checkpointing. Specifically some grads gets is zeroed (running without torch compile, this issue is not encountered). This issue was also reproduced by Umesh Chand from the DS team. We found that in the Pytorch repo torch compile has been specifically disabled using the label: @torch._disable_dynamo()
reference to the WA in the Pytorch repo (https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324) this indicates that there is some issue with torch compile and checkpointing (not necessarily DS related).
given that the checkpointing function in DeepSpeed is based on the Pytorch function, We propose to adopt this WA to ensure correct behavior (it can be removed later if the underlying issue is fixed)
Note: this shouldn't impact non-troch compile cases.