Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate ZeRO3 InflightParamRegistry for train and eval #3884

Merged
merged 9 commits into from
Jul 6, 2023

Conversation

HeyangQin
Copy link
Contributor

@HeyangQin HeyangQin commented Jul 5, 2023

After PR 3462, the InflightParamRegistry is bounded with the model so the InflightParamRegistry is the same for both training and eval, which is why there will be leftover parameters in flight that causes inflight parameter issues. With this fix and #3883, the step3 runs well and it doesn't cause regression for the Lightning side (which is what PR 3462 tries to fix). Potentially related issues: #3068 #3156
#3735
microsoft/DeepSpeedExamples#337
microsoft/DeepSpeedExamples#591.

@HeyangQin HeyangQin enabled auto-merge (squash) July 5, 2023 20:43
@shyustc
Copy link

shyustc commented Jul 8, 2023

After updating to the latest version, I found that the previous issue has been fixed. However, I still encountered the 'inflight' error in the subsequent code at different locations.
#3735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants