Separate ZeRO3 InflightParamRegistry for train and eval #3884
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After PR 3462, the InflightParamRegistry is bounded with the model so the InflightParamRegistry is the same for both training and eval, which is why there will be leftover parameters in flight that causes inflight parameter issues. With this fix and #3883, the step3 runs well and it doesn't cause regression for the Lightning side (which is what PR 3462 tries to fix). Potentially related issues: #3068 #3156
#3735
microsoft/DeepSpeedExamples#337
microsoft/DeepSpeedExamples#591.