Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backwards compatibility w. v020 ckpts, fix issue with zero-1 ckpts #543

Merged
merged 1 commit into from
Nov 19, 2020

Conversation

jeffra
Copy link
Collaborator

@jeffra jeffra commented Nov 19, 2020

  1. Checkpoints from versions <= v0.2.0 serialized a class from deepspeed.pt.loss_scaler, was seeing an exception on v0.3+ code. Added patch to direct deepspeed.pt.loss_scaler to its new home.
  2. Discovered issue with latest zero-1 checkpoint changes related to elastic loading checkpoints and max-elems-per-comm not aligning between src/dst partitions. We don't actually have to keep the 'legacy' max-elems-per-comm value at all.

@jeffra jeffra merged commit dce054d into master Nov 19, 2020
@jeffra jeffra deleted the jeffra/zero-1-update branch November 19, 2020 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants