-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconstruction of fp32 weights on stage3 doesn't work #1009
Comments
Here is steps to reproduce my error with model reconstruction:
|
Could you please save one checkpoint and share it - on my dev machine I won't be able to train 45GB model. Alternatively, can you reproduce the same problem with a much smaller model? say t5-small? Or do you get it only with t5-11b? |
Yes, I can reproduce it on t5-small
Here is checkpoint of siamese model based on t5-small: |
This is odd as it works for me:
This was generated by Unrelated, but I had to downgrade If I download your checkpoint, indeed it fails:
Any ideas on what might be different in our envs? do you somehow copy the checkpoint via some OS/way that possibly mangles the checkpiont? Can you resume from those checkpoints I'm on pytorch-nightly, deepspeed master and downgraded to transformers==4.5.1 to be able to run your script. |
Oddly enough, your t5-small checkpoint looks very different from mine: Mine:
Yours:
the optim states are slighly different and model states are very different. (though we don't care for model states for this purpose, but it's an indicator that we use different envs). I used your scripts unmodified so we should be getting the same results. Could you update your deepspeed and transformers to the latest official versions and try again?
|
I have reinstalled deepspeed and transformers to the latest versions from pip repo. Also I installed everything this in fresh python environment with pytorch-nighty: Sizes of weights files really different from yours.
Can this problem be caused by CUDA? |
After i installed all packets in the fresh env I got an warnings while training, despite training ends successfuly:
T5SiameseModelOutput is my custom output class. Can it make problems? I even don't know what went wrong. |
I have found an interesting behavior. This checkpoint can be successfully loaded via |
the last one is a buglet in the deepspeed=0.3.15 - they forgot to remove a debug print. So you can upgrade to deepspeed master and it will go away. I think the slight difference in filesize is OK, since it has all kinds of keys stored. I will try to save just the data entries we want from your checkpoint to compare. But why your model states are shorted, this is strange. Basically your model states include the placeholders for the model weights and not the model itself. In ZeRO-3 the model is partitioned over gpus and deepspeed uses a placeholder of tensor of Size[1] and reconstructs the weights on the fly just before each Let me think some more about this strange discrepancy a bit later and I will get back to you. |
oh, completely missed the crucial point - I get zero2 checkpoint and you sent me zero3 checkpoint. thank you, @tjruwase for noticing that! So we aren't testing the same thing. ah yes, your |
Yes, OK, now I changed your script to use ZeRO-3 config and I can now reproduce the problem! Will be looking at it later today. |
Oh, that's my fault, I haven't staged changes in *.sh scripts in my repo... |
That's no problem. The zero2 checkpoint looks significantly different from zero3 one. We will sort it out. |
OK, I see it stores the weights differently under your model, could you please give a chance to this version of the script? https://gist.github.com/stas00/479a6ae2fac070e866440d8a5211f5cd Please ignore all the debug noise, just watch for the final:
should be around 270MB for the t5-small in your double model. I did only a quick validation, so I'm not 100% it reconstructs it correctly. Meanwhile, let me do the same against the staple example script |
In #892 @stas00 proposed a new script which can consolidate fp32 weights from fp16 model checkpoint on stage 3 training.
Unfortunately I have found? that t5-11b model can't be consolidated due to some error:
Maybe @stas00 could say what is the problem, and how it can be fixed?
The text was updated successfully, but these errors were encountered: