-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this loss curve normal #468
Comments
That is a crazy high learning rate, could be the issue, also check your data, and check val loss for overfitting |
@banyan-god, did you try to match the total batch size of ~0.5M? batch_size * num_of_gpus * gradaccum > 500. Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training. PS: my single GPU training is too slow for that. :( |
|
Yes i did I used 2 for gpu instead of 8
|
@banyan-god , you need to keep the gradient_accumulation_steps at 40, instead of 10 to maintain the "total batch size ~0.5M" as cited in the original comment from karpathy. My GPU was fried after 1w training, and have to stop. But with total batch size ~0.5M, I am able to break previous lowest loss I had. @bigsnarfdude , you comment is not relevant as that were the loss for loading gpt-2 from openAI if I read the doc correctly. |
i've trained 124m, medium, and large using both openwebtext and red-pajamas datasets. your iterations should be around 100k and you will reach same training and val loss as the gpt2 loaded from weights. for example. 124m with grad_acc=5 and batch_size=12 and standard LR provided in repo you get a pretrained model from scratch that is very similar to the posted chart |
@yalding so started another job today with ~572.06M Parameters with grad accumulation of 40 as you suggested. Will report back on progress if it explodes
|
@banyan-god you are setting batch size to 5. This again will reduce total batch size to 5501024= 0.25M which is half of the recommended 0.5M total batch size... |
@yalding ok rolled back all the changes to hyper parameter and just running them with following
|
@yalding unfortunately that didnt work either |
Did you validate the config logged in wandb? My last run config: always_save_checkpoint:false final train/val loss until it crashed my GPU: train/loss:3.237154483795166 |
|
I am also wondering possibly something to do with pytorch version or openweb text |
I'm encountering the same issue, @banyan-god did you eventually figure out a way to resolve this? |
@seanxwzhang I want to say it is combination of tokenizer and dataset. When i switched over to gpt4 tokenizer problem disappeared. |
Interesting, in my case it was fixed by not using bf16 but fp16. Surprised that tokenizer can have an effect on what looks like a numerical issue (or perhaps it isn't) |
@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer). If possible, please let me know what versions of torch you are using. |
I was using torch 2.3.0 |
I am running on 2x 4090 , updated gpu to 2 instead of 8 in gradient_accumulation_steps
The text was updated successfully, but these errors were encountered: