Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this loss curve normal #468

Open
banyan-god opened this issue Mar 27, 2024 · 20 comments
Open

Is this loss curve normal #468

banyan-god opened this issue Mar 27, 2024 · 20 comments

Comments

@banyan-god
Copy link

banyan-god commented Mar 27, 2024

W B Chart 3_27_2024, 9 55 33 AM
I am running on 2x 4090 , updated gpu to 2 instead of 8 in gradient_accumulation_steps

 more train_gpt2.py 
# config for training GPT-2 (124M) down to very nice loss of ~2.85 on 1 node of 8X A100 40GB
# launch as the following (e.g. in a screen session) and wait ~5 days:
# $ torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py

wandb_log = True
wandb_project = 'owt'
wandb_run_name='gpt2-124M'

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 20
block_size = 1024
gradient_accumulation_steps = 5 * **2**

# this makes total number of tokens be 300B
max_iters = 600000
lr_decay_iters = 600000

# eval stuff
eval_interval = 1000
eval_iters = 200
log_interval = 10

# weight decay
weight_decay = 1e-1
@VatsaDev
Copy link

That is a crazy high learning rate, could be the issue, also check your data, and check val loss for overfitting

@banyan-god
Copy link
Author

Is it though ? that was the value on train.py, Either way tried few runs but no luck
W B Chart 4_3_2024, 1 33 41 PM
W B Chart 4_3_2024, 1 33 49 PM

@yalding
Copy link

yalding commented Apr 8, 2024

I have the same issue with a single GPU 4060Ti (16G).

image

@yalding
Copy link

yalding commented Apr 10, 2024

@banyan-god, did you try to match the total batch size of ~0.5M? batch_size * num_of_gpus * gradaccum > 500.

Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.

PS: my single GPU training is too slow for that. :(

@bigsnarfdude
Copy link

in the README.md loss curves expected

| model | params | train loss | val loss |
| ------| ------ | ---------- | -------- |
| gpt2 | 124M         | 3.11  | 3.12     |
| gpt2-medium | 350M  | 2.85  | 2.84     |
| gpt2-large | 774M   | 2.66  | 2.67     |
| gpt2-xl | 1558M     | 2.56  | 2.54     |

@banyan-god
Copy link
Author

banyan-god commented Apr 14, 2024

@banyan-god, did you try to match the total batch size of ~0.5M? batch_size * num_of_gpus * gradaccum > 500.

Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.

PS: my single GPU training is too slow for that. :(

Yes i did I used 2 for gpu instead of 8

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 16
block_size = 1024
gradient_accumulation_steps = 5 * 2

@yalding

@yalding
Copy link

yalding commented Apr 15, 2024

@banyan-god , you need to keep the gradient_accumulation_steps at 40, instead of 10 to maintain the "total batch size ~0.5M" as cited in the original comment from karpathy.

My GPU was fried after 1w training, and have to stop. But with total batch size ~0.5M, I am able to break previous lowest loss I had.

@bigsnarfdude , you comment is not relevant as that were the loss for loading gpt-2 from openAI if I read the doc correctly.

@bigsnarfdude
Copy link

i've trained 124m, medium, and large using both openwebtext and red-pajamas datasets. your iterations should be around 100k and you will reach same training and val loss as the gpt2 loaded from weights. for example. 124m with grad_acc=5 and batch_size=12 and standard LR provided in repo you get a pretrained model from scratch that is very similar to the posted chart

@banyan-god
Copy link
Author

banyan-god commented Apr 16, 2024

@yalding so started another job today with ~572.06M Parameters with grad accumulation of 40 as you suggested. Will report back on progress if it explodes

always_save_checkpoint:true
backend:"nccl"
batch_size:5
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:1,600
n_head:16
n_layer:16
out_dir:"out"
wandb_log:true
wandb_project:"owt"
wandb_run_name:"gpt2"
warmup_iters:2,000
weight_decay:0.1

@yalding
Copy link

yalding commented Apr 16, 2024

@banyan-god you are setting batch size to 5. This again will reduce total batch size to 5501024= 0.25M which is half of the recommended 0.5M total batch size...

@banyan-god
Copy link
Author

@yalding ok rolled back all the changes to hyper parameter and just running them with following
torchrun --standalone --nproc_per_node=2 train.py config/train_gpt2.py

always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006

@banyan-god
Copy link
Author

@yalding unfortunately that didnt work either
Screenshot 2024-04-17 at 6 40 12 AM

@yalding
Copy link

yalding commented Apr 17, 2024

Did you validate the config logged in wandb? My last run config:

always_save_checkpoint:false
backend:"nccl"
batch_size:11
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:100
eval_iters:50
eval_only:false
grad_clip:1
gradient_accumulation_steps:50
init_from:"resume"
learning_rate:0.0006
log_interval:1
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:768
n_head:12
n_layer:12
out_dir:"out"
wandb_log:true
wandb_project:"owt"
wandb_run_name:"gpt2-124M-original-seed3000-bs550-resume"
warmup_iters:2,000
weight_decay:0.1

final train/val loss until it crashed my GPU:

train/loss:3.237154483795166
val/loss:3.248462915420532

@banyan-god
Copy link
Author

banyan-god commented Apr 17, 2024

@yalding

always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:768
n_head:12
n_layer:12
out_dir:"out"
wandb_log:true
wandb_project:"sifra"
wandb_run_name:"sifra-124M"
warmup_iters:2,000
weight_decay:0.1

@banyan-god
Copy link
Author

I am also wondering possibly something to do with pytorch version or openweb text

@seanxwzhang
Copy link

I'm encountering the same issue, @banyan-god did you eventually figure out a way to resolve this?

@banyan-god
Copy link
Author

@seanxwzhang I want to say it is combination of tokenizer and dataset. When i switched over to gpt4 tokenizer problem disappeared.

@seanxwzhang
Copy link

Interesting, in my case it was fixed by not using bf16 but fp16. Surprised that tokenizer can have an effect on what looks like a numerical issue (or perhaps it isn't)

@mattgorb
Copy link

@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).

If possible, please let me know what versions of torch you are using.

@seanxwzhang
Copy link

@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).

If possible, please let me know what versions of torch you are using.

I was using torch 2.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants