Is this loss curve normal #468

banyan-god · 2024-03-27T16:56:21Z

I am running on 2x 4090 , updated gpu to 2 instead of 8 in gradient_accumulation_steps

 more train_gpt2.py 
# config for training GPT-2 (124M) down to very nice loss of ~2.85 on 1 node of 8X A100 40GB
# launch as the following (e.g. in a screen session) and wait ~5 days:
# $ torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py

wandb_log = True
wandb_project = 'owt'
wandb_run_name='gpt2-124M'

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 20
block_size = 1024
gradient_accumulation_steps = 5 * **2**

# this makes total number of tokens be 300B
max_iters = 600000
lr_decay_iters = 600000

# eval stuff
eval_interval = 1000
eval_iters = 200
log_interval = 10

# weight decay
weight_decay = 1e-1

The text was updated successfully, but these errors were encountered:

VatsaDev · 2024-03-29T22:33:09Z

That is a crazy high learning rate, could be the issue, also check your data, and check val loss for overfitting

banyan-god · 2024-04-03T20:34:12Z

Is it though ? that was the value on train.py, Either way tried few runs but no luck

yalding · 2024-04-08T14:30:02Z

I have the same issue with a single GPU 4060Ti (16G).

yalding · 2024-04-10T16:16:47Z

@banyan-god, did you try to match the total batch size of ~0.5M? batch_size * num_of_gpus * gradaccum > 500.

Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.

PS: my single GPU training is too slow for that. :(

bigsnarfdude · 2024-04-13T00:11:00Z

in the README.md loss curves expected

| model | params | train loss | val loss |
| ------| ------ | ---------- | -------- |
| gpt2 | 124M         | 3.11  | 3.12     |
| gpt2-medium | 350M  | 2.85  | 2.84     |
| gpt2-large | 774M   | 2.66  | 2.67     |
| gpt2-xl | 1558M     | 2.56  | 2.54     |

banyan-god · 2024-04-14T20:45:30Z

@banyan-god, did you try to match the total batch size of ~0.5M? batch_size * num_of_gpus * gradaccum > 500.

Your current total batch size is 40% of the original total batch size. It might impact you due to the stochastic nature of the training.

PS: my single GPU training is too slow for that. :(

Yes i did I used 2 for gpu instead of 8

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 16
block_size = 1024
gradient_accumulation_steps = 5 * 2

@yalding

yalding · 2024-04-15T01:16:42Z

@banyan-god , you need to keep the gradient_accumulation_steps at 40, instead of 10 to maintain the "total batch size ~0.5M" as cited in the original comment from karpathy.

My GPU was fried after 1w training, and have to stop. But with total batch size ~0.5M, I am able to break previous lowest loss I had.

@bigsnarfdude , you comment is not relevant as that were the loss for loading gpt-2 from openAI if I read the doc correctly.

bigsnarfdude · 2024-04-16T01:37:12Z

i've trained 124m, medium, and large using both openwebtext and red-pajamas datasets. your iterations should be around 100k and you will reach same training and val loss as the gpt2 loaded from weights. for example. 124m with grad_acc=5 and batch_size=12 and standard LR provided in repo you get a pretrained model from scratch that is very similar to the posted chart

banyan-god · 2024-04-16T13:59:31Z

@yalding so started another job today with ~572.06M Parameters with grad accumulation of 40 as you suggested. Will report back on progress if it explodes

always_save_checkpoint:true
backend:"nccl"
batch_size:5
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:1,600
n_head:16
n_layer:16
out_dir:"out"
wandb_log:true
wandb_project:"owt"
wandb_run_name:"gpt2"
warmup_iters:2,000
weight_decay:0.1

yalding · 2024-04-16T14:15:32Z

@banyan-god you are setting batch size to 5. This again will reduce total batch size to 5501024= 0.25M which is half of the recommended 0.5M total batch size...

banyan-god · 2024-04-16T23:09:04Z

@yalding ok rolled back all the changes to hyper parameter and just running them with following
torchrun --standalone --nproc_per_node=2 train.py config/train_gpt2.py

always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006

banyan-god · 2024-04-17T13:40:47Z

@yalding unfortunately that didnt work either

yalding · 2024-04-17T16:13:08Z

Did you validate the config logged in wandb? My last run config:

always_save_checkpoint:false
backend:"nccl"
batch_size:11
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:100
eval_iters:50
eval_only:false
grad_clip:1
gradient_accumulation_steps:50
init_from:"resume"
learning_rate:0.0006
log_interval:1
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:768
n_head:12
n_layer:12
out_dir:"out"
wandb_log:true
wandb_project:"owt"
wandb_run_name:"gpt2-124M-original-seed3000-bs550-resume"
warmup_iters:2,000
weight_decay:0.1

final train/val loss until it crashed my GPU:

train/loss:3.237154483795166
val/loss:3.248462915420532

banyan-god · 2024-04-17T17:06:29Z

@yalding

always_save_checkpoint:true
backend:"nccl"
batch_size:12
beta1:0.9
beta2:0.95
bias:false
block_size:1,024
compile:true
dataset:"openwebtext"
decay_lr:true
device:"cuda"
dropout:0
dtype:"bfloat16"
eval_interval:1,000
eval_iters:200
eval_only:false
grad_clip:1
gradient_accumulation_steps:40
init_from:"scratch"
learning_rate:0.0006
log_interval:10
lr_decay_iters:600,000
max_iters:600,000
min_lr:0.00006
n_embd:768
n_head:12
n_layer:12
out_dir:"out"
wandb_log:true
wandb_project:"sifra"
wandb_run_name:"sifra-124M"
warmup_iters:2,000
weight_decay:0.1

banyan-god · 2024-04-17T17:07:30Z

I am also wondering possibly something to do with pytorch version or openweb text

seanxwzhang · 2024-07-27T03:21:55Z

I'm encountering the same issue, @banyan-god did you eventually figure out a way to resolve this?

banyan-god · 2024-08-04T17:16:30Z

@seanxwzhang I want to say it is combination of tokenizer and dataset. When i switched over to gpt4 tokenizer problem disappeared.

seanxwzhang · 2024-08-04T17:20:12Z

Interesting, in my case it was fixed by not using bf16 but fp16. Surprised that tokenizer can have an effect on what looks like a numerical issue (or perhaps it isn't)

mattgorb · 2024-09-13T16:17:47Z

@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).

If possible, please let me know what versions of torch you are using.

seanxwzhang · 2024-09-13T16:32:11Z

@seanxwzhang @banyan-god Were you able to converge your training to 2.9 on GPT2 Small? Did the loss go to NaN or explode back up? I am encountering the same issue, and have tried both of your solutions (fp16 and gpt4 tokenizer).

If possible, please let me know what versions of torch you are using.

I was using torch 2.3.0

mattgorb mentioned this issue Sep 13, 2024

Pretraining loss explosion #554

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this loss curve normal #468

Is this loss curve normal #468

banyan-god commented Mar 27, 2024 •

edited

Loading

VatsaDev commented Mar 29, 2024

banyan-god commented Apr 3, 2024

yalding commented Apr 8, 2024

yalding commented Apr 10, 2024

bigsnarfdude commented Apr 13, 2024

banyan-god commented Apr 14, 2024 •

edited

Loading

yalding commented Apr 15, 2024

bigsnarfdude commented Apr 16, 2024

banyan-god commented Apr 16, 2024 •

edited

Loading

yalding commented Apr 16, 2024

banyan-god commented Apr 16, 2024

banyan-god commented Apr 17, 2024

yalding commented Apr 17, 2024

banyan-god commented Apr 17, 2024 •

edited

Loading

banyan-god commented Apr 17, 2024

seanxwzhang commented Jul 27, 2024

banyan-god commented Aug 4, 2024

seanxwzhang commented Aug 4, 2024

mattgorb commented Sep 13, 2024

seanxwzhang commented Sep 13, 2024

Is this loss curve normal #468

Is this loss curve normal #468

Comments

banyan-god commented Mar 27, 2024 • edited Loading

VatsaDev commented Mar 29, 2024

banyan-god commented Apr 3, 2024

yalding commented Apr 8, 2024

yalding commented Apr 10, 2024

bigsnarfdude commented Apr 13, 2024

banyan-god commented Apr 14, 2024 • edited Loading

yalding commented Apr 15, 2024

bigsnarfdude commented Apr 16, 2024

banyan-god commented Apr 16, 2024 • edited Loading

yalding commented Apr 16, 2024

banyan-god commented Apr 16, 2024

banyan-god commented Apr 17, 2024

yalding commented Apr 17, 2024

banyan-god commented Apr 17, 2024 • edited Loading

banyan-god commented Apr 17, 2024

seanxwzhang commented Jul 27, 2024

banyan-god commented Aug 4, 2024

seanxwzhang commented Aug 4, 2024

mattgorb commented Sep 13, 2024

seanxwzhang commented Sep 13, 2024

banyan-god commented Mar 27, 2024 •

edited

Loading

banyan-god commented Apr 14, 2024 •

edited

Loading

banyan-god commented Apr 16, 2024 •

edited

Loading

banyan-god commented Apr 17, 2024 •

edited

Loading