[BUG] Loss did not decrease in Albert example after 125000 max step. #447

elricwan · 2022-01-18T15:45:08Z

Describe the bug
I run the albert example with wikitext data. I use one peer, default settings (target_batch_size=4096, train_batch_size=4, max_step=125000, lr=0.00176), but the loss did not decrease after training, it start as 11 and finish as 11.

Jan 15 10:30:14.734 [INFO] Step #1 loss = 11.04938
Jan 15 10:32:14.842 [INFO] Step #2 loss = 11.05589
Jan 15 10:34:14.975 [INFO] Step #3 loss = 11.06803
Jan 15 10:36:15.093 [INFO] Step #4 loss = 11.06271
Jan 15 10:38:15.228 [INFO] Step #5 loss = 11.06433
Jan 15 10:40:15.337 [INFO] Step #6 loss = 11.05447
Jan 15 10:41:45.401 [INFO] Step #7 loss = 11.06115
Jan 15 10:43:45.541 [INFO] Step #8 loss = 11.06025
..........
Jan 15 18:09:13.117 [INFO] Step #238 loss = 11.05597
Jan 15 18:11:13.233 [INFO] Step #239 loss = 11.06724
Jan 15 18:13:13.369 [INFO] Step #240 loss = 11.06289
Jan 15 18:15:13.494 [INFO] Step #241 loss = 11.05922
Jan 15 18:16:43.577 [INFO] Step #242 loss = 11.05226
Jan 15 18:18:43.691 [INFO] Step #243 loss = 11.05418
Jan 15 18:20:43.843 [INFO] Step #244 loss = 11.05638

To Reproduce
Run the script in albert example.
For monitor, I run:

python run_training_monitor.py
--experiment_prefix albert_experiment
--wandb_project albert_wandb

For trainer, I run:

IP=/ip4/192.168.0.188/tcp/45731/p2p/QmSRerwCPUSreHhwMuTLHoVHqTfWuT8J57w3sXFZtU8ECo

WANDB_DISABLED=true CUDA_VISIBLE_DEVICES=0 python run_trainer.py
--experiment_prefix albert_experiment
--initial_peers $IP
--logging_first_step
--output_dir ./outputs
--overwrite_output_dir
--logging_dir ./logs
--dataset_path="/home/protago/Xiangpeng/hivemind/examples/albert/data/albert_tokenized_wikitext"
--per_device_train_batch_size 4
--learning_rate 0.00176
--num_train_epochs=5
--save_steps=60000

Environment
Please list:

python version (e.g. 3.8.1); 3.8
hivemind.version; 1.1.0
Please copy and paste the output from pytorch environment collection script

If the script doesn't work, please report pytorch and numpy versions manually. We also encourage you to include any additional information that you believe can help us solve the issue.

justheuristic · 2022-01-19T05:27:29Z

Hi, (and thanks for reporting the issue!)

I reproduced the error on my side -- the loss does indeed not_decrease in your setup.
I then restarted with two training peers and the loss was decreasing normally:

Q:Can you please check if it works on your side as well?

To avoid waiting for long time, please reduce the batch size and warmup: --target_batch_size 256 --warmup_steps 500
Otherwise the learning rate warmup would take 3125 steps at batch size 4096, which only makes sense when you have 10-50 peers.

If you only have one GPU, just reduce the batch size until you can fit two trainers on one GPU (please tell us if you encounter problems in that, we'll figure out something).

justheuristic · 2022-01-19T06:04:05Z

I investigated what went wrong in when training with only one trainer. Currently, hivemind.Optimizer is hard-wired to use the averaged gradients -- as in "averaged with peers".

If you are the only peer, gradients are not averaged, so optimizer runs with zero gradients all the time.
This change should fix the problem in your specific case: 4ffd9ca
I have seemingly introduced that bug myself in #440 . It only affects the github version of hivemind (i.e. not the pypi version)

Thank you again for the report! We'll get the fix to master as soon as possible.

If possible, I'd appreciate if you can verify if it works with 2 peers (and/or with the fix above) whenever you have time.

finger92 · 2022-01-19T06:46:12Z

I tried it with two peers work and one peer fail (to reduce loss)

elricwan · 2022-01-19T22:25:24Z

Yes, it works with two peers. Thank you for the quick response!

justheuristic · 2022-01-20T07:10:39Z

[closing the issue. Feel free to add more if you encounter further problems]

elricwan added the bug Something isn't working label Jan 18, 2022

elricwan assigned justheuristic Jan 18, 2022

justheuristic mentioned this issue Jan 19, 2022

Fix offloaded optimizer with single peer #450

Merged

justheuristic closed this as completed Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Loss did not decrease in Albert example after 125000 max step. #447

[BUG] Loss did not decrease in Albert example after 125000 max step. #447

elricwan commented Jan 18, 2022 •

edited

Loading

justheuristic commented Jan 19, 2022 •

edited

Loading

justheuristic commented Jan 19, 2022 •

edited

Loading

finger92 commented Jan 19, 2022 •

edited

Loading

elricwan commented Jan 19, 2022

justheuristic commented Jan 20, 2022

[BUG] Loss did not decrease in Albert example after 125000 max step. #447

[BUG] Loss did not decrease in Albert example after 125000 max step. #447

Comments

elricwan commented Jan 18, 2022 • edited Loading

justheuristic commented Jan 19, 2022 • edited Loading

justheuristic commented Jan 19, 2022 • edited Loading

finger92 commented Jan 19, 2022 • edited Loading

elricwan commented Jan 19, 2022

justheuristic commented Jan 20, 2022

elricwan commented Jan 18, 2022 •

edited

Loading

justheuristic commented Jan 19, 2022 •

edited

Loading

justheuristic commented Jan 19, 2022 •

edited

Loading

finger92 commented Jan 19, 2022 •

edited

Loading