-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Loss did not decrease in Albert example after 125000 max step. #447
Comments
I investigated what went wrong in when training with only one trainer. Currently, hivemind.Optimizer is hard-wired to use the averaged gradients -- as in "averaged with peers". If you are the only peer, gradients are not averaged, so optimizer runs with zero gradients all the time. Thank you again for the report! We'll get the fix to master as soon as possible. If possible, I'd appreciate if you can verify if it works with 2 peers (and/or with the fix above) whenever you have time. |
I tried it with two peers work and one peer fail (to reduce loss) |
Yes, it works with two peers. Thank you for the quick response! |
[closing the issue. Feel free to add more if you encounter further problems] |
Describe the bug
I run the albert example with wikitext data. I use one peer, default settings (target_batch_size=4096, train_batch_size=4, max_step=125000, lr=0.00176), but the loss did not decrease after training, it start as 11 and finish as 11.
Jan 15 10:30:14.734 [INFO] Step #1 loss = 11.04938
Jan 15 10:32:14.842 [INFO] Step #2 loss = 11.05589
Jan 15 10:34:14.975 [INFO] Step #3 loss = 11.06803
Jan 15 10:36:15.093 [INFO] Step #4 loss = 11.06271
Jan 15 10:38:15.228 [INFO] Step #5 loss = 11.06433
Jan 15 10:40:15.337 [INFO] Step #6 loss = 11.05447
Jan 15 10:41:45.401 [INFO] Step #7 loss = 11.06115
Jan 15 10:43:45.541 [INFO] Step #8 loss = 11.06025
..........
Jan 15 18:09:13.117 [INFO] Step #238 loss = 11.05597
Jan 15 18:11:13.233 [INFO] Step #239 loss = 11.06724
Jan 15 18:13:13.369 [INFO] Step #240 loss = 11.06289
Jan 15 18:15:13.494 [INFO] Step #241 loss = 11.05922
Jan 15 18:16:43.577 [INFO] Step #242 loss = 11.05226
Jan 15 18:18:43.691 [INFO] Step #243 loss = 11.05418
Jan 15 18:20:43.843 [INFO] Step #244 loss = 11.05638
To Reproduce
Run the script in albert example.
For monitor, I run:
python run_training_monitor.py
--experiment_prefix albert_experiment
--wandb_project albert_wandb
For trainer, I run:
IP=/ip4/192.168.0.188/tcp/45731/p2p/QmSRerwCPUSreHhwMuTLHoVHqTfWuT8J57w3sXFZtU8ECo
WANDB_DISABLED=true CUDA_VISIBLE_DEVICES=0 python run_trainer.py
--experiment_prefix albert_experiment
--initial_peers $IP
--logging_first_step
--output_dir ./outputs
--overwrite_output_dir
--logging_dir ./logs
--dataset_path="/home/protago/Xiangpeng/hivemind/examples/albert/data/albert_tokenized_wikitext"
--per_device_train_batch_size 4
--learning_rate 0.00176
--num_train_epochs=5
--save_steps=60000
Environment
Please list:
If the script doesn't work, please report pytorch and numpy versions manually. We also encourage you to include any additional information that you believe can help us solve the issue.
The text was updated successfully, but these errors were encountered: