You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Trying out the Cifar-10 examples and it appears that adding in some arguments doesn't work because they are being overwritten somewhere.
I tried changing the number of epochs to run by using the flag --epochs but it looks like the cifar10_deepspeed.py script has hard coded 2 epochs:
for epoch in range(2): # loop over the dataset multiple times
I also tried to change the learning rate to 0.0005 by changing the ds_config.json file and it seems like that gets pick up in some parts but overwritten in other parts.
For example I see worker-0: [2020-10-01 22:41:49,395] [INFO] [config.py:624:print] optimizer_params ............. {'lr': 0.0005, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
Which seems to have picked it up, but when it actually runs the training it always says:
Which seems to have not picked up the lr change (it stays at 0.001 throughout as well which suggests it's not doing any lr_warmup either). I haven't tracked down where in the script the learning rate got overwritten... but it does seem to be happening.
The text was updated successfully, but these errors were encountered:
You are right about the epochs being hard-coded. Please use this patch to resolve this issue:#759
About the learning rate, I see that it used to be hardcoded to 0.001 but that line of code is already commented out. Therefore, this issue should have been resolved.
Trying out the Cifar-10 examples and it appears that adding in some arguments doesn't work because they are being overwritten somewhere.
I tried changing the number of epochs to run by using the flag
--epochs
but it looks like thecifar10_deepspeed.py
script has hard coded 2 epochs:for epoch in range(2): # loop over the dataset multiple times
I also tried to change the learning rate to
0.0005
by changing theds_config.json
file and it seems like that gets pick up in some parts but overwritten in other parts.For example I see
worker-0: [2020-10-01 22:41:49,395] [INFO] [config.py:624:print] optimizer_params ............. {'lr': 0.0005, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
Which seems to have picked it up, but when it actually runs the training it always says:
worker-0: [2020-10-01 22:42:57,190] [INFO] [logging.py:60:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
Which seems to have not picked up the
lr
change (it stays at 0.001 throughout as well which suggests it's not doing anylr_warmup
either). I haven't tracked down where in the script the learning rate got overwritten... but it does seem to be happening.The text was updated successfully, but these errors were encountered: