Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cifar-10 example arguments being overwritten #57

Closed
Enumaris opened this issue Oct 1, 2020 · 2 comments
Closed

Cifar-10 example arguments being overwritten #57

Enumaris opened this issue Oct 1, 2020 · 2 comments

Comments

@Enumaris
Copy link

Enumaris commented Oct 1, 2020

Trying out the Cifar-10 examples and it appears that adding in some arguments doesn't work because they are being overwritten somewhere.

I tried changing the number of epochs to run by using the flag --epochs but it looks like the cifar10_deepspeed.py script has hard coded 2 epochs:

for epoch in range(2): # loop over the dataset multiple times

I also tried to change the learning rate to 0.0005 by changing the ds_config.json file and it seems like that gets pick up in some parts but overwritten in other parts.

For example I see
worker-0: [2020-10-01 22:41:49,395] [INFO] [config.py:624:print] optimizer_params ............. {'lr': 0.0005, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}

Which seems to have picked it up, but when it actually runs the training it always says:

worker-0: [2020-10-01 22:42:57,190] [INFO] [logging.py:60:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]

Which seems to have not picked up the lr change (it stays at 0.001 throughout as well which suggests it's not doing any lr_warmup either). I haven't tracked down where in the script the learning rate got overwritten... but it does seem to be happening.

@PareesaMS
Copy link
Contributor

You are right about the epochs being hard-coded. Please use this patch to resolve this issue:#759

About the learning rate, I see that it used to be hardcoded to 0.001 but that line of code is already commented out. Therefore, this issue should have been resolved.

@PareesaMS
Copy link
Contributor

The issue with the epochs being hard-coded is fixed here. So I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants