[Question] Optimal hyperparameters and scripts to reach 2000 steps/sec training speed #58

wenjie-mo · 2022-10-12T05:17:49Z

Question

Hello I am wondering which script and hyperparameters could achieve the 2000+ step/sec training speed as mentioned in the paper. So I have tried the following:

run_sample_factory.py algorithm=APPO
Problem: When using sample_factory library: parameters lr_schedule and max_entropy_coeff are missing, not sure what are the optimal numbers I should use.
run_rllib.py
Problem: same run time error for every worker, attached below:

nocturne_runner.py
Problem: the training speed is not that fast (around 100 steps/sec with around 40 fps). I have tried Issues Running MAPPO #38 and fps improved to around 80fps but the steps are still around the same.

My settings:
Code: newest code from main branch
OS: Ubuntu 20.04
GPU: RTX 3080 with CUDA 11.6
sample_factory: I have tried latest and aed6cc92a7eb3510c4d4bcfac083ced07b5222f9 (as mentioned in paper)

Please let me know if I made anything wrong when running the scripts. Thanks so much for answering!

eugenevinitsky · 2022-10-12T13:49:22Z

Hi! Sorry you've been having trouble. Let me answer each one piece by piece. First off, that 2k number corresponds to environment stepping time (i.e. no RL algo in the loop) so during training you'll see an FPS that differs significantly from algorithm depending on the type of policy used and whether the environment calculates a per-agent FPS or an overall "amount of experience generated per second in total". As for each particular one.

In the first type, we didn't freeze our sample factory version and the newest one has an additional hparam that we didn't have in our version. This is fixed here Freeze sample-factory version, add missing hparams #59 and will be merged shortly. If you run on that PR on the machine you have you should see about 10k-20k fps.
Looking into this one, this one usually means something went wrong with setting the config.
For this one, you need to increase the value of n_training_threads. The environment is running without any vectorization by default. Hope that helps

wenjie-mo · 2022-10-12T16:05:30Z

Hi Eugene, thanks so much for the reply and clarification! I will try out these solutions soon and let you know if they all works!

wenjie-mo · 2022-10-12T16:07:11Z

Hi, sorry I accidentally closed the issue. I would like to keep the issue open just for tracking purpose. Thanks!

wenjie-mo added the question Further information is requested label Oct 12, 2022

wenjie-mo closed this as completed Oct 12, 2022

wenjie-mo reopened this Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Optimal hyperparameters and scripts to reach 2000 steps/sec training speed #58

[Question] Optimal hyperparameters and scripts to reach 2000 steps/sec training speed #58

wenjie-mo commented Oct 12, 2022

eugenevinitsky commented Oct 12, 2022

wenjie-mo commented Oct 12, 2022

wenjie-mo commented Oct 12, 2022

[Question] Optimal hyperparameters and scripts to reach 2000 steps/sec training speed #58

[Question] Optimal hyperparameters and scripts to reach 2000 steps/sec training speed #58

Comments

wenjie-mo commented Oct 12, 2022

Question

eugenevinitsky commented Oct 12, 2022

wenjie-mo commented Oct 12, 2022

wenjie-mo commented Oct 12, 2022