Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Check (Discrete actions) #49

Closed
3 tasks done
araffin opened this issue Jun 9, 2020 · 8 comments · Fixed by #110
Closed
3 tasks done

Performance Check (Discrete actions) #49

araffin opened this issue Jun 9, 2020 · 8 comments · Fixed by #110
Assignees

Comments

@araffin
Copy link
Member

araffin commented Jun 9, 2020

The discrete action counterpart of #48

Associated PR: #110

Test envs: Atari Games (Pong - easy, Breakout - medium, ...)

@Miffyli
Copy link
Collaborator

Miffyli commented Jun 9, 2020

Initial results with PPO: Seems to mostly match performance of SB PPO2 but with some glaring errors (see training runs with six games with bit different action spaces). It seems that at least few games should be used for evaluation, because in some sb3 version gets similar performance (e.g. MsPacman, Q*bert), but in others it does not reach same numbers (e.g. Breakout, Enduro). I still have double-check the parameters were right etc.

atari_ppo_sb.pdf
atari_ppo_sb3.pdf

@araffin
Copy link
Member Author

araffin commented Jun 10, 2020

Are you using the zoo? And if so, which wrapper?
You should be using dqn branch for SB3 and the zoo.

@Miffyli
Copy link
Collaborator

Miffyli commented Jun 10, 2020

Are you using the zoo? And if so, which wrapper?
You should be using dqn branch for SB3 and the zoo.

No Zoo, based on this code. These are copied and modified wrappers from SB. The only thing that changes between SB and SB3 runs is where algorithm is imported from, rest is handled by the other code (and is the same).

@m-rph
Copy link
Contributor

m-rph commented Jul 17, 2020

Cross Posting:

Relevant, I am getting some rather weird performance from DQN, it seems to reach 0 fps (it was with num_threads=1, and old polyak update). When using an ensemble of 10 estimators I got much better performance and I can't pinpoint the issue.

image

In the policy, instead of having a single Qnetwork, I have n_estimator identical QNetworks and their estimation is averaged.
Note, this was running on GPU and the environment was LunarLander.

n_estimators is a hyper-parameter for a custom version of DQN that uses an ensemble of n_estimators identical (except the weights) to the QNetwork of DQN.

This is observed with the latest version of DQN.

@jarlva
Copy link

jarlva commented Jul 17, 2020

Hello all,

I've been an avid SB1 user for over a year. An amazing framework with thorough documentation and active support group indeed.
New RL developments have propelled RL to new highs. For ex, Async PPO. Which can scale 3X and more on the same hardware. It is my humble opinion that it may be a good time to start thinking seriously about async. I believe that SB3 will greatly benefit from Async. Making it a strong, viable framework into the future!

@Miffyli
Copy link
Collaborator

Miffyli commented Jul 17, 2020

@partiallytyped

I will work on DQN next. Could you share what envs/settings you used to get stuck like that with "standard" setup?

@jarlva

This is on the suggestions list for v1.2, I believe. At the moment we are working on optimizing the performance even of the synchronous variants, and PyTorch is not making things too easy with its tendency to use too many threads at the same time etc :)

@jarlva
Copy link

jarlva commented Jul 17, 2020

Completely understand @Miffyli . Would it be helpful to review https://github.com/alex-petrenko/sample-factory

@m-rph
Copy link
Contributor

m-rph commented Jul 17, 2020

@Miffyli

The script that runs the DQN agent:

from stable_baselines3 import DQN
import argparse



if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--lr","--learning-rate", type=float, default=1e-4, dest="learning_rate")
    parser.add_argument("env", type=str)
    parser.add_argument("--policy", default="MlpPolicy")
    parser.add_argument("--policy-kwargs", type=eval, default={})
    parser.add_argument("--buffer-size", type=int, default=int(1e5))
    parser.add_argument("--learning-starts", type=int, default=5000)
    parser.add_argument("--batch-size", default=32, type=int)
    parser.add_argument("--tau", type=float, default=1.0)
    parser.add_argument("--gamma", default=0.99, type=float)
    parser.add_argument("--train-freq", type=int, default=4)
    parser.add_argument("--gradient-steps", type=int, default=-1)
    parser.add_argument("--n-episodes-rollout", type=int, default=-1)
    parser.add_argument("--target-update-interval", type=int, default=5000)
    parser.add_argument("--exploration-fraction", type=float, default=0.2)
    parser.add_argument("--exploration-initial-eps", type=float, default=1.0)
    parser.add_argument("--exploration-final-eps", type=float, default=0.05)
    learn = argparse.ArgumentParser()
    learn.add_argument("--n-timesteps", default=int(5e5), type=int, dest="total_timesteps")
    learn.add_argument("--eval-freq", type=int, default=10)
    learn.add_argument("--n-eval-episodes", type=int, default=5)
    agent_args, learn_args = parser.parse_known_args()
    learn_args = learn.parse_args(learn_args)
    
    agent = DQN(**agent_args.__dict__, verbose=2, create_eval_env=True, tensorboard_log=f"tb/dqn_{agent_args.env}")
    agent.learn(**learn_args.__dict__)

The script that I call the above with:

python dqn.py "LunarLander-v2" --n-timesteps=50000 --learning-rate 1e-4 --batch-size 128 --buffer-size 50000 --learning-starts 0 --gamma 0.99 --target-update-interval 1000 --train-freq 4 --gradient-steps -1 --exploration-fraction 0.12 --exploration-final-eps 0.05 --policy-kwargs "dict(net_arch=[256, 256])"

The hyper parameters (except lr) are taken from the zoo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants