Abstract: Model-free deep reinforcement learning algorithms have been shown to be capable of learning a wide range of robotic skills, but typically require a very large number of samples to achieve good performance. Model-based algorithms, in principle, can provide for much more efficient learning, but have proven difficult to extend to expressive, high-capacity models such as deep neural networks. In this work, we demonstrate that medium-sized neural network models can in fact be combined with model predictive control (MPC) to achieve excellent sample complexity in a model-based reinforcement learning algorithm, producing stable and plausible gaits to accomplish various complex locomotion tasks. We also propose using deep neural network dynamics models to initialize a model-free learner, in order to combine the sample efficiency of model-based approaches with the high task-specific performance of model-free methods. We empirically demonstrate on MuJoCo locomotion tasks that our pure model-based approach trained on just minutes of random action data can follow arbitrary trajectories, and that our hybrid algorithm can accelerate model-free learning on high-speed benchmark tasks, achieving sample efficiency gains of 3-5x on swimmer, cheetah, hopper, and ant agents.
- For installation guide, go to installation.md
- For notes on how to use your own environment, how to edit envs, etc. go to notes.md
cd scripts
./swimmer_mbmf.sh
./cheetah_mbmf.sh
./hopper_mbmf.sh
./ant_mbmf.sh
Each of those scripts does something similar to the following (but for multiple seeds):
python main.py --seed=0 --run_num=1 --yaml_file='swimmer_forward'
python mbmf.py --run_num=1 --which_agent=2
python trpo_run_mf.py --seed=0 --save_trpo_run_num=1 --which_agent=2 --num_workers_trpo=2 --std_on_mlp_policy=0.5
python plot_mbmf.py --trpo_dir=[trpo_dir] --std_on_mlp_policy=0.5 --which_agent=2 --run_nums 1 --seeds 0
Note that [trpo_dir] above corresponds to where the TRPO runs are saved. Probably somewhere in ~/rllab/data/...
Each of these steps are further explained in the following sections.
Need to specify:
--yaml_file Specify the corresponding yaml file
--seed Set random seed to set for numpy and tensorflow
--run_num Specify what directory to save files under
--use_existing_training_data To use the data that already exists in the directory run_num instead of recollecting
--desired_traj_type What type of trajectory to follow (if you want to follow a trajectory)
--num_rollouts_save_for_mf Number of on-policy rollouts to save after last aggregation iteration, to be used later
--might_render If you might want to visualize anything during the run
--visualize_MPC_rollout To set a breakpoint and visualize the on-policy rollouts after each agg iteration
--perform_forwardsim_for_vis To visualize an open-loop prediction made by the learned dynamics model
--print_minimal To not print messages regarding progress/notes/etc.
python main.py --seed=0 --run_num=0 --yaml_file='cheetah_forward'
python main.py --seed=0 --run_num=1 --yaml_file='swimmer_forward'
python main.py --seed=0 --run_num=2 --yaml_file='ant_forward'
python main.py --seed=0 --run_num=3 --yaml_file='hopper_forward'
python main.py --seed=0 --run_num=4 --yaml_file='cheetah_trajfollow' --desired_traj_type='straight' --visualize_MPC_rollout
python main.py --seed=0 --run_num=4 --yaml_file='cheetah_trajfollow' --desired_traj_type='backward' --visualize_MPC_rollout --use_existing_training_data --use_existing_dynamics_model
python main.py --seed=0 --run_num=4 --yaml_file='cheetah_trajfollow' --desired_traj_type='forwardbackward' --visualize_MPC_rollout --use_existing_training_data --use_existing_dynamics_model
python main.py --seed=0 --run_num=5 --yaml_file='swimmer_trajfollow' --desired_traj_type='straight' --visualize_MPC_rollout
python main.py --seed=0 --run_num=5 --yaml_file='swimmer_trajfollow' --desired_traj_type='left_turn' --visualize_MPC_rollout --use_existing_training_data --use_existing_dynamics_model
python main.py --seed=0 --run_num=5 --yaml_file='swimmer_trajfollow' --desired_traj_type='right_turn' --visualize_MPC_rollout --use_existing_training_data --use_existing_dynamics_model
python main.py --seed=0 --run_num=6 --yaml_file='ant_trajfollow' --desired_traj_type='straight' --visualize_MPC_rollout
python main.py --seed=0 --run_num=6 --yaml_file='ant_trajfollow' --desired_traj_type='left_turn' --visualize_MPC_rollout --use_existing_training_data --use_existing_dynamics_model
python main.py --seed=0 --run_num=6 --yaml_file='ant_trajfollow' --desired_traj_type='right_turn' --visualize_MPC_rollout --use_existing_training_data --use_existing_dynamics_model
python main.py --seed=0 --run_num=6 --yaml_file='ant_trajfollow' --desired_traj_type='u_turn' --visualize_MPC_rollout --use_existing_training_data --use_existing_dynamics_model
Need to specify:
--save_trpo_run_num number Number used as part of directory name for saving mbmf TRPO run (you can use 1,2,3,etc to differentiate your different seeds)
--run_num Specify what directory to get relevant MB data from & to save new MBMF files in
--which_agent Specify which agent (1 ant, 2 swimmer, 4 cheetah, 6 hopper)
--std_on_mlp_policy Initial std you want to set on your pre-initialization policy for TRPO to use
--num_workers_trpo How many worker threads (cpu) for TRPO to use
--might_render If you might want to visualize anything during the run
--visualize_mlp_policy To visualize the rollout performed by trained policy (that will serve as pre-initialization for TRPO)
--visualize_on_policy_rollouts To set a breakpoint and visualize the on-policy rollouts after each agg iteration of dagger
--print_minimal To not print messages regarding progress/notes/etc.
--use_existing_pretrained_policy To run only the TRPO part (if you've already done the imitation learning part in the same directory)
Note that the finished TRPO run saves to ~/rllab/data/local/experiments/
python mbmf.py --run_num=1 --which_agent=2 --std_on_mlp_policy=1.0
python mbmf.py --run_num=0 --which_agent=4 --std_on_mlp_policy=0.5
python mbmf.py --run_num=3 --which_agent=6 --std_on_mlp_policy=1.0
python mbmf.py --run_num=2 --which_agent=1 --std_on_mlp_policy=0.5
Run pure TRPO, for comparisons.
Need to specify command line args as desired
--seed Set random seed to set for numpy and tensorflow
--steps_per_rollout Length of each rollout that TRPO should collect
--save_trpo_run_num Number used as part of directory name for saving TRPO run (you can use 1,2,3,etc to differentiate your different seeds)
--which_agent Specify which agent (1 ant, 2 swimmer, 4 cheetah, 6 hopper)
--num_workers_trpo How many worker threads (cpu) for TRPO to use
--num_trpo_iters Total number of TRPO iterations to run before stopping
Note that the finished TRPO run saves to ~/rllab/data/local/experiments/
python trpo_run_mf.py --seed=0 --save_trpo_run_num=1 --which_agent=4 --num_workers_trpo=4
python trpo_run_mf.py --seed=0 --save_trpo_run_num=1 --which_agent=2 --num_workers_trpo=4
python trpo_run_mf.py --seed=0 --save_trpo_run_num=1 --which_agent=1 --num_workers_trpo=4
python trpo_run_mf.py --seed=0 --save_trpo_run_num=1 --which_agent=6 --num_workers_trpo=4
python trpo_run_mf.py --seed=50 --save_trpo_run_num=2 --which_agent=4 --num_workers_trpo=4
python trpo_run_mf.py --seed=50 --save_trpo_run_num=2 --which_agent=2 --num_workers_trpo=4
python trpo_run_mf.py --seed=50 --save_trpo_run_num=2 --which_agent=1 --num_workers_trpo=4
python trpo_run_mf.py --seed=50 --save_trpo_run_num=2 --which_agent=6 --num_workers_trpo=4
- MBMF
-Need to specify the commandline arguments as desired (in plot_mbmf.py)
-Examples of running the plotting script:
cd plotting
python plot_mbmf.py --trpo_dir=[trpo_dir] --std_on_mlp_policy=1.0 --which_agent=2 --run_nums 1 --seeds 0
python plot_mbmf.py --trpo_dir=[trpo_dir] --std_on_mlp_policy=1.0 --which_agent=2 --run_nums 1 2 3 --seeds 0 70 100
Note that [trpo_dir] above corresponds to where the TRPO runs are saved. Probably somewhere in ~/rllab/data/...
-
Dynamics model training and validation losses per aggregation iteration
IPython notebook: plotting/plot_loss.ipynb
Example plots: docs/sample_plots/... -
Visualize a forward simulation (an open-loop multi-step prediction of the elements of the state space)
IPython notebook: plotting/plot_forwardsim.ipynb
Example plots: docs/sample_plots/... -
Visualize the trajectories (on policy rollouts) per aggregation iteration
IPython notebook: plotting/plot_trajfollow.ipynb
Example plots: docs/sample_plots/...