-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ddpg #62
base: master
Are you sure you want to change the base?
add ddpg #62
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you set a name on every TensorFlow placeholder and operation so the graph is readable. I'm not sure if this implementation is serializable right now, so please ensure you can run DDPG with run_experiment_lite
(add a separate launcher for this). See more comments in the code.
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
total_reward += reward | ||
if terminal: | ||
self._report_total_reward(total_reward, episode_step) | ||
print("epoch %d, total reward %lf\n" % (episode_step, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use rllab's logger for this (logger.record_tabular('key', value)
) and log more performance measures than reward, e.g.:
- actor loss
- critic loss
- max and average predicted Q value
- etc. (maybe you can think of more things that could help us understand the performance)
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
self._target_critic.name = 'target_critic' | ||
|
||
#replay buffer | ||
self._replay_buffer = ReplayBuffer(1e6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of the replay buffer should be user-definable
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
env, | ||
gamma=0.99, | ||
tau=0.001, | ||
observation_range=(-5, 5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observation_range
isn't used anywhere (and shouldn't be needed)
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
self._replay_buffer = ReplayBuffer(1e6) | ||
|
||
if log_dir: | ||
self._summary_writer = tf.summary.FileWriter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use rllab's logger
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
critic_l2_weight_decay=0.01, | ||
action_noise=None, | ||
plot=False, | ||
check_point_dir=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger.save_itr_params
which takes care of checkpoint folders etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what this function is used for yet. The checkpoint folder is used for tensorflow, it seems it does not need to save other parameters but the graph.
self._train_op = self._optimizer.apply_gradients( | ||
zip(grads, self.trainable_vars)) | ||
|
||
def _build_net(self, state): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add support for user-defined number and sizes of hidden layers, e.g. hidden_layers=[64, 64]
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
self._train_net() | ||
self.save_session(episode_step) | ||
|
||
def load_session(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this code if everything is serializable. Make sure you can run your implementation with run_experiment_lite
.
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
action_range=(-1, 1), | ||
actor_lr=1e-4, | ||
critic_lr=1e-3, | ||
reward_scale=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reward_scale
should be called discount
(in accordance with the other implementations). A value of 1 seems strange, are you sure it isn't <1, e.g. 0.99?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the paper and baseline, they all set this value to be 1.
sandbox/rocky/tf/algos/ddpg/ddpg.py
Outdated
gamma=0.99, | ||
tau=0.001, | ||
observation_range=(-5, 5), | ||
action_range=(-1, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume that the algorithm always gets a normalized environment, where action_range
is always in [-1, 1]. We need to implement normalization for gym Envs anyways as described in #64.
soft_updates.append( | ||
tf.assign(target_var, (1. - tau) * target_var + tau * var)) | ||
assert len(soft_updates) == len(vars) | ||
self._update_paras = tf.group(*soft_updates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you set a meaningful name on every TensorFlow operation
|
||
def _fc(x, | ||
output_dim, | ||
weight_initializer=tf.contrib.layers.xavier_initializer(uniform=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialize weights based on seed from get_seed()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this weight does not use seed to be initialized.
examples/ddpg_gym_tf_pendulum.py
Outdated
RANDOM_SEED = 1234 | ||
|
||
np.random.seed(RANDOM_SEED) | ||
tf.set_random_seed(RANDOM_SEED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these lines, call set_seed(seed) instead
examples/ddpg_gym_tf_pendulum.py
Outdated
tf.set_random_seed(RANDOM_SEED) | ||
|
||
env = gym.make('Pendulum-v0') | ||
env.seed(RANDOM_SEED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use get_seed()
examples/ddpg_gym_tf_pendulum.py
Outdated
env.seed(RANDOM_SEED) | ||
|
||
action_dim = env.action_space.shape[-1] | ||
action_noise = OrnsteinUhlenbeckActionNoise( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the noise should also be based on a seed
@@ -64,12 +65,12 @@ def _build_net(self, state): | |||
with tf.variable_scope(self.name) as scope: | |||
|
|||
with tf.variable_scope('fc1'): | |||
fc1_out = _fc(state, 64) | |||
fc1_out = _fc(state, self._hidden_layers[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be more than 2 hidden layers, so you should create fc
layers dynamically based on the array size.
@@ -166,13 +168,13 @@ def _build_net(self, state, action, reuse=False): | |||
scope.reuse_variables() | |||
|
|||
with tf.variable_scope('fc1'): | |||
fc1_out = _fc(state, 64) | |||
fc1_out = _fc(state, self._hidden_layers[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be more than 2 hidden layers, so you should create fc
layers dynamically based on the array size.
examples/ddpg_gym_tf_pendulum.py
Outdated
|
||
env = gym.make('Pendulum-v0') | ||
env.seed(RANDOM_SEED) | ||
env.seed(ext.get_seed()) | ||
|
||
|
||
action_dim = env.action_space.shape[-1] | ||
action_noise = OrnsteinUhlenbeckActionNoise( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add another launcher script which runs your code via run_experiment_lite
. It should work like this:
def run_task(*_):
# initialize actor, critic, noise, env, etc.
env = ...
algo = DDPG(env=env, ...)
algo.train()
run_experiment_lite(
run_task,
n_parallel=20,
plot=False,
)
We need to make sure your DDPG implementation is serializable.
add ddpg implementation to master branch