Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ddpg #62

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

add ddpg #62

wants to merge 4 commits into from

Conversation

cjcchen
Copy link

@cjcchen cjcchen commented May 16, 2018

add ddpg implementation to master branch

@cjcchen cjcchen mentioned this pull request May 16, 2018
@ryanjulian ryanjulian added this to the Week of May 14 milestone May 16, 2018
Copy link
Collaborator

@eric-heiden eric-heiden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you set a name on every TensorFlow placeholder and operation so the graph is readable. I'm not sure if this implementation is serializable right now, so please ensure you can run DDPG with run_experiment_lite (add a separate launcher for this). See more comments in the code.

total_reward += reward
if terminal:
self._report_total_reward(total_reward, episode_step)
print("epoch %d, total reward %lf\n" % (episode_step,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use rllab's logger for this (logger.record_tabular('key', value)) and log more performance measures than reward, e.g.:

  • actor loss
  • critic loss
  • max and average predicted Q value
  • etc. (maybe you can think of more things that could help us understand the performance)

self._target_critic.name = 'target_critic'

#replay buffer
self._replay_buffer = ReplayBuffer(1e6)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the replay buffer should be user-definable

env,
gamma=0.99,
tau=0.001,
observation_range=(-5, 5),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation_range isn't used anywhere (and shouldn't be needed)

self._replay_buffer = ReplayBuffer(1e6)

if log_dir:
self._summary_writer = tf.summary.FileWriter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use rllab's logger

critic_l2_weight_decay=0.01,
action_noise=None,
plot=False,
check_point_dir=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use logger.save_itr_params which takes care of checkpoint folders etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this function is used for yet. The checkpoint folder is used for tensorflow, it seems it does not need to save other parameters but the graph.

self._train_op = self._optimizer.apply_gradients(
zip(grads, self.trainable_vars))

def _build_net(self, state):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add support for user-defined number and sizes of hidden layers, e.g. hidden_layers=[64, 64]

self._train_net()
self.save_session(episode_step)

def load_session(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this code if everything is serializable. Make sure you can run your implementation with run_experiment_lite.

action_range=(-1, 1),
actor_lr=1e-4,
critic_lr=1e-3,
reward_scale=1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reward_scale should be called discount (in accordance with the other implementations). A value of 1 seems strange, are you sure it isn't <1, e.g. 0.99?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the paper and baseline, they all set this value to be 1.

gamma=0.99,
tau=0.001,
observation_range=(-5, 5),
action_range=(-1, 1),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume that the algorithm always gets a normalized environment, where action_range is always in [-1, 1]. We need to implement normalization for gym Envs anyways as described in #64.

soft_updates.append(
tf.assign(target_var, (1. - tau) * target_var + tau * var))
assert len(soft_updates) == len(vars)
self._update_paras = tf.group(*soft_updates)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you set a meaningful name on every TensorFlow operation


def _fc(x,
output_dim,
weight_initializer=tf.contrib.layers.xavier_initializer(uniform=False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize weights based on seed from get_seed()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this weight does not use seed to be initialized.

RANDOM_SEED = 1234

np.random.seed(RANDOM_SEED)
tf.set_random_seed(RANDOM_SEED)
Copy link
Collaborator

@eric-heiden eric-heiden May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these lines, call set_seed(seed) instead

tf.set_random_seed(RANDOM_SEED)

env = gym.make('Pendulum-v0')
env.seed(RANDOM_SEED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

env.seed(RANDOM_SEED)

action_dim = env.action_space.shape[-1]
action_noise = OrnsteinUhlenbeckActionNoise(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the noise should also be based on a seed

@@ -64,12 +65,12 @@ def _build_net(self, state):
with tf.variable_scope(self.name) as scope:

with tf.variable_scope('fc1'):
fc1_out = _fc(state, 64)
fc1_out = _fc(state, self._hidden_layers[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be more than 2 hidden layers, so you should create fc layers dynamically based on the array size.

@@ -166,13 +168,13 @@ def _build_net(self, state, action, reuse=False):
scope.reuse_variables()

with tf.variable_scope('fc1'):
fc1_out = _fc(state, 64)
fc1_out = _fc(state, self._hidden_layers[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be more than 2 hidden layers, so you should create fc layers dynamically based on the array size.


env = gym.make('Pendulum-v0')
env.seed(RANDOM_SEED)
env.seed(ext.get_seed())


action_dim = env.action_space.shape[-1]
action_noise = OrnsteinUhlenbeckActionNoise(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add another launcher script which runs your code via run_experiment_lite. It should work like this:

def run_task(*_):
    # initialize actor, critic, noise, env, etc.
    env = ...

    algo = DDPG(env=env, ...)
    algo.train()

run_experiment_lite(
    run_task,
    n_parallel=20,
    plot=False,
)

We need to make sure your DDPG implementation is serializable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algos New algorithms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants