-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Op not fetchable when wrapping optimizer with SyncReplicasOptimizer #1
Comments
It seems like some ops that are defined inside the tf.cond are being fetched for synchronizing between workers. I'm surprised that tf.train.SyncReplicasOptimizer actually fetches anything though, as the distributed session should span all machines. I could imagine that it will be difficult to restructure the code so that all synchronization points are outside of any tf.cond statements. @mrry Do you have an idea how to solve this? |
I guess one approach would be to just use MPI as is used in OAI Baselines MpiAdam. |
Since the error message is:
...can you show the code that creates that op? This doesn't look like something the |
Sure the code is here https://github.com/cwbeitel/agents/blob/master/agents/ppo/algorithm.py, see also kubeflow/training-operator#159 (comment) |
@mrry Would you mind taking a look at the code linked above, please? |
Shoot sorry I the links in the first comment above should have been tied to a specific commit, Anyways the most recent commit also produces the error with sync_replicas=True, see notebook for params and logs as well as ksonnet params for the fetchable error-producing job. Also this is running with tf v1.4.1 INFO:tensorflow:Tensorflow version: 1.4.1
INFO:tensorflow:Tensorflow git version: v1.4.0-19-ga52c8d9 |
See kubeflow/training-operator#159 and error.
Error can be reproduced with --sync_replicas set to True (via task.py or passing param via job YAML) using run-remote on a cluster deployed using deploy-gke.
@jlewi @danijar
The text was updated successfully, but these errors were encountered: