Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Op not fetchable when wrapping optimizer with SyncReplicasOptimizer #1

Open
cwbeitel opened this issue Jan 7, 2018 · 6 comments
Open

Comments

@cwbeitel
Copy link
Owner

cwbeitel commented Jan 7, 2018

See kubeflow/training-operator#159 and error.

Error can be reproduced with --sync_replicas set to True (via task.py or passing param via job YAML) using run-remote on a cluster deployed using deploy-gke.

@jlewi @danijar

@danijar
Copy link

danijar commented Jan 8, 2018

It seems like some ops that are defined inside the tf.cond are being fetched for synchronizing between workers. I'm surprised that tf.train.SyncReplicasOptimizer actually fetches anything though, as the distributed session should span all machines. I could imagine that it will be difficult to restructure the code so that all synchronization points are outside of any tf.cond statements. @mrry Do you have an idea how to solve this?

@cwbeitel
Copy link
Owner Author

cwbeitel commented Jan 8, 2018

I guess one approach would be to just use MPI as is used in OAI Baselines MpiAdam.

@mrry
Copy link

mrry commented Jan 9, 2018

Since the error message is:

ValueError: Operation u'end_episode/cond/cond/training/scan_1/while/Assign' has been marked as not fetchable.

...can you show the code that creates that op? This doesn't look like something the SyncReplicasOptimizer would do, since it doesn't generally create or consume the result of Assign ops, so for now I suspect the problem is in the user code.

@cwbeitel
Copy link
Owner Author

cwbeitel commented Jan 9, 2018

@danijar
Copy link

danijar commented Jan 18, 2018

@mrry Would you mind taking a look at the code linked above, please?

@cwbeitel
Copy link
Owner Author

Shoot sorry I the links in the first comment above should have been tied to a specific commit,
i.e. the one that had run-remote which could be used to produce the error.

Anyways the most recent commit also produces the error with sync_replicas=True, see notebook for params and logs as well as ksonnet params for the fetchable error-producing job.

Also this is running with tf v1.4.1

INFO:tensorflow:Tensorflow version: 1.4.1
INFO:tensorflow:Tensorflow git version: v1.4.0-19-ga52c8d9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants