-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluator should be set in TF_CONFIG when using Estimator distribute strategy #1139
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
similar issue in tensorflow repo: tensorflow/tensorflow#30121 |
@johnugeorge @richardsliu @gaocegege Looks like tf-operator needs to support evaluator in addition to ps and worker. An example cluster spec that has all these roles:
On evaluator node, TF_CONFIG should be similar to the following in order to tell TensorFlow to use this node for performing model evaluation:
|
@terrytangyuan If evaluator contains TF_CONFIG cluster spec, some changes should be made in tensorflow 1.12 ( or above ) when train_distribute attribute in tf.estimator.RunConfig is set to None. Otherwise other node will wait for evaluator forever: |
@richardsliu @gaocegege I'm glad to take time to fix this problem if possible. |
LGTM |
�Hi,you mean the way TF_CONFIG environment set by tf-operator currently is correct? |
@meibenjin are you working on this? We are blocked on this too and we'd like to create a patch if you haven't already(would be glad to test your patch if there's an image). |
@meibenjin Welcome the PR. I think it is a problem, but one question: How about the old version TF? |
I think some changes would be made in old version TF if we add evaluator in TF_CONFIG cluster (remove evaluator from TF_CONFIG cluster_spec),In our test when train_distribute attribute in tf.estimator.RunConfig is set to None. Other nodes will wait for evaluator forever: |
@ashahab please see my reply to gaocegege ,We should think about the compatibility in old version TF like Tensorflow1.12。 |
@meibenjin Thanks for the reply. I will comment soon after a deep dive into the TF code. |
@terrytangyuan We already support it if there is no DistributionStrategy. DistributionStrategy requires a validate function call to validate that the evaluator should be in cluster spec. |
Yeah. It is what I worry about. All the replicas will wait for the evaluator session. I am not sure why distribute strategy needs such a validate function to keep evaluators in cluster_spec. Not sure if it is a bug or feature. /cc @terrytangyuan Do you have any idea about it? |
@gaocegege Not sure. It's probably due to some requirements in higher level APIs. We can bring this up in tensorflow/tensorflow#30121. |
I have a PR fix here: #1146 But this will run into the issue mentioned by @meibenjin. |
Please have a look at this tensorflow/tensorflow#27857 (comment). Master node is not officially supported. Because |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
hey, do you guys have a clear solution? I still came out with this problem. |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
i have the same problem, is there one solution or workaround please? @terrytangyuan from estimator.run_config.py:
|
@pengyuan There is a workaround here. But it seems a little hacky. An estimator parses all the cluster information from the environment variable original
|
Evaluator was excluded when generatingTF_CONFIG environment in tf-operator, see:https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/tensorflow.go#L110
However, when use Estimator with distribute strategy ,TF 1.12 will raise an error:
similar error occured in TF 1.15:
TF code with distribute strategy (1 ps 1 chief 1 worker 1 evaluator):
Note: If train_distribute attribute in tf.estimator.RunConfig is set to None, it works well.
The text was updated successfully, but these errors were encountered: