-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
Here is an example: https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator We cannot use 1 worker to run the example with tf-operator. After deleting the env var manually, it can be run with tf-operator. |
I can fix it /assign @gaocegege |
In Pytorch, it is validated to have 1 Master with 0 or more workers |
@johnugeorge Will pytorch-operator set MASTER_ADDR in this master pod? |
I think so. But, it would not matter even if it does it |
Gotcha. TF_CONFIG has a side effect in this case. |
There are some users want to use TFJob to run local training jobs with Estimator. They will have such a config:
One worker will be created with TF_CONFIG:
The estimator will try to load TF_CONFIG from the env var. then it will get the config.
But we do not expect this. We should not set TF_CONFIG for this scenario.
The text was updated successfully, but these errors were encountered: