Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078

Closed
gaocegege opened this issue Sep 11, 2019 · 7 comments · Fixed by #1080
Assignees
Labels

Comments

@gaocegege
Copy link
Member

gaocegege commented Sep 11, 2019

There are some users want to use TFJob to run local training jobs with Estimator. They will have such a config:

 spec:
    tfReplicaSpecs:
      Worker:
        replicas: 1
        template:
          spec:
            containers:
            - command:
              - /bin/bash
              - -c
              - python cifar10_main.py --data-dir=/clever/input/datasets/cifar10 
                --job-dir=/tmp/cifar10 --num-gpus=0 --train-steps=1000

One worker will be created with TF_CONFIG:

{
    "cluster": {
        "worker": [
            "tensorflow-190911-rr84s-worker-0.tsk.svc:2222"
        ]
    },
    "task": {
        "type": "worker",
        "index": 0
    },
    "environment": "cloud"
}

The estimator will try to load TF_CONFIG from the env var. then it will get the config.

But we do not expect this. We should not set TF_CONFIG for this scenario.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.90. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@gaocegege
Copy link
Member Author

gaocegege commented Sep 11, 2019

Here is an example: https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator

We cannot use 1 worker to run the example with tf-operator. After deleting the env var manually, it can be run with tf-operator.

@gaocegege
Copy link
Member Author

/cc @johnugeorge @richardsliu

I can fix it

/assign @gaocegege

@gaocegege gaocegege changed the title [bug] Cannot initialize the training job when the user uses 1 worker and 0 PS [bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS Sep 11, 2019
@johnugeorge
Copy link
Member

In Pytorch, it is validated to have 1 Master with 0 or more workers

@gaocegege
Copy link
Member Author

@johnugeorge Will pytorch-operator set MASTER_ADDR in this master pod?

@johnugeorge
Copy link
Member

I think so. But, it would not matter even if it does it

@gaocegege
Copy link
Member Author

Gotcha. TF_CONFIG has a side effect in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants