Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657

Closed
nosmatch opened this issue Jun 14, 2018 · 10 comments

Comments

@nosmatch
Copy link

nosmatch commented Jun 14, 2018

{
    "apiVersion": "kubeflow.org/v1alpha2",
    "kind": "TFJob",
    "metadata": {
        "name": "fengzhu-tf-v2-2"
    },
    "spec": {
        "tfReplicaSpecs": {
            "Master": {
                "replicas": 1,
                "restartPolicy": "OnFailure",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "command": [
                                    "python",
                                    "/distribute.py"
                                ],
                                "image": "registry.v2.wx.service.mogujie.org/public/tinytf_baseline:201806131831",
                                "name": "tensorflow",
                                "ports": [
                                    {
                                        "containerPort": 2222,
                                        "name": "tfjob-port"
                                    }
                                ],
                                "resources": {}
                            }
                        ]
                    }
                }
            },
            "PS": {
                "replicas": 1,
                "restartPolicy": "OnFailure",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "command": [
                                    "python",
                                    "/distribute.py"
                                ],
                                "image": "registry.v2.wx.service.mogujie.org/public/tinytf_baseline:201806131831",
                                "name": "tensorflow",
                                "ports": [
                                    {
                                        "containerPort": 2222,
                                        "name": "tfjob-port"
                                    }
                                ],
                                "resources": {}
                            }
                        ]
                    }
                }
            },
            "Worker": {
                "replicas": 1,
                "restartPolicy": "OnFailure",
                "template": {
                    "spec": {
                        "containers": [
                            {
                                "command": [
                                    "python",
                                    "/distribute.py"
                                ],
                                "image": "registry.v2.wx.service.mogujie.org/public/tinytf_baseline:201806131831",
                                "name": "tensorflow",
                                "ports": [
                                    {
                                        "containerPort": 2222,
                                        "name": "tfjob-port"
                                    }
                                ],
                                "resources": {}
                            }
                        ]
                    }
                }
            }
        }
    }
}

and for ps:
server = server = tf.train.Server(cluster, job_name=local_type, task_index=local_index)
...
if local_type == "ps":
print "local server is ps"
server.join()

@nosmatch
Copy link
Author

@gaocegege

@gaocegege
Copy link
Member

I do not think you could do it, because PS is designed to be long running in TF

@nosmatch
Copy link
Author

but the ps belongs to one tfjob, and when i create a new tfjob we will create new ps, according my understand when a tfjob compled we will delete all resources the tfjob used, right?

@gaocegege
Copy link
Member

We do not implement the logic to delete all pods and services after the TFJob is completed in v1alpha2. And even if we implement it, we do not set the ps to be completed, we just delete it.

@nosmatch
Copy link
Author

which one will delete the ps server when TFjob completed? mster?

@nosmatch
Copy link
Author

Thank you for your reply.

@gaocegege
Copy link
Member

controller for v1alpha1 will do it.

@nosmatch
Copy link
Author

what about v1alpha2?

@gaocegege
Copy link
Member

We will investigate if there is another way to do the same thing without deletions. Ref #661

@gaocegege
Copy link
Member

You could use features in #691 , to control the behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants