Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: Support chief #637

Merged
merged 2 commits into from
Jun 13, 2018
Merged

status: Support chief #637

merged 2 commits into from
Jun 13, 2018

Conversation

gaocegege
Copy link
Member

@gaocegege gaocegege commented Jun 12, 2018

Signed-off-by: Ce Gao [email protected]


This change is Reviewable

@coveralls
Copy link

coveralls commented Jun 12, 2018

Coverage Status

Coverage decreased (-0.5%) to 54.436% when pulling e0c19a8 on gaocegege:chief into 2895654 on kubeflow:master.

@gaocegege
Copy link
Member Author

/retest

@gaocegege
Copy link
Member Author

Maybe need an example about estimator to test it.

@gaocegege
Copy link
Member Author

/retest

@jlewi
Copy link
Contributor

jlewi commented Jun 12, 2018

The test failure should be fixed by #642; Can you resync and try it out?

@jlewi
Copy link
Contributor

jlewi commented Jun 12, 2018

I tried this and it doesn't look like conditions is getting updated or pods are being cleaned up

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  annotations:
    ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/+yQz8oaMRTF932Ms44OSqGQbVH6B1vR0i5E5BrvOMHJH5I4RSTv/hHnE5Vv4wO4u3cm555zfmeQ1385RO0sJA7HLdet+z90YV91I2p9Q2MIHLTdQeLP9IfbQsBwoh0lgjzDkmFIGIqJw0DHgWo014M7cXkRPSm+M0AWiJ5VuZDqBftWK1p6VrF8+frt+2RahtD/iJAjgcAxUUhz12p1gsQv7jhAILHxLSUugutN5WwibTlEyFVZjaFSYQV/So2zEKg6CpVyO64aiptL6o0Lm77I0J+wFtCG9iV2L5Lj4Zf3PpBIbKMLlzICnWuPhmfuaFPvaMo4p9RA3pxu4suW13l9lV5z2lrvZ+SfIZs/XMtZYL58BDd+gXsO3L/fi5+TxSO8zy94z8DLOX96AwAA//8BAAD//6qI3IdKBAAA"}'
  clusterName: ""
  creationTimestamp: 2018-06-12T20:38:24Z
  generation: 0
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
  name: master-is-chief-v1alpha2
  namespace: kubeflow
  resourceVersion: "1080051"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/master-is-chief-v1alpha2
  uid: 8d883088-6e80-11e8-996c-42010af000af
spec:
  tfReplicaSpecs:
    CHIEF:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
    WORKER:
      replicas: 4
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
status:
  conditions: null
  tfReplicaStatuses:
    CHIEF:
      succeeded: 1
    PS:
      active: 2
    WORKER:
      active: 4

@gaocegege
Copy link
Member Author

It works from my side. Could you please have a try again?

And what do you mean pods are being cleaned up? Do you want to delete all pods after the job is succeeded?

@gaocegege gaocegege changed the title WIP: status: Support chief status: Support chief Jun 13, 2018
Signed-off-by: Ce Gao <[email protected]>
Avoid the situation that failed and succeeded are all in conditions

Signed-off-by: Ce Gao <[email protected]>
@gaocegege
Copy link
Member Author

/assign @jlewi

@jlewi
Copy link
Contributor

jlewi commented Jun 13, 2018

Thanks. Go ahead and merge it; I'll continue to test things out.
/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit eebc838 into kubeflow:master Jun 13, 2018
@gaocegege gaocegege deleted the chief branch June 13, 2018 05:24
yph152 pushed a commit to yph152/tf-operator that referenced this pull request Jun 18, 2018
* pkg: Support chief

Signed-off-by: Ce Gao <[email protected]>

* status: Avoid conflicts

Avoid the situation that failed and succeeded are all in conditions

Signed-off-by: Ce Gao <[email protected]>
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this pull request Jul 9, 2018
* pkg: Support chief

Signed-off-by: Ce Gao <[email protected]>

* status: Avoid conflicts

Avoid the situation that failed and succeeded are all in conditions

Signed-off-by: Ce Gao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants