Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFJob not marked as success when master exits but not workers #634

Closed
jlewi opened this issue Jun 11, 2018 · 12 comments
Closed

TFJob not marked as success when master exits but not workers #634

jlewi opened this issue Jun 11, 2018 · 12 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jun 11, 2018

I ran the simple TFJob we use for smoke tests but modified it to have PS, Workers, and Master
https://github.com/kubeflow/tf-operator/blob/master/test/workflows/components/simple_tfjob.jsonnet

The master is marked as succeeded but the job is still marked as running. See below.

This is different behavior from v1alpha1 and seems like a regression.

@gaocegege @ScorpioCPH

apiVersion: kubeflow.org/v1alpha2
  kind: TFJob
  metadata:
    annotations:
      ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/8yQzUosMRCF9/cxzjo9f8zi3jzAXQjK4IguRKQmXWlDdychVY7IkHeX+APqE8yuOMU5fOecQDnccpGQIizG5wP7Kb0sUhmWxzVN+Yk2MBhD7GFx8/8iHWAws1JPSrAnRJoZFhLmPHG3Wq1h3jXJ5PhbZKcs2oXoC6EaSGbX7OqvOU/B0T6zk6ZckiiXdpWPj8CuDQqLUtFdmoJ7hcUVH7nAQHnOEyk3w1eoS1EpRC4Ce39CmGloKIMri5CW6rsUu/GvdH0afEr9Uv2jUOO3vfu33Xr/2QEWylFSaQVQH2qtBrv9T7bNGbHdpTL+3m57Nny11j9vAAAA//8BAAD//2bf3XNxAgAA"}'
    clusterName: ""
    creationTimestamp: 2018-06-11T22:01:43Z
    generation: 0
    labels:
      app.kubernetes.io/deploy-manager: ksonnet
    name: simple-001
    namespace: kubeflow-test-infra
    resourceVersion: "864756"
    selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow-test-infra/tfjobs/simple-001
    uid: 06ce7f5a-6dc3-11e8-996c-42010af000af
  spec:
    tfReplicaSpecs:
      Master:
        replicas: 1
        restartPolicy: Never
        template:
          metadata:
            creationTimestamp: null
          spec:
            containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              ports:
              - containerPort: 2222
                name: tfjob-port
              resources: {}
      PS:
        replicas: 2
        restartPolicy: Never
        template:
          metadata:
            creationTimestamp: null
          spec:
            containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              ports:
              - containerPort: 2222
                name: tfjob-port
              resources: {}
      Worker:
        replicas: 4
        restartPolicy: Never
        template:
          metadata:
            creationTimestamp: null
          spec:
            containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              ports:
              - containerPort: 2222
                name: tfjob-port
              resources: {}
  status:
    conditions:
    - lastTransitionTime: 2018-06-11T22:01:43Z
      lastUpdateTime: 2018-06-11T22:01:43Z
      message: TFJob simple-001 is created.
      reason: TFJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2018-06-11T22:01:48Z
      lastUpdateTime: 2018-06-11T22:01:48Z
      message: TFJob simple-001 is running.
      reason: TFJobRunning
      status: "True"
      type: Running
    startTime: 2018-06-11T22:37:16Z
    tfReplicaStatuses:
      Master:
        succeeded: 1
      PS:
        active: 2
      Worker:
        active: 4

@jlewi
Copy link
Contributor Author

jlewi commented Jun 11, 2018

The problem appears to be here
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/controller_status.go#L45

The code doesn't appear to be implementing the termination policy logic as we discussed in the v1alpha2 spec. In particular

If there is a chief the job should be marked as completed when the chief exits.
If there is no chief worker 0 should be considered the chief.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 11, 2018

Support for exit policies was discussed here in the original proposal
kubeflow/community#30 (comment)

@gaocegege
Copy link
Member

We have no logic about master replicas now, and I think we do not just miss the support of exit policies. I think it could be added in 0.3. My opinion is to make ps/worker high availability, WDYT

jlewi added a commit to jlewi/k8s that referenced this issue Jun 12, 2018
* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589
@gaocegege gaocegege self-assigned this Jun 12, 2018
@gaocegege
Copy link
Member

And FYI, the master spec is using Chief instead of Master.

@gaocegege
Copy link
Member

gaocegege commented Jun 12, 2018

I implement the logic #637 and the code will generate the cluster spec like this:

{'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

Thus we need to write an example for the case.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

I don't think we can wait for 0.3 to implement exit policy. Without exit policy most TF jobs won't worker and this would be a major regression so we effectively can't launch v1alpha2 in 0.2; which would be a major blow to our plans to be v1 by the end of the year.

Lots of TF jobs either have an explicit chief or use worker 0 as the chief.

Can you open up an issue on high availability? Its not clear to me what this means or why its important.

@jlewi
Copy link
Contributor Author

jlewi commented Jun 12, 2018

/cc @ddysher @DjangoPeng

@gaocegege
Copy link
Member

OK, SGTM. we have 3 days before the DL and I will work on it.

jlewi added a commit to jlewi/k8s that referenced this issue Jun 12, 2018
* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589
jlewi added a commit to jlewi/k8s that referenced this issue Jun 12, 2018
* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589
k8s-ci-robot pushed a commit that referenced this issue Jun 13, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to #589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
@gaocegege
Copy link
Member

I think it is closed by #637

jlewi added a commit to jlewi/k8s that referenced this issue Jun 14, 2018
@DjangoPeng
Copy link
Member

thanks! @gaocegege

jlewi added a commit to jlewi/k8s that referenced this issue Jun 15, 2018
* Fix kubeflow#634

* Speedup the E2E test by running the build and setup cluster steps in parallel
  * To do this we split the setup step into two steps 1. setting up the cluster
    and 2. setting up Kubeflow.

    Fix kubeflow#659

* Shorten the name of the workflow for v1alpha2
   * Otherwise the label for the workflow pod becomes too long and argo
     can't run it.

* Pin the test worker image so that we don't get broken when someone updates
  the latest image
  * Make it a parameter in the prow_config.yaml

* Use a file lock to ensure only one instance of test_runner is modifying
  the ksonnet app at a time; this should help with various test flakes.
k8s-ci-robot pushed a commit that referenced this issue Jun 16, 2018
* Fix #634

* Speedup the E2E test by running the build and setup cluster steps in parallel
  * To do this we split the setup step into two steps 1. setting up the cluster
    and 2. setting up Kubeflow.

    Fix #659

* Shorten the name of the workflow for v1alpha2
   * Otherwise the label for the workflow pod becomes too long and argo
     can't run it.

* Pin the test worker image so that we don't get broken when someone updates
  the latest image
  * Make it a parameter in the prow_config.yaml

* Use a file lock to ensure only one instance of test_runner is modifying
  the ksonnet app at a time; this should help with various test flakes.
@372046933
Copy link

@gaocegege Can you provide a sample training with Chief, Master and Worker

@gaocegege
Copy link
Member

@372046933 I do not really train a mode, I just use busybox to test the functionality.

yph152 pushed a commit to yph152/tf-operator that referenced this issue Jun 18, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
yph152 pushed a commit to yph152/tf-operator that referenced this issue Jun 18, 2018
* Fix kubeflow#634

* Speedup the E2E test by running the build and setup cluster steps in parallel
  * To do this we split the setup step into two steps 1. setting up the cluster
    and 2. setting up Kubeflow.

    Fix kubeflow#659

* Shorten the name of the workflow for v1alpha2
   * Otherwise the label for the workflow pod becomes too long and argo
     can't run it.

* Pin the test worker image so that we don't get broken when someone updates
  the latest image
  * Make it a parameter in the prow_config.yaml

* Use a file lock to ensure only one instance of test_runner is modifying
  the ksonnet app at a time; this should help with various test flakes.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this issue Jul 9, 2018
* Changes to support v1alpha2 testing in presubmits.

* The tests are currently disabled because they aren't passing yet because
  termination policy isn't handled correctly (kubeflow#634)

* Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as
  opposed to using mnist.
  mnist causing problems because of issues downloading the data
  see kubeflow/kubeflow#974

* We want a simpler test that allows for more direct testing of the distributed
  communication pattern
* Also mnist is expensive in that it tries to download data.

* Add a parameter tfJobVersion to the deploy script so we can control
  whether we deploy v1alpha1 or v1alpha2

* Parameterize the E2E test workflow by the TFJob version we want to run.

* update test-app - We need to pull in a version of the app which
  has the TFJobVersion flag.

* Create a script to regenerate the test-app for future use.

Related to kubeflow#589

* Fix versionTag logic; we need to allow for case where versionTag is an
empty string.
jetmuffin pushed a commit to jetmuffin/tf-operator that referenced this issue Jul 9, 2018
* Fix kubeflow#634

* Speedup the E2E test by running the build and setup cluster steps in parallel
  * To do this we split the setup step into two steps 1. setting up the cluster
    and 2. setting up Kubeflow.

    Fix kubeflow#659

* Shorten the name of the workflow for v1alpha2
   * Otherwise the label for the workflow pod becomes too long and argo
     can't run it.

* Pin the test worker image so that we don't get broken when someone updates
  the latest image
  * Make it a parameter in the prow_config.yaml

* Use a file lock to ensure only one instance of test_runner is modifying
  the ksonnet app at a time; this should help with various test flakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants