TFJob not marked as success when master exits but not workers #634

jlewi · 2018-06-11T22:38:51Z

I ran the simple TFJob we use for smoke tests but modified it to have PS, Workers, and Master
https://github.com/kubeflow/tf-operator/blob/master/test/workflows/components/simple_tfjob.jsonnet

The master is marked as succeeded but the job is still marked as running. See below.

This is different behavior from v1alpha1 and seems like a regression.

@gaocegege @ScorpioCPH

apiVersion: kubeflow.org/v1alpha2
  kind: TFJob
  metadata:
    annotations:
      ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/8yQzUosMRCF9/cxzjo9f8zi3jzAXQjK4IguRKQmXWlDdychVY7IkHeX+APqE8yuOMU5fOecQDnccpGQIizG5wP7Kb0sUhmWxzVN+Yk2MBhD7GFx8/8iHWAws1JPSrAnRJoZFhLmPHG3Wq1h3jXJ5PhbZKcs2oXoC6EaSGbX7OqvOU/B0T6zk6ZckiiXdpWPj8CuDQqLUtFdmoJ7hcUVH7nAQHnOEyk3w1eoS1EpRC4Ce39CmGloKIMri5CW6rsUu/GvdH0afEr9Uv2jUOO3vfu33Xr/2QEWylFSaQVQH2qtBrv9T7bNGbHdpTL+3m57Nny11j9vAAAA//8BAAD//2bf3XNxAgAA"}'
    clusterName: ""
    creationTimestamp: 2018-06-11T22:01:43Z
    generation: 0
    labels:
      app.kubernetes.io/deploy-manager: ksonnet
    name: simple-001
    namespace: kubeflow-test-infra
    resourceVersion: "864756"
    selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow-test-infra/tfjobs/simple-001
    uid: 06ce7f5a-6dc3-11e8-996c-42010af000af
  spec:
    tfReplicaSpecs:
      Master:
        replicas: 1
        restartPolicy: Never
        template:
          metadata:
            creationTimestamp: null
          spec:
            containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              ports:
              - containerPort: 2222
                name: tfjob-port
              resources: {}
      PS:
        replicas: 2
        restartPolicy: Never
        template:
          metadata:
            creationTimestamp: null
          spec:
            containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              ports:
              - containerPort: 2222
                name: tfjob-port
              resources: {}
      Worker:
        replicas: 4
        restartPolicy: Never
        template:
          metadata:
            creationTimestamp: null
          spec:
            containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: tensorflow
              ports:
              - containerPort: 2222
                name: tfjob-port
              resources: {}
  status:
    conditions:
    - lastTransitionTime: 2018-06-11T22:01:43Z
      lastUpdateTime: 2018-06-11T22:01:43Z
      message: TFJob simple-001 is created.
      reason: TFJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: 2018-06-11T22:01:48Z
      lastUpdateTime: 2018-06-11T22:01:48Z
      message: TFJob simple-001 is running.
      reason: TFJobRunning
      status: "True"
      type: Running
    startTime: 2018-06-11T22:37:16Z
    tfReplicaStatuses:
      Master:
        succeeded: 1
      PS:
        active: 2
      Worker:
        active: 4

The text was updated successfully, but these errors were encountered:

jlewi · 2018-06-11T23:20:47Z

The problem appears to be here
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/controller_status.go#L45

The code doesn't appear to be implementing the termination policy logic as we discussed in the v1alpha2 spec. In particular

If there is a chief the job should be marked as completed when the chief exits.
If there is no chief worker 0 should be considered the chief.

jlewi · 2018-06-11T23:35:23Z

Support for exit policies was discussed here in the original proposal
kubeflow/community#30 (comment)

gaocegege · 2018-06-12T02:00:08Z

We have no logic about master replicas now, and I think we do not just miss the support of exit policies. I think it could be added in 0.3. My opinion is to make ps/worker high availability, WDYT

* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589

gaocegege · 2018-06-12T08:15:06Z

And FYI, the master spec is using Chief instead of Master.

gaocegege · 2018-06-12T08:16:33Z

I implement the logic #637 and the code will generate the cluster spec like this:

{'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

Thus we need to write an example for the case.

jlewi · 2018-06-12T12:36:39Z

I don't think we can wait for 0.3 to implement exit policy. Without exit policy most TF jobs won't worker and this would be a major regression so we effectively can't launch v1alpha2 in 0.2; which would be a major blow to our plans to be v1 by the end of the year.

Lots of TF jobs either have an explicit chief or use worker 0 as the chief.

Can you open up an issue on high availability? Its not clear to me what this means or why its important.

jlewi · 2018-06-12T12:37:20Z

/cc @ddysher @DjangoPeng

gaocegege · 2018-06-12T13:20:04Z

OK, SGTM. we have 3 days before the DL and I will work on it.

* The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589

* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to #589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.

gaocegege · 2018-06-14T01:39:59Z

I think it is closed by #637

Fix kubeflow#634

DjangoPeng · 2018-06-15T02:22:56Z

thanks! @gaocegege

* Fix kubeflow#634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix kubeflow#659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.

* Fix #634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix #659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.

372046933 · 2018-06-17T04:35:16Z

@gaocegege Can you provide a sample training with Chief, Master and Worker

gaocegege · 2018-06-17T14:02:05Z

@372046933 I do not really train a mode, I just use busybox to test the functionality.

* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.

* Fix kubeflow#634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix kubeflow#659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.

* Changes to support v1alpha2 testing in presubmits. * The tests are currently disabled because they aren't passing yet because termination policy isn't handled correctly (kubeflow#634) * Changed the v1alpha2 test to use the same smoke test as used by v1alpha1 as opposed to using mnist. mnist causing problems because of issues downloading the data see kubeflow/kubeflow#974 * We want a simpler test that allows for more direct testing of the distributed communication pattern * Also mnist is expensive in that it tries to download data. * Add a parameter tfJobVersion to the deploy script so we can control whether we deploy v1alpha1 or v1alpha2 * Parameterize the E2E test workflow by the TFJob version we want to run. * update test-app - We need to pull in a version of the app which has the TFJobVersion flag. * Create a script to regenerate the test-app for future use. Related to kubeflow#589 * Fix versionTag logic; we need to allow for case where versionTag is an empty string.

* Fix kubeflow#634 * Speedup the E2E test by running the build and setup cluster steps in parallel * To do this we split the setup step into two steps 1. setting up the cluster and 2. setting up Kubeflow. Fix kubeflow#659 * Shorten the name of the workflow for v1alpha2 * Otherwise the label for the workflow pod becomes too long and argo can't run it. * Pin the test worker image so that we don't get broken when someone updates the latest image * Make it a parameter in the prow_config.yaml * Use a file lock to ensure only one instance of test_runner is modifying the ksonnet app at a time; this should help with various test flakes.

jlewi added priority/p0 api/v1alpha2 area/0.2.0 labels Jun 11, 2018

jlewi mentioned this issue Jun 11, 2018

[v1alpha2] Add CI test #589

Closed

jlewi mentioned this issue Jun 11, 2018

[v1alpha2] Add distributed state management #625

Merged

jlewi mentioned this issue Jun 11, 2018

[proposal]TFJob condition for v1alpha2 #562

Closed

jlewi mentioned this issue Jun 12, 2018

Modify presubmits to support testing with v1alpha2 #632

Merged

gaocegege self-assigned this Jun 12, 2018

gaocegege closed this as completed Jun 14, 2018

jlewi added a commit to jlewi/k8s that referenced this issue Jun 14, 2018

Enable the E2E tests for v1alpha2.

0f4abf6

Fix kubeflow#634

jlewi mentioned this issue Jun 14, 2018

Enable the E2E tests for v1alpha2. #667

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFJob not marked as success when master exits but not workers #634

TFJob not marked as success when master exits but not workers #634

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

gaocegege commented Jun 12, 2018

gaocegege commented Jun 12, 2018

gaocegege commented Jun 12, 2018 •

edited

Loading

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

gaocegege commented Jun 12, 2018

gaocegege commented Jun 14, 2018

DjangoPeng commented Jun 15, 2018

372046933 commented Jun 17, 2018

gaocegege commented Jun 17, 2018

TFJob not marked as success when master exits but not workers #634

TFJob not marked as success when master exits but not workers #634

Comments

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

gaocegege commented Jun 12, 2018

gaocegege commented Jun 12, 2018

gaocegege commented Jun 12, 2018 • edited Loading

jlewi commented Jun 12, 2018

jlewi commented Jun 12, 2018

gaocegege commented Jun 12, 2018

gaocegege commented Jun 14, 2018

DjangoPeng commented Jun 15, 2018

372046933 commented Jun 17, 2018

gaocegege commented Jun 17, 2018

gaocegege commented Jun 12, 2018 •

edited

Loading