Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If a TFJob spec is invalid mark the job as failed with an appropriate condition #815

Merged
merged 1 commit into from
Sep 1, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Aug 31, 2018

  • If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we
    want to update the TFJob status to indicate it failed.

  • We need to use the REST API to update the TFJob status because we won't
    be able to deserialize the json to TFJob.

  • Fix Surface invalid spec errors in a more user friendly way #755

  • Added an E2E test

  • I tested it manually and got the following result

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-31T23:37:14Z
  generation: 1
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
    ksonnet.io/component: invalid-tfjob
  name: invalid-tfjob
  namespace: kubeflow
  resourceVersion: "1826961"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob
  uid: ca7b4b02-ad76-11e8-be57-42010a8e0084
spec:
  notTheActualField:
    Ps:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
status:
  conditions:
  - lastTransitionTime: 2018-08-31T23:37:14Z
    lastUpdateTime: 2018-08-31T23:37:14Z
    message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to
      marshal the object to TFJob'
    reason: FailedInvalidTFJobSpec
    status: "True"
    type: Failed
  tfReplicaStatuses: null

This change is Reviewable

@jlewi
Copy link
Contributor Author

jlewi commented Aug 31, 2018

/assign @gaocegege
/assign @richardsliu

@jlewi jlewi changed the title If a TFJob spec is invalid mark the job as failed with an appropriate… If a TFJob spec is invalid mark the job as failed with an appropriate condition Aug 31, 2018
@TravisBuddy
Copy link

Travis tests have failed

Hey @jlewi,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

1st Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)

3rd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)

travis_time:end:041cafb8:start=1535759387961045059,finish=1535759533070855959,duration=145109810900

@coveralls
Copy link

coveralls commented Sep 1, 2018

Coverage Status

Coverage decreased (-1.3%) to 56.932% when pulling 685e744 on jlewi:invalid_spec into 1bc8425 on kubeflow:master.

@TravisBuddy
Copy link

Travis tests have failed

Hey @jlewi,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

2nd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)

3rd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)

travis_time:end:0aed1668:start=1535766190904099749,finish=1535766308575260949,duration=117671161200

@TravisBuddy
Copy link

Travis tests have failed

Hey @jlewi,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

2nd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)

3rd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)

travis_time:end:051e40d8:start=1535767586443320354,finish=1535767715411724086,duration=128968403732

@TravisBuddy
Copy link

Travis tests have failed

Hey @jlewi,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

2nd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)

3rd Build

gometalinter --config=linter_config.json --vendor ./...
pkg/util/k8sutil/client.go:1::warning: file is not goimported (goimports)
pkg/controller.v2/tensorflow/job.go:1::warning: file is not goimported (goimports)

travis_time:end:12b57e08:start=1535768130181658937,finish=1535768240004950095,duration=109823291158

… condition.

* If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we
  want to update the TFJob status to indicate it failed.

* We need to use the REST API to update the TFJob status because we won't
  be able to deserialize the json to TFJob.

* Related to kubeflow#755

* I created invalid-tfjob.jsonnet which can be used in an E2E test but
  I haven't included the E2E test in this PR.

* I tested it manually and got the following result

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-31T23:37:14Z
  generation: 1
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
    ksonnet.io/component: invalid-tfjob
  name: invalid-tfjob
  namespace: kubeflow
  resourceVersion: "1826961"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob
  uid: ca7b4b02-ad76-11e8-be57-42010a8e0084
spec:
  notTheActualField:
    Ps:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
status:
  conditions:
  - lastTransitionTime: 2018-08-31T23:37:14Z
    lastUpdateTime: 2018-08-31T23:37:14Z
    message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to
      marshal the object to TFJob'
    reason: FailedInvalidTFJobSpec
    status: "True"
    type: Failed
  tfReplicaStatuses: null

* Add an E2E test; test_runner.py was getting overly complicated so I created
  a new main file to run the test and just call methods in test_runner.py as
  needed.
@jlewi
Copy link
Contributor Author

jlewi commented Sep 1, 2018

/test all

@gaocegege
Copy link
Member

/cc @johnugeorge

I think you might be interested in the PR.

/lgtm
/approve

I think creating a client every time we meet an invalid spec works well now but I will try to find a better way to implement it.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants