Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface invalid spec errors in a more user friendly way #755

Closed
jlewi opened this issue Jul 25, 2018 · 8 comments
Closed

Surface invalid spec errors in a more user friendly way #755

jlewi opened this issue Jul 25, 2018 · 8 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jul 25, 2018

Right now if a user submits a TFJob with invalid spec the job will be created and nothing will happen.

The logs for the TFJob operator indicate that the spec is invalid but this isn't very discoverable. The only indication that the job is invalid is the fact there is no status.

I'd like the error to be surfaced when user does

kubectl describe tfjobs <job>
kubectl get tfjobs -o yaml <job>

Firing off a K8s events linked to the job should be doable. This should work as long as the metadata is accurate enough which I believe should be guaranteed if the job is being accepted.

See also #605 CRD validation

@gaocegege
Copy link
Member

The TFJob will get an event when the spec is invalid, I think the code is here: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/controller_tfjob.go#L28

If it does not have the event, it should be a bug.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 30, 2018

I think the problem is that pass tfjob to the event recorder
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/controller_tfjob.go#L30

But I think TFJob will be nil because it was unable to successfully parse the unstructured object into TFJob. So we should be passing the unstructured object to the event recorder.

@gaocegege
Copy link
Member

@gaocegege
Copy link
Member

/assign @gaocegege

@gaocegege
Copy link
Member

I think we should add an e2e test for invalid spec.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 31, 2018

Thanks @gaocegege I think this provides a good short term solution. But I'd really like to surface the error in the status of the TFJob as a failed condition so that it shows up as

kubectl get tfjob -o yaml

It looks like we can use the metaunstructured package to easily set the the status on the TFJob struct

SetNestedField(un, tfJobStatus, "status")

We then just need to issue an update call to the APIServer.

@gaocegege Do you know if there's a convenient method to do this?

We can't use TFJobController.updateTFJobStatus because we don't have a TFJob.

Is there a clientset for unstructured objects?

@jlewi
Copy link
Contributor Author

jlewi commented Aug 31, 2018

@jlewi
Copy link
Contributor Author

jlewi commented Aug 31, 2018

I'll take a stab at this.

jlewi added a commit to jlewi/k8s that referenced this issue Aug 31, 2018
… condition.

* If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we
  want to update the TFJob status to indicate it failed.

* We need to use the REST API to update the TFJob status because we won't
  be able to deserialize the json to TFJob.

* Related to kubeflow#755

* I created invalid-tfjob.jsonnet which can be used in an E2E test but
  I haven't included the E2E test in this PR.

* I tested it manually and got the following result

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-31T23:37:14Z
  generation: 1
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
    ksonnet.io/component: invalid-tfjob
  name: invalid-tfjob
  namespace: kubeflow
  resourceVersion: "1826961"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob
  uid: ca7b4b02-ad76-11e8-be57-42010a8e0084
spec:
  notTheActualField:
    Ps:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
status:
  conditions:
  - lastTransitionTime: 2018-08-31T23:37:14Z
    lastUpdateTime: 2018-08-31T23:37:14Z
    message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to
      marshal the object to TFJob'
    reason: FailedInvalidTFJobSpec
    status: "True"
    type: Failed
  tfReplicaStatuses: null
jlewi added a commit to jlewi/k8s that referenced this issue Sep 1, 2018
… condition.

* If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we
  want to update the TFJob status to indicate it failed.

* We need to use the REST API to update the TFJob status because we won't
  be able to deserialize the json to TFJob.

* Related to kubeflow#755

* I created invalid-tfjob.jsonnet which can be used in an E2E test but
  I haven't included the E2E test in this PR.

* I tested it manually and got the following result

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-31T23:37:14Z
  generation: 1
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
    ksonnet.io/component: invalid-tfjob
  name: invalid-tfjob
  namespace: kubeflow
  resourceVersion: "1826961"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob
  uid: ca7b4b02-ad76-11e8-be57-42010a8e0084
spec:
  notTheActualField:
    Ps:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
status:
  conditions:
  - lastTransitionTime: 2018-08-31T23:37:14Z
    lastUpdateTime: 2018-08-31T23:37:14Z
    message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to
      marshal the object to TFJob'
    reason: FailedInvalidTFJobSpec
    status: "True"
    type: Failed
  tfReplicaStatuses: null

* Add an E2E test; test_runner.py was getting overly complicated so I created
  a new main file to run the test and just call methods in test_runner.py as
  needed.
k8s-ci-robot pushed a commit that referenced this issue Sep 1, 2018
… condition. (#815)

* If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we
  want to update the TFJob status to indicate it failed.

* We need to use the REST API to update the TFJob status because we won't
  be able to deserialize the json to TFJob.

* Related to #755

* I created invalid-tfjob.jsonnet which can be used in an E2E test but
  I haven't included the E2E test in this PR.

* I tested it manually and got the following result

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-08-31T23:37:14Z
  generation: 1
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
    ksonnet.io/component: invalid-tfjob
  name: invalid-tfjob
  namespace: kubeflow
  resourceVersion: "1826961"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob
  uid: ca7b4b02-ad76-11e8-be57-42010a8e0084
spec:
  notTheActualField:
    Ps:
      replicas: 2
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        spec:
          containers:
          - image: busybox
            name: tensorflow
status:
  conditions:
  - lastTransitionTime: 2018-08-31T23:37:14Z
    lastUpdateTime: 2018-08-31T23:37:14Z
    message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to
      marshal the object to TFJob'
    reason: FailedInvalidTFJobSpec
    status: "True"
    type: Failed
  tfReplicaStatuses: null

* Add an E2E test; test_runner.py was getting overly complicated so I created
  a new main file to run the test and just call methods in test_runner.py as
  needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants