-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface invalid spec errors in a more user friendly way #755
Comments
The TFJob will get an event when the spec is invalid, I think the code is here: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/controller_tfjob.go#L28 If it does not have the event, it should be a bug. |
I think the problem is that pass tfjob to the event recorder But I think TFJob will be nil because it was unable to successfully parse the unstructured object into TFJob. So we should be passing the unstructured object to the event recorder. |
Do you mean here https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tfcontroller/tfjob.go#L37 ? I will try to fix it. |
/assign @gaocegege |
I think we should add an e2e test for invalid spec. |
Thanks @gaocegege I think this provides a good short term solution. But I'd really like to surface the error in the status of the TFJob as a failed condition so that it shows up as
It looks like we can use the metaunstructured package to easily set the the status on the TFJob struct
We then just need to issue an update call to the APIServer. @gaocegege Do you know if there's a convenient method to do this? We can't use TFJobController.updateTFJobStatus because we don't have a TFJob. Is there a clientset for unstructured objects? |
I think we could use the RestClient interface |
I'll take a stab at this. |
… condition. * If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we want to update the TFJob status to indicate it failed. * We need to use the REST API to update the TFJob status because we won't be able to deserialize the json to TFJob. * Related to kubeflow#755 * I created invalid-tfjob.jsonnet which can be used in an E2E test but I haven't included the E2E test in this PR. * I tested it manually and got the following result apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: clusterName: "" creationTimestamp: 2018-08-31T23:37:14Z generation: 1 labels: app.kubernetes.io/deploy-manager: ksonnet ksonnet.io/component: invalid-tfjob name: invalid-tfjob namespace: kubeflow resourceVersion: "1826961" selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob uid: ca7b4b02-ad76-11e8-be57-42010a8e0084 spec: notTheActualField: Ps: replicas: 2 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow Worker: replicas: 4 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow status: conditions: - lastTransitionTime: 2018-08-31T23:37:14Z lastUpdateTime: 2018-08-31T23:37:14Z message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob' reason: FailedInvalidTFJobSpec status: "True" type: Failed tfReplicaStatuses: null
… condition. * If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we want to update the TFJob status to indicate it failed. * We need to use the REST API to update the TFJob status because we won't be able to deserialize the json to TFJob. * Related to kubeflow#755 * I created invalid-tfjob.jsonnet which can be used in an E2E test but I haven't included the E2E test in this PR. * I tested it manually and got the following result apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: clusterName: "" creationTimestamp: 2018-08-31T23:37:14Z generation: 1 labels: app.kubernetes.io/deploy-manager: ksonnet ksonnet.io/component: invalid-tfjob name: invalid-tfjob namespace: kubeflow resourceVersion: "1826961" selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob uid: ca7b4b02-ad76-11e8-be57-42010a8e0084 spec: notTheActualField: Ps: replicas: 2 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow Worker: replicas: 4 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow status: conditions: - lastTransitionTime: 2018-08-31T23:37:14Z lastUpdateTime: 2018-08-31T23:37:14Z message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob' reason: FailedInvalidTFJobSpec status: "True" type: Failed tfReplicaStatuses: null * Add an E2E test; test_runner.py was getting overly complicated so I created a new main file to run the test and just call methods in test_runner.py as needed.
… condition. (#815) * If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we want to update the TFJob status to indicate it failed. * We need to use the REST API to update the TFJob status because we won't be able to deserialize the json to TFJob. * Related to #755 * I created invalid-tfjob.jsonnet which can be used in an E2E test but I haven't included the E2E test in this PR. * I tested it manually and got the following result apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: clusterName: "" creationTimestamp: 2018-08-31T23:37:14Z generation: 1 labels: app.kubernetes.io/deploy-manager: ksonnet ksonnet.io/component: invalid-tfjob name: invalid-tfjob namespace: kubeflow resourceVersion: "1826961" selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob uid: ca7b4b02-ad76-11e8-be57-42010a8e0084 spec: notTheActualField: Ps: replicas: 2 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow Worker: replicas: 4 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow status: conditions: - lastTransitionTime: 2018-08-31T23:37:14Z lastUpdateTime: 2018-08-31T23:37:14Z message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob' reason: FailedInvalidTFJobSpec status: "True" type: Failed tfReplicaStatuses: null * Add an E2E test; test_runner.py was getting overly complicated so I created a new main file to run the test and just call methods in test_runner.py as needed.
Right now if a user submits a TFJob with invalid spec the job will be created and nothing will happen.
The logs for the TFJob operator indicate that the spec is invalid but this isn't very discoverable. The only indication that the job is invalid is the fact there is no status.
I'd like the error to be surfaced when user does
Firing off a K8s events linked to the job should be doable. This should work as long as the metadata is accurate enough which I believe should be guaranteed if the job is being accepted.
See also #605 CRD validation
The text was updated successfully, but these errors were encountered: