-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If a TFJob spec is invalid mark the job as failed with an appropriate condition #815
Conversation
/assign @gaocegege |
Travis tests have failedHey @jlewi, 1st Buildgometalinter --config=linter_config.json --vendor ./...
3rd Buildgometalinter --config=linter_config.json --vendor ./...
|
Travis tests have failedHey @jlewi, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
3rd Buildgometalinter --config=linter_config.json --vendor ./...
|
Travis tests have failedHey @jlewi, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
3rd Buildgometalinter --config=linter_config.json --vendor ./...
|
Travis tests have failedHey @jlewi, 2nd Buildgometalinter --config=linter_config.json --vendor ./...
3rd Buildgometalinter --config=linter_config.json --vendor ./...
|
… condition. * If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we want to update the TFJob status to indicate it failed. * We need to use the REST API to update the TFJob status because we won't be able to deserialize the json to TFJob. * Related to kubeflow#755 * I created invalid-tfjob.jsonnet which can be used in an E2E test but I haven't included the E2E test in this PR. * I tested it manually and got the following result apiVersion: kubeflow.org/v1alpha2 kind: TFJob metadata: clusterName: "" creationTimestamp: 2018-08-31T23:37:14Z generation: 1 labels: app.kubernetes.io/deploy-manager: ksonnet ksonnet.io/component: invalid-tfjob name: invalid-tfjob namespace: kubeflow resourceVersion: "1826961" selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/invalid-tfjob uid: ca7b4b02-ad76-11e8-be57-42010a8e0084 spec: notTheActualField: Ps: replicas: 2 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow Worker: replicas: 4 restartPolicy: Never template: spec: containers: - image: busybox name: tensorflow status: conditions: - lastTransitionTime: 2018-08-31T23:37:14Z lastUpdateTime: 2018-08-31T23:37:14Z message: 'Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob' reason: FailedInvalidTFJobSpec status: "True" type: Failed tfReplicaStatuses: null * Add an E2E test; test_runner.py was getting overly complicated so I created a new main file to run the test and just call methods in test_runner.py as needed.
/test all |
/cc @johnugeorge I think you might be interested in the PR. /lgtm I think creating a client every time we meet an invalid spec works well now but I will try to find a better way to implement it. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gaocegege The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
If a TFJob spec is invalid (e.g. can't be marshaled to TFJob YAML) we
want to update the TFJob status to indicate it failed.
We need to use the REST API to update the TFJob status because we won't
be able to deserialize the json to TFJob.
Fix Surface invalid spec errors in a more user friendly way #755
Added an E2E test
I tested it manually and got the following result
This change is