-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TfJob operator stops working on invalid spec #561
Comments
See also #437 - OpenAPI Validation for TFJob controller |
It is hard to validate the config using the feature crd validation since we have podtemplatespec in the definition. The feature does not support |
Can you explain more about $ref and how it is used? Would it be possible to just use OpenAPI validation to ensure that container args are strings not integers? I guess another solution might be to use an admission controller to validate the spec. @enisoc Any suggestions on how to handle this? @jessesuen I think you faced a similar problem with the Argo CRD what did you do? |
If you just want to validate container args, and not everything in PodTemplateSpec, then it may be feasible to write an OpenAPI schema by hand for that. If you want to validate the whole PodTemplateSpec, the best workaround I've heard of so far is this one (although I haven't tried it personally): |
Thanks, I will take a look. We want to validate the whole podtemplatespec 😄 |
@jlewi coincidentally, I literally just "fixed" this in the workflow controller. But it's more to workaround the upstream kubernetes issue: kubernetes/kubernetes#57705. I followed the recommendation in this comment: Instead of using the auto-generated workflow informer, I wrote a The fix can be seen here: |
Thanks for your reply @jessesuen @jlewi I wrote a tool to generate the validation from the OpenAPI specification: https://github.com/gaocegege/crd-validation. Generated CRD for tfjob v1alpha2 is https://github.com/gaocegege/crd-validation/blob/master/generated/tfjob-crd-v1alpha2.yaml While we meet an issue from Kubernetes side: kubernetes/kubernetes#59485 (comment). Kubernetes does not support addtionalproperties while it is needed for map type. Unfortunately, The upstream said it will be implemented in 1.11. Maybe we should wait for it. After 1.11 I think we are able to use the tool above to generate validations for all crds in kubeflow community. |
At this moment validating all types in the CRD is not practical, and the crd validation feature has some limitations, such as lack of addtional properties support. I think the workaround from jessesuen is a good way to solve the problem. |
@gaocegege I like the idea of following @jessesuen's work around and then implementing what validation we can with CRD validation. |
Yeah, I am working on it. 😄 |
I submitted a job with an invalid spec (container args contained integrs and not strings). The job was created but it was never started and the status was never updated. Furthermore, I think this blocked the TFJob operator from processing any other jobs. Deleting the job fixed things.
The TFJob operator showed the following logs.
I believe what's happening is that since the spec is invalid the result of List can't be successfully parsed into a Go struct. As a result, I think the TFJob operator is unable to work.
I think this is a problem in the underlying informer package; i.e. its not robust to invalid specs. We should check if this is a known issue and if there is an existing bug. (Ideally, it would just ignore invalid specs).
I think we could fix this a number of ways in TFJob controller
If we use CRD's spec validation feature and provide a swagger spec I think we could prevent invalid specs from being accepted in the first place
The operator could try to catch this error and then find and update the invalid spec
Swagger is probably the best place to start.
We should try to get this fixed in 0.2
The text was updated successfully, but these errors were encountered: