Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing a ValidatingWebhook admisson controller to validate specs #1019

Closed
whynowy opened this issue Jan 15, 2021 · 5 comments · Fixed by #1021
Closed

Introducing a ValidatingWebhook admisson controller to validate specs #1019

whynowy opened this issue Jan 15, 2021 · 5 comments · Fixed by #1021
Labels
enhancement New feature or request

Comments

@whynowy
Copy link
Member

whynowy commented Jan 15, 2021

Is your feature request related to a problem? Please describe.
Spec validation is something needed for all the CRD objects, for example, following operations are not expected, such as creating a Sensor object without defining any dependencies, updating auth strategy of an EventBus object, and so on.

Currently spec validation happens in the controller reconciliation, which means users will not know if the spec he/she provided, or the operation he/she did is allowed until he/she checks the status field of the CRD object.

Describe the solution you'd like
To let the user know if the spec or operation is valid earlier, we can introduce a ValidatingAdmissionWebhook, so that people can know the errors before the spec is stored in etcd.

This will introduce:

  1. A deployment and a Service running in argo-events namespace, it implements the validation logic.
  2. A ValidatingWebhookConfigurations object for admission controller configuration. It will be automatically created by the deployment in 1.)

With this validating webhook, right after an invalid spec is applied, you will get an error message like below (assume a native EventBus requires a replica number >=3):

kubectl apply -f test-eventbus.yaml
Error from server (BadRequest): error when creating "test-eventbus.yaml": admission webhook "webhook.argo-events.argoproj.io" denied the request: Replicas must be no less than 3

Additional context
This validating webhook will be included in the cluster scope installation spec, might not be in the namespace scope installation.


Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@whynowy whynowy added the enhancement New feature or request label Jan 15, 2021
@jessesuen
Copy link
Member

jessesuen commented Jan 15, 2021

Here is the perspective from argo-rollouts, and why we chose not to implement a ValidatingWebhook (at least not at this time) and instead chose to have Error conditions on the Rollout object

  1. With a ValidatingWebhook, you must choose a failurePolicy for when the API server gets errors when talking with your validation endpoint. Example is when your validating webhook server is down. failurePolicy can be either ignore, or fail.

If failurePolicy is Ignore , then you must live with the fact that your controller will at some point have to deal with possibility of malformed objects entering the system. On the other hand, if the failurePolicy is Fail, then you now have to make sure your endpoint has a high degree of uptime, otherwise downtime of the webhook will prevent any mutations of event object from happening.

  1. Admission Webhooks are more complicated to operate because of the fact that proper certificate trusts have to be established between the webhook server validating the resource, and the K8s API server. In many projects, this is handled by Kubernetes Jobs which run as part of the installation process to generate the trust, but it forces the installation process to become more imperative / complex.

  2. Validating/Mutating webhooks, by design, are directly in the data path for object mutation and will add to the latency of API requests. Depending on the anticipated churn of the object, it may also require you to horizontally scale the webhook server to deal with number of requests with high performance.

Rollouts came to the conclusion that it would rather deal with malformed objects that somehow entered the cluster after the fact, and instead provide clear indications of a Rollout with an InvalidSpec. In the future, we may improve the user experience by complementing spec validation with a validating webhook with failurePolicy of Ignore. This would provide a better user experience since it catches the problem much earlier. But principally, the primary means of dealing with invalid objects in argo-rollouts, are Error conditions, with the future possibility of complementing that with an optional but not mandatory validating webhook (failurePolicy: Ignore).

@tvalasek
Copy link
Contributor

argo cli comes with lint option which we use in our CICD. would that be sufficient spec validator? Honestly, I think, catching spec issues right before applying to K8s is too late: you want to be able to catch these before CD kicks off. (validation step in your CI).

@whynowy whynowy closed this as completed Jan 15, 2021
@whynowy whynowy reopened this Jan 15, 2021
@whynowy
Copy link
Member Author

whynowy commented Jan 15, 2021

@jessesuen

  1. Introducing a ValidationWebhook is just a supplement, it will not replace current reconciliation error conditions, so using a Ignore/None failure mode is good enough.

  2. Certificate stuff can be automatically managed, see my implementation.

    func (ac *AdmissionController) configureCerts(ctx context.Context, clientAuth tls.ClientAuthType) (*tls.Config, []byte, error) {

  3. In terms of API request latency, first of all we only expect a fixed patten of requests to go through the new webhook, the filtering won't have much impact to the latency. For those requests need to go through the webhook, we could try to avoid/reduce operations like doing extra calls to api-server, or other kinds of heavy loaded operations, and I don't think there will be lots of requests to operate argo-events related CRD objects.

@whynowy
Copy link
Member Author

whynowy commented Jan 15, 2021

argo cli comes with lint option which we use in our CICD. would that be sufficient spec validator? Honestly, I think, catching spec issues right before applying to K8s is too late: you want to be able to catch these before CD kicks off. (validation step in your CI).

lint can tell you if there's any issue with your spec, but there's no way to prevent it from going to the k8s etcd if ppl apply it anyways.

@whynowy
Copy link
Member Author

whynowy commented Jan 16, 2021

BTW - lint can not solve the problem of updating an existing object (e.g. some fields are not allowed to update), @tvalasek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants