Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CloneSet validating webhook requests pod with restartPolicy onFailure even though it is supported #794

Closed
shuyingliang opened this issue Nov 9, 2021 · 22 comments
Assignees
Labels
kind/bug Something isn't working

Comments

@shuyingliang
Copy link

shuyingliang commented Nov 9, 2021

What happened:
We are deploying a workload using CloneSet with restartPolicy as onFailure.
because one of the colocating containers is not a long running service and it needs to exit after its job is done.

We configured the validating webhook cloneset ( vcloneset.kb.io) failurePolicy as Ignore.

However, the pod request was rejected given the error:

"admission webhook \"vcloneset.kb.io\" denied the request: spec.template.spec.restartPolicy: Unsupported value: \"OnFailure\": supported values: \"Always\""}

Then we somehow disabled the webhook completely (hacks such as giving the webhook a wrong cert in the field caBundle , controller-manager happily proceeds but just the webhook will not take effect), and we see our workloads deployed successfully with the expected behavior (no restarting the colocating container in the pod if exited successfully)

I noticed something in the code base specified the webhook to fail the request once the restartPolicy is not Always.

What you expected to happen:

We expect that

  1. when the webhook failurePolicy says Ignore, then just don't fail the request when the pod restart policy is not Always.
  2. The onFailure is expected to behave, why adding the guards such as here? Can we remove it?

How to reproduce it (as minimally and precisely as possible):
Just create the Cloneset with a pod restartPolicy as onFailure, webhook mutating webhook failurePolicy as Ignore
and you will see the pod request is rejected with the aforementioned error

Environment
We are using Kruise version: 0.9.0, but the same issue exists for other versions.
If the problem fixed, we'd appreciate the change is available >= 0.9.0

@shuyingliang shuyingliang added the kind/bug Something isn't working label Nov 9, 2021
@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

@wgitg Actually, workloads for both stateless and stateful long-running-service, e.g. Deployment, CloneSet, StatefulSet, are all limited that failurePolicy can only be Always.
Validation for Deployment/ReplicaSet

It is to avoid continuously creating pods when pods turn into completed state, no matter expected or unexpected.

@shuyingliang
Copy link
Author

shuyingliang commented Nov 9, 2021

It is to avoid continuously creating pods when pods turn into completed state, no matter expected or unexpected.

@FillZpp I think you meant to say that restartPolicy as Always is to restart applications when they stop, irrespective they stopped or not. :)

To circle back to the first point here, any insights on why webhook configuration failurePolicy: Ignore, still rejects the pod request with the restartPolicy as OnFailure? Isn't the validating webhook gonna ignore/allow such pod request?

@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

@wgitg You can read this doc. Ignore means that an error calling the webhook is ignored and the API request is allowed to continue, such as webhook has crashed and apiserver can not connect to it, but not the expected error in the webhook response.

@shuyingliang
Copy link
Author

shuyingliang commented Nov 9, 2021

@FillZpp the thing is that the cloneset's validating webhook rejects the request when we specified the webhook configuration failurePolicy as Ignore.

@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

You may need to understand that:

  • Rejection from webhook is expected failure, apiserver will not ignore it.
  • Connection error is unexpected failure, apiserver will ignore it.

@shuyingliang
Copy link
Author

Well, we are talking about the first case here, where we supposed to see apiserver should ignore the error from webhook and allow the request, even though the webhook returns errors, as specified in the doc
an error calling the webhook is ignored and the API request is allowed to continue

Is it possible that the error here

return admission.Errored(http.StatusUnprocessableEntity, allErrs.ToAggregate())

returned from validating webhook for cloneset is not ignored by apiserver and used to reject the request?

@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

But the question is, if apiserver should allow the request no matter what response the webhook returns for Ignore, what's the meaning of the existence of this webhook...

@shuyingliang
Copy link
Author

shuyingliang commented Nov 9, 2021

@FillZpp Good question. But isn't that what the doc says: an error calling the webhook is ignored and the API request is allowed to continue?

The doc does not specify what error is ignored or any error should be ignored..Clarifying this will help us to know if anything related to how cloneset validating webhook should return error or nothing needs to be done, do you agree?

Maybe we can get more insights form the apiserver code? Are you familiar with the apiserver code where it does the behavior? an error calling the webhook is ignored and the API request is allowed to continue? searching code...

@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

It is the error for HTTPS calling, but not the rejection in response.

You have to know that, when webhook rejects a requst, the HTTPS response code is still 200, but the message is in the response body, such as:

{
    "allowed": false,
    "status": {
        "code": 400,
        "message": "spec.template.spec.restartPolicy: Unsupported value ......"
    }
}

@shuyingliang
Copy link
Author

OK following your logic, in the case of failurePolicy is Ignore, shouldn't the HTTPS response be like this?

{
    "allowed": true,
    "status": {
        "code": 200,
        "message": "spec.template.spec.restartPolicy: Unsupported value ......"
    }
}

@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

No, webhook itself does not care of the failurePolicy. It is only used by kube-apiserver to ignore failure HTTPS calling, to skip validation when webhook is abnormal.

@shuyingliang
Copy link
Author

Understood. But back to the original ask, why does the request is rejected by apisever (basically no pod can be created) when the specifies the Cloneset validating webhook policy failurePolicy is Ignore? what do you suggest at this point?

@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

I don't quite understand your question. As i said, even the failurePolicy is Ignore, kube-apiserver get the 200 response from webhook and use the status code 400 as the return code to user.

@shuyingliang
Copy link
Author

As i said, even the failurePolicy is Ignore, kube-apiserver get the 200 response from webhook and use the status code 400 as the return code to user.

Thanks for the summary.
IIUC, given we have the situation where no pods are created in cloneset even with the webhook configuration failurePolicy as Ignore, you think there is nothing needs to be done from cloneset side to address the situation?

@FillZpp
Copy link
Member

FillZpp commented Nov 9, 2021

I'm not sure whether we have to provide a new feature-gate that allows users to create CloneSet with unlimited restartPolicy. In your case, is that some of the containers in Pod will exit while other containers always remain running?

@shuyingliang
Copy link
Author

shuyingliang commented Nov 9, 2021

@FillZpp Yes. The use case is that not all of the applications is long running services, some colocated containers in the same CloneSet pod will do some stuff and exit.

Per container restartPolicy might be expensive to support (?), but the feature gate seems to be a great idea to enable the flexibility. Can you enable the CloneSet to do that?

@veophi
Copy link
Member

veophi commented Nov 10, 2021

@wgitg If a Pod with RestartPolicy=OnFailure, its behavior about restart will be as follows:

  • For the containers that have been exited and exitCode=0, inplace update will not be triggered via replacing its image ;
  • For the running containers, whether the inplace update can be triggered depends on the exitCode of the container when the container is stopped. If exitCode=0 , the inplace update will not be triggered as well (the container will just be killed, and not be recreated).

@FillZpp
Copy link
Member

FillZpp commented Nov 10, 2021

@wgitg If you will use the InPlace Update feature, please note the comment before. Could you accept that?

Besides, can those colocated containers define as init containers? So you don't have to set onFailure restartPolicy.

@shuyingliang
Copy link
Author

@veophi @FillZpp Thanks for the above clarification! It is a surprise that onFailure can affect the InPlace Update feature.
:(

Yes, we do want to leverage the InPlace Update feature for all containers in the same pod. In fact, part of the reason we did not do it in the init container because the in-place update only works in app containers, and we did not want the changes in the init containers from the colocated short running app cause the entire pod to be recreated thus affecting the main applications in the same pod.

However, I am still pondering why the complications of mixing the flow of in-place update with the pod policy like this, because in-place update is about the restarting containers to deploy a new version, it should happen irrespective of how container exit before, shouldn't it?

@stale
Copy link

stale bot commented Feb 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Feb 8, 2022
@shuyingliang
Copy link
Author

Given the complication been discussed, we think the alternative proposal can help us to truly achieve the in-place update. We may close this issue.

@stale stale bot removed the wontfix This will not be worked on label Feb 10, 2022
@FillZpp
Copy link
Member

FillZpp commented Feb 10, 2022

closed by #901

@FillZpp FillZpp closed this as completed Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants