-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kueue] Check certificates readiness before the webhook server. #1707
[kueue] Check certificates readiness before the webhook server. #1707
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
@@ -292,7 +293,12 @@ func setupProbeEndpoints(mgr ctrl.Manager) { | |||
// the function, otherwise a not fully-initialized webhook server (without | |||
// ready certs) fails the start of the manager. | |||
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error { | |||
return mgr.GetWebhookServer().StartedChecker()(req) | |||
select { | |||
case <-certsReady: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we discussed this originally I wasn't sure if this is needed #1676 (comment). And I'm still not 100% sure, but the change reflects the intention, and it should be harmless. Also, the analogous change is done in another project: https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/d367fc08a261135b63d22aeb01c688317d9b7e02/cmd/manager/main.go#L296-L307, so I'm happy to try this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trasc can you give a rationale on why you think this fixes the issue?
From the logs, it does look like there is some race condition between setting the certs and starting the manager, so I can see the relationship. But the specific set of events are unclear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the old thread: #1676 (comment):
What I verified experimentally is that if I just call GetWebhookServer before WaitForCertsReady then the webhook is not initialized, and it stays that way.
I did this experiments just by modifying the code and calling GetWebhookServer before or after
Line 226 in c4a2255
cert.WaitForCertsReady(setupLog, certsReady) |
I think what might be now happening when the readiness probe is called between the readiness server is listening (probably does not require certs) and the certs are initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trasc can you give a rationale on why you think this fixes the issue?
From the logs, it does look like there is some race condition between setting the certs and starting the manager, so I can see the relationship. But the specific set of events are unclear to me.
Unfortunately I don't have anything concert, I suspect that it the original case, the Start hit in the middle of a rotation. But it's just a suspicion.
/test pull-kueue-test-e2e-main-1-26 |
If we go with the fix we need to cherry-pick this to 0.5, as we did with #1682. No separate release note is probably needed. |
ups, looks like there is still an issue: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1707/pull-kueue-test-e2e-main-1-27/1755619946483683328: |
yes , one of the managers did not get the certificates.... https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1707/pull-kueue-test-e2e-main-1-27/1755619946483683328/artifacts/run-test-multikueue-e2e-1.27.3/kind-worker2-control-plane/pods/kueue-system_kueue-controller-manager-77656c747d-bmrfm_e1741fd8-969b-4768-966c-f7f6b4708a76/manager/0.log, We could give up if the certificates are not ready in Xmin, the pod will be replaced and maybe we have better luck next time :) Or extend that wait timeout. |
/test pull-kueue-test-e2e-main-1-26 |
Maybe let's see if longer timeout helps, though it might be hard to confirm with certainty EDiT this way we will know if the issue repeats with timeout say 6m that is it pretty much non recoverable. Restarting the pod may cover a wide range of issues |
I'll try it tomorrow. |
/test pull-kueue-test-e2e-main-1-26 |
Did you mean timeout in the bash script? |
Yes |
What makes me yet curious is how the condition in the bash script was met in the first failure we saw. It must mean the server responded ok at least once, and must have been terminated later. |
/test pull-kueue-test-e2e-main-1-26 |
/lgtm Let's leave the timeout for another PR |
LGTM label has been added. Git tree hash: 99f8fc8556c3471708d5b27babfa5c830cf6858b
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, trasc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-0.5 |
@trasc: new pull request created: #1713 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…rnetes-sigs#1707) * [kueue] Check certificates readiness before the webhook server. * Don't keep the caller busy.
What type of PR is this?
/kind bug
/kind flake
What this PR does / why we need it:
Check certificates readiness before checking if the webhook server is ready.
Which issue(s) this PR fixes:
Fixes #1700
Special notes for your reviewer:
Does this PR introduce a user-facing change?