Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kueue] Check certificates readiness before the webhook server. #1707

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Feb 8, 2024

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Check certificates readiness before checking if the webhook server is ready.

Which issue(s) this PR fixes:

Fixes #1700

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 8, 2024
Copy link

netlify bot commented Feb 8, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 953e41e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/65c4f802d8d94900083eef5b

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 8, 2024
@trasc trasc marked this pull request as ready for review February 8, 2024 15:16
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Feb 8, 2024
@@ -292,7 +293,12 @@ func setupProbeEndpoints(mgr ctrl.Manager) {
// the function, otherwise a not fully-initialized webhook server (without
// ready certs) fails the start of the manager.
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
return mgr.GetWebhookServer().StartedChecker()(req)
select {
case <-certsReady:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we discussed this originally I wasn't sure if this is needed #1676 (comment). And I'm still not 100% sure, but the change reflects the intention, and it should be harmless. Also, the analogous change is done in another project: https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/d367fc08a261135b63d22aeb01c688317d9b7e02/cmd/manager/main.go#L296-L307, so I'm happy to try this fix.

Copy link
Contributor

@alculquicondor alculquicondor Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trasc can you give a rationale on why you think this fixes the issue?

From the logs, it does look like there is some race condition between setting the certs and starting the manager, so I can see the relationship. But the specific set of events are unclear to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the old thread: #1676 (comment):

What I verified experimentally is that if I just call GetWebhookServer before WaitForCertsReady then the webhook is not initialized, and it stays that way.

I did this experiments just by modifying the code and calling GetWebhookServer before or after

cert.WaitForCertsReady(setupLog, certsReady)
.

I think what might be now happening when the readiness probe is called between the readiness server is listening (probably does not require certs) and the certs are initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trasc can you give a rationale on why you think this fixes the issue?

From the logs, it does look like there is some race condition between setting the certs and starting the manager, so I can see the relationship. But the specific set of events are unclear to me.

Unfortunately I don't have anything concert, I suspect that it the original case, the Start hit in the middle of a rotation. But it's just a suspicion.

cmd/kueue/main.go Outdated Show resolved Hide resolved
@trasc
Copy link
Contributor Author

trasc commented Feb 8, 2024

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@mimowo
Copy link
Contributor

mimowo commented Feb 8, 2024

If we go with the fix we need to cherry-pick this to 0.5, as we did with #1682. No separate release note is probably needed.

@mimowo
Copy link
Contributor

mimowo commented Feb 8, 2024

ups, looks like there is still an issue: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1707/pull-kueue-test-e2e-main-1-27/1755619946483683328: error: timed out waiting for the condition on deployments/kueue-controller-manager

@trasc
Copy link
Contributor Author

trasc commented Feb 8, 2024

ups, looks like there is still an issue: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1707/pull-kueue-test-e2e-main-1-27/1755619946483683328: error: timed out waiting for the condition on deployments/kueue-controller-manager

yes , one of the managers did not get the certificates.... https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/1707/pull-kueue-test-e2e-main-1-27/1755619946483683328/artifacts/run-test-multikueue-e2e-1.27.3/kind-worker2-control-plane/pods/kueue-system_kueue-controller-manager-77656c747d-bmrfm_e1741fd8-969b-4768-966c-f7f6b4708a76/manager/0.log,

We could give up if the certificates are not ready in Xmin, the pod will be replaced and maybe we have better luck next time :)

Or extend that wait timeout.

@trasc
Copy link
Contributor Author

trasc commented Feb 8, 2024

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@mimowo
Copy link
Contributor

mimowo commented Feb 8, 2024

Maybe let's see if longer timeout helps, though it might be hard to confirm with certainty

EDiT this way we will know if the issue repeats with timeout say 6m that is it pretty much non recoverable. Restarting the pod may cover a wide range of issues

@trasc
Copy link
Contributor Author

trasc commented Feb 8, 2024

Maybe let's see if longer timeout helps, though it might be hard to confirm with certainty

EDiT this way we will know if the issue repeats with timeout say 6m that is it pretty much non recoverable. Restarting the pod may cover a wide range of issues

I'll try it tomorrow.

@trasc
Copy link
Contributor Author

trasc commented Feb 8, 2024

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@alculquicondor
Copy link
Contributor

Did you mean timeout in the bash script?

@mimowo
Copy link
Contributor

mimowo commented Feb 8, 2024

Did you mean timeout in the bash script?

Yes

@mimowo
Copy link
Contributor

mimowo commented Feb 8, 2024

What makes me yet curious is how the condition in the bash script was met in the first failure we saw. It must mean the server responded ok at least once, and must have been terminated later.

@trasc
Copy link
Contributor Author

trasc commented Feb 8, 2024

/test pull-kueue-test-e2e-main-1-26
/test pull-kueue-test-e2e-main-1-27
/test pull-kueue-test-e2e-main-1-28
/test pull-kueue-test-e2e-main-1-29

@alculquicondor
Copy link
Contributor

/lgtm
/approve

Let's leave the timeout for another PR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 99f8fc8556c3471708d5b27babfa5c830cf6858b

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2024
@k8s-ci-robot k8s-ci-robot merged commit f39f9c7 into kubernetes-sigs:main Feb 9, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Feb 9, 2024
@trasc trasc deleted the dont-check-server-status-before-certs-ready branch February 9, 2024 08:19
@trasc
Copy link
Contributor Author

trasc commented Feb 9, 2024

/cherry-pick release-0.5

@k8s-infra-cherrypick-robot
Copy link
Contributor

@trasc: new pull request created: #1713

In response to this:

/cherry-pick release-0.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
…rnetes-sigs#1707)

* [kueue] Check certificates readiness before the webhook server.

* Don't keep the caller busy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MultiKueue e2e test flaky on startup
5 participants