Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kueue] Check certificates readiness before the webhook server. #1707

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions cmd/kueue/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ package main

import (
"context"
"errors"
"flag"
"net/http"
"os"
Expand Down Expand Up @@ -165,7 +166,7 @@ func main() {

serverVersionFetcher := setupServerVersionFetcher(mgr, kubeConfig)

setupProbeEndpoints(mgr)
setupProbeEndpoints(mgr, certsReady)
// Cert won't be ready until manager starts, so start a goroutine here which
// will block until the cert is ready before setting up the controllers.
// Controllers who register after manager starts will start directly.
Expand Down Expand Up @@ -276,7 +277,7 @@ func setupControllers(mgr ctrl.Manager, cCache *cache.Cache, queues *queue.Manag
}

// setupProbeEndpoints registers the health endpoints
func setupProbeEndpoints(mgr ctrl.Manager) {
func setupProbeEndpoints(mgr ctrl.Manager, certsReady <-chan struct{}) {
defer setupLog.Info("Probe endpoints are configured on healthz and readyz")

if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
Expand All @@ -292,7 +293,12 @@ func setupProbeEndpoints(mgr ctrl.Manager) {
// the function, otherwise a not fully-initialized webhook server (without
// ready certs) fails the start of the manager.
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error {
return mgr.GetWebhookServer().StartedChecker()(req)
select {
case <-certsReady:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we discussed this originally I wasn't sure if this is needed #1676 (comment). And I'm still not 100% sure, but the change reflects the intention, and it should be harmless. Also, the analogous change is done in another project: https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/d367fc08a261135b63d22aeb01c688317d9b7e02/cmd/manager/main.go#L296-L307, so I'm happy to try this fix.

Copy link
Contributor

@alculquicondor alculquicondor Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trasc can you give a rationale on why you think this fixes the issue?

From the logs, it does look like there is some race condition between setting the certs and starting the manager, so I can see the relationship. But the specific set of events are unclear to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the old thread: #1676 (comment):

What I verified experimentally is that if I just call GetWebhookServer before WaitForCertsReady then the webhook is not initialized, and it stays that way.

I did this experiments just by modifying the code and calling GetWebhookServer before or after

cert.WaitForCertsReady(setupLog, certsReady)
.

I think what might be now happening when the readiness probe is called between the readiness server is listening (probably does not require certs) and the certs are initialized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trasc can you give a rationale on why you think this fixes the issue?

From the logs, it does look like there is some race condition between setting the certs and starting the manager, so I can see the relationship. But the specific set of events are unclear to me.

Unfortunately I don't have anything concert, I suspect that it the original case, the Start hit in the middle of a rotation. But it's just a suspicion.

return mgr.GetWebhookServer().StartedChecker()(req)
default:
return errors.New("certificates are not ready")
}
}); err != nil {
setupLog.Error(err, "unable to set up ready check")
os.Exit(1)
Expand Down