-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Kueue startup by waiting for webhooks server using probes #1676
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,5 +55,6 @@ function cluster_kueue_deploy { | |
else | ||
kubectl apply --server-side -k test/e2e/config | ||
fi | ||
kubectl wait --for=condition=available --timeout=3m deployment/kueue-controller-manager -n kueue-system | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we do this somewhere else ? It could help if in case of multikue , we deploy in manager, and instead of waiting to be ready now, we also deploy in the workers , and check the availability later on for all of them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, this is a simple way of waiting for Kueue ready. I imagine users would like to use in their setup scripts. Typically it takes a couple of seconds. One alternative is to revert the removal of Other alternatives I see modify There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on changing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's leave this for a follow up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, but I'm not sure what is the preferred option as a follow up. Is it returning to the |
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or could the
GetWebhookServer()
return different servers on different calls?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot do this, then the replicas never get ready. The function needs to get the webhook server once ready. I think the webhook server is set asynchronously in controller-runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.... very strange, maybe the cert manager is is doing it ...
maybe add a comment about this if confirmed.
all other references in sigs are using it directly.
https://github.com/search?q=org%3Akubernetes-sigs%20StartedChecker&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is indeed about the certs not being ready. One of the projects actually solves this problem too: https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/d367fc08a261135b63d22aeb01c688317d9b7e02/cmd/manager/main.go#L296-L307.
Here they wait explicitly for the certs to be ready, I guess we could do the same so that we don't get the transient errors logged. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm not sure why either. What I verified experimentally is that if I just call
GetWebhookServer
beforeWaitForCertsReady
then the webhook is not initialized, and it stays that way.I get the following logs then even if keeping the readyz probe like currently, but just invoking
GetWebookServer()
before registering the probe:I suspect this has something to do with the one-time memorization of the not initialized webhook as runnable here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L258.
However, I'm not sure of the details. I would need to further investigate probably with debugger. Let me know if you have some ideas how to get to the bottom of it.
Let me put in the comment what I know so far.
I meant transient errors before the webhook server is listening, but I didn't how the look like in practice before writing. I was thinking maybe we have many errors about certs not ready. For what I can see we get these errors:
before the checker checks the
started
field before attempting the connection. So, there is only a small window where we can get transient errors, between settingstarted: true
, and actually listening: https://github.com/kubernetes-sigs/controller-runtime/blob/8475c55f3e00e5611de0880eccd785efa85e8e38/pkg/webhook/server.go#L261-L263. However, I have never seen this errors so far.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'ts normal behavior when the pod is not yet ready, (previously we ware "ready" from the start).
Maybe something like:
Could decrease the umber of messages.
But i find it a bit odd to block the http handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about waiting explicitly for
certsReady
(it is done in thehierarchical-namespaces
project as pointed out above).However, I think it doesn't improve messages., because we already wait for certs ready as
mgr.GetWebhookServer().StartedChecker()
first checks ifstarted=true
(started=true
implicitly means that certs are ready). If certs aren't ready we get the nice message:{"checker": "readyz", "error": "webhook server has not been started yet"}
. I think this is enough, and no reason to complicate the code.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I have some new insights into :
if we call
GetWebhookServer
early, then the not fully initialized webhook server runnable is registered here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L257.When manager starts it starts all runnables, including the webhook server, but it fails, in this case I see in logs early on:
Note the presence of "Stopping and waiting for non leader election runnables" just after the start.
OTOH (as in this PR) If we delay calling
GetWebhookServer
then the runnables are added after the certs are ready, and only then the webhook server is started, then it starts successfully. We the logStarting webhook server
is much later (after certs ready).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly polished the comment