-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Kueue startup by waiting for webhooks server using probes #1676
Fix Kueue startup by waiting for webhooks server using probes #1676
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
ccd05b6
to
2c7918f
Compare
All green, retrying to get more signal: |
/test pull-kueue-test-e2e-main-1-26 |
cmd/kueue/main.go
Outdated
return mgr.GetWebhookServer().StartedChecker()(req) | ||
} | ||
|
||
if err := mgr.AddHealthzCheck("healthz", readyz); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change the healthz
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the change to healthz is required. I will test if this can be simplified. Do you know what healthz is used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should indicate if the container is running, https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#types-of-probe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, reverted setting it, going to test.
@@ -55,5 +55,6 @@ function cluster_kueue_deploy { | |||
else | |||
kubectl apply --server-side -k test/e2e/config | |||
fi | |||
kubectl wait --for=condition=available --timeout=3m deployment/kueue-controller-manager -n kueue-system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this somewhere else ?
It could help if in case of multikue , we deploy in manager, and instead of waiting to be ready now, we also deploy in the workers , and check the availability later on for all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this is a simple way of waiting for Kueue ready. I imagine users would like to use in their setup scripts. Typically it takes a couple of seconds.
One alternative is to revert the removal of KueueReadyForTesting
, but I don't like it, because it recreates objects, this takes time and could cover some issues (as it is actually currently in case of single cluster). Actually, in this PR I wanted to remove it to test that it is not necessary.
Other alternatives I see modify KueueReadyForTesting
to check the deployment status as this line, or run the entire setup of different clusters for MK in parallel at the script level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on changing KueueReadyForTesting
with an availability check for the deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this for a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I'm not sure what is the preferred option as a follow up. Is it returning to the KueueReadyForTesting
, but waiting for deployment available, as in the command line?
/test pull-kueue-test-e2e-main-1-26 |
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error { | ||
return mgr.GetWebhookServer().StartedChecker()(req) | ||
}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error { | |
return mgr.GetWebhookServer().StartedChecker()(req) | |
}); err != nil { | |
if err := mgr.AddReadyzCheck("readyz",mgr.GetWebhookServer().StartedChecker()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or could the GetWebhookServer()
return different servers on different calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot do this, then the replicas never get ready. The function needs to get the webhook server once ready. I think the webhook server is set asynchronously in controller-runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.... very strange, maybe the cert manager is is doing it ...
maybe add a comment about this if confirmed.
all other references in sigs are using it directly.
https://github.com/search?q=org%3Akubernetes-sigs%20StartedChecker&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is indeed about the certs not being ready. One of the projects actually solves this problem too: https://github.com/kubernetes-sigs/hierarchical-namespaces/blob/d367fc08a261135b63d22aeb01c688317d9b7e02/cmd/manager/main.go#L296-L307.
Here they wait explicitly for the certs to be ready, I guess we could do the same so that we don't get the transient errors logged. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetWebhookServer().StartedChecker
is what we need to determine the readiness.I'm only not sure why we need a closure for calling it , since from what I can see the field holding the webhook is private and only stable whit an option when crating the manager, however it's not a huge problem.
TBH, I'm not sure why either. What I verified experimentally is that if I just call GetWebhookServer
before WaitForCertsReady
then the webhook is not initialized, and it stays that way.
I get the following logs then even if keeping the readyz probe like currently, but just invoking GetWebookServer()
before registering the probe:
2024-01-31T19:49:50.466048439Z INFO setup kueue/main.go:291 readyz {"ws": {"Options":{"Host":"","Port":9443,"CertDir":"/tmp/k8s-webhook-server/serving-certs","CertName":"tls.crt","KeyName":"tls.key","ClientCAName":"","TLSOpts":null,"WebhookMux":{}}}, "err": "webhook server has not been started yet"}
I suspect this has something to do with the one-time memorization of the not initialized webhook as runnable here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L258.
However, I'm not sure of the details. I would need to further investigate probably with debugger. Let me know if you have some ideas how to get to the bottom of it.
Let me put in the comment what I know so far.
and
, I guess we could do the same so that we don't get the transient errors logged.
What kind of errors?
I meant transient errors before the webhook server is listening, but I didn't how the look like in practice before writing. I was thinking maybe we have many errors about certs not ready. For what I can see we get these errors:
2024-01-31T13:09:29.832263236Z stderr F 2024-01-31T13:09:29.832081013Z DEBUG controller-runtime.healthz healthz/healthz.go:60 healthz check failed {"checker": "readyz", "error": "webhook server has not been started yet"}
before the checker checks the started
field before attempting the connection. So, there is only a small window where we can get transient errors, between setting started: true
, and actually listening: https://github.com/kubernetes-sigs/controller-runtime/blob/8475c55f3e00e5611de0880eccd785efa85e8e38/pkg/webhook/server.go#L261-L263. However, I have never seen this errors so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never seen this errors so far.
I'ts normal behavior when the pod is not yet ready, (previously we ware "ready" from the start).
Maybe something like:
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error { | |
return mgr.GetWebhookServer().StartedChecker()(req) | |
}); err != nil { | |
if err := mgr.AddReadyzCheck("readyz", func(req *http.Request) error { | |
select{ | |
case <-certsReady: | |
return mgr.GetWebhookServer().StartedChecker()(req) | |
case <-req.Context().Done(): | |
return errors.New("request canceled") | |
} | |
}); err != nil { |
Could decrease the umber of messages.
But i find it a bit odd to block the http handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about waiting explicitly for certsReady
(it is done in the hierarchical-namespaces
project as pointed out above).
However, I think it doesn't improve messages., because we already wait for certs ready as mgr.GetWebhookServer().StartedChecker()
first checks if started=true
(started=true
implicitly means that certs are ready). If certs aren't ready we get the nice message: {"checker": "readyz", "error": "webhook server has not been started yet"}
. I think this is enough, and no reason to complicate the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I have some new insights into :
I'm only not sure why we need a closure for calling it , since from what I can see the field holding the webhook is private and only stable whit an option when crating the manager, however it's not a huge problem.
if we call GetWebhookServer
early, then the not fully initialized webhook server runnable is registered here: https://github.com/kubernetes-sigs/controller-runtime/blob/73519a939e27c8465c6c49694356cbd66daf1677/pkg/manager/internal.go#L257.
When manager starts it starts all runnables, including the webhook server, but it fails, in this case I see in logs early on:
2024-02-01T12:44:13.308211223Z INFO controller-runtime.webhook webhook/server.go:191 Starting webhook server
2024-02-01T12:44:13.40926745Z INFO controller/controller.go:178 Starting EventSource {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-02-01T12:44:13.40932954Z INFO controller/controller.go:178 Starting EventSource {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-02-01T12:44:13.40933832Z INFO manager/internal.go:516 Stopping and waiting for non leader election runnables
Note the presence of "Stopping and waiting for non leader election runnables" just after the start.
OTOH (as in this PR) If we delay calling GetWebhookServer
then the runnables are added after the certs are ready, and only then the webhook server is started, then it starts successfully. We the log Starting webhook server
is much later (after certs ready).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly polished the comment
All green, retrying to get more signal: |
All green, retrying to get more signal: |
2b70e0c
to
d2e420b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/hold
for a couple more re-runs just in case.
@@ -55,5 +55,6 @@ function cluster_kueue_deploy { | |||
else | |||
kubectl apply --server-side -k test/e2e/config | |||
fi | |||
kubectl wait --for=condition=available --timeout=3m deployment/kueue-controller-manager -n kueue-system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this for a follow up.
LGTM label has been added. Git tree hash: 672838e0922a62f82b328cd35c211a88005ba5c6
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kueue-test-e2e-main-1-26 |
/hold cancel |
@alculquicondor @tenzen-y |
I'm fine with cherry-picking. |
+1 on cherry pick
…On Fri, Feb 2, 2024, 3:06 a.m. Yuki Iwai ***@***.***> wrote:
@alculquicondor <https://github.com/alculquicondor> @tenzen-y
<https://github.com/tenzen-y> I have added a release note, because this
is actually a user-facing change IMO. Previously it was really hard to tell
when Kueue can accept early requests, especially in the HA mode. Should we
consider cherry-picking?
I'm fine with cherry-picking.
This patch would be helpful for both HA and non-HA users. Without this
patch, we need to manually check if the manager has started since the
readinessprove server starts in spite of not being ready.
—
Reply to this email directly, view it on GitHub
<#1676 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ5E6GLLY7GQNNQ4DFVGCDYRTCEJAVCNFSM6AAAAABCSUVZYGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRTGU4DENRTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…etes-sigs#1676) * Wait for webhooks server using probes * Delete KueueReadyForTesting * revert the setting of healthz * Add a comment about the readyz probe
What type of PR is this?
/kind bug
What this PR does / why we need it:
To fix the flaky test.
To also let users wait for Kueue webhooks ready without the need for creating / deleting objects as in
KueueReadyForTesting
Which issue(s) this PR fixes:
Fixes #1658
Special notes for your reviewer:
Tested locally, trying to see what will happen on GH CI.
Does this PR introduce a user-facing change?