Improve liveness and readiness probes accuracy #1546

astefanutti · 2024-01-05T10:09:23Z

What would you like to be added:

The liveness and readiness probes should succeed only when the operator is fully operational. This means they should report a health status only when all the components, i.e, webhooks, extension API servers, controllers, have started and are servicing requests.

Why is this needed:

At the moment, the liveness and readiness probes report a healthy status unconditionally, as soon as the controller-runtime manager has started, despite the operator and its components (webhooks, extension API servers, controllers) may not be operational yet.

That can happen over the period of time it takes for the certificates generated by cert-controller to be propagated into the secret volume mount for example.

In such cases, having the probes reporting an accurate status can help recovering issues, and other components correctly waiting for the operator readiness.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

mimowo · 2024-02-07T14:03:10Z

I believe this issue is already addressed with #1676 (and cherry-picked to 0.5.x)/
@astefanutti PTAL, if so then we can close.

astefanutti · 2024-02-08T09:21:45Z

@mimowo thanks for the update. I'd say #1676 fixes most of this issue.

There are a couple of things in the scope of this issue that we may still want to address like:

Including the status of the visibility API server in the readiness check
Improving the accuracy / usefulness of the liveness check

So I think we can close this one, and we can create some finer-grained ones for the two left-overs above. WDYT?

mimowo · 2024-02-08T09:37:45Z

Including the status of the visibility API server in the readiness check

Seems like a valid improvement indeed, otherwise we may be getting errors from the server.

Improving the accuracy / usefulness of the liveness check

Any concrete example, other than the visibility API server, where we could improve accuracy?

So I think we can close this one, and we can create some finer-grained ones for the two left-overs above. WDYT?

I prefer more fine-grained issues, so that we can prioritize / close them independently.

astefanutti · 2024-02-08T17:20:57Z

Including the status of the visibility API server in the readiness check

Seems like a valid improvement indeed, otherwise we may be getting errors from the server.

Agreed.

Improving the accuracy / usefulness of the liveness check
Any concrete example, other than the visibility API server, where we could improve accuracy?

I wonder if probing for the webhooks in the liveness probe, as done in the readiness probe now with #1676, would be useful to mitigate the situation where cert-controller might fail generating / injecting the certificates at start time, and possibly over cert rotation. Do you think that would be an improvement over the current liveness probe implementation?

So I think we can close this one, and we can create some finer-grained ones for the two left-overs above. WDYT?

I prefer more fine-grained issues, so that we can prioritize / close them independently.

Sounds good, I'll create them and close this one.

mimowo · 2024-02-08T17:45:04Z

I wonder if probing for the webhooks in the liveness probe

Oh, interesting. I assumed the liveness probe = readiness probe, but it indeed the liveness probe = health probe in controller manager: https://github.com/kubernetes-sigs/controller-runtime/blob/7032a3cc91d2afc4c2d54e4a4891cf75da9f75f5/pkg/manager/internal.go#L281-L285.

And currently, the health probe is just ping in Kueue. So, yes I think using the liveness probe checking webhook server is better so that we can see if the server is down. cc @trasc @alculquicondor

k8s-triage-robot · 2024-05-08T18:02:29Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y · 2024-05-09T07:13:52Z

/remove-lifecycle stale

mbobrovskyi · 2024-07-08T08:07:47Z

/assign

mimowo · 2024-07-18T10:29:50Z

@astefanutti since this question was posted we have added the readiness probe to Kueue in #1676 (and later improved in follow ups).

Do we still have some known gaps to cover in the readiness probe, or scenarios that we want to cover in the liveness probe?

If not, I would suggest to park this issue (close) until we have such scenarios.

astefanutti · 2024-07-29T10:19:23Z

@mimowo apologies for the late reply.

I think what's been done to improve the readiness mostly addresses this issue, and I agree with your suggestion to close it, and create some finer-grained ones if needed.

astefanutti · 2024-07-29T10:19:28Z

/close

k8s-ci-robot · 2024-07-29T10:19:33Z

@astefanutti: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

astefanutti added the kind/feature label Jan 5, 2024

k8s-ci-robot added the lifecycle/stale label May 8, 2024

k8s-ci-robot removed the lifecycle/stale label May 9, 2024

k8s-ci-robot assigned mbobrovskyi Jul 8, 2024

k8s-ci-robot closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve liveness and readiness probes accuracy #1546

Improve liveness and readiness probes accuracy #1546

astefanutti commented Jan 5, 2024

mimowo commented Feb 7, 2024 •

edited

Loading

astefanutti commented Feb 8, 2024

mimowo commented Feb 8, 2024

astefanutti commented Feb 8, 2024

mimowo commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

tenzen-y commented May 9, 2024

mbobrovskyi commented Jul 8, 2024

mimowo commented Jul 18, 2024

astefanutti commented Jul 29, 2024

astefanutti commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

Improve liveness and readiness probes accuracy #1546

Improve liveness and readiness probes accuracy #1546

Comments

astefanutti commented Jan 5, 2024

mimowo commented Feb 7, 2024 • edited Loading

astefanutti commented Feb 8, 2024

mimowo commented Feb 8, 2024

astefanutti commented Feb 8, 2024

mimowo commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

tenzen-y commented May 9, 2024

mbobrovskyi commented Jul 8, 2024

mimowo commented Jul 18, 2024

astefanutti commented Jul 29, 2024

astefanutti commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

mimowo commented Feb 7, 2024 •

edited

Loading