Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sidecars to specify probe port #8949

Closed
emaildanwilson opened this issue Aug 6, 2020 · 24 comments · Fixed by #14853
Closed

Allow sidecars to specify probe port #8949

emaildanwilson opened this issue Aug 6, 2020 · 24 comments · Fixed by #14853
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Well-understood/specified features, ready for coding. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@emaildanwilson
Copy link
Contributor

Currently the API for container livenessProbe and readinessProbe doesn't allow port to be specified since containerPort is already provided. In a multi-container use case, containerPort can only be specified on one container which provides a clear way to identify the container and port which receives traffic. As a side effect, it's currently not possible to configure a sidecar container with liveness/readiness probes using either httpGet or tcpSocket.

A non-ideal workaround for readinessProbe exists to design the primary container's health check code to also check the health of any dependent sidecars. This isn't a typical consideration of the service developer and would not work in many cases for third party containers.

Since k8s restarts the specific container that fails a livenessProbe and NOT the entire pod, this same workaround is not effective for livenessProbe and instead would result in unnecessary restarts of the primary container.

@emaildanwilson emaildanwilson added the kind/feature Well-understood/specified features, ready for coding. label Aug 6, 2020
@emaildanwilson
Copy link
Contributor Author

I can pick this up and work on a PR. before I do though...

Is there agreement that:

  1. This should be fixed
  2. The approach is to update the field mask to allow livenessProbe and readinessProbe to be defined on sidecar containers only.

cc @markusthoemmes

@mattmoor
Copy link
Member

The workaround of proxying through the single exposed container doesn't work for exec probes, which are the defacto way of health checking GRPC services.

The point around the liveness probe restarting the container vs. the pod is another great point / problem with this suggested workaround.

My personal inclination would be to see us allow {liveness,readiness}Probe on sidecars.

cc @dprotaso @markusthoemmes @savitaashture
cc @dgerd (for historical purposes, and because we miss you 😉 )

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2020
@markusthoemmes
Copy link
Contributor

Ack, allowing the probes seems okay for sidecar containers.

@mattmoor
Copy link
Member

/lifecycle frozen

@knative-prow-robot knative-prow-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 14, 2021
@evankanderson
Copy link
Member

/area
/kind api-change

/good-first-issue
Since it sounds like there's agreement on the general API shape. If not, remove /remove good-first-issue.

/triage acecpted

@knative-prow-robot
Copy link
Contributor

@evankanderson:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/area
/kind api-change

/good-first-issue
Since it sounds like there's agreement on the general API shape. If not, remove /remove good-first-issue.

/triage acecpted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot
Copy link
Contributor

@evankanderson: The label(s) area/api-change, triage/acecpted cannot be applied, because the repository doesn't have them.

In response to this:

/area
/kind api-change

/good-first-issue
Since it sounds like there's agreement on the general API shape. If not, remove /remove good-first-issue.

/triage acecpted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 22, 2021
@evankanderson
Copy link
Member

/kind api-change
/triage accepted

@knative-prow-robot knative-prow-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API triage/accepted Issues which should be fixed (post-triage) labels Mar 22, 2021
@dprotaso dprotaso added this to the 0.24.x milestone Jun 12, 2021
@dprotaso dprotaso modified the milestones: 0.24.x, 0.25.x Jul 6, 2021
@salaboy
Copy link
Member

salaboy commented Jul 7, 2021

/assign

@salaboy
Copy link
Member

salaboy commented Jul 7, 2021

Ok, so I started looking at this issue (as my "good first issue"), please remember that I am new with the internals.
This is what I understood so far:

  1. For the primary container (the one which specifies a containerPort) the readiness probe is cleared up (set to nil), because the queue proxy is going to execute the readynessprobe instead. And this is done here: https://github.com/knative/serving/blob/main/pkg/reconciler/revision/resources/deploy.go#L163

  2. For the primary container (the one which specifies a containerPort) the liveness probe, if HTTP is used, is overriden with the networking.BackendHTTPPort to be redirected to the queue proxy, If TCP is used the port provided by the user is used.

  3. For all the sidecars containers, this is not done

My questions based on this:

  1. Should we redirect the probes for the sidecar containers to the queue container as it is done for the main container? In other words, the sidecars' probes should be executed by the kubelet? I can't think of a reason why they shouldn't
  2. Internally no manipulation is done for sidecars probes, but there is a validation web hook screaming at you when you try to add probes to sidecar containers:
Error from server (BadRequest): error when creating "service.yaml": admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.containers[1].livenessProbe, spec.template.spec.containers[1].livenessProbe.initialDelaySeconds, spec.template.spec.containers[1].livenessProbe.periodSeconds

Update: This validation can be turned of by removing this check: https://github.com/knative/serving/blob/main/pkg/apis/serving/k8s_validation.go#L362

It will be great for me to understand historically why this was not allowed. As my gut feeling tells me that having probes coming from different places might cause some race conditions or random behaviour.

I will keep adding my findings here until I have a PR with some changes to discuss in more detail, if someone can shed some light on the expected behaviour that will be great.

@salaboy
Copy link
Member

salaboy commented Jul 8, 2021

@dprotaso @evankanderson I've created a simple PR with a test showing that the probes are maintained if the validation is removed from the validation web hook. The question still remains, do we want the sidecar containers probes to be performed by the kubelet?

I feel that I am missing context here to make a decision on that. In my head it makes some sense to only manage the "serving" container. But at the same time I do realise that I don't fully understand the entire lifecycle of KServices, which might be requiring the queue proxy container to run the probes for sidecars as well.

@dprotaso dprotaso modified the milestones: 0.25.x, 0.26.x Aug 26, 2021
@dprotaso dprotaso removed this from the 0.26.x milestone Oct 6, 2021
@dprotaso dprotaso modified the milestones: v1.1.0, v1.2.0 Jan 4, 2022
@dprotaso dprotaso modified the milestones: v1.2.0, v1.3.0 Feb 2, 2022
@dprotaso dprotaso modified the milestones: v1.3.0, v1.4.0 Mar 10, 2022
@dprotaso dprotaso modified the milestones: v1.4.0, v1.5.0 May 10, 2022
@dprotaso dprotaso modified the milestones: v1.5.0, Icebox Jul 14, 2022
@vishal-chdhry
Copy link
Member

/assign

@vishal-chdhry
Copy link
Member

Hi @dprotaso! I would love to contribute to knative, but i am new to this project, can you please give me some pointers on what to change?

@vishal-chdhry vishal-chdhry removed their assignment Feb 8, 2023
@evankanderson
Copy link
Member

Hey, I'm not Dave, but I saw your note today.

I think the notes left by @salaboy are roughly correct -- when we added sidecar containers, we weren't sure how liveness and readiness probes should interact with the overall Pod health and in particular, the ability of the application and queue-proxy to serve traffic.

I think relaxing the validation Mauricio pointed to is possible and "safe" -- the largest risk would be an increase in cold-start time for pods which define readiness probes on their additional containers.

We intercept the probes on the primary container (as determined by specifying one containerPort, rather than zero) to enable the queue-proxy and activator to probe for initial readiness at a faster interval than supported by kubelet, but this mechanism should either continue to work (since it was independent of kubelet-reported readiness) or should introduce additional readiness delay (delaying cold start, but this can be a documented outcome).

@KaranJagtiani
Copy link
Contributor

Is this issue still valid? I'm new to Knative and I want to contribute. Based on the history of this issue, the PR (that was raised and closed) and the code, what I understand is relaxing the liveliness probe validation on the sidecar container would be safe to implement but we still need to keep primary container validation based on the containerPort. I'm I getting this right or are there some other data points that I need to understand? @evankanderson @dprotaso

@evankanderson
Copy link
Member

I don't think any work has happened on this since my comment above, so this should be available for work.

@KaranJagtiani
Copy link
Contributor

In that case can I be assigned the issue? @evankanderson

@evankanderson
Copy link
Member

/assign @KaranJagtiani

@dprotaso
Copy link
Member

dprotaso commented Sep 11, 2023

We intercept the probes on the primary container (as determined by specifying one containerPort, rather than zero) to enable the queue-proxy and activator to probe for initial readiness at a faster interval than supported by kubelet, but this mechanism should either continue to work (since it was independent of kubelet-reported readiness) or should introduce additional readiness delay (delaying cold start, but this can be a documented outcome).

The activator also does active probing (to the queue-proxy) after initial readiness in order track healthy endpoints for load balancing decisions.

I think relaxing the validation Mauricio pointed to is possible and "safe"

Thinking about this I'd say it's not - we're deviating from what's considered a 'ready' Pod by only considering a subset of containers.

So it seems like we could:

  1. Ensure all probes go through the queue proxy (excluding exec probes)
  2. When probes exist on the sidecar containers, the activator skips it's probing and rely's on the K8s API to understand Pod readiness (this already happens)

@dprotaso
Copy link
Member

dprotaso commented Sep 19, 2023

To elaborate a bit more - even if we complete #1 since our sidecar is unable to perform exec probes we should probably fall back on #2 - and have the activator look at pod readiness

I believe we currently disable probing when we have an exec probe configured for a revision

@ReToCode
Copy link
Member

I believe we currently disable probing when we have an exec probe configured for a revision

Only partially, we still do some TCP probing from Q-P. A full test set can be found in the FT. I'm working on a PR #14853 to enable multi-container probing that works for liveness and readiness probes.

@ReToCode
Copy link
Member

ReToCode commented Feb 1, 2024

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Well-understood/specified features, ready for coding. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
10 participants