Readiness check behaves wrongly #3233

SvenW · 2023-06-07T13:37:03Z

Describe the bug
The default readiness check behaves strangely. If the router gets a SIGINT/SIGTERM it cleans up all remaining connections, however it keeps responding {"status":"UP"} throughout the whole termination process which makes it get new connections that it wont be able to terminate in time before being killed by kubernetes. This is fine in the liveness probe part but not in the readiness however.

To Reproduce
Steps to reproduce the behavior:

Setup services '...'
Look at the /health endpoint
See that it gives a up status even though it has received a SIGINT/SIGTERM

Expected behavior
As soon as the router gets a termination signal it should stop answering {"status":"UP"} on the readiness probe checks. Still fine to do it on the liveness probe though. Would be nice to be able to configure how long this termination grace period should be in the YAML.

Additional context
This is when running on GKE v1.24.12

The text was updated successfully, but these errors were encountered:

Meemaw · 2023-06-07T17:31:21Z

You can already configure terminationGracePeriodSeconds:

router/helm/chart/router/values.yaml

Line 207 in daf9dda

terminationGracePeriodSeconds: 30

SvenW · 2023-06-08T06:57:51Z

You can already configure terminationGracePeriodSeconds:

router/helm/chart/router/values.yaml

Line 207 in daf9dda

terminationGracePeriodSeconds: 30

Yes, but that's not sufficient unfortunately. In order for the readiness to behave correctly both that value and being able to configure how long the router itself is able to work on draining connections needs to be configurable. We need to be able to respond to requests during the time when the LB is still removing us and after that removal stop serving requests and drain old connections

SvenW · 2023-06-12T06:59:45Z

I've bypassed this by adding a shell script that captures the signals and creates a file when it shuts down. The readiness probe then checks if that file exists in order to determine if the pod is still ready or not. This works but it's not especially nice :(

garypen · 2023-06-15T07:22:01Z

I agree this isn't a great situation. I've got a WIP PR which addresses this problem and I've bumped the priority to P1. My hope is that I'll find some time to fix this soon.

voslartomas · 2023-06-21T10:56:03Z

@SvenW Could you share what exactly did you do to overcome this issue please?

I have tried to add touch /tmp/shutdown file in preStop hook and then check it in readinees probe, but still I can see 502 when deploying new version of router.

SvenW · 2023-06-21T12:12:46Z

@SvenW Could you share what exactly did you do to overcome this issue please?

I have tried to add touch /tmp/shutdown file in preStop hook and then check it in readinees probe, but still I can see 502 when deploying new version of router.

Sure. Disregard the ugliness in my little bash script here:

#!/bin/bash

trap 'kill -TERM $PID' TERM INT
exec /dist/router_wrapper.sh &
PID=$!
trap 'touch $SHUTDOWN_FILE' TERM INT
wait $PID
sleep 20
trap - TERM INT
kill -SIGINT $PID
wait $PID
EXIT_STATUS=$?

In order for it to work I had to send a SIGINT to the router pid, if I didn't it didn't receive the termination as it should, and then in my deployment.yaml for the router I had to rewrite the readiness probe to a bash based like this:

readinessProbe:
  exec:
    command:
    - sh
    - -c
    - "if [ -f /tmp/shutdown ]; then exit 1; else exit 0; fi;"

voslartomas · 2023-06-21T12:15:17Z

That makes sense, that is actually beautiful, thanks!

voslartomas · 2023-06-21T12:17:44Z

@SvenW And do you have any special configuration in deployment?

SvenW · 2023-06-21T12:22:08Z

@SvenW And do you have any special configuration in deployment?

I just use the default terminationGracePeriodSeconds which is 30s and for the probe I've used the default values for now

voslartomas · 2023-06-21T12:34:15Z

All right, one more question why do you have that 20s sleep?

SvenW · 2023-06-21T12:35:43Z

All right, one more question why do you have that 20s sleep?

Something that is less than the terminationGracePeriodSeconds so we have a little time to say we're not ready before removing the currently running requests as the router does for us.

voslartomas · 2023-06-21T12:45:53Z

All right, one more question why do you have that 20s sleep?

Something that is less than the terminationGracePeriodSeconds so we have a little time to say we're not ready before removing the currently running requests as the router does for us.

But this trap 'kill -TERM $PID' TERM INT first command will send term signal to router as soon as it will receive it right? What is that for? I thought that we want to catch that signal, first create shutdown file and wait for 20s and then send term signal to router? Or am I missing something?

SvenW · 2023-06-21T12:50:40Z

All right, one more question why do you have that 20s sleep?

Something that is less than the terminationGracePeriodSeconds so we have a little time to say we're not ready before removing the currently running requests as the router does for us.

But this trap 'kill -TERM $PID' TERM INT first command will send term signal to router as soon as it will receive it right? What is that for? I thought that we want to catch that signal, first create shutdown file and wait for 20s and then send term signal to router? Or am I missing something?

It is catching the term signal now. We are sending the SIGINT signal to the pid after 20s. You can test it locally by building your local Dockerfile and then sending a SIGINT or SIGTERM signal to that container to validate that it's working as it should. If you have ideas of improving the bash script I would really appreciate it

voslartomas · 2023-06-21T16:57:53Z

All right it works, I just had to add preStop with sleep for 10 seconds for some reason, after that 0 502 errors on deployment. Thanks!

The router is a complex beast and it's not entirely straightforward to determine if a router is live/ready. From a kubernetes point of view, a service is: - live [if it isn't deadlocked](https://www.linkedin.com/posts/llarsson_betterdevopsonkubernetes-devops-devsecops-activity-7018587202121076736-LRxE) - ready if it is able to start accepting traffic (For more details: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) Our current health check isn't actually doing either of these things. It's returning a payload which indicates if the router is "healthy" or not and it's always returning "UP" (hard-code) with HTTP status code: 200. As long as it is serving traffic it appears "healthy". To address this we need: 1. A mechanism within the router to define whether it is ready or live. 2. A new interface whereby an interested party may determine if the router is ready or live. Both of these questions are significantly complicated by the fact that the router is architected to maintain connectivity in the face of changing configuration, schemas, etc... Multiple interested parties may be operating against the same router with different states (via different connections) and thus could, in theory, get different answers. However, I'm working on the assumption that, for k8s control purposes, a client, would only be interested in the current (latest) state of the router, so that's what we'll track. --- 1. How do we know if the router is ready or live? Ready - Not currently processing a change to its set of input - Is in state Running - Resources available (open files limit not reached) (although this removes a pod from service, which would be bad) - Is live Live - Health check enabled and responding 2. Extending our existing health interface If we add support for query parameters named "ready" and "live" to our existing health endpoint, then we can provide a backwards compatible interface which leverages the existing "/health" endpoint. We would support both POST and GET for maximum compatibility. Sample queries: ``` curl -XPOST "http://localhost:8088/health?ready" OR curl "http://localhost:8088/health?ready" curl -XPOST "http://localhost:8088/health?live" OR curl "http://localhost:8088/health?live" ``` You could only query for one at a time (which I think is fine). Note: Our health check returns a 200 with a JSON payload. It isn't easy to process that in k8s. Especially when you can't install useful utilities into your container image (like curl, jq, etc...) because of security and CVEs. When the router is not live or ready, return a 503 which k8s interprets as a probe fail. Fixes #3233  **Checklist** Complete the checklist (and note appropriate exceptions) before a final PR is raised. - [x] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests: - please document the manual testing (extensively) in the Exceptions. - please raise a separate issue to automate the test and label it (or ask for it to be labeled) as `manual test` --------- Co-authored-by: Bryn Cooke <[email protected]> Co-authored-by: Pierre Tondereau <[email protected]> Co-authored-by: Avery Harnish <[email protected]> Co-authored-by: bryn <[email protected]> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Geoffroy Couprie <[email protected]> Co-authored-by: Simon Sapin <[email protected]> Co-authored-by: o0Ignition0o <[email protected]> Co-authored-by: Jesse Rosenberger <[email protected]> Co-authored-by: Jeremy Lempereur <[email protected]> Co-authored-by: Dylan Anthony <[email protected]>

SvenW added the raised by user label Jun 7, 2023

garypen self-assigned this Jun 15, 2023

garypen mentioned this issue Jun 19, 2023

Improved readiness/liveness checks #3276

Merged

6 tasks

garypen closed this as completed in #3276 Jun 30, 2023

BrynCooke mentioned this issue Jul 7, 2023

prep release: v1.23.0 #3390

Merged

Meemaw mentioned this issue Aug 11, 2023

Add support for container lifecycle events to helm chart #3563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness check behaves wrongly #3233

Readiness check behaves wrongly #3233

SvenW commented Jun 7, 2023

Meemaw commented Jun 7, 2023

SvenW commented Jun 8, 2023

SvenW commented Jun 12, 2023

garypen commented Jun 15, 2023

voslartomas commented Jun 21, 2023 •

edited

Loading

SvenW commented Jun 21, 2023 •

edited

Loading

voslartomas commented Jun 21, 2023

voslartomas commented Jun 21, 2023

SvenW commented Jun 21, 2023 •

edited

Loading

voslartomas commented Jun 21, 2023

SvenW commented Jun 21, 2023

voslartomas commented Jun 21, 2023

SvenW commented Jun 21, 2023

voslartomas commented Jun 21, 2023

Readiness check behaves wrongly #3233

Readiness check behaves wrongly #3233

Comments

SvenW commented Jun 7, 2023

Meemaw commented Jun 7, 2023

SvenW commented Jun 8, 2023

SvenW commented Jun 12, 2023

garypen commented Jun 15, 2023

voslartomas commented Jun 21, 2023 • edited Loading

SvenW commented Jun 21, 2023 • edited Loading

voslartomas commented Jun 21, 2023

voslartomas commented Jun 21, 2023

SvenW commented Jun 21, 2023 • edited Loading

voslartomas commented Jun 21, 2023

SvenW commented Jun 21, 2023

voslartomas commented Jun 21, 2023

SvenW commented Jun 21, 2023

voslartomas commented Jun 21, 2023

voslartomas commented Jun 21, 2023 •

edited

Loading

SvenW commented Jun 21, 2023 •

edited

Loading

SvenW commented Jun 21, 2023 •

edited

Loading