Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy exits with wrong code after SIGTERM #1529

Closed
cdaguerre opened this issue Nov 5, 2022 · 5 comments · Fixed by #1530
Closed

Proxy exits with wrong code after SIGTERM #1529

cdaguerre opened this issue Nov 5, 2022 · 5 comments · Fixed by #1530
Assignees
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@cdaguerre
Copy link

Bug Description

Graceful termination seems impossible in v2. Whether I set --max-sigterm-delay or not doesn't matter, process always has exit code 137, even when there are no active connections.

Example code (or command)

- name: cloudsql-proxy
  image: {{ $root.Values.sqlProxy.image }}
  command:
    - '/kubexit/kubexit'
    - '/cloud-sql-proxy'
    - '--address=127.0.0.1'
    - '--max-sigterm-delay=10s'
    - '--structured-logs'
    - '--prometheus'
    {{- range $k, $v := $root.Values.sqlProxy.connections }}
    - '{{ $v.dsn }}?port={{ $v.port }}'
    {{- end }}
  lifecycle:
    preStop:
      exec:
        command: ['sleep', '10']

Stacktrace

Capture d’écran 2022-11-05 à 18 05 03

How to reproduce

  1. ?
  2. ?

Environment

  1. OS type and version: GKE 1.24.4-gke.800
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.0.0-preview.2
@cdaguerre cdaguerre added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Nov 5, 2022
@enocom enocom added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: question Request for information or clarification. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Nov 6, 2022
@enocom
Copy link
Member

enocom commented Nov 6, 2022

Thanks for the issue @cdaguerre.

Looking at these exit codes again, I see that SIGTERM is conventionally 143 and SIGKILL is 137.

I think what this means is that we should be returning exit code 143 instead of 137 here: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/blob/main/cmd/errors.go#L29.

Cf. https://komodor.com/learn/sigterm-signal-15-exit-code-143-linux-graceful-termination/

Is that your expectation?

@enocom enocom added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. labels Nov 6, 2022
@cdaguerre
Copy link
Author

cdaguerre commented Nov 7, 2022

Hey @enocom
Actually in v1, even when the proxy received a SIGTERM, the exit code would be 0 if there were no more active connections.
The use case is for running the proxy as a sidecar in kubernetes Jobs that run to completion.
You need to have a way to kill the cloudsql-proxy sidecar container when the main container finished.
Any exit code other then 0 from one of the containers causes the job to be considered failed.

Basically, ideal behaviour to gracefully terminate on SIGTERM taking the --max-sigterm-delay into account.

  • Exit 0 as soon as there is no active connection anymore
  • After sigterm delay,
    • exit 137 (SIGKILL) if there is an active connection
    • exit 0 if there is no active connection (probably already exited before delay...)

EDIT: added link to v1 code:

// Shutdown waits up to a given amount of time for all active connections to

@enocom enocom changed the title Graceful termination in v2 Proxy exits with wrong code after SIGTERM Nov 8, 2022
@enocom
Copy link
Member

enocom commented Nov 9, 2022

Thanks for the clarification.

There are two issues here:

  1. The v2 proxy uses the wrong exit code for SIGTERM. We'll fix that here.
  2. There's no way to cleanly shutdown the proxy without some wrapper script. That's tracked here: Provide a way to shutdown the process from another container #828.

Since we've made this worse in v2 with the new non-0 exit code, I'll prioritize adding a /quitquitquit endpoint to solve this.

@adamstrawson
Copy link

Hey @enocom Just to dig up this issue again, it seems that we're hitting the same behaviour that @cdaguerre mentions here despite having no open connections (exit code: 143)

We use cloud-sql-proxy on Kubernetes Jobs, wrapped around timeout so that it gracefully shuts down the container after completion. Because of this, and the 143 exit code, kubernetes sees the job as failed and puts it into an Error state.

terminated - Error (exit code: 143)

- name: cloud-sql-proxy
  image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.0.0-buster
  imagePullPolicy: Always
  command: ["/bin/sh", "-c"]
  args:
  - timeout --preserve-status -s SIGTERM 60s /cloud-sql-proxy \
  --max-sigterm-delay=60s --structured-logs --health-check \
  --http-address=0.0.0.0 --port=5432 --private-ip \
  {INSTANCE_CONNECTION}

Prior to v2, this behaved gracefully as expected.

Happy to open as a new issue if you'd prefer.

@enocom
Copy link
Member

enocom commented Feb 14, 2023

We recently added support for a /quitquitquit endpoint to support exactly this use case. See here for details: https://github.com/GoogleCloudPlatform/cloud-sql-proxy#localhost-admin-server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants