Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCR image missing shell in v1.16 #317

Closed
istarkov opened this issue Oct 10, 2019 · 30 comments
Closed

GCR image missing shell in v1.16 #317

istarkov opened this issue Oct 10, 2019 · 30 comments
Assignees
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@istarkov
Copy link

Seems like ash has gone from the latest container version

We used it at cloudbuild as like

  - id: 'db-proxy'
    name: eu.gcr.io/cloudsql-docker/gce-proxy
    waitFor: ['extract-api-build']
    entrypoint: ash
    args:
      - '-c'
      - |
        /cloud_sql_proxy -dir=/cloudsql -instances=$(cat ./_CLOUD_SQL_INSTANCE_) & \
        while [ ! -f /cloudsql/stop ]; do
          sleep 2;
        done
        kill $!
    volumes:
      - name: db
        path: /cloudsql

For the trick which allowed us to not close connection until touch /cloudsql/stop command from other step.

Now no ash or bash in container and it become impossible to use it in cloudbuild in a simple manner

@kurtisvg kurtisvg changed the title Seems like ash has gone from the latest container version GCR image missing shell in v1.16 Oct 11, 2019
@kurtisvg kurtisvg added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Oct 11, 2019
@kurtisvg kurtisvg self-assigned this Oct 11, 2019
@kurtisvg
Copy link
Contributor

Hey folks, apologies if this broke any builds.

This was done intentionally, we switched from using alpine to distroless as a base container to improve security. (You can see the new build in our Dockerfile.) I think we'll need to evaluate if we want to offer a second image for folks that need a shell.

I'm not sure the best way to introduce the shell into a distroless container, but it looks like you can use the above dockerfile and update the distoless base image with the debug tag to add the busybox shell in the meantime.

@istarkov
Copy link
Author

Thank you for the answer.
May be more simple solution exists to access SQL from cloudbuild steps as its the only case when we need shell in the proxy container.

@AdriVanHoudt
Copy link

Is version 1.15 build somewhere else/differently? The commit referenced seems to also be in 1.15?
I don't know if it is "officially" documented anywhere but I'm betting that most people who use the container in cloud build are using it like @istarkov mentioned, probably taken/inspired by https://stackoverflow.com/a/52366671/2339622

@elmarti
Copy link

elmarti commented Oct 11, 2019

Yeah this broke my CD pipeline, absolutely fundamentally my fault for copy/pasting the above stack overflow article and for using :latest as an image version in cloud build, but still a bit too much of a major change to not warrant a major bump. Current fix for me is to specify a version for the step (Excuse the JSON, this will not work in a cloudbuild.yaml):

 {
      id: "cloud-sql-proxy",
      name: "gcr.io/cloudsql-docker/gce-proxy:1.15",
      waitFor: ["other_step"],
      entrypoint: "sh",
      args: [
        "-c",
        `/cloud_sql_proxy -dir=/cloudsql -instances=${cloudsqlconnection} -credential_file=<vcredentialfilepath> & while [ ! -f /cloudsql/stop ]; do sleep 2; done`
      ],

      volumes: [
        {
          name: "db",
          path: "/cloudsql"
        }
      ]
    },

changing to gcr.io/cloudsql-docker/gce-proxy:1.15 in yaml should work the same.

@oldskool
Copy link

oldskool commented Oct 11, 2019

I have the same issue. I don't really mind that there is no shell per se, but using 1.16 currently breaks the deployment of this image in our GKE Kubernetes clusters, as we used nc in our liveness probes, which now also can't be found.

Switching to distroless is quite a radical change and should be tagged as a new major release version 2.0.0 IMHO. This would make it clear to everyone that fundamental changes have been might that might very well break builds.

@kurtisvg
Copy link
Contributor

Hey folks,

Apologies for any inconveniences this caused. I would strong recommend avoiding the use of latest in the future, and instead follow container best practices by pining to a specific version (but updating it often).

We're not planning on guaranteeing the existence of any shells or other tools in the binary other than the proxy. If we were to define an API for this image, it would only contain the path that the proxy is located at (/cloud_sql_proxy). I apologize that this wasn't more clear previously, and will make sure going forward this is clearly stated in our docs.

For using the proxy with other tools, I would recommend downloading a release binary into an environment with the correct versions of other tools. We have versioned release links on the release page that can be used.

For Cloud Build, I believe the correct way to connect to Cloud SQL would be to download the proxy to /workspace and execute it as part of the same step. I'll work on verifying this and getting a more concrete example for folks to take a look at in the next few days.

@istarkov
Copy link
Author

No need in apologies, its great and easy to use software and setting tag for image is not a big deal at least for me.

For Cloud Build, I believe the correct way to connect to Cloud SQL would be to download the proxy to /workspace and execute it as part of the same step. I'll work on verifying this and getting a more concrete example for folks to take a look at in the next

One issue, that same step also can be bash less, curl - wget less and that causing creating additional steps and containers.

For example having ability to stop proxy from other build steps will make bash useless.

@oldskool
Copy link

oldskool commented Oct 11, 2019

Apologies for any inconveniences this caused. I would strong recommend avoiding the use of latest in the future, and instead follow container best practices by pining to a specific version (but updating it often).

We're not using latest, but we do have automated container updates using Flux CD, based on semantic versioning and had 1.* on the whitelist. As said before, I see such a fundamental change to the underlying container as a major update (1.15 > 2.0), rather than a minor one (1.15 > 1.16). So I would not have expected such a backwards incompatible change. For now we've pinned it shut to 1.16 after we fixed the issues it gave.

@roychri
Copy link

roychri commented Oct 11, 2019

Any chance that the release notes mention the fact that you changed to distroless?
(https://github.com/GoogleCloudPlatform/cloudsql-proxy/releases/tag/1.16)

That seems to be a big change worth of being mentioned there.

@ffjeremy
Copy link

ffjeremy commented Oct 16, 2019

I was using a similar solution (running a shell script in the gce-proxy container), and it broke for me as well. Here is the workaround I have come up with: use the docker builder to send a TERM signal to the gce-proxy container.

steps:
  # DB Proxy
  # Launch Cloud SQL proxy and keep it open
  # until the db migration step is finished
  - name: "gcr.io/cloudsql-docker/gce-proxy:1.16"
    id: proxy
    waitFor: ["-"]
    volumes:
      - name: db
        path: "/cloudsql"
    args:
      - "/cloud_sql_proxy"
      - "-dir=/cloudsql"
      - "-instances=$_DB_CONNECTION_NAME"

  # Run Migrations.
  - name: "gcr.io/$PROJECT_ID/my-api"
    id: dbmigrate
    waitFor: ["-"]
    volumes:
      - name: db
        path: "/cloudsql"
    args:
      - "sh"
      - "-c"
      - "sleep 5; echo hello world; ls -lh /cloudsql"

  - name: "gcr.io/cloud-builders/docker"
    id: killproxy
    waitFor: ["dbmigrate"]
    entrypoint: "sh"
    args:
      - "-c"
      - 'docker kill -s TERM $(docker ps -q --filter "volume=db")'

@istarkov
Copy link
Author

Super!!! Thank you!

@kurtisvg
Copy link
Contributor

Alternatively, here is another solution that keeps the execution of the proxy contained to a single step:

  - id: cmd-with-proxy
    name: [YOUR-CONTAINER-HERE]
    timeout: 100s
    entrypoint: sh
    args:
      - -c
      - '(/workspace/cloud_sql_proxy -dir=/workspace -instances=[INSTANCE_CONNECTION_NAME] & sleep 2) && [YOUR-COMMAND-HERE]'

This starts the proxy in the background, waits 2 seconds for it to start up, and then executes the specified command. Since the proxy is in a background process, it will automatically exit when the process is complete. It does require that the proxy is in the /workspace volume, but this can be done as prereq step if necessary:

  - id: proxy-install
    name: alpine:3.10
    entrypoint: sh
    args:
      - -c
      - 'wget -O /workspace/cloud_sql_proxy https://storage.googleapis.com/cloudsql-proxy/v1.16/cloud_sql_proxy.linux.386 &&  chmod +x /workspace/cloud_sql_proxy'
    waitFor: ['-']

I verified earlier today that this works, and shows the step as failed if either the proxy fails to start or if the script exits with a non-zero code.

Here's the cloudbuild.yaml I used to verify this:

steps:
  - id: proxy-install
    name: alpine:3.10
    entrypoint: sh
    args:
      - -c
      - 'wget -O /workspace/cloud_sql_proxy https://storage.googleapis.com/cloudsql-proxy/v1.16/cloud_sql_proxy.linux.386 &&  chmod +x /workspace/cloud_sql_proxy'
    waitFor: ['-']
  - id: execute-with-proxy
    name: python:3.7
    timeout: 100s
    entrypoint: sh
    args:
      - -c
      - '(/workspace/cloud_sql_proxy -dir=/workspace -instances=[INSTANCE_CONNECTION_NAME] & sleep 2) && (pip install -r requirements.txt && python test_sql.py)'
    waitFor: ['proxy-install']

And here's test-sql.py:

db = sqlalchemy.create_engine(
    # Equivalent URL:
    # mysql+pymysql://<db_user>:<db_pass>@/<db_name>?unix_socket=/cloudsql/<cloud_sql_instance_name>
    sqlalchemy.engine.url.URL(
        drivername='mysql+pymysql',
        username=db_user,
        password=db_pass,
        database=db_name,
        query={
            'unix_socket': '/workspace/{}'.format(cloud_sql_connection_name)})
)

try:
    with db.connect() as conn:
        now = conn.execute('SELECT NOW() as now').fetchone()
        print('Connection successful.')
except Exception as ex:
    print('Connection not successful: {}'.format(ex))
    exit(1)

I believe this should also work if the proxy was listening on a local TCP port, as now the proxy is running in the same container (but I haven't verify this).

@cjrh
Copy link

cjrh commented Nov 7, 2019

For kubernetes deployments, in the provided kubernetes doc, a lifecycle prestop hook calling "sleep" is given: https://github.com/GoogleCloudPlatform/cloudsql-proxy/blob/master/Kubernetes.md#creating-the-cloud-sql-proxy-deployment

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cloudsqlproxy
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: cloudsqlproxy
    spec:
      containers:
       # Make sure to specify image tag in production
       # Check out the newest version in release page
       # https://github.com/GoogleCloudPlatform/cloudsql-proxy/releases
      - image: b.gcr.io/cloudsql-docker/gce-proxy:latest
       # 'Always' if imageTag is 'latest', else set to 'IfNotPresent'
        imagePullPolicy: Always
        name: cloudsqlproxy
        command:
        - /cloud_sql_proxy
        - -dir=/cloudsql
        - -instances=project:database1=tcp:0.0.0.0:3306,project:database2=tcp:0.0.0.0:3307
        - -credential_file=/credentials/credentials.json
        # set term_timeout if require graceful handling of shutdown
        # NOTE: proxy will stop accepting new connections; only wait on existing connections
        - term_timeout=10s
        lifecycle:
          preStop:
            exec:
              # (optional) add a preStop hook so that termination is delayed
              # this is required if your server still require new connections (e.g., connection pools)
              command: ['sleep', '10']

<snip>

This may be a silly question, but is this ['sleep', '10'] still expected to work in the new 1.16 distroless container? This did not work for me in 1.16, but it did work in 1.15. I am relying on this sleep above to let my app container finish up currently-processing requests (that involve db access) before the proxy gets SIGTERM.

@kurtisvg
Copy link
Contributor

@cjrh Not a stupid question. Sleep won't work since there isn't a shell to execute it inside the proxy.

The term_timeout flag shouldn't prevent new connections from being formed. The pod should receive a SIGTERM signal and wait the time limit before shutting down. The k8s documentation mentions it removes the container from the endpoints list, but since you are using a local IP I don't see why it wouldn't continue to serve traffic.

It also looks like our example has a typo in it. Can you try the following and see if it works without the lifecycle?

        command:
        - /cloud_sql_proxy
        - -dir=/cloudsql
        - -instances=project:database1=tcp:0.0.0.0:3306,project:database2=tcp:0.0.0.0:3307
        - -credential_file=/credentials/credentials.json
        - -term_timeout=10s

If so, please let me know and I'll update it accordingly.

@cjrh
Copy link

cjrh commented Nov 14, 2019

@kurtisvg Thanks for your reply!

Yep it should be - -term_timeout=10; I spotted that already and corrected the error in my version. That does not solve my problem, which I will try to explain below:

It seems to me there is a subtle, but important difference between the effect of using sleep and term_timeout. When my pod must be terminated (e.g., during a rolling deploy), k8s will do the following 2 things simultaneously:

  1. tell the ingress controller to remove the pod
  2. EITHER send SIGTERM to each container in the pod, OR run the preStop hook, if present, followed by a SIGTERM when the hook is done.

However, for (1) it seems that the pod is not immediately removed from the ingress, and in practice there is an awkward delay where I still see new connections being made to my app container for a second or two, until the pod is removed from the LB. My goal is to also handle these requests (until the pod is removed from the LB and it no longer receives requests), and this typically means they require DB access via the sql proxy. If these requests are not handled correctly, errors are returned to the caller service.

From my admittedly minimal testing, it seems like the sql proxy will not allow any new connections in the time window after SIGTERM, but before the existing open connections are closed according to the -term_timeout setting. So this is why I want a preStop sleep: to allow enough time for new requests to be handled during the time window that my pod is being removed from the ingress.

I apologise if I'm not making sense as I have minimal experience with k8s. The TLDR is that I'm trying to use a preStop sleep to give enough time for the ingress to remove the pod from the LB so that all new incoming requests to that pod can be handled successfully (both by the app container and the sql proxy container), and then after removal from the LB is complete, proceed with the shutdown sequence as normal.

Perhaps an alternative design might be to just return some kind of specific error back to the upstream service and have it do retries. But that's messy.

What is best practice for dealing with that tiny window of time between the pod being marked TERMINATING and it getting removed from the LB? Right now, I am using version 1.15 of the sql proxy because it still has a shell, and the "connection draining" behaviour I've described above is working well, so I'm unsure how to proceed here with 1.16 and beyond.

If it matters, my app uses gunicorn.

@kurtisvg
Copy link
Contributor

From my admittedly minimal testing, it seems like the sql proxy will not allow any new connections in the time window after SIGTERM, but before the existing open connections are closed according to the -term_timeout setting

I would consider this a bug - IIRC the intended behavior was to continue accepting new connections until the term_timeout duration expired. If this is the way the proxy is itself behaving, please open a new issue (with steps to reproduce/test).

@cjrh
Copy link

cjrh commented Nov 16, 2019

I would consider this a bug - IIRC the intended behavior was to continue accepting new connections until the term_timeout duration expired.

Reproduction isn't necessary to disprove this, the code in Shutdown() is clear enough:

// Shutdown waits up to a given amount of time for all active connections to
// close. Returns an error if there are still active connections after waiting
// for the whole length of the timeout.
func (c *Client) Shutdown(termTimeout time.Duration) error {
	termTime := time.Now().Add(termTimeout)
	for termTime.After(time.Now()) && atomic.LoadUint64(&c.ConnectionsCounter) > 0 {
		time.Sleep(1)
	}

	active := atomic.LoadUint64(&c.ConnectionsCounter)
	if active == 0 {
		return nil
	}
	return fmt.Errorf("%d active connections still exist after waiting for %v", active, termTimeout)
}

The sleeping for-loop will terminate if either the specific term_timeout value is reached, or if the number of connections drops to zero. As I described before, this can result in exiting the proxy container while the pod is still receiving incoming requests via the ingress.

Should I make a new issue to request that this be changed to the following?

func (c *Client) Shutdown(termTimeout time.Duration) error {
        time.Sleep(termTimeout)
	active := atomic.LoadUint64(&c.ConnectionsCounter)
	if active == 0 {
		return nil
	}
	return fmt.Errorf("%d active connections still exist after waiting for %v", active, termTimeout)
}

?

@kurtisvg
Copy link
Contributor

@cjrh - Yes, please open a new issue. According to this comment on the PR for the feature, the intention was to allow connections to resume during shutdown.

@cleberjsantos
Copy link

cleberjsantos commented Dec 9, 2019

The problem's with https://github.com/GoogleContainerTools/distroless, this image base have not sh on package list, see here
https://github.com/GoogleContainerTools/distroless/blob/master/base/base.bzl#L29-L33

The cloudsql-proxy (Dockerfile) https://github.com/GoogleCloudPlatform/cloudsql-proxy/blob/master/Dockerfile#L27 or https://github.com/GoogleCloudPlatform/cloudsql-proxy/blob/1.16/Dockerfile#L27 use Distroless as base image in building.

A solution possible's add the entry bellow.

diff --git a/base/base.bzl b/base/base.bzl
index acde679..7ee5739 100644
--- a/base/base.bzl
+++ b/base/base.bzl
@@ -29,6 +29,7 @@ def distro_components(distro_suffix):
         debs = [
             DISTRO_PACKAGES[distro_suffix]["base-files"],
             DISTRO_PACKAGES[distro_suffix]["netbase"],
+            DISTRO_PACKAGES[distro_suffix]["sh"],
             DISTRO_PACKAGES[distro_suffix]["tzdata"],
         ],
         env = {

@austincollinpena
Copy link

austincollinpena commented Feb 18, 2020

@kurtisvg
I'm currently following your steps but am running into issues. When running, I get
jango.db.utils.OperationalError: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/cloudsql/sample-kubernetes-268320:us-west1:cloud-build-staging/.s.PGSQL.3306"?

My cloudbuild.yaml is:

steps:
  - id: proxy-install
    name: alpine:3.10
    entrypoint: sh
    args:
      - -c
      - 'wget -O /workspace/cloud_sql_proxy https://storage.googleapis.com/cloudsql-proxy/v1.16/cloud_sql_proxy.linux.386 &&  chmod +x /workspace/cloud_sql_proxy'
    waitFor: ['-']
  - id: Test
    name: 'python:3.7.6-buster'
    env:
      - "CLOUDBUILD_TEST=True"
    timeout: 100s
    entrypoint: /bin/sh
    args:
      - -c
      - '(/workspace/cloud_sql_proxy -dir=/workspace -instances=sample-kubernetes-268320:us-west1:cloud-build-staging & sleep 2) && (pip install -r requirements.txt && cd fastestfollowup && python3 manage.py test)'
    waitFor: ['proxy-install']

Do you have any ideas of where I should start to look to fix this?
Edit:
Felt like it was bad form to ask for help in this manner. I opened a stack overflow question here if you would be so kind to take a look.

@pratikmallya
Copy link

For the cloudsql proxy authors, what do you think about adding a command to the proxy itself so it can do the healthcheck? i.e. cloud_sql_proxy healthz which can be invoked as a healthcheck

@kurtisvg
Copy link
Contributor

kurtisvg commented Apr 8, 2020

Closing this in favor of #330 and #371

@kurtisvg kurtisvg closed this as completed Apr 8, 2020
@dinigo
Copy link

dinigo commented May 12, 2020

@kurtisvg can you give some context on the outcome of this issue? There you link an alpine-based issue and an ubuntu-based image. Which one is it based on now?

I came here looking for a way to implement this healthcheck like:

 database:
    image: gcr.io/cloudsql-docker/gce-proxy:1.17
    # ...
    healthcheck:
      test: pg_isready --dbname=${DB_NAME} --host=${DB_HOST} --port=${DB_PORT} --username=${DB_USER}

But I'm not sure of the outcome of the discussion. Can you please give me some context on how to access this shell? I suppose that Ubuntu does have a shell while alpine does not. So Which one is it built against now?

Thanks!

@kurtisvg
Copy link
Contributor

kurtisvg commented May 12, 2020

  1. The default base image will remain based on distroless, which doesn't provide a shell. This is to remain more secure by default.
  2. We will be adding alternative images (likely vX.X-buster and vX.X-alpine) that will use different base images. AFAIK, both buster (debian) and alpine images include shells by default.

@ekampf
Copy link

ekampf commented Jun 15, 2020

@kurtisvg any ETA for those buster\alpine images?

@kurtisvg
Copy link
Contributor

@ekampf Hopefully before the end of the month - @dmahugh is working on them.

@CarlQLange
Copy link

I would suggest this issue could be closed, since the -alpine and -buster images exist now.

@kurtisvg
Copy link
Contributor

kurtisvg commented Feb 9, 2021

@CarlQLange Looks already closed to me :)

@CarlQLange
Copy link

Oh my goodness, what a brain fart! Sorry about that!

@aseem-hegshetye
Copy link

Alternatively, here is another solution that keeps the execution of the proxy contained to a single step:

  - id: cmd-with-proxy
    name: [YOUR-CONTAINER-HERE]
    timeout: 100s
    entrypoint: sh
    args:
      - -c
      - '(/workspace/cloud_sql_proxy -dir=/workspace -instances=[INSTANCE_CONNECTION_NAME] & sleep 2) && [YOUR-COMMAND-HERE]'

This starts the proxy in the background, waits 2 seconds for it to start up, and then executes the specified command. Since the proxy is in a background process, it will automatically exit when the process is complete. It does require that the proxy is in the /workspace volume, but this can be done as prereq step if necessary:

  - id: proxy-install
    name: alpine:3.10
    entrypoint: sh
    args:
      - -c
      - 'wget -O /workspace/cloud_sql_proxy https://storage.googleapis.com/cloudsql-proxy/v1.16/cloud_sql_proxy.linux.386 &&  chmod +x /workspace/cloud_sql_proxy'
    waitFor: ['-']

I verified earlier today that this works, and shows the step as failed if either the proxy fails to start or if the script exits with a non-zero code.

Here's the cloudbuild.yaml I used to verify this:

steps:
  - id: proxy-install
    name: alpine:3.10
    entrypoint: sh
    args:
      - -c
      - 'wget -O /workspace/cloud_sql_proxy https://storage.googleapis.com/cloudsql-proxy/v1.16/cloud_sql_proxy.linux.386 &&  chmod +x /workspace/cloud_sql_proxy'
    waitFor: ['-']
  - id: execute-with-proxy
    name: python:3.7
    timeout: 100s
    entrypoint: sh
    args:
      - -c
      - '(/workspace/cloud_sql_proxy -dir=/workspace -instances=[INSTANCE_CONNECTION_NAME] & sleep 2) && (pip install -r requirements.txt && python test_sql.py)'
    waitFor: ['proxy-install']

And here's test-sql.py:

db = sqlalchemy.create_engine(
    # Equivalent URL:
    # mysql+pymysql://<db_user>:<db_pass>@/<db_name>?unix_socket=/cloudsql/<cloud_sql_instance_name>
    sqlalchemy.engine.url.URL(
        drivername='mysql+pymysql',
        username=db_user,
        password=db_pass,
        database=db_name,
        query={
            'unix_socket': '/workspace/{}'.format(cloud_sql_connection_name)})
)

try:
    with db.connect() as conn:
        now = conn.execute('SELECT NOW() as now').fetchone()
        print('Connection successful.')
except Exception as ex:
    print('Connection not successful: {}'.format(ex))
    exit(1)

I believe this should also work if the proxy was listening on a local TCP port, as now the proxy is running in the same container (but I haven't verify this).

This is great. Would be even better if this was incorporated in an official gcp document that shows how to deploy from cloud build by keeping cloud sql running in background. If such doc already exists, its very hard to find. Please provide a link for future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests