Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP Cloud Build 1 hour timeout - failed to fetch oauth token: unexpected status: 401 Unauthorized #1205

Open
dougdonohoe opened this issue Jul 11, 2022 · 6 comments

Comments

@dougdonohoe
Copy link

dougdonohoe commented Jul 11, 2022

Background

At my company, I have a new Apple M1 MacBook Pro. Most of our build infrastructure is all amd64 images, which run very slow and flakey on the arm64 M1 laptop. I've been working on building multi-architecture images using docker buildx and have run into a problem automating these builds in GCP Cloud Build and publishing to/from GCP Artifact Registry.

Our regular amd64 image build normally takes about 15 minutes. Adding the arm64 platform shoots the build time to an hour and a half. While this build works locally on my laptop, it times out on GCP Cloud Build if it is longer than an hour.

Problem Synopsis

Running a cloud build that does docker buildx build --platform linux/amd64,linux/arm64 --push times out when it attempts to push items to Artifact Registry:

Step #1 - "long-build":  > exporting to image:
Step #1 - "long-build": ------
Step #1 - "long-build": error: failed to solve: failed to fetch oauth token: unexpected status: 401 Unauthorized

Workaround

I discovered the workaround is to break the build into three steps:

  1. Build without --push or --cache-to
  2. Stop buildx builder
  3. Build normally

My guess is that stopping the builder forces oauth tokens to be re-fetched in step 3.

My docker-buildx.sh script is what I ended up using to do this.

Details

A full explanation with reproducible example can be found at my build-timeout repo.

Next Steps

It isn't clear where this problem lies (e.g., in buildx or with Google Cloud Build/Artifact Registry or some combination). I'm also going to raise this issue with Google through my company as a support ticket.

I haven't dug into the internals of buildx (and how it interacts with things like GCP Artifact Registry) or how Docker works in an environment like GCP Cloud Build and was hoping a buildx core contributor might have an idea of where to look or what might be happening. The workaround of stopping the builder is an interesting clue.

Possible solution?

Not sure if this will work, but maybe buildx can request the tokens on demand, rather than at startup?

@dougdonohoe
Copy link
Author

dougdonohoe commented Jul 11, 2022

This ZcashFoundation PR seems to fix the same issue in Github CI. I wonder if there is a way to set this in GCP Cloud Build?

          # Some builds might take over an hour, and Google's default lifetime duration for
          # an access token is 1 hour (3600s). We increase this to 3 hours (10800s)
          # as some builds take over an hour.
          access_token_lifetime: 10800s

Docs on this state:

access_token_lifetime: (Optional) Desired lifetime duration of the access token, in seconds. This must be specified as the number of seconds with a trailing "s" (e.g. 30s). The default value is 1 hour (3600s). The maximum value is 1 hour, unless the constraints/iam.allowServiceAccountCredentialLifetimeExtension organization policy is enabled, in which case the maximum value is 12 hours.

I time-boxed a dive into this at an hour, but didn't learn much (to many unknowns about how Docker runs inside of GCP).

@parkerroan
Copy link

+1 @dougdonohoe seeing this issue as well. Workaround for me was to create a service account json key that I passed in base64 through env var and then performed a docker login with it prior to the buildx command. This is not ideal for security but working. Posting here to help out the next person to encounter this.

Here is implementation for anyone stuck and finding this through google:

  - name: gcr.io/cloud-builders/docker
    entrypoint: 'bash'
    args:
      - -c
      - |
        # Write key to "/workspace"
        printf $_SERVICE_KEY | base64 --decode > /workspace/key.json
    id: base64 env secret key
  - name: gcr.io/cloud-builders/docker
    entrypoint: 'bash'
    args:
      - -c
      - |
        # Read from "/workspace"
        docker logout https://gcr.io && cat /workspace/key.json | docker login -u _json_key --password-stdin https://gcr.io &&
        docker buildx build --platform $_DOCKER_BUILDX_PLATFORMS --build-arg GITHUB_TOKEN=$_GITHUB_TOKEN \
        --build-arg CRYPTO_KEY=$_CRYPTO_KEY --build-arg PROJECT_NAME=$_PROJECT_NAME \
        -t gcr.io/$PROJECT_ID/$_DEPLOYMENT_NAME:$BRANCH_NAME \
        -t gcr.io/$PROJECT_ID/$_DEPLOYMENT_NAME:${BRANCH_NAME}_${BUILD_ID} --push .
    id: build-multi-architecture-container-image

@dougdonohoe
Copy link
Author

dougdonohoe commented Jan 25, 2023

Thanks @parkerroan - out of curiosity, what permissions did you give the service account that key belongs to? Also, this sounds like a good use for Google Secret Manager. My article, noted below, has an example of using it for ssh keys.

This also might be of interest: I just published a medium post on how to use arm64 VMs to vastly speed up docker buildx builds in GCP Cloud Build. For us, this has reduced build times to under an hour.

Also mentioned in that article is my docker-buildx.sh script which I use to break the docker buildx into three steps to avoid this bug.

@Davidnet
Copy link

Davidnet commented May 2, 2023

Hey @dougdonohoe I just tumble in this very issue, did you find a way to solve this? is there any way to extend the token provided by Google?

@dougdonohoe
Copy link
Author

@Davidnet I haven't explored @parkerroan's solution yet. What we do is use this script to break the build into two parts, which has provided some success: https://github.com/dougdonohoe/multi-arch-docker/blob/main/docker-buildx.sh

@Davidnet
Copy link

Davidnet commented May 2, 2023

Perfect, @dougdonohoe I made it work by specifying the image in the image field instead of doing docker push I could see in the logs that cloud build got unexpected status: 401 Unauthorized but I imagine that Cloud Build retried and eventually was successful, so I guess if you specify images cloud build will push it by default.

Hope this can bring visibility to this

"options": {
        "diskSizeGb": "200",
        "machineType": "N1_HIGHCPU_8",
        "logging": "CLOUD_LOGGING_ONLY"
    },
    "timeout": "18000s",
    "images": [
        "us-central1-docker.pkg.dev/project/repo/image_name:image_tag"
    ]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants