buildx kubernetes driver sometimes returns `ERROR: error dialing backend: remote error: tls: internal error` #2668

dcherniv · 2024-08-31T21:47:58Z

Contributing guidelines

I've read the contributing guidelines and wholeheartedly agree

I've found a bug and checked that ...

... the documentation does not mention anything about my problem
... there are no open or closed issues that are related to my problem

Description

On EKS 1.29 specifically using ARM64 nodes, it seems that there exists a race condition somewhere where the node CSR is not signed yet but the node is reported as ready in the cluster.
The issue goes away after a number of seconds once the CSR is approved and issued.
However, during that time all calls to the pods situated on the node return the above mentioned error.
Buildx kubernetes driver specifically returns this:

ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded

NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
 \_ buildx-php-release-8.2-2b0efca1    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig=   error                linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

I was browsing the buildx source code and found the call to list the workers: https://github.com/docker/buildx/blob/master/vendor/github.com/moby/buildkit/client/workers.go#L31
But it's not clear what the retry logic is here. It seems to me when we get the above tls internal error, the code just dies and buildx quits.
Is it possible to handle this specific error somehow? It's transient, and should succeed if buildx does some kind of exponential backoff.

buildx is started as follows:

          docker buildx create --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/amd64 \
                 --buildkitd-flags '--debug --trace' \
                 --driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=amd64","tolerations=key=runners,value=dedicated"'
          sleep 10
          docker buildx ls
          docker buildx create --append --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/arm64 \
                 --buildkitd-flags '--debug --trace' \
                 --driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=arm64","tolerations=key=runners,value=dedicated;key=arch,value=arm64"'
          sleep 10
          docker buildx ls

buildx and docker versions:
buildkit remote agent to be booted on the nodes: moby/buildkit:v0.15.2
docker version: docker:27.2-dind with its built-in buildkit, no modifications.
The whole setup runs on self-hosted github-actions runners using 0.9.3 version of oci://ghcr.io/actions/actions-runner-controller-charts

It seems to only happen under heavy load on the cluster. We have a repo where we build about 20-30 docker images in parallel (its our base images repo). Each docker image requests 2 buildx kubernetes workers, one for amd64 and one for arm64. So a lot of nodes get spun up at the same time.

Expected behaviour

buildx to not die when it encounters a transient error.

Actual behaviour

Failure log follows:

#1 [internal] booting buildkit
W0831 21:09:50.507151     230 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 66.8s done
#1 DONE 66.8s
buildx-php-release-8.2-2b0efca
NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
#1 [internal] booting buildkit
W0831 21:11:07.543695     328 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 71.9s done
#1 DONE 71.9s
buildx-php-release-8.2-2b0efca
ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded

Sometimes it is able to proceed past this error (i'm guessing due to the sleep 10 statement), but not always.

Buildx version

github.com/dockerbuildx v0.16.2 99dea6d

Docker info

/ # docker info
Client:
 Version:    27.2.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2
    Path:     /usr/local/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2
    Path:     /usr/local/libexec/docker/cli-plugins/docker-compose

Builders list

NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
 \_ buildx-php-release-8.2-2b0efca1    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig=   error                linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

Configuration

FROM public.ecr.aws/docker/library/php:8.2-apache

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get -y update &&\
    apt-get -y install gnupg git unzip


### Build logs

_No response_

### Additional info

_No response_

The text was updated successfully, but these errors were encountered:

dcherniv · 2024-08-31T22:30:01Z

Found the reason why this error pops up: awslabs/amazon-eks-ami#1944
TLDR, apparently in some cases the nodes on eks report ready status, but their CSR is not signed and approved yet. So pods schedule, start running and when buildx attempts to get their status, it gets the above error.

tonistiigi · 2024-09-04T14:43:46Z

cc @AkihiroSuda

dcherniv · 2024-09-09T20:01:18Z

Sounds like the discussion was already had and the consensus was that the clients should retry.
kubernetes/kubernetes#73047

dcherniv added the status/triage label Aug 31, 2024

crazy-max added the area/driver/kubernetes label Sep 2, 2024

AkihiroSuda added the kind/bug Something isn't working label Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buildx kubernetes driver sometimes returns `ERROR: error dialing backend: remote error: tls: internal error` #2668

buildx kubernetes driver sometimes returns `ERROR: error dialing backend: remote error: tls: internal error` #2668

dcherniv commented Aug 31, 2024 •

edited

Loading

dcherniv commented Aug 31, 2024

tonistiigi commented Sep 4, 2024

dcherniv commented Sep 9, 2024

buildx kubernetes driver sometimes returns ERROR: error dialing backend: remote error: tls: internal error #2668

buildx kubernetes driver sometimes returns ERROR: error dialing backend: remote error: tls: internal error #2668

Comments

dcherniv commented Aug 31, 2024 • edited Loading

Contributing guidelines

I've found a bug and checked that ...

Description

Expected behaviour

Actual behaviour

Buildx version

Docker info

Builders list

Configuration

dcherniv commented Aug 31, 2024

tonistiigi commented Sep 4, 2024

dcherniv commented Sep 9, 2024

buildx kubernetes driver sometimes returns `ERROR: error dialing backend: remote error: tls: internal error` #2668

buildx kubernetes driver sometimes returns `ERROR: error dialing backend: remote error: tls: internal error` #2668

dcherniv commented Aug 31, 2024 •

edited

Loading