buildx kubernetes driver sometimes returns ERROR: error dialing backend: remote error: tls: internal error
#2668
Labels
ERROR: error dialing backend: remote error: tls: internal error
#2668
Contributing guidelines
I've found a bug and checked that ...
Description
On EKS 1.29 specifically using ARM64 nodes, it seems that there exists a race condition somewhere where the node CSR is not signed yet but the node is reported as ready in the cluster.
The issue goes away after a number of seconds once the CSR is approved and issued.
However, during that time all calls to the pods situated on the node return the above mentioned error.
Buildx kubernetes driver specifically returns this:
I was browsing the buildx source code and found the call to list the workers: https://github.com/docker/buildx/blob/master/vendor/github.com/moby/buildkit/client/workers.go#L31
But it's not clear what the retry logic is here. It seems to me when we get the above tls internal error, the code just dies and buildx quits.
Is it possible to handle this specific error somehow? It's transient, and should succeed if buildx does some kind of exponential backoff.
buildx is started as follows:
buildx and docker versions:
buildkit remote agent to be booted on the nodes:
moby/buildkit:v0.15.2
docker version:
docker:27.2-dind
with its built-in buildkit, no modifications.The whole setup runs on self-hosted github-actions runners using
0.9.3
version of oci://ghcr.io/actions/actions-runner-controller-chartsIt seems to only happen under heavy load on the cluster. We have a repo where we build about 20-30 docker images in parallel (its our base images repo). Each docker image requests 2 buildx kubernetes workers, one for amd64 and one for arm64. So a lot of nodes get spun up at the same time.
Expected behaviour
buildx to not die when it encounters a transient error.
Actual behaviour
Failure log follows:
Sometimes it is able to proceed past this error (i'm guessing due to the sleep 10 statement), but not always.
Buildx version
github.com/dockerbuildx v0.16.2 99dea6d
Docker info
Builders list
Configuration
The text was updated successfully, but these errors were encountered: