Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on #3635

Closed
Antiarchitect opened this issue Feb 14, 2023 · 10 comments
Labels

Comments

@Antiarchitect
Copy link

Antiarchitect commented Feb 14, 2023

Original issue docker/buildx#1308, but according to the code it's buildkit throwing this error in CalcSlowCache.

Using docker buildx bake we get sporadically (different build steps):

> [assets] importing cache manifest from xxx/yyy/zzz:main--assets:
------
ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on [assets 3/9] COPY --from=cache /cache/node-modules /app/apps/xxx/yyy/node_modules
$ docker version
Client:
 Version:           23.0.1
 API version:       1.42
 Go version:        go1.19.5
 Git commit:        a5ee5b1
 Built:             Thu Feb  9 19:45:43 2023
 OS/Arch:           linux/amd64
 Context:           default
Server: Docker Engine - Community
 Engine:
  Version:          23.0.1
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.19.5
  Git commit:       bc3805a
  Built:            Thu Feb  9 19:48:50 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.16
  GitCommit:        31aa4358a36870b21a[99](https://gitlab.enapter.ninja/backend/monorepo/-/jobs/1421100#L99)2d3ad2bef29e1d693bec
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.10.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.16.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 65
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 31aa4358a36870b21a992d3ad2bef29e1d693bec
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
 Kernel Version: 6.1.10-1.el7.elrepo.x86_64
 Operating System: Alpine Linux v3.17 (containerized)
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: [125](https://gitlab.enapter.ninja/backend/monorepo/-/jobs/1421100#L125).6GiB
 Name: app-gitlab-runners-docker-5
 ID: c9f224dc-3a07-449d-9c82-846b9433ba8b
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  [127](https://gitlab.enapter.ninja/backend/monorepo/-/jobs/1421100#L127).0.0.0/8
 Registry Mirrors:
  https://dockerhub.enapter.ninja/
 Live Restore Enabled: false
 Product License: Community Engine
 Default Address Pools:
   Base: 10.10.0.0/16, Size: 24
WARNING: API is accessible on http://0.0.0.0:2375/ without encryption.
         Access to the remote API is equivalent to root access on the host. Refer
         to the 'Docker daemon attack surface' section in the documentation for
         more information: https://docs.docker.com/go/attack-surface/
@jedevc
Copy link
Member

jedevc commented Mar 4, 2024

Spent some time looking into this, it's specifically related to the codepath here:

buildkit/solver/jobs.go

Lines 808 to 812 in 2713491

if p != nil {
st := s.st.solver.getState(s.st.vtx.Inputs()[index])
if st == nil {
return "", errors.Errorf("failed to get state for index %d on %v", index, s.st.vtx.Name())
}

This happens when CacheMap.PreprocessFunc is set (which could explain why we maybe only seem to see this on Copy operation, but the code indicates it could also happen on Exec operation).

One of two things seems to be happening:

  • We are hitting this code path before we added it to actives - I don't think this can be possible? The scheduler "shouldn't" let this happen.
  • We are hitting this code after it's been removed from actives
    • Potentially a reference counting related issue, where a state is being used by two jobs, one job disconnects, and the state is erroneously removed (also maybe edge merging is related? would explain why we suddenly start to see this on long running instances that get hammered with lots of CI jobs all at once).
    • Potentially a locking issue, where our locks aren't broad enough, and somehow it's possible for another component to get access to actives in a scenario where we shouldn't.

@lucasteligioridis
Copy link

For what its worth I have experienced this sporadically, if it helps, I've only seen it with buildx bake related builds, could be a coincidence and maybe thats actually the assumption from the description in this issue 👍🏼

@jedevc
Copy link
Member

jedevc commented Mar 8, 2024

Yeah, that tracks with my observations as well - the issue definitely seems related to edge merging / shared vertices, which happens when multiple build requests are run at the same time, and this causes merges in the internal solve graph when you have similar nodes.

We see this when we have a large shared runner that lots of builds share - you could also see this with bake.

I'm headed out on holiday for a bit, so won't have a lot of time to spend digging into this for a while. But as a general request: if anyone in this thread is experiencing this issue and has a reliable reproducer (ideally minimal, but not necessary), that would go so far to helping track this down - currently we really only have the error message and some hunches to go on.

@pjonsson
Copy link

pjonsson commented Apr 7, 2024

I don't have a reproducer, but I think I just hit this on Ubuntu 22.04 LTS with a Docker 26.0.0 freshly reinstalled on Monday (6 days ago) without any involvement of bake:

...
34 [builder 16/22] COPY --link /sourcecode ./sourcecode
#34 DONE 0.1s
#35 [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .
#35 ERROR: failed to get state for index 0 on [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .
------
 > [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .:
------
ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on [builder 17/22] RUN /usr/local/bin/directory-checksum --max-depth 8 .

@jedevc is this log hitting the Exec operation you mention in #3635 (comment)?

The reason I reinstalled docker was moby/moby#46136, so I removed the previous docker 26.0.0 deb packages, did rm -rf /var/lib/docker and then installed the packages again according to the instructions.

The machines build many different images, and quite a few of those images are based on the same Dockerfile with different ARGs (the failing excerpt above is when building that Dockerfile). The Dockerfile has a builder image, a base image of common things, and a prod/dev image FROM the base image. The prod/dev images are built in parallel. There are nightly jobs that build all images with --no-cache, and some of those jobs have enough images to build so they build multiple base images in parallel.

All builds have SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct) and are done with docker build using the default builder, nobody has ever executed bake on the machines.

I don't know if it relates to this bug, but there is something interesting with space usage on the machine that got the error:

$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          92        0         32.98GB   32.98GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     1553      17        42.9GB    41.87GB
$ sudo du -hs /var/lib/docker
96G	/var/lib/docker
$ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
32.98+42.9
75.88

For comparison, on my local machine (Ubuntu 24.04, Docker 26.0.0):

$ docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          43        6         38.04GB   29.2GB (76%)
Containers      7         7         2.249GB   0B (0%)
Local Volumes   33        0         1.029GB   1.029GB (100%)
Build Cache     455       0         0B        0B
$ sudo du -hs /var/lib/docker
56G	/var/lib/docker
$ bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
38.04+2.249+1.029
41.318

so whatever else is stored in /var/lib/docker just takes ~15G on my machine, but 20G on the machine where I got the cache error.

Detailed version info for the machine that got the cache error:

$ docker version
Client: Docker Engine - Community
 Version:           26.0.0
 API version:       1.45
 Go version:        go1.21.8
 Git commit:        2ae903e
 Built:             Wed Mar 20 15:17:48 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          26.0.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       8b79278
  Built:            Wed Mar 20 15:17:48 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ docker info
Client: Docker Engine - Community
 Version:    26.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.25.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 211
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-101-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.751GiB
 Name: <machine>
 ID: <id>
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Registry Mirrors:
  https://<mirror>/
 Live Restore Enabled: false

@thaJeztah
Copy link
Member

@pjonsson curious; what does du show if you also use the -x / --one-file-system` option? It's possible it's double-counting mounts.

https://linux.die.net/man/1/du

-x, --one-file-system

skip directories on different file systems

Here's a quick example on a machine I had running;

docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          1         1         9.89MB    0B (0%)
Containers      1         1         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     0         0         0B        0B

du -hs /var/lib/docker/overlay2/
20M	/var/lib/docker/overlay2/
root@swarm-test-01:~# du -xhs /var/lib/docker/overlay2/
9.6M	/var/lib/docker/overlay2/

@pjonsson
Copy link

pjonsson commented Apr 16, 2024

The machines have been running since my comment last week, and they run apt-get upgrade on a nightly basis so they are now using Docker 26.0.1. The Docker 26.0.0 that I had last week was installed with a non-existent /var/lib/docker, so it's something that comes with the docker deb packages that is responsible for the content in that directory. With that said, here's the current state:

# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          22        0         9.174GB   9.174GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     1407      0         42.59GB   42.59GB
# du -hs /var/lib/docker/overlay2
102G	/var/lib/docker/overlay2
# du -xhs /var/lib/docker/overlay2
102G	/var/lib/docker/overlay2

so the difference is ~50G now.

@thaJeztah my apologies if I came across as trying to tell you where bugs are in the system, I don't know that at all, it just appears that something involving the (in layman's terms) "storage" must be wrong. And just this morning one machine failed to build images, and even after a reboot it says Error response from daemon: error getting build cache usage: failed to get usage for 1xeqlft7tbtfb9veq92jdzws8: layer does not exist when I run docker system df.

I understand that you would prefer to have a reproducer, but it's difficult to reproduce the conditions of a CI machine, is there anything else we could do to provide more information that would help you? I saw the "Go race detector" mentioned in some other thread, and personally I would be happy to switch the CI machines to some other apt-source with debug-packages for a couple of weeks (if the performance of Docker is within 5x) and provide logs to you.

Edit: ran docker system prune on the machine that couldn't find the layer, and that apparently still has 280 build cache entries (?) summing up to 0 bytes:

# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          22        0         20.51GB   20.51GB (100%)
Containers      0         0         0B        0B
Local Volumes   0         0         0B        0B
Build Cache     280       0         0B        0B
# du -xhs /var/lib/docker/overlay2
45G     /var/lib/docker/overlay2

Docker version information:

# docker info
Client: Docker Engine - Community
 Version:    26.0.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.26.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
..

@dancysoft
Copy link
Contributor

I ran into this today:

#18 DONE 15.6s
#19 [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]
#19 ERROR: failed to get state for index 0 on [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]

xref: https://gitlab.wikimedia.org/repos/releng/scap/-/jobs/278321

The client is buildctl and the server is buildkitd v0.13.2.

The situation matches what @jedevc described above: 6 separate jobs started at the same time, many of them probably hitting the same buildkitd.

The log entry from bulidkitd is about the same:

time="2024-06-06T16:42:49Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
time="2024-06-06T16:42:50Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Canceled desc = context canceled"
time="2024-06-06T16:42:58Z" level=error msg="/moby.buildkit.v1.Control/Solve returned error: rpc error: code = Unknown desc = failed to compute cache key: failed to get state for index 0 on [test-bullseye 11/12] COPY --chown=65533:65533 [./, .]"

@tonistiigi
Copy link
Member

tonistiigi commented Jun 6, 2024

intermittent failed to get state for index should be solved by #4887

@snaumenko-st
Copy link

snaumenko-st commented Jun 24, 2024

We are still facing the same issue using the latest docker 26.1.4 release with buildkit 0.14.1 unfortunately.

@jfheins
Copy link

jfheins commented Aug 2, 2024

We also had the same issue. Previously we were building a few containers per agent in github actions:

    - name: Build images
      run: docker compose build ${{ inputs.service }}
    
    - name: Push images
      run: docker compose push ${{ inputs.service }}

After replacing this with docker buildx bake --push ${{ inputs.service }} it worked twice.
Every run that came later got this error at least once:

ERROR: failed to solve: DeadlineExceeded: failed to compute cache key: failed to copy: httpReadSeeker: failed open: failed to authorize: no active session for ...: context deadline exceeded

We are building 21 containers across 4 agents. so each agent has 5 containers to build. It was nice and faster than docker compose, but it just did not work.

Docker version: 27.0.3
buildx version: v0.15.1 1c1dbb2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants