-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ERROR: failed to solve: failed to compute cache key: failed to get state for index 0 on #3635
Comments
Signed-off-by: Andrey Voronkov <[email protected]>
Spent some time looking into this, it's specifically related to the codepath here: Lines 808 to 812 in 2713491
This happens when One of two things seems to be happening:
|
For what its worth I have experienced this sporadically, if it helps, I've only seen it with |
Yeah, that tracks with my observations as well - the issue definitely seems related to edge merging / shared vertices, which happens when multiple build requests are run at the same time, and this causes merges in the internal solve graph when you have similar nodes. We see this when we have a large shared runner that lots of builds share - you could also see this with bake. I'm headed out on holiday for a bit, so won't have a lot of time to spend digging into this for a while. But as a general request: if anyone in this thread is experiencing this issue and has a reliable reproducer (ideally minimal, but not necessary), that would go so far to helping track this down - currently we really only have the error message and some hunches to go on. |
I don't have a reproducer, but I think I just hit this on Ubuntu 22.04 LTS with a Docker 26.0.0 freshly reinstalled on Monday (6 days ago) without any involvement of bake:
@jedevc is this log hitting the Exec operation you mention in #3635 (comment)? The reason I reinstalled docker was moby/moby#46136, so I removed the previous docker 26.0.0 deb packages, did The machines build many different images, and quite a few of those images are based on the same Dockerfile with different ARGs (the failing excerpt above is when building that Dockerfile). The Dockerfile has a builder image, a base image of common things, and a prod/dev image FROM the base image. The prod/dev images are built in parallel. There are nightly jobs that build all images with All builds have I don't know if it relates to this bug, but there is something interesting with space usage on the machine that got the error:
For comparison, on my local machine (Ubuntu 24.04, Docker 26.0.0):
so whatever else is stored in /var/lib/docker just takes ~15G on my machine, but 20G on the machine where I got the cache error. Detailed version info for the machine that got the cache error:
|
@pjonsson curious; what does https://linux.die.net/man/1/du
Here's a quick example on a machine I had running; docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 1 1 9.89MB 0B (0%)
Containers 1 1 0B 0B
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B
du -hs /var/lib/docker/overlay2/
20M /var/lib/docker/overlay2/
root@swarm-test-01:~# du -xhs /var/lib/docker/overlay2/
9.6M /var/lib/docker/overlay2/ |
The machines have been running since my comment last week, and they run
so the difference is ~50G now. @thaJeztah my apologies if I came across as trying to tell you where bugs are in the system, I don't know that at all, it just appears that something involving the (in layman's terms) "storage" must be wrong. And just this morning one machine failed to build images, and even after a reboot it says I understand that you would prefer to have a reproducer, but it's difficult to reproduce the conditions of a CI machine, is there anything else we could do to provide more information that would help you? I saw the "Go race detector" mentioned in some other thread, and personally I would be happy to switch the CI machines to some other apt-source with debug-packages for a couple of weeks (if the performance of Docker is within 5x) and provide logs to you. Edit: ran
Docker version information:
|
I ran into this today:
xref: https://gitlab.wikimedia.org/repos/releng/scap/-/jobs/278321 The client is The situation matches what @jedevc described above: 6 separate jobs started at the same time, many of them probably hitting the same The log entry from
|
intermittent |
We are still facing the same issue using the latest docker 26.1.4 release with buildkit 0.14.1 unfortunately. |
We also had the same issue. Previously we were building a few containers per agent in github actions:
After replacing this with
We are building 21 containers across 4 agents. so each agent has 5 containers to build. It was nice and faster than docker compose, but it just did not work. Docker version: 27.0.3 |
Original issue docker/buildx#1308, but according to the code it's
buildkit
throwing this error inCalcSlowCache
.Using
docker buildx bake
we get sporadically (different build steps):The text was updated successfully, but these errors were encountered: