-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker buildx build
hangs when using some cached layers
#2009
Comments
Added some logging to
|
Log with the start of the loop here, it hanged after/on |
We're seeing the same problem too. Environment: As requested in buildkit Slack, I have manually killed each process with buildkit-prod-596464b95f-4z2ls.log (3x 100% threads) |
To see if this problem still exists in buildkit-test-fc7544dc8-kgjks.log (1x 100% thread) And here is another |
Continued troubleshooting... I have isolated the trigger for this condition to one specific image in our library, and it can be trigger without being a parallel build. It might be important that this is a large image (10.7GB). This image build using several remote
Troubleshooting steps, which haven't helped:
What has helped:
This leads me to wonder if previously built/released images (which we use as a possible remote cache source), may be corrupt in some way. Build failure messages like this one (which feels related to #2303) seem to support this:
When buildkitd enters this 100% state, but does eventually successfully continue on to completion, the corresponding master/debug messages look like this:
As you can see from the timestamps, at best it sits stuck there for 90 seconds, but it can get stuck there for 6-7 minute. While at this point, there is one buildkitd thread pegged at 100%, and no obvious disk/network activity. I don't know how to dig any deeper to see what it's doing, except to generate these stack traces for the developers to look at. |
An update of sorts. This but is possibly related to this bug: #2526 and it is possibly fixed with this merge: #2556 In our case, through trial and error we found that it was running several very large image builds in parallel that was triggering the issue. We restructured our pipelines to run the largest image builds serially, to better avoid these hangs and it's been running without hitting this issue for a couple of weeks now (even without the fix linked above). Looking at the CPU profile of the builder when it's checksumming layers, there is a definite improvement following this MR, so I can confirm that we have less wasted CPU with this new implementation, but I can't say for sure that the issue is fixed either sorry. Other users may be able to offer better feedback. (If I attempt to test by simply telling buildkit to build all of images in parallel, it still panics - which is a different problem) |
buildx-0.5.1
,moby/buildkit:buildx-stable-1 (be8e8392f56c)
,Docker version 20.10.5, build 55c4c88
on linux x86_64.Unfortunately cannot provide a reproducer since it's an internal project.
docker buildx build --cache-from … --cache-to type=inline --pull .
hangs on various steps indefinitely (or for a very long time, in GH Actions there was 6 hour timeout). buildkit container is started withdocker buildx create --name builder-builder --driver docker-container --driver-opt image=moby/buildkit:buildx-stable-1 --buildkitd-flags --debug --use
(setup is stolen fromdocker/setup-buildx-action@v1
GH action), logs after sending SIGQUIT is here. This time the hang was onDuring the hang buildkitd eats some cpu.
perf top -g
showsI can reproduce it every time building the image and can provide any additional info if needed. If
--cache-from
is removed — image is built just fine.The text was updated successfully, but these errors were encountered: