Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker buildx build hangs when using some cached layers #2009

Open
ojab opened this issue Mar 5, 2021 · 6 comments
Open

docker buildx build hangs when using some cached layers #2009

ojab opened this issue Mar 5, 2021 · 6 comments

Comments

@ojab
Copy link

ojab commented Mar 5, 2021

buildx-0.5.1, moby/buildkit:buildx-stable-1 (be8e8392f56c), Docker version 20.10.5, build 55c4c88 on linux x86_64.

Unfortunately cannot provide a reproducer since it's an internal project.

docker buildx build --cache-from … --cache-to type=inline --pull . hangs on various steps indefinitely (or for a very long time, in GH Actions there was 6 hour timeout). buildkit container is started with docker buildx create --name builder-builder --driver docker-container --driver-opt image=moby/buildkit:buildx-stable-1 --buildkitd-flags --debug --use (setup is stolen from docker/setup-buildx-action@v1 GH action), logs after sending SIGQUIT is here. This time the hang was on

 => CACHED [… 11/13] RUN bundle exec bootsnap precompile --gemfile config/ app/ lib/    282.6s

During the hang buildkitd eats some cpu. perf top -g shows

  Children      Self  Shared Object     Symbol
+   75,62%     0,39%  buildkitd         [.] github.com/moby/buildkit/solver.(*cacheManager).ensurePersistentKey
+   68,22%     0,08%  buildkitd         [.] github.com/moby/buildkit/solver/bboltcachestorage.(*Store).HasLink
+   67,33%     0,28%  buildkitd         [.] go.etcd.io/bbolt.(*DB).View
+   59,37%     0,34%  buildkitd         [.] github.com/moby/buildkit/solver/bboltcachestorage.(*Store).HasLink.func1
+   39,27%     0,44%  buildkitd         [.] go.etcd.io/bbolt.(*Cursor).seek
+   38,38%     9,83%  buildkitd         [.] go.etcd.io/bbolt.(*Cursor).search
+   29,51%     0,52%  buildkitd         [.] go.etcd.io/bbolt.(*Cursor).searchPage
+   24,31%     0,33%  buildkitd         [.] go.etcd.io/bbolt.(*Bucket).Bucket
+   22,37%     0,07%  buildkitd         [.] runtime.systemstack
+   21,22%     0,17%  buildkitd         [.] runtime.gcDrain
+   21,22%     0,24%  buildkitd         [.] go.etcd.io/bbolt.(*Bucket).Get
+   19,76%    10,83%  buildkitd         [.] runtime.scanobject
+   17,86%     2,92%  buildkitd         [.] sort.Search
+   15,52%     7,50%  buildkitd         [.] runtime.mallocgc
+   11,83%    11,26%  buildkitd         [.] cmpbody
+    9,81%     0,00%  buildkitd         [.] runtime.goexit
+    9,51%     0,00%  buildkitd         [.] github.com/moby/buildkit/solver/internal/pipe.NewWithFunction.func2
+    9,51%     0,00%  buildkitd         [.] github.com/moby/buildkit/solver.(*edge).loadCache-fm
+    9,51%     0,00%  buildkitd         [.] github.com/moby/buildkit/solver.(*edge).loadCache
+    9,51%     0,00%  buildkitd         [.] github.com/moby/buildkit/solver.(*sharedOp).LoadCache
+    9,51%     0,00%  buildkitd         [.] github.com/moby/buildkit/solver.(*combinedCacheManager).Load
+    9,51%     0,00%  buildkitd         [.] github.com/moby/buildkit/solver.(*cacheManager).Save
+    9,00%     1,36%  buildkitd         [.] runtime.growslice
+    8,40%     0,46%  buildkitd         [.] encoding/json.Marshal
+    8,24%     0,32%  buildkitd         [.] go.etcd.io/bbolt.(*Cursor).nsearch
+    7,29%     6,73%  buildkitd         [.] runtime.findObject
+    6,21%     0,79%  buildkitd         [.] go.etcd.io/bbolt.(*DB).beginTx
+    6,16%     0,02%  buildkitd         [.] go.etcd.io/bbolt.(*DB).Begin
+    5,88%     0,09%  buildkitd         [.] encoding/json.(*encodeState).marshal
+    5,69%     0,31%  buildkitd         [.] runtime.newobject
+    5,42%     0,24%  buildkitd         [.] encoding/json.(*encodeState).reflectValue
+    4,98%     1,01%  buildkitd         [.] go.etcd.io/bbolt.(*Bucket).openBucket
+    4,68%     4,52%  buildkitd         [.] runtime.heapBitsSetType
+    3,98%     0,01%  buildkitd         [.] encoding/json.structEncoder.encode-fm
+    3,84%     0,67%  buildkitd         [.] encoding/json.structEncoder.encode
+    3,46%     0,49%  buildkitd         [.] go.etcd.io/bbolt.(*Tx).init

I can reproduce it every time building the image and can provide any additional info if needed. If --cache-from is removed — image is built just fine.

@ojab
Copy link
Author

ojab commented Mar 5, 2021

Added some logging to HasLink (just printing the arguments), looks like infinite loop:

time="2021-03-05T09:39:52Z" level=debug msg="Got id ulqurtcb1dvawk58l6u6tkr45, link {\"Input\":1,\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\",\"Selector\":\"sha256:e15d509af2a4bce0cf98b5e84a2425725a4d64251ea333a4fb8e9b6cad1f25ab\"}, target csrwd0ukivoem0z1mp65pwju1"
time="2021-03-05T09:39:52Z" level=debug msg="Got id sha256:64f1674a08578c2ad8b68c155dcaf8fe288309b53e06e09612b35f5f2be2b26c, link {\"Input\":1,\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target csrwd0ukivoem0z1mp65pwju1"
time="2021-03-05T09:39:52Z" level=debug msg="Got id dul8ljhyfs84k963n5pyieina, link {\"Digest\":\"sha256:a430bc0b49b450a86e72d01f1ea40d2313ddd5c24006066781ef5870c341803c\"}, target i86ur3uho5wx2fx1rujsusid7"
time="2021-03-05T09:39:52Z" level=debug msg="Got id 2uxf1tpsf7akxiuya8ijwlyjt, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target dul8ljhyfs84k963n5pyieina"
time="2021-03-05T09:39:52Z" level=debug msg="Got id uw5iew8vp453b7u4qhc1sal9d, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target dul8ljhyfs84k963n5pyieina"
time="2021-03-05T09:39:52Z" level=debug msg="Got id unfvaqrd8ay15tfivnfcedaof, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target dul8ljhyfs84k963n5pyieina"
time="2021-03-05T09:39:52Z" level=debug msg="Got id x1uf4wkz0lul7n7tgupk9f957, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target dul8ljhyfs84k963n5pyieina"
time="2021-03-05T09:39:52Z" level=debug msg="Got id ulqurtcb1dvawk58l6u6tkr45, link {\"Input\":1,\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\",\"Selector\":\"sha256:e15d509af2a4bce0cf98b5e84a2425725a4d64251ea333a4fb8e9b6cad1f25ab\"}, target dul8ljhyfs84k963n5pyieina"
time="2021-03-05T09:39:52Z" level=debug msg="Got id sha256:64f1674a08578c2ad8b68c155dcaf8fe288309b53e06e09612b35f5f2be2b26c, link {\"Input\":1,\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target dul8ljhyfs84k963n5pyieina"
time="2021-03-05T09:39:52Z" level=debug msg="Got id 9z2wuznxzgsujgbdymm3ujram, link {\"Digest\":\"sha256:a430bc0b49b450a86e72d01f1ea40d2313ddd5c24006066781ef5870c341803c\"}, target i86ur3uho5wx2fx1rujsusid7"
time="2021-03-05T09:39:52Z" level=debug msg="Got id 2uxf1tpsf7akxiuya8ijwlyjt, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 9z2wuznxzgsujgbdymm3ujram"
time="2021-03-05T09:39:52Z" level=debug msg="Got id uw5iew8vp453b7u4qhc1sal9d, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 9z2wuznxzgsujgbdymm3ujram"
time="2021-03-05T09:39:52Z" level=debug msg="Got id unfvaqrd8ay15tfivnfcedaof, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 9z2wuznxzgsujgbdymm3ujram"
time="2021-03-05T09:39:52Z" level=debug msg="Got id x1uf4wkz0lul7n7tgupk9f957, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 9z2wuznxzgsujgbdymm3ujram"
time="2021-03-05T09:39:52Z" level=debug msg="Got id ulqurtcb1dvawk58l6u6tkr45, link {\"Input\":1,\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\",\"Selector\":\"sha256:e15d509af2a4bce0cf98b5e84a2425725a4d64251ea333a4fb8e9b6cad1f25ab\"}, target 9z2wuznxzgsujgbdymm3ujram"
time="2021-03-05T09:39:52Z" level=debug msg="Got id sha256:64f1674a08578c2ad8b68c155dcaf8fe288309b53e06e09612b35f5f2be2b26c, link {\"Input\":1,\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 9z2wuznxzgsujgbdymm3ujram"
time="2021-03-05T09:39:52Z" level=debug msg="Got id 88gaa9ddk1s9xu4c0ky5oiwfe, link {\"Digest\":\"sha256:a430bc0b49b450a86e72d01f1ea40d2313ddd5c24006066781ef5870c341803c\"}, target i86ur3uho5wx2fx1rujsusid7"
time="2021-03-05T09:39:52Z" level=debug msg="Got id 2uxf1tpsf7akxiuya8ijwlyjt, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 88gaa9ddk1s9xu4c0ky5oiwfe"
time="2021-03-05T09:39:52Z" level=debug msg="Got id uw5iew8vp453b7u4qhc1sal9d, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 88gaa9ddk1s9xu4c0ky5oiwfe"
time="2021-03-05T09:39:52Z" level=debug msg="Got id unfvaqrd8ay15tfivnfcedaof, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 88gaa9ddk1s9xu4c0ky5oiwfe"
time="2021-03-05T09:39:52Z" level=debug msg="Got id x1uf4wkz0lul7n7tgupk9f957, link {\"Digest\":\"sha256:ce05dc67e00f4d190db9acced98f9b8f18ca5b8a1f41a0f2ed80bd2dbff6547a\"}, target 88gaa9ddk1s9xu4c0ky5oiwfe"

@ojab
Copy link
Author

ojab commented Mar 5, 2021

Log with the start of the loop here, it hanged after/on load cache for [builder_yarn 11/11] RUN yarn build with step.

@griffous
Copy link

We're seeing the same problem too.
A pipeline of ours featuring 5x parallel builds (we use docker bake) seems to trigger a condition causing a buildkit thread to enter an endless loop which consumes 100% CPU.
Subsequent builds trigger the same condition until the container becomes overwhelmed by zombie 100% buildkit threads and we kill it.

Environment:
buildkitd: v0.9.2
docker buildx: v0.6.1
Platform: Kubernetes 1.21 full-privilege containers with persistent storage (300GB), 8vCPUs, 25GB RAM.

As requested in buildkit Slack, I have manually killed each process with kill -s SIGQUIT 1 and then downloaded the container's logs. Hopefully there is something common in the stack traces leading to the issue :)

buildkit-prod-596464b95f-4z2ls.log (3x 100% threads)
buildkit-prod-596464b95f-qdfsx.log (2x 100% threads)
buildkit-prod-596464b95f-sfnt8.log (2x 100% threads)

@griffous
Copy link

griffous commented Dec 11, 2021

To see if this problem still exists in master, I created a new builder pod based on the master tag from dockerhub, and we hit the same problem with this fresh builder:

buildkit-test-fc7544dc8-kgjks.log (1x 100% thread)

And here is another master builder, this time with --debug enabled, which I SIGQUIT killed after 30 minutes of no progress:
buildkit-test-6df4c4cd7d-qj7vh.log

@griffous
Copy link

griffous commented Dec 13, 2021

Continued troubleshooting...

I have isolated the trigger for this condition to one specific image in our library, and it can be trigger without being a parallel build. It might be important that this is a large image (10.7GB).

This image build using several remote cache-from tags:

  1. The final image for this branch ($BRANCH)
  2. The final image from the previous release ("latest")
  3. An earlier build stage 1 for this branch ($BRANCH)
  4. An earlier build stage 1 for the previous release ("latest")
  5. An earlier build stage 2 for this branch ($BRANCH)
  6. An earlier build stage 2 for the previous release ("latest")

Troubleshooting steps, which haven't helped:

  • Recreate the builder container
  • Recreate the builder container with fresh local storage (no cache).
  • Use buildkit builder tag master instead of 0.9.2
  • Remove flag --oci-worker-gc-keepstorage
  • Remove flag --oci-worker-net=cni
  • Switch to containerd by using the flag --containerd-worker-net=cni
  • Deleting the image tag from the registry that cache-from number 1 pointed at.

What has helped:

  • Removing cache-from number 2 argument.

This leads me to wonder if previously built/released images (which we use as a possible remote cache source), may be corrupt in some way. Build failure messages like this one (which feels related to #2303) seem to support this:

error: failed to solve: failed to get edge: inconsistent graph state

When buildkitd enters this 100% state, but does eventually successfully continue on to completion, the corresponding master/debug messages look like this:

time="2021-12-13T09:56:00Z" level=debug msg="snapshot garbage collected" d=2.090066ms snapshotter=overlayfs
time="2021-12-13T09:57:30Z" level=warning msg="invalid image config with unaccounted layers"
time="2021-12-13T12:26:33Z" level=debug msg="snapshot garbage collected" d=2.969114ms snapshotter=overlayfs
time="2021-12-13T12:28:38Z" level=warning msg="invalid image config with unaccounted layers"
time="2021-12-13T12:48:49Z" level=debug msg="gc cleaned up 3841102621 bytes"
time="2021-12-13T12:55:21Z" level=warning msg="invalid image config with unaccounted layers"

As you can see from the timestamps, at best it sits stuck there for 90 seconds, but it can get stuck there for 6-7 minute.
If it hasn't cleared this state in 10 minutes, it's usually not going to (it stays stuck forever).

While at this point, there is one buildkitd thread pegged at 100%, and no obvious disk/network activity. I don't know how to dig any deeper to see what it's doing, except to generate these stack traces for the developers to look at.
How can I troubleshoot further?

@griffous
Copy link

An update of sorts.

This but is possibly related to this bug: #2526 and it is possibly fixed with this merge: #2556

In our case, through trial and error we found that it was running several very large image builds in parallel that was triggering the issue. We restructured our pipelines to run the largest image builds serially, to better avoid these hangs and it's been running without hitting this issue for a couple of weeks now (even without the fix linked above).

Looking at the CPU profile of the builder when it's checksumming layers, there is a definite improvement following this MR, so I can confirm that we have less wasted CPU with this new implementation, but I can't say for sure that the issue is fixed either sorry. Other users may be able to offer better feedback.

(If I attempt to test by simply telling buildkit to build all of images in parallel, it still panics - which is a different problem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants