-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent map write on cache export #2041
Comments
Tonis kindly created a race-detector enabled master/buildkit image for me to try. Using this builder, I do see DATA RACE messages printed in the container log, so that's working correctly, and hopefully this will help the developers who look into this case :) With this image, and a fresh set of docker prunes, I am not seeing builder crashes, however some stages/images which previously had cache-from (registry) hits are now rebuilding instead. And during their rebuilds they're actually failing with issues having to do with the contents of cache mounts (freshly downloaded files suddenly vanishing, causing install scripts to fail). This might have more to do with the switch from 0.8.2 to master, so I will try to rebuild our entire library, with a new set of image tags, push those, and continue trying to reproduce the (export) crash with this master/race-detector builder. Let me know if there is anything else I can add to the case which would be of help? |
Update: the master/race image doesn't seem to be able to reliably 'prepare build cache for export', even for a small number of targets. So I switched to moby/master and continued building the entire 30 stage library as parallel as buildkit could. I built and switched to a master + CNI binaries/config image and was able to successfully complete the build and push of our 11 oracle images/targets. More digging brought me to this bug #1984. I commented out all of the cache-froms, reran the same 11 targets (cached on disk), and it successfully built and exported! So, --cache-from seems to be at least one of the triggers which is causing my crashes. Next I tried again using the tonistiigi/buildkit:tcp-concurrency-fix image mentioned in the other bug, and that builder can build the same 11 targets with --cache-froms enabled! Excellent. Next I tried building the full library (30 targets), and that crashed it: This has been really informative though, and I think it's pointing to two issues:
Is there any chance we can get andupdated #1984 implementation against master, so I can try again with the latest code? |
Thanks @tonistiigi Test results: On very first build I am also seeing a strange issue in which cache volumes are not writable to the run command. When building the second time, I was able to capture a crash on a remote host
The buildkit image was generated by the build above, and this buildkit/Dockerfile
I also produced the crash on my local machine, here is a second example: |
Hi team, Coming back to this after easter, today am getting this crash attempting to build 11 targets, using the same master + cni builder. 0b74ad4ed96bfc66bf7f6bf760fc504bcb9d3e3c6fbb1c52a1ad37c2941e0f54-json.log
Is there any more information I can collect for you @tonistiigi? |
#2067 handles the last error. Please provide a reproducer how to hit this error though. |
Thanks. I'm keeping on eye on dockerhub for an updated image with this change. It doesn't look like it's hit nightly or master yet: https://hub.docker.com/r/moby/buildkit/tags I will try to repo it with a simpler test case. |
Update - my recent develpment/builds against the And then I managed to crash it again today, while attempting to build 25 stages. Top of the log here:
I know you're after a reproducer, and I'll keep trying to make one a simplified use case. |
I have made a simplified reproducer: buildkit_crasher.zip It uses a containerised registry to keep everything local, and a 3-stage build, fanning out as 1=>3=>30 images. To increase the chances of a crash - and I know how odd this sounds - run a Let me know if there is anything else I can do to help with finding/fixing this bug? Example crash from this reproducer
|
I have also crashed the race_detector version that @tonistiigi built for me around a month ago. If my reproducer doesn't work for you locally, but the race-detector feature and its logs are helpful troubleshooting: please build me a new race-detector enabled image, and I'll happily re-run it again :) |
I tried this again today, using the current Unfortunately it's still panicking with the same error: Is there anything that can be done to make buildkitd work more reliably on large builds? |
I've just built a custom image of docker 20.10.7 with buildkit v0.9.0 for my build farm and now facing similar (to this moby/moby#42422) error but in another places:
|
Thanks for sharing @Antiarchitect I am still seeing panics with our builds, even when using the Our way of dealing with this is wrap our buildkit commands in retry/timeout blocks, and to only feed <10 targets to buildkit at a time. Together this gives us a CI pipeline which fairly consistently "works", as the best we can do until the bug is resolved. Here is my most recent run.
|
I bet new errors appear 10 times less often than the original one (my IMHO) but are still present. |
You're probably right - the race is still there; fewer threads simply triggers it less often. I also see gracefully handled errors like this one on the more recent code. It's better than a panic, but weird all the same.
|
Update for v0.9.1: I can still immediately crash a v0.9.1 buildkit by giving it "too many" parallel builds, so unfortunately this issue still hasn't been found.
|
@maxlaverse ptal this as well. Related to #2296 (comment) but from that trace looks like the cachekey itself needs some locking (or cloning or a lock in a higher level). |
Thanks, I'll try it out as soon as it's available and get back to you with the results. |
Hi Tonis! I just gave this a try with our "full build" (47 targets), and the latest Good news, it didn't panic this time. 🥳🥳🥳 Unfortunately this build is still failing for us, though :/ Run 1 failed with this error after 36 seconds:
Run 2 failed with this error after 8 minutes:
Run 3 failed with a similar error after 7 minutes, so I gather it's the same problem as run 2:
And I stopped trying at 3 attempts. Unless you feel these errors are related, I think we can consider this bug closed. |
I am also running into a similar |
I'm seeing a segfault in docker:dind 23.0 ( buildx output:
daemon output:
|
@maaarghk Also faced the problem similar to yours on 23.0.0 moby/moby#44918 (comment) |
Also faced with this issue. And we found the way to reproduce it with near 100% probability. https://gist.github.com/skokhanovskiy/a8f13d3ffe411ff0ece5fde8af9c26f1 The key here is to create multiple similar images with a set of common layers, push it to a registry and pass the list of this images to the After that, building of images with compose becomes unstable because of failure of docker daemon:
In this case, to mitigate issue just avoid passing images with common set of layers to the Hope this will help to fix it soon. |
Overview
I am seeing consistent crashes when parallel building many large stages at once with buildkit and docker buildx bake. The crashes seem to happen when buildkit is attempting to do many image exports in parallel.
Both buildx --load and --push outputs result in crashes, using either the inbuildt, or docker-container-based buildkit 0.8.2 instances.
When buildkit dies, the progress output keeps incrementing time, but nothing is actually happening. Typically I observe an export of some kind running, and in the instances where I don't, I think it's just that the UI didn't refresh the active threads before buildkit died.
or
Environment
Software: Ubuntu 20.10, Docker 20.10.5, docker buildx bake.
Hardware: 8 cores, 64GM RAM, NVME storage.
Build info: Largish (~1200 line) multistage Dockerfile. There are around 30 stages, 11 of which are end stages.
Most of the images are quite large, in the 3-10GB range (Oracle Database and Weblogic images).
The build uses these features: inline cache export, cache-from, --mount=type=secret, and --mount=type=cache.
buildkit image (tags) which I can crash with our bake/build:
. :cni-v0.8.2 (This is v.0.8.2 with the CNI binaries, running in CNI networking mode)
A typical bake command looks like this.
time docker buildx bake --set *.labels.com.tiatechnology.authors=xxxx --set *.labels.com.tiatechnology.buildhost=jeeves --set *.labels.com.tiatechnology.buildurl=local --set *.labels.com.tiatechnology.created=2021-03-26 13:50:34 CET --set *.labels.com.tiatechnology.dockerversions=Client/Server versions: 20.10.5/20.10.5 --set *.labels.com.tiatechnology.revision=d498421126227c7628ea99cd0c0492415989825b --set *.labels.com.tiatechnology.source=https://git.xxxxx--set *.labels.com.tiatechnology.version=monodockerfile --builder buildkit-v0.8.2 --progress=auto --set *.cache-to=type=inline --set *.secrets=id=token,src=.token --push weblogic-11g-base weblogic-12c-base weblogic-10.3.6 weblogic-12.2.1.1 weblogic-12.2.1.3 oracledb-12.2-base oracledb-19.3-base
We're using an internal docker registry, and our images reply on internal build artifacts, so it can't be used for repo. If you can isolate the internal conditions which cause the crash from the logs, I might be able to construct a more generic test case - at this stage I don't understand the internal conditions leading to the crash.
Observations
When there are many targets which need building (cache misses), buildkit has fewer concurrent exports to process, which tends to result in a more stable for the same set/count of targets.
When all layers are cached however, 6 or more bake targets tends to crash buildkit within a few seconds of the bake & container starting.
There isn't a magic target number which triggers crashes. A run of 5 targets might crash on first try, but after a pause and retry, it will run successfully the second time. The more targets there are, the higher the chance of a crash.
Logs
The attachment
3dba0d7f53934d47fe23b56d05ddf2d01362610d9c580e9cbedd9dc788a8f05c-json.log
is the container log for the builder created and crashed using these commands:
The second attachment ea41283f5920e3f10471b160ceed8b684eea37ec3351936dc50e74b393772b6b-json.log
was created much the same way, but with --debug used instead:
On Tonis' suggestion, I will try to build a buildkitd/container with the go race detector enabled: wish me luck!
The text was updated successfully, but these errors were encountered: