Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite looping in a build with a lot of cache imports #1984

Open
robtaylor opened this issue Feb 17, 2021 · 18 comments · Fixed by #1986
Open

Infinite looping in a build with a lot of cache imports #1984

robtaylor opened this issue Feb 17, 2021 · 18 comments · Fixed by #1986

Comments

@robtaylor
Copy link
Contributor

I have a public build here that's looping indefinitely. The build is using latest stable.

https://github.com/robtaylor/openlane/runs/1910363514?check_suite_focus=true

@tonistiigi
Copy link
Member

Are there logs? Can't see from the link. Can you repro locally or is it related to your cache state?

Only similar issues I've seen are related to COPY loop in Dockerfile where the same file is copied again to target that already contains it. Didn't instantly spot a case like this but it's a long Dockerfile so might have missed something.

@robtaylor
Copy link
Contributor Author

robtaylor commented Feb 17, 2021

This one seems to have stored the logs better: https://github.com/robtaylor/openlane/runs/1914403173?check_suite_focus=true (the loop triggers a github workflow issue when the worker runs out of disk space)

Raw longs here: https://pipelines.actions.githubusercontent.com/6HcEdXX4QosuWhCiyfFIsk7uCtk1AKLuOCm1NIJ1Yrn64POax9/_apis/pipelines/1/runs/54/signedlogcontent/465?urlExpires=2021-02-17T22%3A38%3A02.7320963Z&urlSigningMethod=HMACV1&urlSignature=M6DSoXBN1qeWhLLzsAkL3IEGQVZX3N2etuxQgUzfxWA%3D
Built commit: robtaylor/openlane@aa36332

The cache is clean at the start of the build. Cache is fetched from inline caches on multiple artefacts on docker hub.
I suspect something in the cache state is causing this - I previously hit this and managed to get working by regenerating one of the images. I've not done this yet to give you repro ;)

Its happing locally as well. Locally it tends to eventually fail with a cache item size mismatch error (i'm presuming due to running out of space on the builder docker instance)

I've gone over all the COPY --from lines, they all have unique destination folders.

@tonistiigi tonistiigi changed the title Infinite looping in multi stage build using COPY --from Infinite looping in a build with a lot of cache imports Feb 18, 2021
@tonistiigi
Copy link
Member

I can't reproduce because the images where you import cache are private and running the build skips them for me. So I can't see what is weird in one of these images. If you can't get me access to the images maybe you can give me steps how to generate cache into a local registry so that it hits this issue on import.

@tonistiigi
Copy link
Member

tonistiigi commented Feb 18, 2021

My hunch is that because you import so many cache sources, somewhere there is a loop. Eg. file is copied A into B, but based on another cache source builder determines that B is the same as A or maybe its parent and then gets confused.

@robtaylor
Copy link
Contributor Author

I'm a little confused: all the images on the shapebuild dockerhub account are public.

To reproduce locally use
make -C docker CACHE_ID=shapebuild CROSS_BUILD=1 openlane

@robtaylor
Copy link
Contributor Author

I've looked for loops, but I can't spot any. Could it be possible that some of the build stages are generating identical files that cause the clash?

@tonistiigi
Copy link
Member

I didn't know this was the CACHE_ID. It's masked in the logs https://github.com/robtaylor/openlane/runs/1914403173?check_suite_focus=true#step:8:14

@robtaylor
Copy link
Contributor Author

robtaylor commented Feb 18, 2021 via email

@tonistiigi
Copy link
Member

I can reproduce now. Will look again tomorrow. If you need to update these cache source images please let me know first and I'll make a backup of them for my testing.

@tonistiigi
Copy link
Member

So it looks like there isn't actually any loop, it just looks like it is because of the massive amount of data. 2 issues are in play here. First is an issue with limiting concurrency when pulling the layers. It does not work properly with so many separate images being pulled in parallel. This means too many tcp connections are created causing things to break down (I got an error from cloudflare). The second issue is excessive progress rows in --progress=plain output when importing remote cache. We saw this in another project as well and it looks like regression in the last release but haven't looked into it yet. It seems harmless and only extra data on client side.

I pushed an image with the concurrency fix tonistiigi/buildkit:tcp-concurrency-fix.

With that image I didn't have issues with pulling the cache. It took about 5min though with almost 400 requests. I didn't wait until the qemu builds finished but everything up to that point seemed normal. https://gist.github.com/tonistiigi/66f16d8daf6750d29ce8f51ac9a228c2

@robtaylor
Copy link
Contributor Author

That's amazing work, thank you Toni!

Out of interest, why is it making 400 requests?

@tonistiigi
Copy link
Member

tonistiigi commented Feb 18, 2021

Out of interest, why is it making 400 requests?

There are 18 36 (edit: only counted for one arch initially) cache sources. In total they pull 221 layers(didn't check how many different cache sources match but a lot). Rest are things like pulling manifests, image configs, authentication handshakes etc (maybe also some redirects showed up in logs).

Did you have a chance to validate tonistiigi/buildkit:tcp-concurrency-fix locally? You can set a custom buildkit image with --driver-opt image=<ref> on buildx create.

@robtaylor
Copy link
Contributor Author

robtaylor commented Feb 19, 2021

@tonistiigi For some reason, it's now decided to rebuild all the dependencies. Could that be an effect of this patch?

https://pipelines.actions.githubusercontent.com/6HcEdXX4QosuWhCiyfFIsk7uCtk1AKLuOCm1NIJ1Yrn64POax9/_apis/pipelines/1/runs/60/signedlogcontent/465?urlExpires=2021-02-19T12%3A13%3A28.1551892Z&urlSigningMethod=HMACV1&urlSignature=cGkBZtUj%2BUv78UX4zkv6u8FtU5fkWUtqtgaM3GJVYjo%3D

interestingly, it worked fine locally a couple of times, but is now behaving the same as the above build.

@robtaylor
Copy link
Contributor Author

@tonistiigi @AkihiroSuda this isn't fixed for me... why is it closed?

@AkihiroSuda
Copy link
Member

The PR #1986 description contained "partly fixes #1984", so github automatically fully closed this issue 😢
Anyway, the PR was fully reverted in #1989, so we need to "fully" reopen this again

@AkihiroSuda AkihiroSuda reopened this Feb 22, 2021
@robtaylor
Copy link
Contributor Author

Thanks @AkihiroSuda !

@robtaylor
Copy link
Contributor Author

any more thoughts on this?

@ojab
Copy link

ojab commented Mar 5, 2021

I created another issue with infinite loop with cached layers #2009, doesn't looks like duplicate after reading comments here.
Please tell me if it is, I'll close it and subscribe here then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants