Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when multiple parallel stages depend on the same cache mount path. #5566

Closed
KevinMind opened this issue Dec 4, 2024 · 4 comments · Fixed by #5588
Closed
Labels

Comments

@KevinMind
Copy link

I'm not sure if this is the same problem, but it does seem like implementing this feature might also solve this bug.

I think this is the minimal setup to reproduce this issue:

  • 2 stages that are not co-dependent mount the same cache directory
  • both stages are required by the final stage so they both run in the build
FROM python:3.11-slim-bookworm AS one

FROM one AS two

RUN \
--mount=type=cache,target=/npm/cache
npm install

FROM one AS three

RUN \
--mount=type=cache,target=/npm/cache
npm install

FROM one AS four

COPY --stage=two / /
COPY --stage=three / /

Realistic example:

logs: https://github.com/mozilla/addons-server/actions/runs/12159813323/job/33910761294?pr=22911

0.210 runc run failed: unable to start container process: error during container init: error mounting "/var/lib/docker/tmp/buildkit-mount2661530153" to rootfs at "/deps/cache/npm": create mountpoint for /deps/cache/npm mount: mkdirat /var/lib/docker/buildkit/executor/s20w0ugneetszzlvewmiu49i6/rootfs/deps/cache/npm: file exists

This is a race condition. Sometimes it happens, sometimes not. It feels like this should not be allowed to happen and buildkit should be smart enough to lazy create the mount points or reuse if existing, or append the stage to make them stage independent..

@tonistiigi
Copy link
Member

I don't think the provided example case is very realistic. Apart from there being no COPY --stage flag there is also a case that if you npm install in exactly the same context then buildkit will optimize it out and there will not be 2 instances of npm install running, but only one.

It's not clear to me what mkdir actually hits the race condition so runnable reproducer would clear it up.

@tonistiigi
Copy link
Member

I managed to reproduce with following Dockerfile

FROM alpine AS b1
RUN --mount=type=cache,target=/foo/bar --mount=type=cache,target=/foo/bar/baz echo 1

FROM alpine AS b2
RUN --mount=type=cache,target=/foo/bar --mount=type=cache,target=/foo/bar/baz echo 2

FROM scratch
COPY --from=b1 /etc/passwd p1
COPY --from=b2 /etc/passwd p2

Script

#!/usr/bin/env bash

set -ex

for i in $(seq 1 100); do
docker buildx build --no-cache .
done

Afaics this is fundamentally a problem in runc so will first report it there. Theoretically, I think we could add some locking so that runc startup is not called in parallel if the specs share the same mounts.

@thaJeztah
Copy link
Member

@tonistiigi I marked this as fixed through #5588, but just see you mentioned some possible changes on the buildkit side as well, so not sure if this one should be closed or stay open for that

@tonistiigi
Copy link
Member

@thaJeztah Closing is correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants