Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"unlinkat ([path]) (Directory not empty)" errors with experimental_worker_for_repo_fetching #22680

Open
jfirebaugh opened this issue Jun 10, 2024 · 23 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@jfirebaugh
Copy link

Description of the bug:

After upgrading to Bazel 7.2.0 and removing --experimental_worker_for_repo_fetching=off from .bazelrc (because #21803 was fixed), we started to observe failures of the following form:

ERROR: /home/circleci/figma/WORKSPACE.bazel:157:27: fetching node_repositories rule //external:nodejs_linux_amd64: java.io.IOException: unlinkat (/home/circleci/.cache/bazel/_bazel_circleci/a6e8cb0ce75ceb1ea1faa0cca0b0ae0d/external/nodejs_linux_amd64/bin/nodejs) (Directory not empty)
ERROR: no such package '@@nodejs_linux_amd64//': unlinkat (/home/circleci/.cache/bazel/_bazel_circleci/a6e8cb0ce75ceb1ea1faa0cca0b0ae0d/external/nodejs_linux_amd64/bin/nodejs) (Directory not empty)

These errors appear to be linked with repository fetch restarts, particularly "fetch interrupted due to memory pressure; restarting". In two of two cases where I have seen this error, it happened in a repo where the repo fetch got restarted, judging by either the presence of that log message or the JSON trace profile.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Bazel Slack discussion: https://bazelbuild.slack.com/archives/CA31HN1T3/p1718057176146129

Any other information, logs, or outputs that you want to share?

No response

@jfirebaugh
Copy link
Author

I am presently confirming that --experimental_worker_for_repo_fetching=off mitigates the issue.

@Wyverald Wyverald added P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed untriaged labels Jun 10, 2024
@jfirebaugh
Copy link
Author

Still on 7.2.0, haven't seen this again with --experimental_worker_for_repo_fetching=off.

@fmeum
Copy link
Collaborator

fmeum commented Jun 14, 2024

I managed to reproduce this via #22748 (comment):

INFO: repository @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2' used the following cache hits instead of downloading the corresponding file.
 * Hash '16a51843ef28d79f06c864eb305266b3daa1dc2a932af02a82ab139e42c8f2c2aed34dbca2ba8187134c16415e9f4cc6ca0e9ea40083df6a63564e7c9e0204ad' for https://registry.npmjs.org/decode-uri-component/-/decode-uri-component-0.2.2.tgz
If the definition of 'repository @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2' was updated, verify that the hashes were also updated.
com.google.devtools.build.lib.rules.repository.RepositoryFunction$RepositoryFunctionException: java.io.IOException: /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
	at com.google.devtools.build.lib.rules.repository.RepositoryFunction.setupRepoRoot(RepositoryFunction.java:139)
	at com.google.devtools.build.lib.bazel.repository.starlark.StarlarkRepositoryFunction.lambda$fetch$1(StarlarkRepositoryFunction.java:154)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at java.base/java.util.concurrent.ThreadPerTaskExecutor$TaskRunner.run(ThreadPerTaskExecutor.java:314)
	at java.base/java.lang.VirtualThread.run(VirtualThread.java:309)
Caused by: java.io.IOException: /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
	at com.google.devtools.build.lib.unix.NativePosixFiles.remove(Native Method)
	at com.google.devtools.build.lib.unix.UnixFileSystem.delete(UnixFileSystem.java:444)
	at com.google.devtools.build.lib.vfs.FileSystem.deleteTree(FileSystem.java:232)
	at com.google.devtools.build.lib.vfs.Path.deleteTree(Path.java:594)
	at com.google.devtools.build.lib.rules.repository.RepositoryFunction.setupRepoRoot(RepositoryFunction.java:136)
	... 6 more
ERROR: <builtin>: fetching npm_import_rule rule //:aspect_rules_js~~npm~npm__decode-uri-component__0.2.2: java.io.IOException: /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
ERROR: no such package '@@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2//': /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
ERROR: /Users/fmeum/git/examples/frontend/BUILD.bazel:10:22: //:.aspect_rules_js/node_modules/[email protected]/pkg depends on @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2//:pkg in repository @@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 which failed to fetch. no such package '@@aspect_rules_js~~npm~npm__decode-uri-component__0.2.2//': /private/var/tmp/_bazel_fmeum/0465a0857e26a35c072f48ab751a1794/external/aspect_rules_js~~npm~npm__decode-uri-component__0.2.2 (Directory not empty)
Use --verbose_failures to see the command lines of failed build steps.
ERROR: Analysis of target '//react/src:test_lib' failed; build aborted: Analysis failed
INFO: Elapsed time: 38.951s, Critical Path: 0.02s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED:
    Fetching repository @@aspect_rules_swc~~swc~swc_darwin-arm64; fetch interrupted due to memory pressure; restarting. 19s
    Fetching repository @@aspect_rules_js~~npm~npm__buffer-from__1.1.2; fetch interrupted due to memory pressure; restarting. 11s
    Fetching https://registry.npmjs.org/@typescript-eslint/type-utils/-/type-utils-5.44.0.tgz; 11s
    Fetching repository @@aspect_rules_js~~npm~npm__esbuild-windows-64__0.15.18; fetch interrupted due to memory pressure; restarting. 10s
    Fetching repository @@aspect_rules_js~~npm~npm__esbuild-freebsd-arm64__0.15.18; fetch interrupted due to memory pressure; restarting. 8s
    Fetching https://registry.npmjs.org/is-weakset/-/is-weakset-2.0.2.tgz; 7s
    Fetching repository @@aspect_rules_js~~npm~npm__at_next_swc-win32-arm64-msvc__13.0.5; fetch interrupted due to memory pressure; restarting. 6s
    Fetching repository @@aspect_rules_js~~npm~npm__randombytes__2.1.0; fetch interrupted due to memory pressure; restarting. 5s ... (20 fetches)

@meteorcloudy
Copy link
Member

@bazel-io fork 7.2.1

copybara-service bot pushed a commit that referenced this issue Jun 19, 2024
`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards #22680
Work towards #22748

Closes #22772.

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624
fmeum added a commit to fmeum/bazel that referenced this issue Jun 20, 2024
`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards bazelbuild#22680
Work towards bazelbuild#22748

Closes bazelbuild#22772.

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624
github-merge-queue bot pushed a commit that referenced this issue Jun 20, 2024
…22814)

`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in
`close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on
failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful
evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1
and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in
`DownloadManager` with virtual threads, thereby resolving a TODO about
not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards #22680
Work towards #22748

Closes #22772

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624

Closes #22775
fmeum added a commit to fmeum/bazel that referenced this issue Jun 25, 2024
`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in `close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1 and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in `DownloadManager` with virtual threads, thereby resolving a TODO about not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards bazelbuild#22680
Work towards bazelbuild#22748

Closes bazelbuild#22772.

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624
github-merge-queue bot pushed a commit that referenced this issue Jun 25, 2024
…22883)

`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in
`close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on
failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful
evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1
and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in
`DownloadManager` with virtual threads, thereby resolving a TODO about
not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards #22680
Work towards #22748

Closes #22772.

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624

Closes #22776
fmeum added a commit to fmeum/bazel that referenced this issue Jul 18, 2024
…azelbuild#22814)

`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in
`close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on
failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful
evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1
and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in
`DownloadManager` with virtual threads, thereby resolving a TODO about
not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards bazelbuild#22680
Work towards bazelbuild#22748

Closes bazelbuild#22772

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624

Closes bazelbuild#22775
jjmaestro added a commit to jjmaestro/bazel_issue_22680_repro that referenced this issue Jul 22, 2024
jjmaestro added a commit to jjmaestro/bazel_issue_22680_repro that referenced this issue Jul 22, 2024
jjmaestro added a commit to jjmaestro/bazel_issue_22680_repro that referenced this issue Jul 22, 2024
jjmaestro added a commit to jjmaestro/bazel_issue_22680_repro that referenced this issue Jul 22, 2024
jjmaestro added a commit to jjmaestro/bazel_issue_22680_repro that referenced this issue Jul 22, 2024
@jjmaestro
Copy link
Contributor

jjmaestro commented Jul 22, 2024

I have a repro of this bug here: https://github.com/jjmaestro/bazel_issue_22680_repro

It always fails when I run it inside that Docker.

@fmeum and @iancha1992 & @moroten (from #22748) could you check it out and see if this also fails for you? Thanks!!

PS: also, sorry for the commit noise, I did a few push --force and didn't realize that GH would push the info here because I had linked the issue 😓 )

@iancha1992
Copy link
Member

@jjmaestro Did you test with 7.2.1? According to #23029 (comment), the fix may have bene included in 7.2.1.

@jjmaestro
Copy link
Contributor

@jjmaestro Did you test with 7.2.1? According to #23029 (comment), the fix may have bene included in 7.2.1.

Yup, that's exactly what I did. The repro repo that I linked is set to 7.2.1rc2. Can you give it a quick check? :)

It's literally one git clone and make away 👌 That will run a Docker image with Bazel 7.2.1rc2 and do one bazel build that works (cache dir inside Docker) and two that fail, with the cache directory outside Docker. One of those also uses the --experimental_worker_for_repo_fetching=off which was mentioned to mitigate this issue, but it still fails in the repro I built.

I hope this will help debug what's causing this!

@jjmaestro
Copy link
Contributor

Ah, ok, sorry, my brain just caught on to what you are saying 😅 I've update the repo and re-run it with 7.2.1 and it's still broken. I had tested it with both versions myself, dunno why my brain short-circuited and committed the rc2 version instead of the released version 😫 Sorry for the noise!

Can you check it out and see if that helps with identifying what's going on?

Thanks!

@meteorcloudy
Copy link
Member

@jjmaestro Does it work for Bazel built from last_green? (Sorry I'm a bit occupied now to try to repro)

@jjmaestro
Copy link
Contributor

@meteorcloudy not sure, haven't tested it yet with a custom built version. I was now beginning to clone the repo and read about how to build bazel from source, so I'll play with this a bit more and will get back to you!

@fmeum
Copy link
Collaborator

fmeum commented Jul 23, 2024

I ran the reproducer, but all three builds passed for me.

@jjmaestro
Copy link
Contributor

@fmeum wow, that's surprising! It's a 100% repro for me, it always fails when I run it. And I had a friend run the repro and it also failed for them, that's why I thought it could be something with Bazel. But maybe it's with Docker and/or some incompatibility or configuration?

I'm running Docker 26.1.4, build 5650f9b on a Mac Studio (MacOS 13.6.6 on M1 Max), so Docker runs on linux/arm64.

What's your setup? Also, any ideas as to how I could debug this further? I'd love if I could get some help and see if I can dig into this further :)

Thanks!!

@fmeum
Copy link
Collaborator

fmeum commented Jul 23, 2024

I ran this in a Debian VM on a Mac M3 host.

Could you share the stack traces and build failures you got?

@jjmaestro
Copy link
Contributor

jjmaestro commented Jul 23, 2024

I ran this in a Debian VM on a Mac M3 host.

Yeah, I just ran it on a Debian VM (debian12 running in UTM) in my mac and it does work... so my guess is that there's something going on with the virtualization that Docker uses on Mac.

Could you share the stack traces and build failures you got?

This is the error I get running build with --verbose_failures:

$ bazel --output_base /tmp/out/bazel build --verbose_failures //testimage:tarball

INFO: Repository rules_python~~python~python_3_11_aarch64-unknown-linux-gnu instantiated at:
  <builtin>: in <toplevel>
Repository rule python_repository defined at:
  /tmp/out/bazel/external/rules_python~/python/repositories.bzl:443:36: in <toplevel>
INFO: repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu' used the following cache hits instead of downloading the corresponding file.
 * Hash 'b3a7199ac2615d75fb906e5ba556432efcf24baf8651fc70370d9f052d4069ee' for https://github.com/indygreg/python-build-standalone/releases/download/20240415/cpython-3.11.9+20240415-aarch64-unknown-linux-gnu-install_only.tar.gz
If the definition of 'repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu' was updated, verify that the hashes were also updated.
ERROR: An error occurred during the fetch of repository 'rules_python~~python~python_3_11_aarch64-unknown-linux-gnu':
   Traceback (most recent call last):
        File "/tmp/out/bazel/external/rules_python~/python/repositories.bzl", line 255, column 28, in _python_repository_impl
                rctx.delete("share/terminfo")
Error in delete: java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: /tmp/out/bazel/external/rules_pkg~/pkg/private/tar/BUILD:47:10: @@rules_pkg~//pkg/private/tar:build_tar depends on @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//:python_runtimes in repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu which failed to fetch. no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: Analysis of target '//testimage:tarball' failed; build aborted: Analysis failed
INFO: Elapsed time: 7.711s, Critical Path: 0.01s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED: 
    Fetching repository @@_main~_repo_rules~postgres_src; starting 7s
    Fetching repository @@rules_foreign_cc~~tools~cmake-3.23.2-linux-aarch64; starting 7s
    Fetching /tmp/out/bazel/external/_main~_repo_rules~postgres_src; Extracting REL_16_3.tar.gz 6s
    Fetching https://github.com/Kitware/CMake/releases/download/v3.23.2/cmake-3.23.2-linux-aarch64.tar.gz; 20.1 MiB (44.5%) 6s

Re. the stack traces, I'm a Bazel n00b so I don't know if there's any other flag or way to get more verbose output or logs. If there is, please let me know and I'll get them.

Thanks!

@jjmaestro
Copy link
Contributor

It seems the offending line is L255rctx.delete("share/terminfo") in https://github.com/bazelbuild/rules_python/blob/084b877c98b580839ceab2b071b02fc6768f3de6/python/repositories.bzl#L237-L256

So, Bazel fails when trying to delete a file with a "Directory not empty" error or something like that :-?

I'll try to dig deeper and see if I can get more info on the Bazel side

@fmeum
Copy link
Collaborator

fmeum commented Jul 24, 2024

Since the stack trace is produced by an rctx.delete call in a Python repo rule and the crash still happens with virtual workers turned off, I wonder whether this may be a legitimate domain-specific non-Bazel issue. Have you raised this on the rules_python repo? I won't rule out that this is ultimately a Bazel issue though.

@jjmaestro
Copy link
Contributor

No, I haven't yet. It could definitely be either but I was much more suspecting that the error could be somewhere in Bazel because that rctx.delete in rules_python works under certain conditions e.g. the directory / files not being inside a Docker volume running in the "Docker virtualization" (whatever that is, I need to check what's the VM and how it's different from e.g. UTM where it doesn't repro)... and that's something that, to me, points more to a potential bug somewhere deeper than a rule.

I'll try to dig a bit more into this and see where it goes!

@jjmaestro
Copy link
Contributor

Small update, IMHO there's definitely something weird going on... to me, this definitely looks like some sort of "race condition / non-deterministic behavior. When I add printf statements to do some print-debugging, I get a different error than when I just run bazel from main!?

When I just compile main at [17eadaf366] and use that bazel-dev, I get: LOG-main_without_prints, which is the same error I've been getting when I use a 7.2.1 binary:

INFO: Repository rules_python~~python~python_3_11_aarch64-unknown-linux-gnu instantiated at:
  <builtin>: in <toplevel>
Repository rule python_repository defined at:
  /tmp/out/bazel/external/rules_python~/python/repositories.bzl:443:36: in <toplevel>
ERROR: An error occurred during the fetch of repository 'rules_python~~python~python_3_11_aarch64-unknown-linux-gnu':
   Traceback (most recent call last):
	File "/tmp/out/bazel/external/rules_python~/python/repositories.bzl", line 255, column 28, in _python_repository_impl
		rctx.delete("share/terminfo")
Error in delete: java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
Analyzing: target //testimage:tarball (277 packages loaded, 9517 targets configured)
[1 / 1] checking cached actions
ERROR: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: /tmp/out/bazel/external/rules_pkg~/pkg/private/tar/BUILD:47:10: @@rules_pkg~//pkg/private/tar:build_tar depends on @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//:python_runtimes in repository @@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu which failed to fetch. no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': java.io.IOException: unlinkat (/tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/share/terminfo/X) (Directory not empty)
ERROR: Analysis of target '//testimage:tarball' failed; build aborted: Analysis failed
INFO: Elapsed time: 63.166s, Critical Path: 0.02s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED: 

But when I run the repro in Docker using a compiled bazel-dev with this patch applied to main at [17eadaf366], I get a different error: LOG-main_with_prints:

<PRINT-DEBUG LOGS SHOWING DELETED PATHS>
(...)
ERROR: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': error globbing [bin/**, extensions/**, include/**, lib/**, libs/**, share/**] - [**/* *, lib/libpython3.11.so, lib/**/*.a, lib/python3.11/**/test/**, lib/python3.11/**/tests/**, **/__pycache__/*.pyc.*, **/__pycache__/*.pyc, **/__pycache__/*.pyo] op=FILES: /tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/lib/python3.11/site-packages/pip/_internal/operations/build (No such file or directory)
ERROR: /tmp/out/bazel/external/rules_pkg~/pkg/private/tar/BUILD:47:10: Target '@@rules_pkg~//pkg/private/tar:build_tar' depends on toolchain '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//:python_runtimes', which cannot be found: no such package '@@rules_python~~python~python_3_11_aarch64-unknown-linux-gnu//': error globbing [bin/**, extensions/**, include/**, lib/**, libs/**, share/**] - [**/* *, lib/libpython3.11.so, lib/**/*.a, lib/python3.11/**/test/**, lib/python3.11/**/tests/**, **/__pycache__/*.pyc.*, **/__pycache__/*.pyc, **/__pycache__/*.pyo] op=FILES: /tmp/out/bazel/external/rules_python~~python~python_3_11_aarch64-unknown-linux-gnu/lib/python3.11/site-packages/pip/_internal/operations/build (No such file or directory)'
ERROR: Analysis of target '//testimage:tarball' failed; build aborted: Analysis failed
INFO: Elapsed time: 50.333s, Critical Path: 0.02s
INFO: 1 process: 1 internal.
ERROR: Build did NOT complete successfully
FAILED: 

This seems to me like the delete that failed before is now working when the logging / printf lines are executed, and then it fails somewhere else, or something like that.

In any case, whether this is what's happening or something else, it definitely seems quite weird to me that this error(s) were caused by rules_python and that I would get different results (different errors) just by having or not having printf statements in the Bazel code :-? Maybe I'm missing something and/or I'm doing something wrong, dunno.

I'll keep checking further, hopefully I'll be able to isolate / get closer to what's actually failing.

@fmeum
Copy link
Collaborator

fmeum commented Jul 25, 2024

CC @rickeylev @aignas in case you know of any logic in the rules_python repo rules that could result in concurrent access to the files it deletes.

@jjmaestro
Copy link
Contributor

@fmeum BTW, just found this while looking for the implementation of deleteTreesBelow:

// On macOS and some other non-Linux OSes, on some filesystems, readdir(dir)
// may return NULL after an entry in dir is deleted even if not all files have
// been read yet - see
// https://pubs.opengroup.org/onlinepubs/9699919799/functions/readdir.html;
// "If a file is removed from or added to the directory after the most recent
// call to opendir() or rewinddir(), whether a subsequent call to readdir()
// returns an entry for that file is unspecified." We thus read all the names
// of dir's entries before deleting. We don't want to simply use fts(3)
// because we want to be able to chmod at any point in the directory hierarchy
// to retry a filesystem operation after hitting an EACCES.
// If in the future we hit any problems here due to the unspecified behavior
// of readdir() when a file has been deleted by a different thread we can use
// some form of locking to make sure the threads don't try to clean up the
// same directory at the same time; or doing it in a loop until the directory
// is really empty.

Part of the comment was added by @oquenchil in d6c79db and in the commit it was mentioned:

Asynchronous deletion may make it more likely for the problem to surface but the likelihood is still very low.

Could this be the issue? Maybe running the deletes in a Docker volume running in MacOS makes triggering the bug more likely :-?

@rickeylev
Copy link
Contributor

rickeylev commented Jul 25, 2024 via email

@jjmaestro
Copy link
Contributor

Digging into why this happens in Docker running on MacOS, I see that it's using virtiofs to share the filesystem in those volumes. And, looking into virtiofs and Docker... well, there's quite a bunch of present and past issues, with quite the repertoir like docker/for-mac#7004 and a lot of permissions issues.

I've also checked past issues with Bazel and virtiofs and sure enough, found a past permission issue: MacOS + Docker + Cache on the VirtioFS volume => "ERROR: Failed to initialize sandbox: /sandbox/inaccessibleHelperDir (Permission denied)" #16364 .

I'm trying to "prove this" by using Docker with gRPC-FUSE instead but... that also fails the build with a different error 🤣 😭

# bazel --output_base /tmp/out/bazel build --verbose_failures //testimage:tarball
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Analyzed target //testimage:tarball (278 packages loaded, 18569 targets configured).
ERROR: /tmp/out/bazel/external/rules_foreign_cc~/toolchains/private/BUILD.bazel:23:10: output tree artifact external/rules_foreign_cc~/toolchains/private/make was not created
ERROR: /tmp/out/bazel/external/rules_foreign_cc~/toolchains/private/BUILD.bazel:23:10: BootstrapGNUMake external/rules_foreign_cc~/toolchains/private/make [for tool] failed: not all outputs were created or valid
Target //testimage:tarball failed to build
INFO: Elapsed time: 148.828s, Critical Path: 24.86s
INFO: 110 processes: 32 internal, 78 processwrapper-sandbox.
ERROR: Build did NOT complete successfully

And... after that failure, I can't remove the bazel cache! 😅

$ rm -rf /tmp/out/bazel 
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/bin/make': Operation not permitted
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/include/gnumake.h': Operation not permitted
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/share/info/make.info': Operation not permitted
(...)
rm: cannot remove '/tmp/out/bazel/execroot/_main/bazel-out/aarch64-opt-exec-ST-d57f47055a04/bin/external/rules_foreign_cc~/toolchains/private/make/make/share/man/man1': Operation not permitted

I get the same errors even when trying to remove the folder outside Docker. I need to sudo rm in MacOS to remove it. Something seems very broken in Docker, both with virtiofs and gRPC-FUSE.

So, I have a feeling there's one or more bugs in virtiofs and/or docker-on-mac with virtiofs and that's what causing the break 😞 I'll probably just avoid all of this and work on a Linux VM directly so that there's no volume shared with MacOS 🤷

Thanks for all your help!

@jjmaestro
Copy link
Contributor

BTW, for completion, I've also tried removing Docker Desktop and running docker cli with Lima:

  • with "everything default" (qemu and sshfs) the three builds in the repro repo run OK ("as expected")
limactl create --name=docker template://docker
  • with the following VM which runs a vz VM with virtiofs:
limactl create --name=docker --vm-type=vz --mount-type=virtiofs template://docker

and making sure the home and /tmp/lima are mounted rw (`limactl)

# limactl edit docker --mount-writable
# or edit $HOME/.lima/docker/lima.yaml
mounts:
  - location: "~"
    writable: true
  - location: "/tmp/lima"
    writable: true

all three builds also work!

So, my guess remains as I mentioned in my last comment: something's going on with Docker Desktop and virtiofs, maybe there's a bug in the version that Docker Desktop uses or there's something that triggers the issue in how Docker Desktop uses virtiofs or something else.

oliviernotteghem pushed a commit to uber-common/bazel that referenced this issue Aug 13, 2024
…azelbuild#22814)

`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in
`close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on
failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful
evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1
and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in
`DownloadManager` with virtual threads, thereby resolving a TODO about
not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards bazelbuild#22680
Work towards bazelbuild#22748

Closes bazelbuild#22772

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624

Closes bazelbuild#22775
oliviernotteghem pushed a commit to uber-common/bazel that referenced this issue Dec 4, 2024
…azelbuild#22814)

`StarlarkBaseExternalContext` now implements `AutoCloseable` and, in
`close()`:
1. Cancels all pending async tasks.
2. Awaits their termination.
3. Cleans up the working directory (always for module extensions, on
failure for repo rules).
4. Fails if there were pending async tasks in an otherwise successful
evaluation.

Previously, module extensions didn't do any of those. Repo rules did 1
and 4 and sometimes 3, but not in all cases.

This change required replacing the fixed-size thread pool in
`DownloadManager` with virtual threads, thereby resolving a TODO about
not using a fixed-size thread pool for the `GrpcRemoteDownloader`.

Work towards bazelbuild#22680
Work towards bazelbuild#22748

Closes bazelbuild#22772

PiperOrigin-RevId: 644669599
Change-Id: Ib71e5bf346830b92277ac2bd473e11c834cb2624

Closes bazelbuild#22775
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests

9 participants