Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"dangling symbolic link" flakes when switching from --noenable_bzlmod -> --enable_bzlmod #22867

Open
quic-sbjorkle opened this issue Jun 24, 2024 · 19 comments
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug

Comments

@quic-sbjorkle
Copy link

Description of the bug:

Switching from --noenable_bzlmod to the default --enable_bzlmod occasionally causes builds to flake, resulting in "dangling symbolic link" errors. Although this issue was reported in #20886 and supposedly fixed in commit 52adf0b, the problem persists.

Error message from build:

13:34:43 ERROR: /Top-bazel/478b2cbff2254079381e27d1a245fab2/external/FOO/BUILD.bazel:99:11: output '_solib_k8/_U@@FOO_Ulibrary___Uexternal_FOO_Sexternal_FOO_Slib_Smodified_Urunpath/libBar.so.5' is a dangling symbolic link
13:34:44 ERROR: /Top-bazel/478b2cbff2254079381e27d1a245fab2/external/FOO/BUILD.bazel:99:11: SolibSymlink _solib_k8/_U@@FOO_S_S_FOO_Ulibrary___Uexternal_SFOO_Sexternal_SFOO_Slib_Smodified_Urunpath/libBar.so.5 failed: not all outputs were created or valid
13:34:44 [207,396 / 262,586] 645 / 1863 tests, 1 failed; checking cached actions

This error occurs quite often when running in CI, though it happens less so on local development machines. It appears to be linked to the local action cache becoming corrupted somehow when toggling bzlmod_enabled. Once the error manifests on a node, it persists on that node until the action cache is cleared or a ./bazel clean is executed.

Below are extracted parts from a decoded action cache:

In index:
bazel-out/ubuntu22-fastbuild/bin/_solib_k8/_U@@FOOcc_Ulibrary___Uexternal_SFOO_Sexternal_SFOO_Slib_Smodified_Urunpath/libBar.so.5 <==> 333425 <-- Missing entry

ls -alh bazel-out/ubuntu22-fastbuild/bin/_solib_k8/_U@@FOOcc_Ulibrary___Uexternal_SCLibs_UFOO_Sexternal_SFOO_Slib_Smodified_Urunpath/libBar.so.5 <--- File exist but dangling symlink
lrwxrwxrwx 1 user 1002 182 May 23 11:34 bazel-out/ubuntu22-fastbuild/bin/_solib_k8/_U@@FOO_S_S_FOO_Ucc_Ulibrary___Uexternal_SFOO_Sexternal_SFOO_Slib_Smodified_Urunpath/libBar.so.5 -> /Top-bazel/478b2cbff2254079381e27d1a245fab2/execroot/_main/bazel-out/ubuntu22-fastbuild/bin/external/FOO/external/FOO/lib/modified_runpath/libBar.so.5
333424, bazel-out/ubuntu22-fastbuild/bin/external/FOO/external/FOO/lib/modified_runpath/libBar.so.5:
      actionKey = efa037ea581a671c7388fa04511f5002423d8f58bf4942ba7e4bbaba04eeeea2
      usedClientEnvKey = 3ef42779f2af10825fa190221bb26ed61875ae41b887cc046b95040348911d16
      digestKey = b4ab98ca955cd21c9533086be59710a64be28843af7bed2436b88f07b58000ae
      bazel-out/ubuntu22-fastbuild/bin/external/FOO/external/FOO/lib/modified_runpath/libBar.so.5 = RemoteFileArtifactValueWithExpiration{digest=0x91949D49AED5522D6D40827CB94D192037457E6DC5036D0482D771E4F389D831, size=5976097, locationIndex=1, materializationExecPath=null, expireAtEpochMilli=1715950679390}
      packed_len = 186

A known workaround is to use the flag --nouse_action_cache.

Which category does this issue belong to?

Local Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

It's not clear how to reproduce this issue with ease. Attempts to manually induce this state on a local machine in a controlled manner have been unsuccessful.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.2.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

What does FOO look like here? In particular, does it contain any ~ characters?

bazel-out/ubuntu22-fastbuild/bin/external/FOO/external/FOO/lib/modified_runpath/libBar.so.5 also looks very weird as it has two external segments. Is that what you see in the cache entry or is it a result of redacting information?

@quic-sbjorkle
Copy link
Author

quic-sbjorkle commented Jun 24, 2024

@fmeum, no.. Just [A-Z0-9_]
bazel-out/ubuntu22-fastbuild/bin/_solib_k8/_U@@CLibs_UGt_Ulinux_U64_S_S_CCLibs_UGt_Ucc_Ulibrary___Uexternal_SCLibs_UGt_Ulinux_U64_Sexternal_SCLibs_UGt_Ulinux_U64_Slib_Smodified_Urunpath/libBar.so.5

You are correct with the observation of 2 external/ directories:
333424, bazel-out/ubuntu22-fastbuild/bin/external/CLibs_Gt_linux_64/external/CLibs_Gt_linux_64/lib/modified_runpath/libBar.so.5

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

@justinhorvitz Not sure who to ask about this, but do you happen to have an idea how such corrupted paths could end up in the action cache?

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jun 24, 2024
@iancha1992
Copy link
Member

@bazel-io fork 7.3.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Jun 24, 2024
@justinhorvitz
Copy link
Contributor

The action cache indexes the path strings of outputs' exec paths. I'm not sure why you've concluded that there is action cache corruption? Is the correct output path bin/external/CLibs_Gt_linux_64/lib/modified_runpath/libBar.so.5 and not bin/external/CLibs_Gt_linux_64/external/CLibs_Gt_linux_64/lib/modified_runpath/libBar.so.5?

@fmeum
Copy link
Collaborator

fmeum commented Jun 24, 2024

The action cache indexes the path strings of outputs' exec paths. I'm not sure why you've concluded that there is action cache corruption? Is the correct output path bin/external/CLibs_Gt_linux_64/lib/modified_runpath/libBar.so.5 and not bin/external/CLibs_Gt_linux_64/external/CLibs_Gt_linux_64/lib/modified_runpath/libBar.so.5?

Yes, I don't know of any source of the latter type of path that wouldn't be a bug. I don't know whether this comes from action cache logic though.

@justinhorvitz
Copy link
Contributor

justinhorvitz commented Jun 24, 2024

I'm afraid I can't be of too much help here. Is it at all possible that multiple concurrent builds are writing to the same action cache (wild guess)?

If it could be reproduced, I would try to see if I could hit a breakpoint where the questionable path string is being written to the action cache. It would be in PersistentStringIndexer#getOrCreate, or perhaps it's being loaded from disk incorrectly, which would be when instantiating the PersistentIndexMap.

@quic-sbjorkle
Copy link
Author

We don’t have concurrent builds on these machines so multiple simultaneous writes to the action cache is not possible.
It also happens fairly frequently on multiple machines in the cluster. Maybe 1 in 50 builds will end up in this corrupt state when —enable_bzlmod is toggled. But will be tricky to catch with a debugger.

@justinhorvitz
Copy link
Contributor

@Wyverald since this seems related to bzlmod. I do not have any further ideas beyond adding more verbose logging to try and gather data.

@zhengwei143 zhengwei143 added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. and removed team-Local-Exec Issues and PRs for the Execution (Local) team labels Jul 2, 2024
@Wyverald
Copy link
Member

Wyverald commented Jul 8, 2024

cc @tjgq in case you have any observations.

I have no idea how enabling Bzlmod could make a difference here, except for maybe the presence of the ~ character in paths.

@quic-sbjorkle, if you're able to specify a custom Bazel version on your CI (using Bazelisk for example), I could create a custom build of Bazel that includes extra logging. (That is, if anyone has idea what sort of extra logging would help here.)

@Wyverald Wyverald added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Jul 8, 2024
@Wyverald Wyverald added the area-Bzlmod Bzlmod-specific PRs, issues, and feature requests label Jul 8, 2024
@Gormo
Copy link

Gormo commented Jul 9, 2024

@Wyverald Yes we can specify a custom bazel to be used in CI so if you have some logging points we can add that as a patch. However @quic-sbjorkle is on vacation until beginning of August so we can't do it until he's back.

@amit-mittal
Copy link

We are also seeing the bazel flakiness since upgrading to bazel 7.2.1 from bazel 6 on our Linux agents. We haven't bzlmod yet.

Go builds would fail with the below error:

ERROR: /....: output ..../_virtual_imports/docker_proto.withgogoimport/.../docker.proto' is a dangling symbolic link
ERROR: ....: Symlinking virtual .proto sources for .... failed: not all outputs were created or valid
Use --verbose_failures to see the command lines of failed build steps.
ERROR: .... ~testmain.a failed: not all outputs were created or valid

Javascript builds would fail with the below error:

[webpack-cli] Error: Cannot find module 'source-map'
     Require stack:
     - .cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-fastbuild/bin/node_modules/css-minimizer-webpack-plugin/dist/index.js
     - .cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-fastbuild/bin/tools/js/webpack.config.js
     - .cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-opt-exec-ST-a828a81199fe/bin/tools/js/webpack/webpack_js_binary.sh.runfiles/yext/node_modules/.aspect_rules_js/[email protected]_webpack_5.76.3/node_modules/webpack-cli/lib/webpack-cli.js
     - .cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-opt-exec-ST-a828a81199fe/bin/tools/js/webpack/webpack_js_binary.sh.runfiles/yext/node_modules/.aspect_rules_js/[email protected]_webpack_5.76.3/node_modules/webpack-cli/lib/bootstrap.js
     - .cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-opt-exec-ST-a828a81199fe/bin/tools/js/webpack/webpack_js_binary.sh.runfiles/yext/node_modules/.aspect_rules_js/[email protected]_webpack_5.76.3/node_modules/webpack-cli/bin/cli.js
     - .cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-opt-exec-ST-a828a81199fe/bin/tools/js/webpack/webpack_js_binary.sh.runfiles/yext/node_modules/.aspect_rules_js/[email protected]_webpack-cli_4.4.0/node_modules/webpack/bin/webpack.js
         at Module._resolveFilename (node:internal/modules/cjs/loader:1140:15)
         at Module._load (node:internal/modules/cjs/loader:981:27)
         at Module.require (node:internal/modules/cjs/loader:1231:19)
         at require (.cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-opt-exec-ST-a828a81199fe/bin/tools/js/webpack/webpack_js_binary.sh.runfiles/yext/node_modules/.aspect_rules_js/[email protected]/node_modules/v8-compile-cache/v8-compile-cache.js:159:20)
         at Object.<anonymous> (.cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-fastbuild/bin/node_modules/css-minimizer-webpack-plugin/dist/index.js:7:5)
         at Module._compile (.cache/bazel/.../523ecd3a78772af5fd843de01aa3ce97/sandbox/linux-sandbox/13411/execroot/yext/bazel-out/k8-opt-exec-ST-a828a81199fe/bin/tools/js/webpack/webpack_js_binary.sh.runfiles/yext/node_modules/.aspect_rules_js/[email protected]/node_modules/v8-compile-cache/v8-compile-cache.js:192:30)
         at Module._extensions..js (node:internal/modules/cjs/loader:1422:10)
         at Module.load (node:internal/modules/cjs/loader:1203:32)
         at Module._load (node:internal/modules/cjs/loader:1019:12)
         at Module.require (node:internal/modules/cjs/loader:1231:19) {
       code: 'MODULE_NOT_FOUND',
       requireStack: [
         ...
       ]
     }

Rerun of the builds typically fix the issue. If it is relevant, we also have --noallow_unresolved_symlinks and --experimental_inprocess_symlink_creation flags set in our .bazelrc.

@meteorcloudy
Copy link
Member

Ping @Wyverald

@quic-sbjorkle
Copy link
Author

@Wyverald, I'm back from vacation now so If you have any proposed log-points I can deploy a patched bazel and collect those logs in our environment.

@Wyverald
Copy link
Member

Wyverald commented Aug 5, 2024

Sorry, but I'm really not familiar enough with local action execution to even know where to insert log statements.

@fmeum, @tjgq, @zhengwei143 -- any pointers here? Where could we add log statements to help debug this?

@adamscybot
Copy link

adamscybot commented Aug 6, 2024

Rerun of the builds typically fix the issue. If it is relevant, we also have --noallow_unresolved_symlinks and --experimental_inprocess_symlink_creation flags set in our .bazelrc.

@amit-mittal Just want to point out the Webpack one could be to do with aspect-build/rules_js#1877 if you are using rules_js.

Best everyone checks --experimental_inprocess_symlink_creation is not enabled in order to disambiguate these issues. It is currently unclear if this flag negatively affects other rules when used with Bazel 7.

@fmeum
Copy link
Collaborator

fmeum commented Aug 6, 2024

@Wyverald Places to log could be (based on release-7.2.0 branch):

fp.addPath(symlink.getExecPath());
fp.addPath(getPrimaryInput().getExecPath());
(both properties)
public void put(String key, ActionCache.Entry entry) {
(entry has toString implemented)

@quic-sbjorkle Could you also provide the output of --announce_rc and the command-line flags for both CI runs, with and without Bzlmod enabled?

@fmeum
Copy link
Collaborator

fmeum commented Aug 6, 2024

@quic-sbjorkle Could you also share the definitions of the targets producing @CLibs_Gt_linux_U64//:CLibs_Gt_cc_library and @CLibs_Gt_linux_U64//lib/modified_runpath:libBar.so.5? Does that repo have any subdirectories called external?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Bzlmod Bzlmod-specific PRs, issues, and feature requests P2 We'll consider working on this in future. (Assignee optional) team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. type: bug
Projects
None yet
Development

No branches or pull requests