-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"dangling symbolic link" flakes when switching from --noenable_bzlmod -> --enable_bzlmod #22867
Comments
What does
|
@fmeum, no.. Just [A-Z0-9_] You are correct with the observation of 2 external/ directories: |
@justinhorvitz Not sure who to ask about this, but do you happen to have an idea how such corrupted paths could end up in the action cache? |
@bazel-io flag |
@bazel-io fork 7.3.0 |
The action cache indexes the path strings of outputs' exec paths. I'm not sure why you've concluded that there is action cache corruption? Is the correct output path |
Yes, I don't know of any source of the latter type of path that wouldn't be a bug. I don't know whether this comes from action cache logic though. |
I'm afraid I can't be of too much help here. Is it at all possible that multiple concurrent builds are writing to the same action cache (wild guess)? If it could be reproduced, I would try to see if I could hit a breakpoint where the questionable path string is being written to the action cache. It would be in |
We don’t have concurrent builds on these machines so multiple simultaneous writes to the action cache is not possible. |
@Wyverald since this seems related to bzlmod. I do not have any further ideas beyond adding more verbose logging to try and gather data. |
cc @tjgq in case you have any observations. I have no idea how enabling Bzlmod could make a difference here, except for maybe the presence of the @quic-sbjorkle, if you're able to specify a custom Bazel version on your CI (using Bazelisk for example), I could create a custom build of Bazel that includes extra logging. (That is, if anyone has idea what sort of extra logging would help here.) |
@Wyverald Yes we can specify a custom bazel to be used in CI so if you have some logging points we can add that as a patch. However @quic-sbjorkle is on vacation until beginning of August so we can't do it until he's back. |
We are also seeing the bazel flakiness since upgrading to bazel 7.2.1 from bazel 6 on our Linux agents. We haven't Go builds would fail with the below error:
Javascript builds would fail with the below error:
Rerun of the builds typically fix the issue. If it is relevant, we also have |
Ping @Wyverald |
@Wyverald, I'm back from vacation now so If you have any proposed log-points I can deploy a patched bazel and collect those logs in our environment. |
Sorry, but I'm really not familiar enough with local action execution to even know where to insert log statements. @fmeum, @tjgq, @zhengwei143 -- any pointers here? Where could we add log statements to help debug this? |
@amit-mittal Just want to point out the Webpack one could be to do with aspect-build/rules_js#1877 if you are using rules_js. Best everyone checks |
@Wyverald Places to log could be (based on release-7.2.0 branch): bazel/src/main/java/com/google/devtools/build/lib/rules/cpp/SolibSymlinkAction.java Lines 103 to 104 in 5b546af
bazel/src/main/java/com/google/devtools/build/lib/actions/cache/CompactPersistentActionCache.java Line 359 in 49a9502
toString implemented)bazel/src/main/java/com/google/devtools/build/lib/actions/cache/CompactPersistentActionCache.java Line 349 in 49a9502
@quic-sbjorkle Could you also provide the output of |
@quic-sbjorkle Could you also share the definitions of the targets producing |
Description of the bug:
Switching from --noenable_bzlmod to the default --enable_bzlmod occasionally causes builds to flake, resulting in "dangling symbolic link" errors. Although this issue was reported in #20886 and supposedly fixed in commit 52adf0b, the problem persists.
Error message from build:
This error occurs quite often when running in CI, though it happens less so on local development machines. It appears to be linked to the local action cache becoming corrupted somehow when toggling
bzlmod_enabled
. Once the error manifests on a node, it persists on that node until the action cache is cleared or a./bazel clean
is executed.Below are extracted parts from a decoded action cache:
In index:
bazel-out/ubuntu22-fastbuild/bin/_solib_k8/_U@@FOOcc_Ulibrary___Uexternal_SFOO_Sexternal_SFOO_Slib_Smodified_Urunpath/libBar.so.5 <==> 333425 <-- Missing entry
A known workaround is to use the flag
--nouse_action_cache
.Which category does this issue belong to?
Local Execution
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
It's not clear how to reproduce this issue with ease. Attempts to manually induce this state on a local machine in a controlled manner have been unsuccessful.
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?release 7.2.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: