-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--experimental_remote_cache_eviction_retries
doesn't evict all metadata causing build failure
#18696
Comments
That's interesting. Maybe I am wrong, but I don't think the code for cleaning Bazel's internal state relies on the specific actions, i.e. if it works for other actions, it should work for For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be? Is that possible to collect logs from server side of the access patterns for this particular blob? e.g. create time, refresh time, delete time, and etc. |
I don't think it matters? The first build fails with missing cache item, and it rebuilds the whole build, and the input shouldn't be assumed to exist, so the action that generates the missing item should be re-run. That's not happening. Instead the dependent action is simply run again (which is guaranteed to fail). |
The generating action is expected to be re-run. Do you have execution log for the generating action from the second invoction? Otherwise, it might be the case that the file was evicted again before Bazel requested to download it. |
Patrick should be able to provide that. The build log shows only 2 actions though (status.txt and the dependent action), in the gRPC log for the second run there isn't any upload (or any calls really) for that blob/action, and it was part of the findMissingBlobs call, so I assume that means Bazel knows that it wasn't generated locally. |
This looks like a Bazel bug to me. Probably the state was not cleaned due to race conditions. |
@coeuvre I sent you a gRPC log that I had from the second invocation that caused the build to fail. I hope it's useful. I can add some logging for collecting an execution log, and hope to catch it again in the next few days. |
5 was picked randomly for --experimental_remote_cache_eviction_retries in the integration test. While it is working perfectly to test the rewinding, it cannot reports transient errors (e.g. #18696) because a next retry will probably fix that. This CL changes it to 1. Also add a assert to check invocation ids from second attempt is different to catch #18694. PiperOrigin-RevId: 541855893 Change-Id: I5f07cc1ed91a328454ed4949a04ff1acf6fa98b7
@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you. |
cc: @coeuvre for visibility |
IIRC, the log @BalestraPatrick shared with me didn't reveal the root cause. How frequent is the issue now in Bazel 7? Can you capture the logs again given that the code has changed a lot in Bazel 7? |
We are also experiencing this using Bazel 7, but is pretty hard to reproduce, I did not manage to reproduce it yet in a way that I can troubleshoot it |
We're hitting the same issue too, and often enough that BwoB is not usable for our (local) development environment. We're using Bazel 7. Can this be prioritize? |
It's worth trying whether this is fixed by the upcoming release 7.2 which includes eda0fe4. |
7.2.0rc2 does not fix the issue for us. |
I think it would be good to change the spec from SHOULD to MUST in that case, or at least explain the consequences. Bazel is by far the most used client of this spec. |
This doesn't really vibe with https://bazel.build/remote/caching, which Even if b were not on the web page, it's pretty much implict in option a - if you bring a HTTP cache, you need to clean it up, you can't have unbounded growth of your cache. So what do you do? Expire the objects that either have lived the longest, or some sort of LRU if your cache is fancier. Bazel ought to be able to handle both, or otherwise it needs to be explicitly stated that these strategies are not supported, and you need to defer to special bazel cache software instead. |
@coeuvre I'm not sure this is 100% necessary - if using a HTTP cache like |
I will revert the change and explore other possible solutions. |
Will the revert (b899783) be cherry-picked to 7.4.0 release? |
The original commit is not cherry-picked to 7.4.0 so cherry-pick for the revert is not necessary. |
@bazel-io fork 7.4.0 |
Previously, it's possible for an HTTP cache to delete CAS entries referenced by AC without deleting the AC itself because HTTP cache doesn't understand the relationship between AC and CAS. This could result in permanent build errors because Bazel always trust the AC from remote cache assuming all referenced CAS entries exist. Now, we record the digest of lost blobs before build rewinding, so that during the next build, we can ignore the stale AC and continue with execution. Fixes bazelbuild#18696. RELNOTES: Added support for using a remote cache that evicts blobs and doesn't have AC integrity check (e.g. HTTP cache). PiperOrigin-RevId: 672536163 Change-Id: Ic1271431d28333f6d86e5963542d15a133075157
Previously, it's possible for an HTTP cache to delete CAS entries referenced by AC without deleting the AC itself because HTTP cache doesn't understand the relationship between AC and CAS. This could result in permanent build errors because Bazel always trust the AC from remote cache assuming all referenced CAS entries exist. Now, we record the digest of lost blobs before build rewinding, so that during the next build, we can ignore the stale AC and continue with execution. Fixes #18696. RELNOTES: Added support for using a remote cache that evicts blobs and doesn't have AC integrity check (e.g. HTTP cache). PiperOrigin-RevId: 672536163 Change-Id: Ic1271431d28333f6d86e5963542d15a133075157 Commit 5d81579 Co-authored-by: Googler <[email protected]>
A fix for this issue has been included in Bazel 7.4.0 RC1. Please test out the release candidate and report any issues as soon as possible. |
I am testing 7.4.0 RC2. Will update here if I observe build failures. If I dont say anything consider it as a validation that the fix works as expected 😄 |
👋 Using
My configuration is:
The thing is: I only see the error once in the logs, but I would expect to see it 5 times due to the default value of |
Since you mentioned the error only appeared once in the log, does it mean the build completed successfully after retries? If not, what's the error message? |
Nope, the build failed after the |
Argh, I think |
Aha, You are right, that explains it. Thanks! I will set it and continue monitoring 🤞 |
👋 Did not see the error again since last comment. Thanks! 🙏 |
Description of the bug:
Hello!
We have used
--experimental_remote_cache_eviction_retries=1
with--remote_download_toplevel
during the last few weeks, and we noticed from our data that in some cases, when Bazel exits with exit code 39, the following invocation will fail with the exact same error. This is not always the case (we have seen it correctly recover in some situations), but for one specificCppArchive
action, we see it failing in about 1% of our builds.The logs look like the following:
The only way to recover from this failure seems to be to run
bazel clean
or disable BwtB (--remote_download_toplevel
).What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
In the gRPC log for the second invocation, we see the following:
Which operating system are you running Bazel on?
macOS
What is the output of
bazel info release
?6.2.0 @ 286306e
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: