-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847
Comments
FYI @buchgr |
@meteorcloudy , how do i find the commit that Bazel was built at? I cannot repro this locally with Bazel built at f2b3bec. Also, the output root on CI is under |
Never mind, found it in the "Build Bazel (Windows)" task on the same page! |
Great, we explicitly set the output user root to |
Oh, you're right! I was looking at the "Bazel Info" part of the log. That does not use the |
I see, but it's better we use |
I tried to rerun the job in culprit finder Looks like it's a flaky error, d235a06 is certainly not the culprit, sorry for the noise here. |
No worries! Thanks for looking again! |
Are P1 and Team-Windows adequate? |
Still need to look into what caused it, it can be stably reproduced in downstream |
As a side-note, it would help repro-ing if some step (maybe "Bazel Info") printed the client environment. Is that possible? |
Yes, that's doable! |
Similar failure is happening to TensorFlow https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/679#0910f88b-7b15-49d8-97c6-6d73f3aa45d6 |
Let me take a look! |
Add the --[no]experimental_header_validation_debug flag. This flag tells Bazel to print extra debugging information when a C++ compilation action fails header inclusion validation. We will enable this flag on BuildKite in hopes of catching the culprit of bazelbuild#6847. After we find the culprit, this commit should be reverted and the --experimental_header_validation_debug flag from BuildKite's configuration should be removed. See bazelbuild#6847
I noticed something in https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/679#71068d7e-b147-46c6-a7ea-dcf6ddb97425 that I didn't before:
The error comes from |
Hmmm, a wild guess: could this be a poisoned cache entry? Header validation on Windows works by asking the compiler to write the list of included files (via the I haven't yet figured out how, but maybe the compiler's output from an earlier build (with a different output root) was somehow written to file in the output cache, and incorrectly retrieved for header validation? A cache issue would also explain why this issue is flaky, as pointed out by @meteorcloudy here: #6847 (comment) |
More evidence for the cache poisoning: In the previous build of "Android Testing" on Windows (678), the build failed with a similar error (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/678#515c01ce-c6a1-4f9e-989f-ec4aef899b24):
output base was In the previous build before that (677) (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/677#9a4f231f-c38a-4b6a-a691-38be5ec7768b), "Android Testing" on Windows built successfully. Guess what the output base was? It was My theory is that build 677 inserted a cache entry that builds 678 and 679 incorrectly retrieved, and failed header validation because they tried stripping the wrong execroot prefix. |
@lberki told me it's quite possible that Bazel stores the action's stdout in the remote cache, because that's how Bazel can replay the action's output if it was cached. The latest "Android Testing" build as of now is 689 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/689#56cc7a37-3397-4d8a-b06b-ea7bb02980fe) and it's failing just like the others, only with different paths:
This can be traced back to build 682 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/682#98aa10ac-cbc6-43de-9aea-dd9cda8ce0a4) that also failed (which I find strange), because how did it insert a cache entry then? Or maybe the output root
Build 681 also used
Build 680 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/680#905b6889-1c55-43a4-af02-22d19f9b4e33) used the |
@buchgr , @meteorcloudy :
|
@laszlocsomor Thanks for looking into this! Your conclusion is very plausible, I suspect it's caused by cbacbb9. @ulfjack confirmed we do cache the action's outErr in remote cache. But we haven't understand why it's not happening before. |
I believe I have understand the underlying problem. cbacbb9 actually revealed a header checking bug when using remote execution. After cbacbb9, we store the output to file first, so no matter remote cache is hit or not, we always parse the original output. Because the output could contain absolute path (execroot) from a different machine, we got this problem. I'm working on a fix to remote the execroot prefilx from include file paths in the output. We'll see if that works. |
After some investigation, I think filtering the original output before storing it to file is not possible without applying a filter to the |
Fixes bazelbuild/bazel#6847 This change is for making including scanning work on Windows for builds with remote caching or remote execution enabled. After this change, the ShowIncludesFilter will look for the first `execroot\<workspace_name>` in the output header file paths, then it considers `C:\...\execroot\<workspace_name>` as the execroot path. Because execroot path could be different if remote cache is hit, we ignore it and only add the relative path as dependencies. I'm quite unwilling to make this change, because parsing `execroot\\<workspace_name>` for execroot is not guaranteed to work always. But considering the only case this could go wrong is when people use an output base that already contains `execroot\\<workspace_name>`, which I think should never happen. Closes #6931. Change-Id: Ife2cb91c75f1b5b297851400e672db2b35ff09e0 PiperOrigin-RevId: 225553627
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/649#8425be19-6d12-4ec4-934b-228d76be474c
Culprit Finder says d235a06 is the culprt@laszlocsomor
The text was updated successfully, but these errors were encountered: