Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847

meteorcloudy · 2018-12-05T10:46:55Z

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/649#8425be19-6d12-4ec4-934b-228d76be474c

ERROR: D:/b/ny24af4c/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/file-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/file-jni.cc':
  | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

~~Culprit Finder says d235a06 is the culprt~~

@laszlocsomor

The text was updated successfully, but these errors were encountered:

meteorcloudy · 2018-12-05T10:47:04Z

FYI @buchgr

laszlocsomor · 2018-12-05T12:57:53Z

@meteorcloudy , how do i find the commit that Bazel was built at? I cannot repro this locally with Bazel built at f2b3bec.

Also, the output root on CI is under C:/windows/system32/config/systemprofile -- that doesn't look good.

laszlocsomor · 2018-12-05T13:03:23Z

Never mind, found it in the "Build Bazel (Windows)" task on the same page!

meteorcloudy · 2018-12-05T13:05:24Z

Great, we explicitly set the output user root to D:/b, so it shouldn't be C:/windows/system32/config/systemprofile
https://github.com/bazelbuild/continuous-integration/blob/master/buildkite/bazelci.py#L856

laszlocsomor · 2018-12-05T13:06:37Z

Oh, you're right! I was looking at the "Bazel Info" part of the log. That does not use the --output_user_root flag.

meteorcloudy · 2018-12-05T13:09:49Z

I see, but it's better we use --output_user_root flag also for bazel info. I'll send a change.

meteorcloudy · 2018-12-05T13:34:06Z

I tried to rerun the job in culprit finder
https://buildkite.com/bazel/culprit-finder/builds/41#b24d1754-a989-47f4-86e9-0345e771560a
This time it says everything is passing..

Looks like it's a flaky error, d235a06 is certainly not the culprit, sorry for the noise here.

laszlocsomor · 2018-12-05T13:34:48Z

No worries! Thanks for looking again!

laszlocsomor · 2018-12-05T13:35:24Z

Are P1 and Team-Windows adequate?

meteorcloudy · 2018-12-05T13:37:59Z

Still need to look into what caused it, it can be stably reproduced in downstream
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/649#1064573d-b8f3-4b7a-be4c-cf6467da4312

laszlocsomor · 2018-12-05T13:39:51Z

As a side-note, it would help repro-ing if some step (maybe "Bazel Info") printed the client environment. Is that possible?

meteorcloudy · 2018-12-05T13:41:16Z

Yes, that's doable!

meteorcloudy · 2018-12-07T13:22:02Z

Similar failure is happening to TensorFlow https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/679#0910f88b-7b15-49d8-97c6-6d73f3aa45d6

laszlocsomor · 2018-12-07T13:33:21Z

Let me take a look!

Add the --[no]experimental_header_validation_debug flag. This flag tells Bazel to print extra debugging information when a C++ compilation action fails header inclusion validation. We will enable this flag on BuildKite in hopes of catching the culprit of bazelbuild#6847. After we find the culprit, this commit should be reverted and the --experimental_header_validation_debug flag from BuildKite's configuration should be removed. See bazelbuild#6847

laszlocsomor · 2018-12-10T08:42:21Z

I noticed something in https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/679#71068d7e-b147-46c6-a7ea-dcf6ddb97425 that I didn't before:

ERROR: D:/b/fkulpsov/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/file-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/file-jni.cc':
  | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

The error comes from D:/b/fkulpsov/... but it complains about an inclusion from D:/b/hh7flgnl/.... I have no idea what hh7flgnl is or where it comes from -- the output base (according to the "Bazel Info" step) is D:/b/fkulpsov.

laszlocsomor · 2018-12-10T08:48:55Z

Hmmm, a wild guess: could this be a poisoned cache entry?

Header validation on Windows works by asking the compiler to write the list of included files (via the /showIncludes flag), then validating the list against the allowed list of files. The VC++ compiler however outputs absolute paths. and Bazel strips known prefixes from these paths (e.g. the execution root) to get relative paths.

I haven't yet figured out how, but maybe the compiler's output from an earlier build (with a different output root) was somehow written to file in the output cache, and incorrectly retrieved for header validation?

A cache issue would also explain why this issue is flaky, as pointed out by @meteorcloudy here: #6847 (comment)

laszlocsomor · 2018-12-10T08:54:54Z

More evidence for the cache poisoning:

In the previous build of "Android Testing" on Windows (678), the build failed with a similar error (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/678#515c01ce-c6a1-4f9e-989f-ec4aef899b24):

ERROR: D:/b/ncxufbua/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/processes-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/processes-jni.cc':
  | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

output base was D:/b/ncxufbua, and the missing inclusion was again from D:/b/hh7flgnl.

In the previous build before that (677) (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/677#9a4f231f-c38a-4b6a-a691-38be5ec7768b), "Android Testing" on Windows built successfully. Guess what the output base was? It was D:/b/hh7flgnl.

My theory is that build 677 inserted a cache entry that builds 678 and 679 incorrectly retrieved, and failed header validation because they tried stripping the wrong execroot prefix.

laszlocsomor · 2018-12-10T09:55:21Z

@lberki told me it's quite possible that Bazel stores the action's stdout in the remote cache, because that's how Bazel can replay the action's output if it was cached.

The latest "Android Testing" build as of now is 689 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/689#56cc7a37-3397-4d8a-b06b-ea7bb02980fe) and it's failing just like the others, only with different paths:

ERROR: D:/b/wanmbwio/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/processes-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/processes-jni.cc':
  | 'D:/b/tggxeo2z/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

This can be traced back to build 682 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/682#98aa10ac-cbc6-43de-9aea-dd9cda8ce0a4) that also failed (which I find strange), because how did it insert a cache entry then? Or maybe the output root D:/b/tggxeo2z was reused in this build and was already used in an earlier successful build that inserted this cache entry?

ERROR: D:/b/tggxeo2z/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/jni-util.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/jni-util.cc':
  | 'D:/b/ny24af4c/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Build 681 also used D:/b/tggxeo2z output root, and failed with inclusion ostensibly from D:/b/ny24af4c:

ERROR: D:/b/tggxeo2z/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/jni-util.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
  | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/jni-util.cc':
  | 'D:/b/ny24af4c/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Build 680 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/680#905b6889-1c55-43a4-af02-22d19f9b4e33) used the D:/b/ny24af4c output root and failed with undeclared inclusion from D:/b/hh7flgnl that we already know from builds 678 and 679.

laszlocsomor · 2018-12-10T09:59:17Z

@buchgr , @meteorcloudy :
Do you know:

whether my theory is plausible, i.e. whether Bazel really stores the compilation action's outErr in the remote cache?
how we could postprocess the action's outErr to remove the execroot prefix, so the cached file will only have relative paths?
if it's possible to purge bad entries from the cache, or purge the whole cache that CI uses?

meteorcloudy · 2018-12-13T10:09:15Z

@laszlocsomor Thanks for looking into this! Your conclusion is very plausible, I suspect it's caused by cbacbb9.

@ulfjack confirmed we do cache the action's outErr in remote cache. But we haven't understand why it's not happening before.

meteorcloudy · 2018-12-13T13:24:42Z

I believe I have understand the underlying problem. cbacbb9 actually revealed a header checking bug when using remote execution.
Before cbacbb9, we filter the original output during the actual execution of the compiling action, which means the store action output is already filtered. In later builds, when we reply the output from the stored file due to remote cache hit, the output is already filtered, thus we capture no header file dependencies anymore.

After cbacbb9, we store the output to file first, so no matter remote cache is hit or not, we always parse the original output. Because the output could contain absolute path (execroot) from a different machine, we got this problem.

I'm working on a fix to remote the execroot prefilx from include file paths in the output. We'll see if that works.

meteorcloudy · 2018-12-14T14:03:26Z

After some investigation, I think filtering the original output before storing it to file is not possible without applying a filter to the FileOutErr class, but that's incompatible with some SpawnRunner implementations. That's why @ulfjack authorized cbacbb9. So I had to go with a suboptimal solution, please see #6931 .

Fixes bazelbuild/bazel#6847 This change is for making including scanning work on Windows for builds with remote caching or remote execution enabled. After this change, the ShowIncludesFilter will look for the first `execroot\<workspace_name>` in the output header file paths, then it considers `C:\...\execroot\<workspace_name>` as the execroot path. Because execroot path could be different if remote cache is hit, we ignore it and only add the relative path as dependencies. I'm quite unwilling to make this change, because parsing `execroot\\<workspace_name>` for execroot is not guaranteed to work always. But considering the only case this could go wrong is when people use an output base that already contains `execroot\\<workspace_name>`, which I think should never happen. Closes #6931. Change-Id: Ife2cb91c75f1b5b297851400e672db2b35ff09e0 PiperOrigin-RevId: 225553627

meteorcloudy added type: bug P1 I'll work on this now. (Assignee required) breakage labels Dec 5, 2018

meteorcloudy assigned laszlocsomor Dec 5, 2018

meteorcloudy added the area-Windows Windows-specific issues and feature requests label Dec 5, 2018

meteorcloudy unassigned laszlocsomor Dec 5, 2018

meteorcloudy removed the area-Windows Windows-specific issues and feature requests label Dec 5, 2018

jin mentioned this issue Dec 5, 2018

incompatible_package_name_is_a_function: Remove PACKAGE_NAME and REPOSITORY_NAME #5827

Closed

laszlocsomor self-assigned this Dec 7, 2018

laszlocsomor mentioned this issue Dec 7, 2018

C++ compilation: flag for header-validation debug #6866

Closed

meteorcloudy mentioned this issue Dec 14, 2018

Make ShowIncludesFilter ignore execroot differences #6931

Closed

bazel-io closed this as completed in a1aa5c1 Dec 14, 2018

meteorcloudy mentioned this issue Jan 24, 2019

Include checking fails on windows for generated C header files (undeclared inclusion(s)) #7030

Closed

meteorcloudy mentioned this issue Aug 14, 2019

Windows: Bazel cannot share cpp cache between two projects with different workspace name #9172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847

Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847

meteorcloudy commented Dec 5, 2018 •

edited

Loading

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

meteorcloudy commented Dec 7, 2018 •

edited

Loading

laszlocsomor commented Dec 7, 2018

laszlocsomor commented Dec 10, 2018

laszlocsomor commented Dec 10, 2018 •

edited

Loading

laszlocsomor commented Dec 10, 2018 •

edited

Loading

laszlocsomor commented Dec 10, 2018

laszlocsomor commented Dec 10, 2018 •

edited

Loading

meteorcloudy commented Dec 13, 2018 •

edited

Loading

meteorcloudy commented Dec 13, 2018

meteorcloudy commented Dec 14, 2018

Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847

Bazel CI: Android Testing is failing on Windows with Bazel@HEAD #6847

Comments

meteorcloudy commented Dec 5, 2018 • edited Loading

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

laszlocsomor commented Dec 5, 2018

meteorcloudy commented Dec 5, 2018

meteorcloudy commented Dec 7, 2018 • edited Loading

laszlocsomor commented Dec 7, 2018

laszlocsomor commented Dec 10, 2018

laszlocsomor commented Dec 10, 2018 • edited Loading

laszlocsomor commented Dec 10, 2018 • edited Loading

laszlocsomor commented Dec 10, 2018

laszlocsomor commented Dec 10, 2018 • edited Loading

meteorcloudy commented Dec 13, 2018 • edited Loading

meteorcloudy commented Dec 13, 2018

meteorcloudy commented Dec 14, 2018

meteorcloudy commented Dec 5, 2018 •

edited

Loading

meteorcloudy commented Dec 7, 2018 •

edited

Loading

laszlocsomor commented Dec 10, 2018 •

edited

Loading

laszlocsomor commented Dec 10, 2018 •

edited

Loading

laszlocsomor commented Dec 10, 2018 •

edited

Loading

meteorcloudy commented Dec 13, 2018 •

edited

Loading