Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel's own build is not reproducible on Mac #4770

Closed
gkossakowski opened this issue Mar 5, 2018 · 21 comments
Closed

Bazel's own build is not reproducible on Mac #4770

gkossakowski opened this issue Mar 5, 2018 · 21 comments

Comments

@gkossakowski
Copy link

gkossakowski commented Mar 5, 2018

Description of the problem / feature request:

We build bazel out of a pinned source code in -dist.zip downloaded from https://github.com/bazelbuild/bazel/releases/download.
Bazel is built with the ./compile.sh on each Mac laptop separately. It turns out that despite all Macs having the same configuration, bazel's build digests are different. E.g. the ijar on each machine has a different hash.

Feature requests: what underlying problem are you trying to solve with this feature?

Share JDK-based (e.g. java rules) artifacts between Macs.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Build bazel out of -dist.zip on two Macs with the same configuration and in a sample bazel project run:

md5 "$(./bazel info execution_root)"/external/bazel_tools/tools/jdk/ijar/ijar
MD5 (/private/var/tmp/_bazel_gkk/0ad33e868c3191b3f635e0a81b14b2c9/execroot/com_stripe_zoolander/external/bazel_tools/tools/jdk/ijar/ijar)

What operating system are you running Bazel on?

Mac

What's the output of bazel info release?

release 0.9.0- (@non-git)

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

See above.

Have you found anything relevant by searching the web?

Earlier discussion of me debugging java rules reproducibility: https://groups.google.com/d/msg/bazel-discuss/5M-QoZ4gPq8/d_y1dEWnAAAJ

Any other information, logs, or outputs that you want to share?

Digest logging (or some other simple way of recursively seeing what's going on into each action cache entry) would make it easier to pin down in the future.

@laszlocsomor
Copy link
Contributor

Is this the same issue as #4769 ?

@gkossakowski
Copy link
Author

No, it's different. The #4769 is about digestKey of local_jdk not reproducible across machines configured the same way. I can trigger that bug both on Macos and Linux.

This is about ijar (that's not even written in Java) having different bits across Macs configured the same way. I could only trigger this on Mac and not on Linux.

@laszlocsomor
Copy link
Contributor

laszlocsomor commented Mar 12, 2018

Sorry about the silence!

Do you also see different binaries if you bazel build //third_party/ijar on various machines, or is it only when building using ./compile.sh? If the former, then the cc_* rules might be non-deterministic on Mac -- that'd be bad. If the latter, then just compile.sh might be non-deterministic. Still bad, but less so than the other case.

@jmillikin-stripe
Copy link
Contributor

I've been helping Grzegorz debug this a bit. We've seen different ijar SHA1s from bazel build //third_party/ijar on various MacOS machines. The SHA1 is consistent when built on the same machine, which makes me think the issue is somewhere in the XCode toolchain. It's possible that there's some machine- or installation-specific identifier being embedded.

@jmillikin-stripe
Copy link
Contributor

jmillikin-stripe commented Mar 12, 2018

I built ijar on my corp and personal laptop, and verified Apple's Clang is generating different object code and linking options on different machines at the same compiler major version. Some of this is probably due to the OS version skew, but it's evidence that the binaries are far more machine-dependent on MacOS than they are on Linux.

Notably, the Clang on my personal laptop seems to be linking its outputs with Objective C runtimes.

home$ cc --version
Apple LLVM version 9.0.0 (clang-900.0.38)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

home$ otool -L bazel-bin/third_party/ijar/ijar
bazel-bin/third_party/ijar/ijar:
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 400.9.0)
	/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 1443.14.0)
	/usr/lib/libobjc.A.dylib (compatibility version 1.0.0, current version 228.0.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.0.0)
home$
work$ cc --version
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

work$ otool -L bazel-bin/third_party/ijar/ijar
bazel-bin/third_party/ijar/ijar:
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 400.9.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.0.0)
work$

@jmillikin-stripe
Copy link
Contributor

jmillikin-stripe commented Mar 12, 2018

On Grzegorz's machine (same compiler as my work$ log above), Bazel invokes the linker with -fobjc-link-runtime. This is coming from CROSSTOOL. Looks like some machines have action_config { name: "c++-link-executable" implies: "contains_objc_source"} and others don't.

I looked a little bit at the CROSSTOOL generator but it doesn't seem to be doing anything obviously wrong.

@jmillikin-stripe
Copy link
Contributor

I think I've found the root cause: my work laptop doesn't have Xcode installed, which makes Bazel think it's not using Apple's toolchain. Some unknown set of MacOS-specific options are not being set in CROSSTOOL, but compilation seems to work fine for c/c++ stuff.

Here's the output of xcode-select and Bazel's Xcode locator tool on my laptop:

$ xcode-select --print-path
/Library/Developer/CommandLineTools
$ bazel-bazel-0.11.1/external/local_config_cc/xcode-locator-bin 
error: Error Domain=NSOSStatusErrorDomain Code=-10814 "kLSApplicationNotFoundErr: E.g. no application claims the file"

@laszlocsomor
Copy link
Contributor

@jmillikin-stripe : thanks for your debugging efforts!
/cc @c-parsons @rupertks -- does this look familiar?

@buchgr
Copy link
Contributor

buchgr commented Mar 28, 2018

Has this been fixed? The test works on my macbook with latest released Bazel.

@jmillikin-stripe
Copy link
Contributor

@buchgr Which test? I don't think there's been any test added to verify that presence of XCode doesn't affect ijar builds.

@buchgr
Copy link
Contributor

buchgr commented Mar 28, 2018

I think I have mixed things up. I was under the impression that this made the bazel_determinism_test fail.

@buchgr
Copy link
Contributor

buchgr commented Mar 29, 2018

I have verified that the bazel_determinism_test does not work on our CI machines.

https://source.cloud.google.com/results/invocations/0cedab47-afb6-4f3b-9d57-6e7aa89802f2/targets/%2F%2Fsrc%2Ftest%2Fshell%2Fbazel:bazel_determinism_test/tests

Files /private/var/tmp/_bazel_buildkite/30004132848cb6cbb0d8bc124cd9712b/bazel-sandbox/8820973750646175047/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum1 and /private/var/tmp/_bazel_buildkite/30004132848cb6cbb0d8bc124cd9712b/bazel-sandbox/8820973750646175047/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum2 differ
-- Test log: -----------------------------------------------------------
--- /private/var/tmp/_bazel_buildkite/30004132848cb6cbb0d8bc124cd9712b/bazel-sandbox/8820973750646175047/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum1	2018-03-28 18:00:43.000000000 +0000
+++ /private/var/tmp/_bazel_buildkite/30004132848cb6cbb0d8bc124cd9712b/bazel-sandbox/8820973750646175047/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum2	2018-03-28 18:10:34.000000000 +0000
@@ -10417,0 +10418 @@
+ecd53ba69a8d479d3fa4234e959f869cd10f7ebc68860d2b7915879f8b8b2c54
@@ -10605 +10605,0 @@
-f1954b59039b74d0a0ee3b2bced748604b95b8455a5bf80489296bd81878a5c8
------------------------------------------------------------------------
test_determinism FAILED: terminated because this command returned a non-zero status:
/private/var/tmp/_bazel_buildkite/30004132848cb6cbb0d8bc124cd9712b/bazel-sandbox/8820973750646175047/execroot/io_bazel/bazel-out/darwin-fastbuild/bin/src/test/shell/bazel/bazel_determinism_test.runfiles/io_bazel/src/test/shell/bazel/bazel_determinism_test:51: in call to test_determinism
INFO[bazel_determinism_test 2018-03-28 18:10:34 (+0000)] Cleaning up workspace

@laszlocsomor
Copy link
Contributor

I'm likely the wrong assignee for this bug. Mac is neither my domain of expertise, nor do I have the capacity to work on this.
@buchgr , you've been making great progress on this bug, thank you! Let me actually assign it to you because that makes more sense to me.

@philwo
Copy link
Member

philwo commented Mar 30, 2018

Here's the log of the failure with #4945 in:

-- Test log: -----------------------------------------------------------
--- /private/var/tmp/_bazel_buildkite/0633d5fdbdb93738350e2aa41b56650c/bazel-sandbox/4191562436938518249/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum1	2018-03-30 20:06:15.000000000 +0000
+++ /private/var/tmp/_bazel_buildkite/0633d5fdbdb93738350e2aa41b56650c/bazel-sandbox/4191562436938518249/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum2	2018-03-30 20:11:25.000000000 +0000
@@ -11340 +11340 @@
-bazel-genfiles/external/local_jdk/_ijar/langtools-neverlink/external/local_jdk/lib/tools-ijar.jar b527a295c572a29d2f8c00b62b5887c25d8b2364
+bazel-genfiles/external/local_jdk/_ijar/langtools-neverlink/external/local_jdk/lib/tools-ijar.jar 16ba523f99b8e41420bef8e002a0211ae3d441af
------------------------------------------------------------------------
test_determinism FAILED: Non-deterministic outputs found! .

@philwo
Copy link
Member

philwo commented Mar 30, 2018

I can manually repro this on a CI machine, but not on my personal iMac. This is without any of our special flags, so without remote caching, etc. - interestingly, it's the same file with the same hashes in the diff, although I ran it on a different machine.

Note that the CI machines have the Bazel with embedded JDK installed and on my personal machine I have the version without an embedded JDK. I'm not sure if that might play a role here.

I copied the tools-ijar.jar from the out1 and out2 directories from the test_tmpdir and diffed their contents:

philwo@philwo-macbookpro ~/ijar jar tvf run1_tools-ijar.jar > 1
philwo@philwo-macbookpro ~/ijar jar tvf run2_tools-ijar.jar > 2
philwo@philwo-macbookpro ~/ijar diff -u 1 2
--- 1	2018-03-30 23:37:17.000000000 +0200
+++ 2	2018-03-30 23:37:21.000000000 +0200
@@ -2326,8 +2326,9 @@
    302 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/Pretty$UncheckedIOException.class
  10460 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/Pretty.class
  16253 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeCopier.class
+   569 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeInfo$PosKind.class
    489 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeInfo$TypeAnnotationFinder.class
-  6884 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeInfo.class
+  7031 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeInfo.class
   1936 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeMaker$AnnotationBuilder.class
  26481 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeMaker.class
   8291 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeScanner.class
@@ -2889,7 +2890,7 @@
    392 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/StratumLineInfo.class
    284 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/StringReferenceImpl.class
   1079 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/SunCommandLineLauncher.class
-   372 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/TargetVM$EventController.class
+   356 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/TargetVM$EventController.class
    578 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/TargetVM.class
    278 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/ThreadAction.class
    522 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/jdi/ThreadGroupReferenceImpl$Cache.class

bazel-io pushed a commit that referenced this issue Mar 30, 2018
Don't ask me how so many things can be wrong in a single test...

Progress towards #4770.

FYI @rupertks @buchgr @ulfjack

## Replace 25000 invocations of perl with a single "sha256sum"

This speeds up the test by a factor 2x on my iMac (before: 1200s, now: 600s).

On macOS, "shasum" is a Perl script. Instead of simply passing all input files to the thing at once, we were invoking it once per file. This means roughly 25,000 invocations of Perl per test run. And it's even worse - it wasn't just a call to that Perl script, it was wrapped in a "cat | shasum | cut" pipeline, resulting in silent data loss when you accidentally passed multiple input files to the thing, 75,000 processes being spawned just to compute hashes and losing the file name of what was actually hashed. WTF.

Also, we were using SHA256 to essentially verify that two directory trees are equal. For this purpose, relying on SHA1 should be absolutely fine - and that is, provided by a good native implementation, four times faster than `shasum`. It saves another 10 seconds of the overall run.

With this change, the test also prints the result of a failed determinism check in an easier to read format "filename hash" instead of "hash filename" and on top of that, it also prints the filenames in the diff on macOS, which was missing formerly. Without this, it was basically impossible to debug failures of this test on macOS, as you couldn't see *which files were different*. You had *one* job, bazel_determinism_test.

Before:

```
-- Test log: -----------------------------------------------------------
--- /private/var/tmp/_bazel_buildkite/30004132848cb6cbb0d8bc124cd9712b/bazel-sandbox/8820973750646175047/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum1	2018-03-28 18:00:43.000000000 +0000
+++ /private/var/tmp/_bazel_buildkite/30004132848cb6cbb0d8bc124cd9712b/bazel-sandbox/8820973750646175047/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum2	2018-03-28 18:10:34.000000000 +0000
@@ -10417,0 +10418 @@
+ecd53ba69a8d479d3fa4234e959f869cd10f7ebc68860d2b7915879f8b8b2c54
@@ -10605 +10605,0 @@
-f1954b59039b74d0a0ee3b2bced748604b95b8455a5bf80489296bd81878a5c8
------------------------------------------------------------------------
```

Now (I artificially introduced non-hermeticism to show how a failure would look like):

```
-- Test log: -----------------------------------------------------------
--- /private/var/tmp/_bazel_philwo/7a01905b4627ca044e5e3f5ad5b14d26/bazel-sandbox/5464595340038418595/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum1 2018-03-30 17:12:39.000000000 +0000
+++ /private/var/tmp/_bazel_philwo/7a01905b4627ca044e5e3f5ad5b14d26/bazel-sandbox/5464595340038418595/execroot/io_bazel/_tmp/e503f3f3df14b71e247bc3d7d9bf3608/sum2 2018-03-30 17:17:27.000000000 +0000
@@ -903 +903 @@
-bazel-bin/src/bazel 31d811338ca364f0631560dd4d29406dd6a778ce
+bazel-bin/src/bazel 8f009173894730b00a1d1d6349af7d10f4d21cf3
@@ -5656 +5656 @@
-bazel-bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar f5ec8c4415ad8ecdc0385affc68f2dd4dbf241ef
+bazel-bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar 9899ae35cf431087a34a830bfdaf19d99616689c
@@ -8343 +8343 @@
-bazel-bin/src/main/java/com/google/devtools/build/lib/worker/_javac/worker/libworker_classes/com/google/devtools/build/lib/worker/WorkerFactory.class 780baa17c19ef99ef0b9291db1791ed8e0f1b231
+bazel-bin/src/main/java/com/google/devtools/build/lib/worker/_javac/worker/libworker_classes/com/google/devtools/build/lib/worker/WorkerFactory.class d45c14f09e73e7fcdf01f96aa32646c87b704bc2
@@ -8359 +8359 @@
-bazel-bin/src/main/java/com/google/devtools/build/lib/worker/libworker.jar 60e3afbfec17da7e44c1f0f61cf2a446196717be
+bazel-bin/src/main/java/com/google/devtools/build/lib/worker/libworker.jar 70f557e87d1b32b2e46c79554fe6bf3b89aeaf6e
@@ -11343 +11343 @@
-bazel-genfiles/src/install_base_key 3fad754e4ea19bd1120df5bf16e1f39372e6b9fe
+bazel-genfiles/src/install_base_key 7d7e8b62493912c5ec153032e104640e3980e6b3
@@ -11376 +11376 @@
-bazel-genfiles/src/package.zip 1ce3431b021ca338806162eca72ff84118001df5
+bazel-genfiles/src/package.zip 65f4801d91bbe10cba0d2d4d55c7cf319cd6722d
------------------------------------------------------------------------
test_determinism FAILED: Non-deterministic outputs found! .
```

## Remove obsolete check for BAZEL_TEST_XTRACE

That string does not appear anywhere in our repo, except for these two lines in the test, so there's no point in checking for it.

## Remove obsolete check for Java 7

That was about time.

## Performance improvements and usability fixes

- There's no need to use mktemp to create a unique directory under TEST_TMPDIR, as every test suite has its own TEST_TMPDIR.
- There's no need to remove stuff, as this will just degrade performance and make debugging harder. The surrounding Bazel or system will clean up later.
- There's no need to copy bazel-bin/src/bazel to ./bazel1 before calling it, as you can just call the built bazel from its original location.
- There's no need to run "bazel clean" before the second "bazel build" invocation - it's better to just use two separate output_bases. This is faster and also makes debugging easier, as you can compare the two output_bases in case of a test failure.
- There's no need to call "diff" twice - we can just save the output immediately in the `if` block.

Closes #4945.

PiperOrigin-RevId: 191118833
@gkossakowski
Copy link
Author

this is a bit of a tangent but i found bazel without embedded jdk to give non-reproducible results: #4769
i had to swithch to bazel with embedded jdk to get reproducible builds with scala/java rules

@buchgr
Copy link
Contributor

buchgr commented Apr 9, 2018

@philwo it seems that the two bazel binaries are using a different javac?

@cushon
Copy link
Contributor

cushon commented Apr 12, 2018

This class was added to the JDK by the fix for JDK-8180660, which was backported to JDK 8u. I don't know how the determinism test works, but it looks like you're seeing skew between two different JDKs.

+   569 Fri Jan 01 00:00:00 CET 2010 com/sun/tools/javac/tree/TreeInfo$PosKind.class

@buchgr
Copy link
Contributor

buchgr commented Aug 6, 2018

Is this still problem?

@hlopko
Copy link
Member

hlopko commented Jan 11, 2019

@iirina @meisterT maybe this was fixed by your work on jdks?

@meisterT
Copy link
Member

well, in 6a0a8de#diff-68be8e4b177e1489fffa0557873b6943 we enabled the determinism test again for Mac, so I assume it's fixed. If not, please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants