-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel 0.16.0 crashes with StackOverflow on Windows #5730
Comments
Interestingly, we also had one successful run with Bazel 0.16.0 on Windows: https://buildkite.com/bazel/bazel-bazel/builds/3703#4638243b-5e19-47ba-9cd6-806cf1f49248 |
The problem seems to be a bad path, see the 2nd line of your log:
The backslash is missing after "c:". I'll look into it. |
Haha, never mind, that has nothing to do with the StackOverflow :) |
I downgraded our Windows VMs to 0.15.2 and now the same build that reproducibly crashed before seems to work fine. |
Seeing this in the Android-testing pipeline, may be related? Or is the Android build simply too large?
|
Any updates on this issue @philwo @buchgr @laszlocsomor ? |
Sorry, I have no updates. Anyone else? |
Update: I can also reproduce this error on my local machine with 0.16.0rc3 |
Just tried 0.16.0rc1, it also failed with "Server terminated abruptly". So the culprit is even before the base commit of 0.16.0. |
|
@meteorcloudy I don't think that commit is in Bazel 0.16.0 (it uses OpenJDK 9), though? |
@philwo You are right |
I'm running bisect from 0.15.2's base commit to 0.16.0's base commit. It will take some time. |
@meteorcloudy if you want to rule out that the JDK9 is at fault, please use #5786 which is the 0.16.0 code base with JDK8. |
@buchgr Are we using JDK9 in 0.15.2? |
The error seems to always happen in a [1] http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom |
I'm a little unclear on the root cause here -- do we anticipate this is a JDK9 issue that would be fixed by the commits in #5760 ? |
@c-parsons no, I think this is unrelated to #5760. |
@c-parsons Yes, this issue is gone if we revert JDK version to 8. But @buchgr has figured out a fix to make it work with JDK9 as well. I was running a test for a Bazel version of 0.16.0 with his fix all night, it didn't fail after 100 times rerun. #5760 is irrelevant to this issue, I'm ccing you in the internal CL for fixing this. |
We found that with JDK9 and up Bazel would sometimes crash with a StackOverflowError in one of the Command-Accumulator-Thread-* threads. We experimentally found that this error was due to these threads being constrained to a 32KiB stack size. The default stack size for JVM threads on most 64-bit systems is 1MiB (So that's 3% of the default). The purpose of the Command-Accumulator-Threads is to read stdout/stderr from processes that Bazel launches locally. The proposed fix is to just use the system default stack size for these threads. The alternative is to increase the size limit to some arbitrary number that happens to work, but this is likely premature optimization and I'd like to avoid that if possible. We further found that this code even predates Blaze/Bazel and is from 2005. PiperOrigin-RevId: 208009940
Baseline: 4f64b77 Cherry picks: + 4c9a0c8: reduce the size of bazel's embedded jdk + d3228b6: remote: limit number of open tcp connections by default. Fixes #5491 + 8ff87c1: Fix autodetection of linker flags + c4622ac: Fix autodetection of -z linker flags + 1021965: blaze_util_posix.cc: fix order of #define + ab1f269: blaze_util_freebsd.cc: include path.h explicitly + 68e92b4: openjdk: update macOS openjdk image. Fixes #5532 + f45c224: Set the start time of binary and JSON profiles to zero correctly. + bca1912: remote: fix race on download error. Fixes #5047 + 3842bd3: jdk: use parallel old gc and disable compact strings + 6bd0bdf: Add objc-fully-link to the list of actions that require the apple_env feature. This fixes apple_static_library functionality. + f330439: Add the action_names_test_files target to the OSS version of tools/buils_defs/cc/BUILD. + d215b64: Fix StackOverflowError on Windows. Fixes #5730 + 366da4c: In java_rules_skylark depend on the javabase through //tools/jdk:current_java_runtime + 30c601d: Don't use @local_jdk for jni headers + c56699d: 'DumpPlatformClasspath' now dumps the current JDK's default platform classpath This release is a patch release that contains fixes for several serious regressions that were found after the release of Bazel 0.16.0. In particular this release resolves the following issues: - Bazel crashes with a StackOverflowError on Windows (See #5730) - Bazel requires a locally installed JDK and does not fall back to the embedded JDK (See #5744) - Bazel fails to build for Homebrew on macOS El Capitan (See #5777) - A regression in apple_static_library (See #5683) Please watch our blog for a more detailed release announcement.
We found that with JDK9 and up Bazel would sometimes crash with a StackOverflowError in one of the Command-Accumulator-Thread-* threads. We experimentally found that this error was due to these threads being constrained to a 32KiB stack size. The default stack size for JVM threads on most 64-bit systems is 1MiB (So that's 3% of the default). The purpose of the Command-Accumulator-Threads is to read stdout/stderr from processes that Bazel launches locally. The proposed fix is to just use the system default stack size for these threads. The alternative is to increase the size limit to some arbitrary number that happens to work, but this is likely premature optimization and I'd like to avoid that if possible. We further found that this code even predates Blaze/Bazel and is from 2005. PiperOrigin-RevId: 208009940
After upgrading our Buildkite VMs to Bazel 0.16.0, we noticed that Bazel reproducibly crashes with a StackOverflow a few seconds after it starts building:
https://buildkite.com/bazel/bazel-bazel/builds/3702#a46f7545-e15e-4eca-bee2-904709ee4fed
I managed to repro this manually on the Windows VM and grabbed the jvm.out log:
@buchgr and my theory is that it must be caused by one of the last three cherry-picks that went into 0.16.0, because apparently 0.16.0rc3 was tested with downstream projects (which should have showed this issue), but 0.16.0rc4 (which became the final version) wasn't:
https://buildkite.com/bazel/bazel-with-downstream-projects-bazel/builds?branch=release-0.16.0
The text was updated successfully, but these errors were encountered: