Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 0.16.0 crashes with StackOverflow on Windows #5730

Closed
philwo opened this issue Aug 1, 2018 · 21 comments
Closed

Bazel 0.16.0 crashes with StackOverflow on Windows #5730

philwo opened this issue Aug 1, 2018 · 21 comments
Assignees
Labels
breakage category: misc > misc P0 This is an emergency and more important than other current work. (Assignee required) platform: windows type: bug

Comments

@philwo
Copy link
Member

philwo commented Aug 1, 2018

After upgrading our Buildkite VMs to Bazel 0.16.0, we noticed that Bazel reproducibly crashes with a StackOverflow a few seconds after it starts building:

https://buildkite.com/bazel/bazel-bazel/builds/3702#a46f7545-e15e-4eca-bee2-904709ee4fed

I managed to repro this manually on the Windows VM and grabbed the jvm.out log:

Can't load log handler "java.util.logging.FileHandler"
java.nio.file.NoSuchFileException: c:users\philwo\_bazel~1\ozmjprom\java.log.lck
java.nio.file.NoSuchFileException: c:users\philwo\_bazel~1\ozmjprom\java.log.lck
	at java.base/sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
	at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
	at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
	at java.base/sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown Source)
	at java.base/java.nio.channels.FileChannel.open(Unknown Source)
	at java.base/java.nio.channels.FileChannel.open(Unknown Source)
	at java.logging/java.util.logging.FileHandler.openFiles(Unknown Source)
	at java.logging/java.util.logging.FileHandler.<init>(Unknown Source)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)
	at java.base/java.lang.Class.newInstance(Unknown Source)
	at java.logging/java.util.logging.LogManager.createLoggerHandlers(Unknown Source)
	at java.logging/java.util.logging.LogManager.access$1300(Unknown Source)
	at java.logging/java.util.logging.LogManager$4.run(Unknown Source)
	at java.logging/java.util.logging.LogManager$4.run(Unknown Source)
	at java.base/java.security.AccessController.doPrivileged(Native Method)
	at java.logging/java.util.logging.LogManager.loadLoggerHandlers(Unknown Source)
	at java.logging/java.util.logging.LogManager.initializeGlobalHandlers(Unknown Source)
	at java.logging/java.util.logging.LogManager.access$1800(Unknown Source)
	at java.logging/java.util.logging.LogManager$RootLogger.accessCheckedHandlers(Unknown Source)
	at java.logging/java.util.logging.Logger.getHandlers(Unknown Source)
	at java.logging/java.util.logging.Logger.log(Unknown Source)
	at java.logging/java.util.logging.Logger.doLog(Unknown Source)
	at java.logging/java.util.logging.Logger.log(Unknown Source)
	at java.logging/java.util.logging.Logger.info(Unknown Source)
	at com.google.devtools.build.lib.analysis.BlazeVersionInfo.logVersionInfo(BlazeVersionInfo.java:64)
	at com.google.devtools.build.lib.analysis.BlazeVersionInfo.setBuildInfo(BlazeVersionInfo.java:79)
	at com.google.devtools.build.lib.bazel.Bazel.main(Bazel.java:65)
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by io.netty.util.internal.ReflectionUtil (file:/C:/Users/philwo/_bazel_philwo/install/d522db92d7621d404f960f24e49d0696/_embedded_binaries/A-server.jar) to field sun.nio.ch.SelectorImpl.selectedKeys
WARNING: Please consider reporting this to the maintainers of io.netty.util.internal.ReflectionUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
java.lang.StackOverflowError
	at java.base/java.util.concurrent.FutureTask.finishCompletion(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.set(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Exception: java.lang.StackOverflowError thrown from the UncaughtExceptionHandler in thread "Command-Accumulator-Thread-19"

@buchgr and my theory is that it must be caused by one of the last three cherry-picks that went into 0.16.0, because apparently 0.16.0rc3 was tested with downstream projects (which should have showed this issue), but 0.16.0rc4 (which became the final version) wasn't:

https://buildkite.com/bazel/bazel-with-downstream-projects-bazel/builds?branch=release-0.16.0

@philwo philwo assigned philwo, laszlocsomor and buchgr and unassigned philwo and laszlocsomor Aug 1, 2018
@philwo philwo added type: bug P0 This is an emergency and more important than other current work. (Assignee required) platform: windows category: misc > misc breakage labels Aug 1, 2018
@philwo
Copy link
Member Author

philwo commented Aug 1, 2018

Interestingly, we also had one successful run with Bazel 0.16.0 on Windows: https://buildkite.com/bazel/bazel-bazel/builds/3703#4638243b-5e19-47ba-9cd6-806cf1f49248

@laszlocsomor
Copy link
Contributor

The problem seems to be a bad path, see the 2nd line of your log:

java.nio.file.NoSuchFileException: c:users\philwo\_bazel~1\ozmjprom\java.log.lck

The backslash is missing after "c:". I'll look into it.

@laszlocsomor
Copy link
Contributor

Haha, never mind, that has nothing to do with the StackOverflow :)

@philwo
Copy link
Member Author

philwo commented Aug 1, 2018

I downgraded our Windows VMs to 0.15.2 and now the same build that reproducibly crashed before seems to work fine.

@jin
Copy link
Member

jin commented Aug 1, 2018

Seeing this in the Android-testing pipeline, may be related? Or is the Android build simply too large?


ERROR: D:/build/buildkite-worker-windows-java8-fvgb-1/bazel/android-testing/ui/uiautomator/BasicSample/BUILD.bazel:5:1: Couldn't build file ui/uiautomator/BasicSample/_renamed/BasicSampleLib/AndroidManifest.xml: Merging manifest for //ui/uiautomator/BasicSample:BasicSampleLib failed (Exit 1)
--
  | Error occurred during initialization of VM
  | Could not reserve enough space for code cache
  | ERROR: C:/users/b/_bazel_b/p5ow7b7b/external/com_android_support_support_media_compat_27_0_2/BUILD:7:1: Couldn't build file external/com_android_support_support_media_compat_27_0_2/com_android_support_support_media_compat_27_0_2_symbols/local.bin: Parsing Android resources for @com_android_support_support_media_compat_27_0_2//:com_android_support_support_media_compat_27_0_2 failed (Exit 1)
  | java.lang.OutOfMemoryError
  | at java.util.zip.ZipFile.open(Native Method)
  | at java.util.zip.ZipFile.<init>(ZipFile.java:225)
  | at java.util.zip.ZipFile.<init>(ZipFile.java:155)
  | at java.util.jar.JarFile.<init>(JarFile.java:166)
  | at java.util.jar.JarFile.<init>(JarFile.java:103)
  | at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:930)
  | at sun.misc.URLClassPath$JarLoader.access$800(URLClassPath.java:791)
  | at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:876)
  | at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:869)
  | at java.security.AccessController.doPrivileged(Native Method)
  | at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:868)
  | at sun.misc.URLClassPath$JarLoader.<init>(URLClassPath.java:841)
  | at sun.misc.URLClassPath$3.run(URLClassPath.java:565)
  | at sun.misc.URLClassPath$3.run(URLClassPath.java:555)
  | at java.security.AccessController.doPrivileged(Native Method)
  | at sun.misc.URLClassPath.getLoader(URLClassPath.java:554)
  | at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
  | at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:484)
  | at sun.misc.URLClassPath.getResource(URLClassPath.java:238)
  | at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
  | at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
  | at java.security.AccessController.doPrivileged(Native Method)
  | at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
  | at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  | at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
  | at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  | at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
  | Error: A JNI error has occurred, please check your installation and try again
  |  
  | INFO: From Parsing Android resources for @com_android_support_test_monitor_1_0_2_alpha1//:com_android_support_test_monitor_1_0_2_alpha1:
  | OpenJDK 64-Bit Server VM warning: Initialization of C1 CompilerThread14 thread failed (no space to run compilers)


@dslomov
Copy link
Contributor

dslomov commented Aug 6, 2018

Any updates on this issue @philwo @buchgr @laszlocsomor ?

@laszlocsomor
Copy link
Contributor

Sorry, I have no updates. Anyone else?

@buchgr buchgr removed their assignment Aug 6, 2018
@meteorcloudy
Copy link
Member

Update: I can also reproduce this error on my local machine with 0.16.0rc3

@meteorcloudy
Copy link
Member

Just tried 0.16.0rc1, it also failed with "Server terminated abruptly". So the culprit is even before the base commit of 0.16.0.

@meteorcloudy
Copy link
Member

meteorcloudy commented Aug 8, 2018

Now I suspect 4c9149d, because I can reproduce the same error at this commit.

@philwo
Copy link
Member Author

philwo commented Aug 8, 2018

@meteorcloudy I don't think that commit is in Bazel 0.16.0 (it uses OpenJDK 9), though?

@meteorcloudy
Copy link
Member

@philwo You are right

@meteorcloudy
Copy link
Member

I'm running bisect from 0.15.2's base commit to 0.16.0's base commit. It will take some time.

@buchgr
Copy link
Contributor

buchgr commented Aug 8, 2018

@meteorcloudy if you want to rule out that the JDK9 is at fault, please use #5786 which is the 0.16.0 code base with JDK8.

@meteorcloudy
Copy link
Member

@buchgr Are we using JDK9 in 0.15.2?

@buchgr
Copy link
Contributor

buchgr commented Aug 8, 2018

@meteorcloudy no

@buchgr
Copy link
Contributor

buchgr commented Aug 8, 2018

The error seems to always happen in a Command-Accumulator-Thread-* thread. Looking at the code, we specifically limit the stack size of these threads to 32 KiB (presumably to lower the memory consumption). Now, the default stack size on 64-bit windows is 1 MiB [1]. I think if @meteorcloudy finds the JDK9 change to be at fault, then a likely explanation is the implementation of the thread pool changed and now requires more than 32 KiB in stack size.

[1] http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom

@meteorcloudy
Copy link
Member

I have confirmed it didn't fail on #5786, testing a possible fix suggested by @buchgr

@c-parsons
Copy link
Contributor

I'm a little unclear on the root cause here -- do we anticipate this is a JDK9 issue that would be fixed by the commits in #5760 ?

@cushon
Copy link
Contributor

cushon commented Aug 8, 2018

@c-parsons no, I think this is unrelated to #5760.

@meteorcloudy
Copy link
Member

@c-parsons Yes, this issue is gone if we revert JDK version to 8. But @buchgr has figured out a fix to make it work with JDK9 as well. I was running a test for a Bazel version of 0.16.0 with his fix all night, it didn't fail after 100 times rerun.

#5760 is irrelevant to this issue, I'm ccing you in the internal CL for fixing this.

buchgr added a commit to buchgr/bazel that referenced this issue Aug 9, 2018
We found that with JDK9 and up Bazel would sometimes crash
with a StackOverflowError in one of the Command-Accumulator-Thread-*
threads. We experimentally found that this error was due to these
threads being constrained to a 32KiB stack size. The default stack
size for JVM threads on most 64-bit systems is 1MiB (So that's 3%
of the default). The purpose of the Command-Accumulator-Threads
is to read stdout/stderr from processes that Bazel launches locally.

The proposed fix is to just use the system default stack size for
these threads. The alternative is to increase the size limit to
some arbitrary number that happens to work, but this is likely
premature optimization and I'd like to avoid that if possible. We
further found that this code even predates Blaze/Bazel and is
from 2005.

PiperOrigin-RevId: 208009940
bazel-io pushed a commit that referenced this issue Aug 13, 2018
Baseline: 4f64b77

Cherry picks:
   + 4c9a0c8:
     reduce the size of bazel's embedded jdk
   + d3228b6:
     remote: limit number of open tcp connections by default. Fixes
     #5491
   + 8ff87c1:
     Fix autodetection of linker flags
   + c4622ac:
     Fix autodetection of -z linker flags
   + 1021965:
     blaze_util_posix.cc: fix order of #define
   + ab1f269:
     blaze_util_freebsd.cc: include path.h explicitly
   + 68e92b4:
     openjdk: update macOS openjdk image. Fixes #5532
   + f45c224:
     Set the start time of binary and JSON profiles to zero correctly.
   + bca1912:
     remote: fix race on download error. Fixes #5047
   + 3842bd3:
     jdk: use parallel old gc and disable compact strings
   + 6bd0bdf:
     Add objc-fully-link to the list of actions that require the
     apple_env feature. This fixes apple_static_library functionality.
   + f330439:
     Add the action_names_test_files target to the OSS version of
     tools/buils_defs/cc/BUILD.
   + d215b64:
     Fix StackOverflowError on Windows. Fixes #5730
   + 366da4c:
     In java_rules_skylark depend on the javabase through
     //tools/jdk:current_java_runtime
   + 30c601d:
     Don't use @local_jdk for jni headers
   + c56699d:
     'DumpPlatformClasspath' now dumps the current JDK's default
     platform classpath

This release is a patch release that contains fixes for several serious
regressions that were found after the release of Bazel 0.16.0.

In particular this release resolves the following issues:

 - Bazel crashes with a StackOverflowError on Windows (See #5730)
 - Bazel requires a locally installed JDK and does not fall back
   to the embedded JDK (See #5744)
 - Bazel fails to build for Homebrew on macOS El Capitan (See #5777)
 - A regression in apple_static_library (See #5683)

Please watch our blog for a more detailed release announcement.
laurentlb pushed a commit that referenced this issue Sep 12, 2018
We found that with JDK9 and up Bazel would sometimes crash
with a StackOverflowError in one of the Command-Accumulator-Thread-*
threads. We experimentally found that this error was due to these
threads being constrained to a 32KiB stack size. The default stack
size for JVM threads on most 64-bit systems is 1MiB (So that's 3%
of the default). The purpose of the Command-Accumulator-Threads
is to read stdout/stderr from processes that Bazel launches locally.

The proposed fix is to just use the system default stack size for
these threads. The alternative is to increase the size limit to
some arbitrary number that happens to work, but this is likely
premature optimization and I'd like to avoid that if possible. We
further found that this code even predates Blaze/Bazel and is
from 2005.

PiperOrigin-RevId: 208009940
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage category: misc > misc P0 This is an emergency and more important than other current work. (Assignee required) platform: windows type: bug
Projects
None yet
Development

No branches or pull requests

8 participants