Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM crashed when start job_worker progress #12502

Closed
liupan664021 opened this issue Nov 13, 2020 · 5 comments · Fixed by #12504
Closed

JVM crashed when start job_worker progress #12502

liupan664021 opened this issue Nov 13, 2020 · 5 comments · Fixed by #12504
Labels
type-bug This issue is about a bug

Comments

@liupan664021
Copy link
Contributor

Alluxio Version:
master branch, 2.3.0, 2.4.0

Describe the bug
I compiled the alluxio locally with the system of centos 6 successfully. But when I try to start the job_worker progress, I got the following error and the JVM crashed.

2020-11-12 15:56:04,999 INFO network.NettyUtils (NettyUtils.java:checkNettyEpollAvailable) - EPOLL_MODE is available
2020-11-12 15:56:05,517 INFO metrics.MetricsSystem (MetricsSystem.java:startSinksFromConfig) - Starting sinks with config: {}.
2020-11-12 15:56:05,519 INFO metrics.MetricsHeartbeatContext (MetricsHeartbeatContext.java:addHeartbeat) - Created metrics heartbeat with ID app-8127555058117044977. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2020-11-12 15:56:05,547 INFO network.TieredIdentityFactory (TieredIdentityFactory.java:localIdentity) - Initialized tiered identity TieredIdentity(node=100.76.19.7, rack=presto-ss-qe-presto-test)
2020-11-12 15:56:05,596 INFO util.log (Log.java:initialized) - Logging initialized @1076ms to org.eclipse.jetty.util.log.Slf4jLog
2020-11-12 15:56:05,725 INFO alluxio.ProcessUtils (ProcessUtils.java:run) - Starting Alluxio job worker.
2020-11-12 15:56:05,725 INFO alluxio.ProcessUtils (ProcessUtils.java:run) - Running under Java 1.8.0_252
2020-11-12 15:56:05,726 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service starting @ /0.0.0.0:30003
2020-11-12 15:56:05,727 INFO metrics.MetricsHeartbeatContext (MetricsHeartbeatContext.java:addHeartbeat) - Created metrics heartbeat with ID app-4950460193034851762. This ID will be used for identifying info from the client. It can be set manually through the alluxio.user.app.id property
2020-11-12 15:56:05,730 INFO server.Server (Server.java:doStart) - jetty-9.4.31.v20200723; built: 2020-07-23T17:57:36.812Z; git: 450ba27947e13e66baa8cd1ce7e85a4461cacc1d; jvm 1.8.0_252-b4
2020-11-12 15:56:05,756 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@7cbd9d24{/metrics/json,null,AVAILABLE}
2020-11-12 15:56:05,757 WARN security.SecurityHandler (ConstraintSecurityHandler.java:checkPathsWithUncoveredHttpMethods) - [email protected]@50dfbc58{/,null,STARTING} has uncovered http methods for path: /
2020-11-12 15:56:09,586 INFO handler.ContextHandler (ContextHandler.java:doStart) - Started o.e.j.s.ServletContextHandler@50dfbc58{/,null,AVAILABLE}
2020-11-12 15:56:09,594 INFO server.AbstractConnector (AbstractConnector.java:doStart) - Started ServerConnector@4470fbd6{HTTP/1.1, (http/1.1)}{0.0.0.0:30003}
2020-11-12 15:56:09,595 INFO server.Server (Server.java:doStart) - Started @5075ms
2020-11-12 15:56:09,595 INFO web.WebServer (WebServer.java:start) - Alluxio Job Manager Worker Web service started @ /0.0.0.0:30003
2020-11-12 15:56:09,653 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Started Alluxio job worker with id 1605167752223
2020-11-12 15:56:09,653 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:start) - Alluxio job worker version 2.5.0-SNAPSHOT started. bindHost=/0.0.0.0:30001, connectHost=tdw-100-76-19-7:30001, rpcPort=30001, webPort=30003
2020-11-12 15:56:09,653 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Starting gRPC server on address tdw-100-76-19-7:30001
2020-11-12 15:56:09,689 INFO worker.AlluxioJobWorkerProcess (AlluxioJobWorkerProcess.java:startServingRPCServer) - Started gRPC server on address tdw-100-76-19-7:30001
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007ff6822993b8, pid=100388, tid=0x00007ff6001c1700
#
# JRE version: OpenJDK Runtime Environment (8.0_252-b04) (build 1.8.0_252-b4)
# Java VM: OpenJDK 64-Bit Server VM (25.252-b4 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [ld-linux-x86-64.so.2+0xb3b8] _dl_relocate_object+0x98
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /data/tdwadmin/tdwenv/panyliu/alluxio-2.5-tq-0.1.0-SNAPSHOT/bin/hs_err_pid100388.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

Here is the stack info in the detailed crash report file.

Stack: [0x00007f23423e8000,0x00007f23424e9000], sp=0x00007f23424e4de0, free space=1011k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [ld-linux-x86-64.so.2+0xab12] _dl_relocate_object+0xa2
C [ld-linux-x86-64.so.2+0x1315f] dl_open_worker+0x38f
C [ld-linux-x86-64.so.2+0xe7b6] _dl_catch_error+0x66
C [libdl.so.2+0xf76] dlopen_doit+0x66

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0
j java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328
j java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+48
j java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+57
j java.lang.System.load(Ljava/lang/String;)V+7
j com.sun.jna.Native.loadNativeDispatchLibraryFromClasspath()V+110
j com.sun.jna.Native.loadNativeDispatchLibrary()V+420
j com.sun.jna.Native.()V+108
v ~StubRoutines::call_stub
j oshi.jna.platform.linux.LinuxLibc.()V+4
v ~StubRoutines::call_stub
j oshi.hardware.platform.linux.LinuxCentralProcessor.getSystemLoadAverage(I)[D+24
j alluxio.worker.job.command.JobWorkerHealthReporter.compute()V+18
j alluxio.worker.job.command.CommandHandlingExecutor.heartbeat()V+4
j alluxio.heartbeat.HeartbeatThread.run()V+78
j java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4
J 2294 C1 java.util.concurrent.FutureTask.run()V (126 bytes) @ 0x00007f244966ea64 [0x00007f244966e800+0x264]
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub

Here is the gdb debug info(I am not familiar with it):

(gdb) info shared
From To Syms Read Shared Object Library
No linux-vdso.so.1
0x00007f7ad75fb060 0x00007f7ad75fc4f8 Yes /lib64/libonion.so
0x00007f7ad70c6950 0x00007f7ad70d30f8 Yes /lib64/libpthread.so.0
0x00007f7ad6eac410 0x00007f7ad6eb9778 Yes /data/tdwenv/TencentKona-8.0.3-262/bin/../lib/amd64/jli/libjli.so
0x00007f7ad6ca6e10 0x00007f7ad6ca78e8 Yes /lib64/libdl.so.2
0x00007f7ad6918580 0x00007f7ad6a49594 Yes /lib64/libc.so.6
0x00007f7ad72dfae0 0x00007f7ad72f8950 Yes /lib64/ld-linux-x86-64.so.2
0x00007f7ad5aec870 0x00007f7ad63e9058 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so
0x00007f7ad55d4790 0x00007f7ad5641748 Yes /lib64/libm.so.6
0x00007f7ad53c92a0 0x00007f7ad53cc2d8 Yes /lib64/librt.so.1
0x00007f7ad51bb340 0x00007f7ad51c22b8 Yes (
) /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libverify.so
0x00007f7ad4f9a5c0 0x00007f7ad4fadf78 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libjava.so
0x00007f7ad4d738a0 0x00007f7ad4d84898 Yes (
) /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libzip.so
0x00007f7aa433fe30 0x00007f7aa43470e8 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libnio.so
0x00007f7aa4124bf0 0x00007f7aa4134098 Yes (
) /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libnet.so
0x00007f7a98289a70 0x00007f7a9828c498 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libmanagement.so
No /tmp/libnetty_transport_native_epoll_x86_648487691960033771233.so
0x00007f7a69de8790 0x00007f7a69de8b98 Yes (
) /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libjaas_unix.so
0x00007f7a68594840 0x00007f7a685b27b8 Yes () /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libsunec.so
0x00007f7a68377910 0x00007f7a68387f18 Yes /lib64/libgcc_s-4.4.6-20110824.so.1
No /home/panyliu/.cache/JNA/temp/jna3956775499404260402.tmp
(
): Shared library is missing debugging information.
(gdb) bt
#0 0x00007f7ad692bb15 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f7ad692cf25 in abort () at abort.c:89
#2 0x00007f7ad6211735 in os::abort(bool) () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so
#3 0x00007f7ad63b8ee3 in VMError::report_and_die() () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so
#4 0x00007f7ad6218242 in JVM_handle_linux_signal () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so
#5 0x00007f7ad620d4d3 in signalHandler(int, siginfo*, void*) () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so
#6
#7 _dl_relocate_object (scope=0x7f79b40231d8, reloc_mode=, consider_profiling=0) at dl-reloc.c:238
#8 0x00007f7ad72f215f in dl_open_worker (a=) at dl-open.c:416
#9 0x00007f7ad72ed7b6 in _dl_catch_error (objname=0x7f7a68cd4fd0, errstring=0x7f7a68cd4fc8, mallocedp=0x7f7a68cd4fdf, operate=0x7f7ad72f1dd0 <dl_open_worker>, args=0x7f7a68cd4f80) at dl-error.c:177
#10 0x00007f7ad72f191a in _dl_open (file=0x7f79b4022950 "/home/panyliu/.cache/JNA/temp/jna3956775499404260402.tmp", mode=-2147483647, caller_dlopen=0x7f7ad6214d1d, nsid=-2, argc=19, argv=, env=0x7fffb87486e8)
at dl-open.c:650
#11 0x00007f7ad6ca6f76 in dlopen_doit (a=0x7f7a68cd51a0) at dlopen.c:66
#12 0x00007f7ad72ed7b6 in _dl_catch_error (objname=0x7f79b40011d0, errstring=0x7f79b40011d8, mallocedp=0x7f79b40011c8, operate=0x7f7ad6ca6f10 <dlopen_doit>, args=0x7f7a68cd51a0) at dl-error.c:177
#13 0x00007f7ad6ca72ec in _dlerror_run (operate=0x7f7ad6ca6f10 <dlopen_doit>, args=0x7f7a68cd51a0) at dlerror.c:163
#14 0x00007f7ad6ca6ef1 in __dlopen (file=, mode=) at dlopen.c:87
#15 0x00007f7ad6214d1d in os::dll_load(char const*, char*, int) () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so
#16 0x00007f7ad600a173 in JVM_LoadLibrary () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/server/libjvm.so
#17 0x00007f7ad4f9b7b8 in Java_java_lang_ClassLoader_00024NativeLibrary_load () from /data/tdwenv/TencentKona-8.0.3-262/jre/lib/amd64/libjava.so
#18 0x00007f7ac1018507 in ?? ()
#19 0x00000007099e9038 in ?? ()
#20 0x00007f7ac10080a1 in ?? ()
#21 0x00007f7a68cd5da8 in ?? ()
#22 0x00007f7ac10080a1 in ?? ()
#23 0x00007f7a68cd5d50 in ?? ()
#24 0x0000000000000000 in ?? ()
(gdb)

It seems related to the native method load. The jvm cannot find the .so file or somthing else thus got a signal from the linux kernal and then shutdown.
I know litttle about the jna called of alluxio, so I don't figure out the crash reason yet. Any suggestion is appreciate.
By the way, this problem will not happen when using the community version, so it seems an issue related to the compiling env, but I am not sure.

To Reproduce
Compiling locally in centos 6.
exec following command:
./bin/alluxio-start.sh local

Expected behavior
The job_worker progress starts successfully.

Urgency
HIGH

Additional context
When I update the OSHI version above 5.3.1, the problem solved.

@apc999
Copy link
Contributor

apc999 commented Nov 13, 2020

@liupan664021 what Java version do you run?
e.g., what if you type java --version

@apc999
Copy link
Contributor

apc999 commented Nov 13, 2020

@bradyoo @yuzhu do you think this can be related to Java 11 change?

@yuzhu
Copy link
Contributor

yuzhu commented Nov 13, 2020

Probably need Java version before we can begin to investigate. Have not seen this kind of error before

alluxio-bot pushed a commit that referenced this issue Nov 13, 2020
Update oshi version from 4.2.0 to 5.3.5 to solve the jvm crash problem
mentioned int #12502.

Fixes #12502

pr-link: #12504
change-id: cid-1c0b23d6c9cd68b42baaf9eb5193d4be9641b97e
@bradyoo
Copy link
Contributor

bradyoo commented Nov 13, 2020

It appears that @liupan664021 has figured out the issue and made a fix. The commit is good so I just merged it.

@maobaolong
Copy link
Contributor

@apc999 We use jdk8

alluxio-bot pushed a commit that referenced this issue Nov 18, 2020
Update oshi version from 4.2.0 to 5.3.5 to solve the jvm crash problem
mentioned int #12502.

Fixes #12502

pr-link: #12504
change-id: cid-1c0b23d6c9cd68b42baaf9eb5193d4be9641b97e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants