Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot see native call frames in GDB #2587

Closed
YaSuenag opened this issue Jun 19, 2020 · 12 comments
Closed

Cannot see native call frames in GDB #2587

YaSuenag opened this issue Jun 19, 2020 · 12 comments
Assignees
Labels

Comments

@YaSuenag
Copy link
Contributor

YaSuenag commented Jun 19, 2020

I tried to see native call frames of customized GraalVM which was built with -H:GenerateDebugInfo=1 in native debugger (GDB) on Linux, but I could not see them in GDB.

I expect I can see call frames which are implemented by Java in libpolyglot. objdump --dwarf=info libpolyglot.so shows DWARF information for Java layer, so GenerateDebugInfo seems to be applied partially.

Steps to reproduce the issue

  1. Build GraalVM with -H:GenerateDebugInfo=1 with following patch
diff --git a/sdk/mx.sdk/mx_sdk_vm_impl.py b/sdk/mx.sdk/mx_sdk_vm_impl.py
index de295302e50..68343ab58d7 100644
--- a/sdk/mx.sdk/mx_sdk_vm_impl.py
+++ b/sdk/mx.sdk/mx_sdk_vm_impl.py
@@ -1893,6 +1893,7 @@ class GraalVmSVMNativeImageBuildTask(GraalVmNativeImageBuildTask):
         build_args = [
             '--macro:' + GraalVmNativeProperties.macro_name(self.subject.native_image_config),
             '-H:NumberOfThreads=' + str(self.parallelism),
+            '-H:GenerateDebugInfo=1',
         ]
         if self.subject.native_image_config.is_polyglot:
             build_args += ["--macro:truffle", "--language:all"]
diff --git a/substratevm/mx.substratevm/mx_substratevm.py b/substratevm/mx.substratevm/mx_substratevm.py
index cc551793ed9..6899a5df20b 100644
--- a/substratevm/mx.substratevm/mx_substratevm.py
+++ b/substratevm/mx.substratevm/mx_substratevm.py
@@ -976,6 +976,7 @@ mx_sdk_vm.register_graalvm_component(mx_sdk_vm.GraalVmJreComponent(
         "-Dorg.graalvm.polyglot.nativeapi.libraryPath=${.}/../../../polyglot/",
         "-H:CStandard=C11",
         "-H:+SpawnIsolates",
+        "-H:GenerateDebugInfo=1",
     ],
     polyglot_lib_jar_dependencies=[
         "substratevm:POLYGLOT_NATIVE_API",
  1. Create C source file to crash

libpolyglot-crash.c

#include <stddef.h>
#include <polyglot_api.h>

main(){
  poly_thread dummy_thread; /* Not initialized */
  poly_create_context_builder((poly_thread)&dummy_thread, NULL, 0, NULL);
}
  1. Compile and run libpolyglot-crash.c
$ gcc -I$GRAALVM_HOME/lib/polyglot -L$GRAALVM_HOME/lib/polyglot libpolyglot-crash.c -lpolyglot
libpolyglot-crash.c:4:1: warning: return type defaults to 'int' [-Wimplicit-int]
    4 | main(){
      | ^~~~
$ LD_LIBRARY_PATH=$GRAALVM_HOME/lib/polyglot ./a.out
Segmentation fault (core dumped)
  1. Check native call frames on GDB
$ coredumpctl debug -1
           PID: 53134 (a.out)
  :
(gdb) bt
#0  0x00007fda0f8e7d6c in ?? ()
   from /home/yasuenag/github/graal/sdk/latest_graalvm_home/lib/polyglot/libpolyglot.so
#1  0x00007fda0f9198c9 in ?? ()
   from /home/yasuenag/github/graal/sdk/latest_graalvm_home/lib/polyglot/libpolyglot.so
#2  0x00007fda0f91985c in ?? ()
   from /home/yasuenag/github/graal/sdk/latest_graalvm_home/lib/polyglot/libpolyglot.so
#3  0x0000000000000000 in ?? ()

Describe GraalVM and your environment:

  • GraalVM version: upstream (commit 64d60d2)
  • JDK major version: 11
  • OS: Fedora 32
  • Architecture: AMD64
@olpaw
Copy link
Member

olpaw commented Jun 19, 2020

@YaSuenag it would be great if you could come up with a reproducer that does not involve changing how libpolyglot gets built.
How about a "hello world" that is built with --shared -H:GenerateDebugInfo=1. For inspiration see:

if '--shared' in args:

@YaSuenag
Copy link
Contributor Author

AFAICS libpolyglot.so in GraalVM 20.1.0 and upstream does not have DWARF entry for Java layer, so we need to recompile it with GenerateDebugInfo.

I tried to build GraalVM with following patch, but still does not work well on GDB.

diff --git a/sdk/mx.sdk/mx_sdk_vm_impl.py b/sdk/mx.sdk/mx_sdk_vm_impl.py
index de295302e50..4453e49ff15 100644
--- a/sdk/mx.sdk/mx_sdk_vm_impl.py
+++ b/sdk/mx.sdk/mx_sdk_vm_impl.py
@@ -247,6 +247,8 @@ def registered_graalvm_components(stage1=False):
                                '-Dgraalvm.libpolyglot=true',
                                '-Dorg.graalvm.polyglot.install_name_id=@rpath/jre/lib/polyglot/<lib:polyglot>',
                                '--tool:all',
+                               '--shared',
+                               '-H:GenerateDebugInfo=1',
                            ] + libpoly_build_args,
                         is_polyglot=True,
                     )],

@YaSuenag
Copy link
Contributor Author

Of course we can reproduce same problem on GraalVM 20.1.0 with my reproducer (libpolyglot-crash.c)

@adinn
Copy link
Collaborator

adinn commented Jun 19, 2020

Hi @YaSuenag @olpaw
I'm not clear what the problem is here as I don't know how the polyglot library is built. It looks like it is a shared lib (.so) built using the native image generator which you are then linking into a C program. Is that right?

If so can you provide precise instructions as to how to build the library (it is does not appear to be built when I run mx build in the substratevm directory).

@YaSuenag Are you sure your build tree includes this PR that is needed allow gdb 9.x to work (that is the version of gdb that is used on fedora 32).

@YaSuenag
Copy link
Contributor Author

Hi @YaSuenag @olpaw
I'm not clear what the problem is here as I don't know how the polyglot library is built. It looks like it is a shared lib (.so) built using the native image generator which you are then linking into a C program. Is that right?

Yes, libpolyglot.so is built from PolyglotNativeAPI.java by native image generator. So I expect that I can see Java frame in native backtrace.

If so can you provide precise instructions as to how to build the library (it is does not appear to be built when I run mx build in the substratevm directory).

I run mx --env ce build in vm directory (it is introduced in vm/README.md, and set GRAALVM_HOME to <graal src>/sdk/latest_graalvm_home.

@YaSuenag Are you sure your build tree includes this PR that is needed allow gdb 9.x to work (that is the version of gdb that is used on fedora 32).

Yes, head of my source tree is 64d60d2.

@adinn
Copy link
Collaborator

adinn commented Jun 19, 2020

Hi @YaSuenag Thanks for the info.

I'll try to follow your instructions and see if I can find an error in the DWARF debug info or in gdb that explains the crash.

@adinn
Copy link
Collaborator

adinn commented Jun 22, 2020

HI @YaSuenag
I built libpolyglot and it contains all the expected symbols as you described. There appear to be a few things going wrong here.

The first problem is that gdb needs to know where to find libpolyglot.so. I ran the program from inside gdb with LD_LIBRARY_PATH set correctly and saw this:

$ LD_LIBRARY_PATH=..../graal/sdk/latest_graalvm_home/lib/polyglot gdb a.out
Reading symbols from a.out...
(gdb) set directories .:../sources/graal:../sources/jdk
(gdb) b main
Breakpoint 1 at 0x40112e: file libpolyglot-crash.c, line 6.
(gdb) run
. . .
Breakpoint 1, main () at libpolyglot-crash.c:6
6	  poly_create_context_builder((poly_thread)&dummy_thread, NULL, 0, NULL);
(gdb) s
com.oracle.svm.core.code.IsolateEnterStub::PolyglotNativeAPI_poly_create_context_builder_232c64e3b5beacd8181c73f1c062cb7e8da69f3d(org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotIsolateThread, org.graalvm.nativeimage.c.type.CCharPointerPointer, org.graalvm.word.UnsignedWord, org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotContextBuilderPointer)(void) () at com/oracle/svm/core/code/IsolateEnterStub.java:1
1	/*
(gdb) s
442	            if (thread.isNull()) {
(gdb) s
49	        throw new IllegalArgumentException("Value of VM thread local variable cannot be accessed during native image generation");
(gdb) s
59	        throw new IllegalArgumentException("Value of VM thread local variable cannot be accessed during native image generation");
(gdb) s
403	        boolean needSlowPath = ThreadingSupportImpl.needsNativeToJavaSlowpath() || !StatusSupport.compareAndSetNativeToNewStatus(newStatus);
(gdb) s
com.oracle.svm.core.code.IsolateEnterStub::PolyglotNativeAPI_poly_create_context_builder_232c64e3b5beacd8181c73f1c062cb7e8da69f3d(org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotIsolateThread, org.graalvm.nativeimage.c.type.CCharPointerPointer, org.graalvm.word.UnsignedWord, org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotContextBuilderPointer)(void) () at com/oracle/svm/core/threadlocal/FastThreadLocalInt.java:75
75	        throw new IllegalArgumentException("Value of VM thread local variable cannot be accessed during native image generation");
(gdb) bt
#0  com.oracle.svm.core.code.IsolateEnterStub::PolyglotNativeAPI_poly_create_context_builder_232c64e3b5beacd8181c73f1c062cb7e8da69f3d(org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotIsolateThread, org.graalvm.nativeimage.c.type.CCharPointerPointer, org.graalvm.word.UnsignedWord, org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotContextBuilderPointer)(void) () at com/oracle/svm/core/threadlocal/FastThreadLocalInt.java:59
#1  0x0000000000401149 in main () at libpolyglot-crash.c:6
(gdb) s
403	        boolean needSlowPath = ThreadingSupportImpl.needsNativeToJavaSlowpath() || !StatusSupport.compareAndSetNativeToNewStatus(newStatus);
(gdb) s
com.oracle.svm.core.code.IsolateEnterStub::PolyglotNativeAPI_poly_create_context_builder_232c64e3b5beacd8181c73f1c062cb7e8da69f3d(org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotIsolateThread, org.graalvm.nativeimage.c.type.CCharPointerPointer, org.graalvm.word.UnsignedWord, org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotContextBuilderPointer)(void) () at com/oracle/svm/core/thread/Safepoint.java:404
404	        if (BranchProbabilityNode.probability(BranchProbabilityNode.VERY_SLOW_PATH_PROBABILITY, needSlowPath)) {
(gdb) s    com.oracle.svm.core.code.IsolateEnterStub::PolyglotNativeAPI_poly_create_context_builder_232c64e3b5beacd8181c73f1c062cb7e8da69f3d(org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotIsolateThread, org.graalvm.nativeimage.c.type.CCharPointerPointer, org.graalvm.word.UnsignedWord, org.graalvm.polyglot.nativeapi.types.PolyglotNativeAPITypes$PolyglotContextBuilderPointer)(void) () at com/oracle/svm/core/code/IsolateEnterStub.java:1
1	/*
(gdb) s
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb) 

So, if the LD_LIBRARY_PATH Is set when gdb is started then the debugger appears to find symbols and refers code addresses back to the original source file. The only oddity I see above is that method PolyglotNativeAPI_poly_create_context_builder_232c64e.... is located at line 1 of file IsolateEnterStub.java. That's not too surprising as it appears to be a compiler generated method.

The second problem is that the debugger does not find a thread stack in the core file.

$ LD_LIBRARY_PATH=/home/adinn/redhat/openjdk/graal/graal/sdk/latest_graalvm_home/lib/polyglot gdb a.out
. . .
Reading symbols from a.out...
(gdb) bt
No stack.
(gdb) 

I think this is because the crash is not dumping details of the stack at the SEGV into the core file. I don't think it is a problem with the debuginfo because the debug session shows gdb working right up to the crash point.

@YaSuenag
Copy link
Contributor Author

YaSuenag commented Jun 23, 2020

Thanks @adinn ! I could saw call stacks when I set breakpoint in GDB.

I checked details in below, I guess the information is available what we need at least, however they seem not to work well on GDB. Do you have any idea to resolve it?

Details

Backtrace on GDB

(gdb) bt
#0  0x00007fffee036d7c in ?? ()
   from /home/yasuenag/github/graal/sdk/latest_graalvm_home/lib/polyglot/libpolyglot.so
#1  0x00007fffee0688d9 in ?? ()
   from /home/yasuenag/github/graal/sdk/latest_graalvm_home/lib/polyglot/libpolyglot.so
#2  0x00007fffee06886c in ?? ()
   from /home/yasuenag/github/graal/sdk/latest_graalvm_home/lib/polyglot/libpolyglot.so
#3  0x0000000000000000 in ?? ()

Load address from /proc/<PID>/maps

7fffed836000-7fffedb77000 r--p 00000000 fd:00 728417                     /home/yasuenag/github/graal/sdk/mxbuild/linux-amd64/GRAALVM_CE_JAVA11/graalvm-ce-java11-20.2.0-dev/lib/polyglot/libpolyglot.so
7fffedb77000-7ffff0fbb000 r-xp 00341000 fd:00 728417                     /home/yasuenag/github/graal/sdk/mxbuild/linux-amd64/GRAALVM_CE_JAVA11/graalvm-ce-java11-20.2.0-dev/lib/polyglot/libpolyglot.so
7ffff0fbb000-7ffff7fc4000 r--p 03785000 fd:00 728417                     /home/yasuenag/github/graal/sdk/mxbuild/linux-amd64/GRAALVM_CE_JAVA11/graalvm-ce-java11-20.2.0-dev/lib/polyglot/libpolyglot.so
7ffff7fc4000-7ffff7fc5000 r--p 0a78d000 fd:00 728417                     /home/yasuenag/github/graal/sdk/mxbuild/linux-amd64/GRAALVM_CE_JAVA11/graalvm-ce-java11-20.2.0-dev/lib/polyglot/libpolyglot.so
7ffff7fc5000-7ffff7fc7000 rw-p 0a78e000 fd:00 728417                     /home/yasuenag/github/graal/sdk/mxbuild/linux-amd64/GRAALVM_CE_JAVA11/graalvm-ce-java11-20.2.0-dev/lib/polyglot/libpolyglot.so

Offset of crashed address

0x7fffee036d7c - 0x7fffed836000 = 0x800D7C‬

DWARF information of 0x800D7C‬

 <0><2604b>: Abbrev Number: 1 (DW_TAG_compile_unit)
    <2604c>   DW_AT_language    : 11    (Java)
    <2604d>   DW_AT_name        : (indirect string, offset: 0xdb78c9): VMErrorSubstitutions.java
    <26051>   DW_AT_low_pc      : 0x800d50
    <26059>   DW_AT_high_pc     : 0x801076
    <26061>   DW_AT_stmt_list   : 0x3d65c7
 <1><26065>: Abbrev Number: 2 (DW_TAG_subprogram)
    <26066>   DW_AT_name        : (indirect string, offset: 0xcbbcc7): com.oracle.svm.core.jdk.VMErrorSubstitutions::doShutdown(java.lang.String, java.lang.Throwable)
    <2606a>   DW_AT_low_pc      : 0x800d50
    <26072>   DW_AT_high_pc     : 0x801076
    <2607a>   DW_AT_external    : 1

@adinn
Copy link
Collaborator

adinn commented Jun 23, 2020

I'm not sure why gdb is failing to resolve the crash pc in the core to an address in libpolyglot.so. It does look like all the information needed is available to gdb. I will see if I can identify the problem by debugging gdb itself. Also, I will ask the Red Hat gdb team if they can explain what is going wrong.

@YaSuenag
Copy link
Contributor Author

Thank you so much @adinn !

BTW I tried to use LLDB (lldb-10.0.0-1.fc32.x86_64) to analyze core, it shows interesting backtrace as below:

(lldb) bt
* thread #1, name = 'a.out', stop reason = signal SIGSEGV
  * frame #0: 0x00007f234933fd7c libpolyglot.so`IsolateEnterStub__VmLocatorSymbol__vmLocatorSymbol__bec84cad1f8708102cd8814ef3e496531bf6ff5b__bbf2dbb2d6a07a8e1dae8a3072b01ad86ecc1a50 at RealLog.java:385
    frame #1: 0x00007f2348b2e9e0
    frame #2: 0x00007f2348992042 libc.so.6`__libc_start_main(main=(a.out`main), argc=1, argv=0x00007fffdea7ed78, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffdea7ed68) at libc-start.c:308:16
    frame #3: 0x000000000040106e a.out`_start + 46

I wonder why VmLocatorSymbol points to RealLog.java.

I checked symbols in libpolyglot.so with nm, it do not have any Java symbols. Does it work well if we add symbols of Java methods?

@adinn
Copy link
Collaborator

adinn commented Jun 23, 2020

Hi @YaSuenag
I think I have found what is going wrong. I tried using x/i 0xnnnnnn to list instructions at various code addresses in the library range and noticed that up to a specific address they are listed as

0xnnnnn <name_of_method>  insn op1, op2

and after that address they get listed as as

0xnnnnn insn op1, op2

i.e. this looks like a problem with the .debug_aranges section. The last range for which a method name was printed was StrictMathInvoker::tanh. and its range section includes this

Address            Length
0000000000804740 0000000000000019 
0000000000804760 0000000000000019 
0000000000804780 0000000000000019 
00000000008047a0 0000000000000019 
00000000008047c0 0000000000000019 
00000000008047e0 0000000000000019 
0000000000804800 0000000000000019 
0000000000804820 0000000000000019 
0000000000804840 0000000000000019 
000000000302d320 0000000000000026 
000000000302d350 0000000000000026 
. . .

i.e. the address ranges in this compilation unit include a massive jump forward out of the normal generated method code range. Later range entries are ignored because they switch back to addresses in the range starting at 0x804860.

Looking at the info section around that sudden offset jump I see this

. . .
    <25a96>   DW_AT_name        : (indirect string, offset: 0x3d4864): com.oracle.svm.core.jdk.StrictMathInvoker::sinh(org.graalvm.word.WordBase, org.graalvm.word.WordBase, double)
    <25a9a>   DW_AT_low_pc      : 0x804820
    <25aa2>   DW_AT_high_pc     : 0x804839
    <25aaa>   DW_AT_external    : 1
 <1><25aab>: Abbrev Number: 2 (DW_TAG_subprogram)
    <25aac>   DW_AT_name        : (indirect string, offset: 0xb684a2): com.oracle.svm.core.jdk.StrictMathInvoker::tanh(org.graalvm.word.WordBase, org.graalvm.word.WordBase, double)
    <25ab0>   DW_AT_low_pc      : 0x804840
    <25ab8>   DW_AT_high_pc     : 0x804859
    <25ac0>   DW_AT_external    : 1
 <1><25ac1>: Abbrev Number: 2 (DW_TAG_subprogram)
    <25ac2>   DW_AT_name        : (indirect string, offset: 0x3c75b6): com.oracle.svm.core.jdk.StrictMathInvoker::acos**(org.graalvm.word.WordBase, org.graalvm.word.WordBase, double)
    <25ac6>   DW_AT_low_pc      : 0x302d320
    <25ace>   DW_AT_high_pc     : 0x302d346
    <25ad6>   DW_AT_external    : 1
 <1><25ad7>: Abbrev Number: 2 (DW_TAG_subprogram)
    <25ad8>   DW_AT_name        : (indirect string, offset: 0xfa30fb): com.oracle.svm.core.jdk.StrictMathInvoker::asin**(org.graalvm.word.WordBase, org.graalvm.word.WordBase, double)
    <25adc>   DW_AT_low_pc      : 0x302d350
    <25ae4>   DW_AT_high_pc     : 0x302d376
    <25aec>   DW_AT_external    : 1
. . .

The methods with name StrictMathInvoker::acos** and StrictMathInvoker::asin** appear to be stubs that invoke methods of libm.

An assumption of the current debug info generator is that methods associated with a given class are all compiled at once and occupy a continuous address range in the generated code section. The range generator simply walks through the methods in compile order and outputs ranges based on the method offset in the code segment. So, it looks like I will either have to filter out the range entries for these stubs or else do some sorting before generating the address ranges.

I am surprised the asserts in the debug info generator code did not object to successive ranges for classes crossing over each other. I will investigate further and let you know when I work out what is needed to fix this.

@adinn
Copy link
Collaborator

adinn commented Jun 24, 2020

It seems these extra methods whose names end with a '**' suffix are used as deoptimization targets. They are generated after all the normal methods associated with the class. That means that each class is associated with two distinct ranges of code addresses and that ranges from one class can be interleaved with ranges from a later class.

Unfortunately, the compilation units in the info section need to be sorted in non-overlapping, lo to hi address ranges. Similarly, the frame and range sections also need to list method frame info and range info in non-overlapping, lo to hi address ranges. So, in order to keep gdb happy I have to treat classes which have deopt targets as two compilation units,splitting the methods between them and generate the relevant content in two stages.

I have a patch ready and will raise a PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants