Fix issue #713 #714

cshung · 2024-01-24T23:15:02Z

It appears to me that this change in #524 is problematic.

Original:

        unw_word_t hdr;
        if ((*a->access_mem)(as, eh_frame_table, &hdr, 0, arg) < 0) {
            return -UNW_EINVAL;
        }
        struct dwarf_eh_frame_hdr* exhdr = (struct dwarf_eh_frame_hdr*)&hdr;

Changed:

        struct dwarf_eh_frame_hdr* exhdr = NULL;
        if ((*a->access_mem)(as, eh_frame_table, (unw_word_t*)&exhdr, 0, arg) < 0) {
            return -UNW_EINVAL;
        }

Originally, exhdr will always point to the stack, while the latter will point to whatever access_mem may write it to, or NULL if it doesn't.

The change will make sure exhdr still point to the stack, and I leave more spaces on the stack so that it will not overwrite random stack slots.

bregma · 2024-01-25T15:35:09Z

Previous to #524 the code would just cast a (possibly) uninitialized value to a structure pointer and dereference that pointer, leading to undefined behaviour which apparently usually did something that was not crashing in the mystery test scenario in #713.

The change in #524 initialized that pointer to NULL, so under the same circumstances you get the undefined behaviour of dereferencing a NULL, which happens to cause a segfault on some systems in the mystery test scenario reported in #713.

This change will initialize the pointer to an uninitialized area on the stack, an under the same circumstances will lead to undefined behaviour which apparently does not always crash under whatever mystery test scenario is being used to reproduce #713.

Switching from undefined behaviour to undefined behaviour is not really a very robust fix, even if it makes some unknown symptom in some unknown mystery test failure go away. I would think the acceptable fix is to figure out why the access_mem() call is succeeding but failing to read the pointer data in the first place.

cshung · 2024-01-25T19:49:25Z

To make it clear, throughout this reply, existing code means the code before #524. Current state means the code after #524. Remote means the address space of the process that owns the stack to be unwinded, local means the address space of the process running libunwind.

I think the key misunderstanding here is that access_mem is writing to memory pointed by its 3rd parameter rather than reading it in the scenario we cared about. The initial value didn't matter at all, and the segfault has nothing to do with the NULL.

In our scenario, when access_mem is called, the function will be this function as part of the .NET runtime. We are calling this because we initialized the accessor here.

That function goes through multiple layers of abstraction that we don't need to go into. Suffice to say, throughout the layers, we interpreted that the 2nd argument of access_mem as an address on the remote side, and the 3rd argument of access_mem is a buffer to store the read data.

In particular, the code never read the memory pointed by the 3rd parameter, it didn't matter what the initial value was. It will be filled by the remote memory anyway.

In the original version of the code, the buffer is only sizeof(unw_word_t). I investigated it because later on this buffer is reinterpreted the buffer as a dwarf_eh_frame_hdr, that is suspicious because sizeof(unw_word_t) == 8 but sizeof(dwarf_eh_frame_hdr) == 12. Interestingly, I figure the implementation of access_mem will always read just sizeof(unw_word_t), so the existing code was actually right from a buffer size perspective, at least for the access_mem call.

Next, the code attempts to read the rest of dwarf_eh_frame_hdr (i.e. the eh_frame_ptr field and the fde_count field) into some other local variables. The key is that we never read exhdr->eh_frame, so it doesn't matter that &(exhdr->eh_frame) != &eh_frame_start, same goes with fde_count.

That explained how the existing code worked, let's that a look at the current state. After the change exhdr will point to whatever the remote memory is, that is why we have the access violation, the eh_frame_hdr info is most likely not a valid pointer in the local process address space.

I can reproduce the access violation deterministically. Unfortunately that involve building the .NET runtime, let me know if you need it. Alternatively, I can get whatever info you might need from the scenario.

Can you approve the CI run so that I can see if that breaks anything else?

bertwesarg · 2024-01-26T07:30:53Z

I agree with @cshung. The existing and current code is faulty. And as long as only the first sizeof(unw_word_t) bytes are accessed via exhdr, it now looks correct. Therefore I suggest that such a comment will be added. Something like "We only access the first 4 char-sized members of dwarf_eh_frame_hdr which always fit into sizeof(unw_word_t)".

But I'm also curious, why the later dwarf_read_encoded_pointer calls to read the remaining members do not use the address space and the accessors at all? All of the dwarf_readX functions in dwarf_i.h mark them UNUSED.

bregma · 2024-01-26T14:23:31Z

It looks like the design of get_proc_info_in_range() is just broken.

The call to access_mem() reads a pointer from what could be a remote address space, then goes on to dereference it as if it's in the local address space. That's just wrong. No amount of initializing the pointer before it's read is going to change how wrong the design is.

There are no unit tests exercising this API so I wouldn't expect any change to its implementation to affect CI. This was a mistake when merging the original PR (#377) Visual inspection shows that it won't and can not fix the actual problem.

The first thing to do to fix this is to come up with a unit test that demonstrates the problem and mark it as XFAIL. Then, in a separate PR, a fix that makes the test pass and un-XFAIL the test.

cshung · 2024-01-26T19:13:09Z

All of the dwarf_readX functions in dwarf_i.h mark them UNUSED

You are probably reading the wrong implementation there, those implementations are under the UNW_LOCAL_ONLY define. In the remote scenario, they are defined later here, these implementations do use the access_mem capability to read remote memory.

Here is a stack on how the dwarf_read_encoded_pointer eventually lead to access_mem in the repro.

(lldb) bt
* thread #1, name = 'createdump', stop reason = step in
  * frame #0: 0x00007f63a4b1f51b libmscordaccore.so`access_mem(as=0x00005578263a85b0, addr=139736535711312, valp=0x00007ffcf6ad5080, write=0, arg=0x00007ffcf6ad5b50) at remote-unwind.cpp:2055
    frame #1: 0x00007f63a4b16874 libmscordaccore.so`dwarf_readu8(as=0x00005578263a85b0, a=0x00005578263a85b0, addr=0x00007ffcf6ad5310, valp="", arg=0x00007ffcf6ad5b50) at dwarf_i.h:144
    frame #2: 0x00007f63a4b16162 libmscordaccore.so`dwarf_readu16(as=0x00005578263a85b0, a=0x00005578263a85b0, addr=0x00007ffcf6ad5310, val=0x00007ffcf6ad5126, arg=0x00007ffcf6ad5b50) at dwarf_i.h:161
    frame #3: 0x00007f63a4b16232 libmscordaccore.so`dwarf_readu32(as=0x00005578263a85b0, a=0x00005578263a85b0, addr=0x00007ffcf6ad5310, val=0x00007ffcf6ad5174, arg=0x00007ffcf6ad5b50) at dwarf_i.h:179
    frame #4: 0x00007f63a4b16472 libmscordaccore.so`dwarf_reads32(as=0x00005578263a85b0, a=0x00005578263a85b0, addr=0x00007ffcf6ad5310, val=0x00007ffcf6ad5238, arg=0x00007ffcf6ad5b50) at dwarf_i.h:241
    frame #5: 0x00007f63a4b15d52 libmscordaccore.so`_Ux86_64_dwarf_read_encoded_pointer [inlined] dwarf_read_encoded_pointer_inlined(as=0x00005578263a85b0, a=0x00005578263a85b0, addr=0x00007ffcf6ad5310, encoding='\e', pi=0x00007ffcf6ad5cc8, valp=0x00007ffcf6ad5308, arg=0x00007ffcf6ad5b50) at dwarf_i.h:416
    frame #6: 0x00007f63a4b15a9d libmscordaccore.so`_Ux86_64_dwarf_read_encoded_pointer(as=0x00005578263a85b0, a=0x00005578263a85b0, addr=0x00007ffcf6ad5310, encoding='\e', pi=0x00007ffcf6ad5cc8, valp=0x00007ffcf6ad5308, arg=0x00007ffcf6ad5b50) at Gpe.c:37
    frame #7: 0x00007f63a4af6ee6 libmscordaccore.so`_Ux86_64_get_proc_info_in_range(start_ip=139736535625728, end_ip=139736535729184, eh_frame_table=139736535711312, eh_frame_table_len=2276, exidx_frame_table=0, exidx_frame_table_len=0, as=0x00005578263a85b0, ip=139736535700266, pi=0x00007ffcf6ad5cc8, need_unwind_info=1, arg=0x00007ffcf6ad5b50) at Gget_proc_info_in_range.c:77
    ```

cshung · 2024-01-26T19:21:02Z

then goes on to dereference it as if it's in the local address space.

Only the current state (i.e. code after #524) does. The existing state (i.e. code before #524) copied the remote memory to the stack and then read by dereferencing a pointer to the stack, the same happened after this change.

For the exhdr we are talking about - we only read the eh_frame_ptr_enc and fde_count_enc from it. These are both just unsigned char. Once the access_mem copied into the stack slot we are safe to read them.

There are no unit tests exercising this API

I liked the fact that we insist on having a test, but writing a unit test for this is beyond me, would you or @am11 can give a hand here?

bregma · 2024-01-26T20:16:43Z

I think the following change is clearer and expresses intent better. There is still an aliasing violation and possibly an alignment issue but no worse than the very original pre-524 code.

diff --git a/src/dwarf/Gget_proc_info_in_range.c b/src/dwarf/Gget_proc_info_in_range.c
index 5701c5d2..788aa7a1 100644
--- a/src/dwarf/Gget_proc_info_in_range.c
+++ b/src/dwarf/Gget_proc_info_in_range.c
@@ -58,13 +58,13 @@ unw_get_proc_info_in_range (unw_word_t        start_ip,
     if (eh_frame_table != 0) {
         unw_accessors_t *a = unw_get_accessors_int (as);
 
-        struct dwarf_eh_frame_hdr* exhdr = NULL;
+        struct dwarf_eh_frame_hdr exhdr;
         if ((*a->access_mem)(as, eh_frame_table, (unw_word_t*)&exhdr, 0, arg) < 0) {
             return -UNW_EINVAL;
         }
 
-        if (exhdr->version != DW_EH_VERSION) {
-            Debug (1, "Unexpected version %d\n", exhdr->version);
+        if (exhdr.version != DW_EH_VERSION) {
+            Debug (1, "Unexpected version %d\n", exhdr.version);
             return -UNW_EBADVERSION;
         }
         unw_word_t addr = eh_frame_table + offsetof(struct dwarf_eh_frame_hdr, eh_frame);
@@ -72,12 +72,12 @@ unw_get_proc_info_in_range (unw_word_t        start_ip,
         unw_word_t fde_count;
 
         /* read eh_frame_ptr */
-        if ((ret = dwarf_read_encoded_pointer(as, a, &addr, exhdr->eh_frame_ptr_enc, pi, &eh_frame_start, arg)) < 0) {
+        if ((ret = dwarf_read_encoded_pointer(as, a, &addr, exhdr.eh_frame_ptr_enc, pi, &eh_frame_start, arg)) < 0) {
             return ret;
         }
 
         /* read fde_count */
-        if ((ret = dwarf_read_encoded_pointer(as, a, &addr, exhdr->fde_count_enc, pi, &fde_count, arg)) < 0) {
+        if ((ret = dwarf_read_encoded_pointer(as, a, &addr, exhdr.fde_count_enc, pi, &fde_count, arg)) < 0) {
             return ret;
         }
 
@@ -87,8 +87,8 @@ unw_get_proc_info_in_range (unw_word_t        start_ip,
             return -UNW_ENOINFO;
         }
 
-        if (exhdr->table_enc != (DW_EH_PE_datarel | DW_EH_PE_sdata4)) {
-            Debug (1, "Table encoding not supported %x\n", exhdr->table_enc);
+        if (exhdr.table_enc != (DW_EH_PE_datarel | DW_EH_PE_sdata4)) {
+            Debug (1, "Table encoding not supported %x\n", exhdr.table_enc);
             return -UNW_EINVAL;
         }

cshung · 2024-01-26T21:29:40Z

I think the following change is clearer and expresses intent better. There is still an aliasing violation and possibly an alignment issue but no worse than the very original pre-524 code.

Thanks for the help! This patch is slightly larger, but it does convey the intent better. I put this patch into my scenario and confirms it fixed #713, at least for my repro.

cshung · 2024-01-29T21:45:39Z

Don't worry about the force push, I didn't change any code, I was just trying to experiment with the CI to see if I can get a test baseline. It looks like new run requires new approval, so I just compared that with the previous run instead.

It appears that the new run has a new failure with qemu ppc, and there are a few warnings about GitHub is depreciating Node 16 in favor of Node 20.

None of them feel like caused by my change.

cshung · 2024-01-30T20:41:31Z

GCC is telling me maybe it is better for us to just keep the pre #524 code on this spot.

  In file included from /__w/1/s/src/native/external/libunwind/src/dwarf/Lget_proc_info_in_range.c:4:
  /__w/1/s/src/native/external/libunwind/src/dwarf/Gget_proc_info_in_range.c: In function ‘_ULx86_64_get_proc_info_in_range’:
  /__w/1/s/src/native/external/libunwind/src/dwarf/Gget_proc_info_in_range.c:62:9: error: converting a packed ‘struct dwarf_eh_frame_hdr’ pointer (alignment 1) to a ‘unw_word_t’ {aka ‘long unsigned int’} pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
     62 |         if ((*a->access_mem)(as, eh_frame_table, (unw_word_t*)&exhdr, 0, arg) < 0) {
        |         ^~
  In file included from /__w/1/s/src/native/external/libunwind/src/dwarf/Gget_proc_info_in_range.c:23:
  /__w/1/s/src/native/external/libunwind/include/dwarf-eh.h:115:32: note: defined here
    115 | struct __attribute__((packed)) dwarf_eh_frame_hdr
        |                                ^~~~~~~~~~~~~~~~~~
  cc1: all warnings being treated as errors

My read of the warning is that because dwarf_eh_frame_hdr is a packed struct, the compiler might end up putting it in a stack slot that is not 8 bytes aligned, but then when we reinterpret that address as a unw_word_t*, they would be an unaligned pointer of an aligned type, which could be fatal. By declaring a variable of type unw_word_t on the stack first, we force the compiler to align it, and it won't be a problem that we reinterpret an aligned address as a pointer to a packed struct.

To get around both the current GCC warning and the original arm64 warning that was addressed by #524, I simply increased the buffer size, that should solve both warnings.

…nment

bregma · 2024-01-31T19:40:30Z

The big problem is the strict aliasing violation. The most correct way to work around this would be like so.

@@ -58,13 +58,15 @@ unw_get_proc_info_in_range (unw_word_t        start_ip,
     if (eh_frame_table != 0) {
         unw_accessors_t *a = unw_get_accessors_int (as);
 
-        struct dwarf_eh_frame_hdr* exhdr = NULL;
-        if ((*a->access_mem)(as, eh_frame_table, (unw_word_t*)&exhdr, 0, arg) < 0) {
+        unw_word_t data;
+        if ((*a->access_mem)(as, eh_frame_table, &data, 0, arg) < 0) {
             return -UNW_EINVAL;
         }
 
-        if (exhdr->version != DW_EH_VERSION) {
-            Debug (1, "Unexpected version %d\n", exhdr->version);
+        struct dwarf_eh_frame_hdr exhdr;
+        memcpy(&exhdr, &data, sizeof(data));
+        if (exhdr.version != DW_EH_VERSION) {
+            Debug (1, "Unexpected version %d\n", exhdr.version);
             return -UNW_EBADVERSION;
         }
         unw_word_t addr = eh_frame_table + offsetof(struct dwarf_eh_frame_hdr, eh_frame);

The compiler will recognize and optimize away the memcpy() call (or at least GCC, clang, and qcc do), alignment rulles are always correct, and the compiler will not have any chance to do something unexpected because of the strict aliasing violation because it's now gone.

And yes, I realize it's almost back to the pre-524 code, except for the aliasing violation.

src/dwarf/Gget_proc_info_in_range.c

Co-authored-by: Bert Wesarg <[email protected]>

bregma

I think this is good now.

Can you also make a PR for the v1.8-stable branch?

Fix issue #713

b22db9e

New version

3cbb401

hoyosjs mentioned this pull request Jan 30, 2024

[release/9.0-preview1] Revert "Update HP libunwind to v1.8.0" dotnet/runtime#97679

Merged

Fix GCC compilation warning by making sure we have the necessary alig…

e5733a4

…nment

Fix strict aliasing issue

62bef76

bertwesarg reviewed Feb 1, 2024

View reviewed changes

src/dwarf/Gget_proc_info_in_range.c Show resolved Hide resolved

Update src/dwarf/Gget_proc_info_in_range.c

42799f2

Co-authored-by: Bert Wesarg <[email protected]>

bregma approved these changes Feb 1, 2024

View reviewed changes

This was referenced Feb 1, 2024

Fix issue #713 #717

Merged

Patch libunwind manually to fix the access violation to unblock source build dotnet/runtime#97813

Merged

bregma merged commit 5ce1a7a into libunwind:master Feb 1, 2024
15 of 29 checks passed

cshung deleted the public/fix-issue-713 branch February 1, 2024 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #713 #714

Fix issue #713 #714

cshung commented Jan 24, 2024 •

edited

Loading

bregma commented Jan 25, 2024

cshung commented Jan 25, 2024 •

edited

Loading

bertwesarg commented Jan 26, 2024

bregma commented Jan 26, 2024

cshung commented Jan 26, 2024

cshung commented Jan 26, 2024 •

edited

Loading

bregma commented Jan 26, 2024

cshung commented Jan 26, 2024

cshung commented Jan 29, 2024

cshung commented Jan 30, 2024 •

edited

Loading

bregma commented Jan 31, 2024

bregma left a comment

Fix issue #713 #714

Fix issue #713 #714

Conversation

cshung commented Jan 24, 2024 • edited Loading

bregma commented Jan 25, 2024

cshung commented Jan 25, 2024 • edited Loading

bertwesarg commented Jan 26, 2024

bregma commented Jan 26, 2024

cshung commented Jan 26, 2024

cshung commented Jan 26, 2024 • edited Loading

bregma commented Jan 26, 2024

cshung commented Jan 26, 2024

cshung commented Jan 29, 2024

cshung commented Jan 30, 2024 • edited Loading

bregma commented Jan 31, 2024

bregma left a comment

Choose a reason for hiding this comment

cshung commented Jan 24, 2024 •

edited

Loading

cshung commented Jan 25, 2024 •

edited

Loading

cshung commented Jan 26, 2024 •

edited

Loading

cshung commented Jan 30, 2024 •

edited

Loading