-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #713 #714
Fix issue #713 #714
Conversation
Previous to #524 the code would just cast a (possibly) uninitialized value to a structure pointer and dereference that pointer, leading to undefined behaviour which apparently usually did something that was not crashing in the mystery test scenario in #713. The change in #524 initialized that pointer to NULL, so under the same circumstances you get the undefined behaviour of dereferencing a NULL, which happens to cause a segfault on some systems in the mystery test scenario reported in #713. This change will initialize the pointer to an uninitialized area on the stack, an under the same circumstances will lead to undefined behaviour which apparently does not always crash under whatever mystery test scenario is being used to reproduce #713. Switching from undefined behaviour to undefined behaviour is not really a very robust fix, even if it makes some unknown symptom in some unknown mystery test failure go away. I would think the acceptable fix is to figure out why the |
To make it clear, throughout this reply, existing code means the code before #524. Current state means the code after #524. Remote means the address space of the process that owns the stack to be unwinded, local means the address space of the process running libunwind. I think the key misunderstanding here is that In our scenario, when That function goes through multiple layers of abstraction that we don't need to go into. Suffice to say, throughout the layers, we interpreted that the 2nd argument of In particular, the code never read the memory pointed by the 3rd parameter, it didn't matter what the initial value was. It will be filled by the remote memory anyway. In the original version of the code, the buffer is only Next, the code attempts to read the rest of That explained how the existing code worked, let's that a look at the current state. After the change I can reproduce the access violation deterministically. Unfortunately that involve building the .NET runtime, let me know if you need it. Alternatively, I can get whatever info you might need from the scenario. Can you approve the CI run so that I can see if that breaks anything else? |
I agree with @cshung. The existing and current code is faulty. And as long as only the first But I'm also curious, why the later |
It looks like the design of The call to There are no unit tests exercising this API so I wouldn't expect any change to its implementation to affect CI. This was a mistake when merging the original PR (#377) Visual inspection shows that it won't and can not fix the actual problem. The first thing to do to fix this is to come up with a unit test that demonstrates the problem and mark it as XFAIL. Then, in a separate PR, a fix that makes the test pass and un-XFAIL the test. |
You are probably reading the wrong implementation there, those implementations are under the Here is a stack on how the
|
Only the current state (i.e. code after #524) does. The existing state (i.e. code before #524) copied the remote memory to the stack and then read by dereferencing a pointer to the stack, the same happened after this change. For the
I liked the fact that we insist on having a test, but writing a unit test for this is beyond me, would you or @am11 can give a hand here? |
I think the following change is clearer and expresses intent better. There is still an aliasing violation and possibly an alignment issue but no worse than the very original pre-524 code.
|
Thanks for the help! This patch is slightly larger, but it does convey the intent better. I put this patch into my scenario and confirms it fixed #713, at least for my repro. |
Don't worry about the force push, I didn't change any code, I was just trying to experiment with the CI to see if I can get a test baseline. It looks like new run requires new approval, so I just compared that with the previous run instead. It appears that the new run has a new failure with qemu ppc, and there are a few warnings about GitHub is depreciating Node 16 in favor of Node 20. None of them feel like caused by my change. |
GCC is telling me maybe it is better for us to just keep the pre #524 code on this spot.
My read of the warning is that because To get around both the current GCC warning and the original arm64 warning that was addressed by #524, I simply increased the buffer size, that should solve both warnings. |
The big problem is the strict aliasing violation. The most correct way to work around this would be like so.
The compiler will recognize and optimize away the And yes, I realize it's almost back to the pre-524 code, except for the aliasing violation. |
Co-authored-by: Bert Wesarg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good now.
Can you also make a PR for the v1.8-stable branch?
It appears to me that this change in #524 is problematic.
Original:
Changed:
Originally,
exhdr
will always point to the stack, while the latter will point to whateveraccess_mem
may write it to, orNULL
if it doesn't.The change will make sure
exhdr
still point to the stack, and I leave more spaces on the stack so that it will not overwrite random stack slots.