-
Notifications
You must be signed in to change notification settings - Fork 588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replay failed with "Expected syscall_bp_vm to be clear" #3285
Comments
This is usually a rarer way for a divergence to manifest (i.e. we missed the expected syscall breakpoint and ended up somewhere else). |
Ok, I believe what happens here is that the tracee rewrites the ip in sigreturn to hijack execution, write out the error messages and then exit. This is safe to do in the context of the original application because the syscalls after the hijack are carefully controlled, but unfortunately for us, I believe this causes us to jump out of the syscallbuf, causing corrupted syscallbuf state the next time we enter the syscall. Let me see if I can come up with a reproducer. |
This one is tricky. Because of the possibility of switching away in the signal handler, I don't think we can really support taking signals in the syscallbuf at all, because that'll leave syscallbuf state undefined. Our current mechanism for deferring syscalls is also insufficient, because it doesn't handle syscall restarts properly. I think we may need to significantly rejigger the way we handle this to always force the tracee out of the syscallbuf before delivering a signal or restarting a desched syscall. |
Yeah, for a long time I've considered that it might be good to handle signal interruption of buffered syscalls by unwinding the syscall buffering so signal handlers see the "correct" stack. One issue with that, though, is that I think we would need to unpatch the syscall as well so the IP is correct and the syscall restarts normally. And that sounds expensive if we have to constantly unpatch and repatch syscalls. |
Maybe we could get away with a set of special stubs that do "syscall; jmp back to application code" which we set the IP to after we've unwound the syscallbuf stack. IP wouldn't be in an application binary mapping, which could confuse some signal handlers, but registers and stack would at least be correct. |
I've thought about this a fair bit and I think what I want to do here is to put an extra syscall instruction and jump back to the application hook into the extended jump patch. One core issue here is that we somewhere need to store the address to jump back to. I briefly considered using some of the padding inside the sigframe to store the value of |
That's pretty much what I had in mind. |
So, I've got this basically working, but of course stack unwinding out of the bail path is a problem, because that's now in the extended jump patch rather than in the syscallbuf code for which we had previously carefully crafted the unwind info. The best thing I could come up with to fix that is to have librrpage create a bunch of empty pages that the return stubs go into, give them appropriate unwind info and use that to return. I thought about dynamic registration, but the problem is that all those unwinders have function based apis that are generally not re-entrant or signal safe and we have no guarantee about when we add an extended jump patch. A similar problem applies to dynamically loading a .so that has template space. As a result of this we'd only have a finite number of syscall locations that could have jump stubs with proper unwind info. That said, I think as long as we set that number high enough, we're probably unlikely to run into any problem, so it's more of a theoretical concern, but I'd like to hear your thoughts. |
I guess I should say that there is an alternative where rr writes out its own unwind table format and we teach the various unwind libraries how to read that with a one-time registration, but of course that would force the application developers to upgrade or switch unwind libraries, which is probably a no-go. |
Does the bail path have to be in the stub? I made sure the clone fallback path on aarch64 is in librrpreload for better unwind info.... |
It doesn't if we can have a place to store the jump back address to that gets properly saved/restored across sigreturns. As I mentioned above I considered using the signal frame padding for this purpose, but I'm not 100% comfortable with it, because we don't know what external code might do with it. We could potentially play games with r11, since that gets cleared overridden on syscall entry anyway, but that might be a bit too magic. |
So a fixed address in thread local storage won't work because syscall in the signal handler might mess with it? |
yes, the sigreturn may switch away and abandon the stack, never returning to the interrupted syscall. |
That said, perhaps the best option is just to unpatch the syscall if we see a signal being actually delivered in the bail path and just fudging things out from under the signal handler to make it seem that the signal was delivered in the ordinary program. We could re-patch on the next execution as long as we add a special case to the sigreturn code to unpatch again if there's a sigreturn into a patched region. Patching/unpatching is reasonably expensive, but it's not all that more expensive than setting/unsetting breakpoints, which we already do in this situation, so perhaps it's not that bad. |
Would RR know if the sig return finished without returning to where it started? If so then it seems that just using alt stack (or another stack accessed from thread local storage) would work and the tracer would need to manually reset it if an unusual return happens. |
There isn't really anything that prevents the tracee from just capturing the state arbitrarily and jumping to it later at arbitrary points (and real applications rely on this). From the kernel's perspective, getting a signal is a setjmp and |
I was mainly thinking that I'm pretty sure in an earlier version of the julia signal handler (may or may not be in the merged version) I was directly using assembly code to jump to a different target (also I think the SIGFPE handler still does that). |
yeah, so I was mainly wondering if we could notice that on the next sigreturn that the previous one didn't run. But I guess there's nothing illegal for the application to just spend hours in a signal handler before doing a normal sigreturn and short of reliably unwinding the stack we wouldn't know if that has happened. I was also thinking the value of the stack pointer could be an indication that the tracee has abandoned the signal frame but if the application does any sort of tricks with the stack then that won't work.... |
There's probably heuristics that could work reasonably well, but 99% probably isn't good enough here. So far I'm liking the patching and unpatching the best. There's some annoying edge cases, but it gives us the nice property that application signal handlers will never see any ip outside the application code. I think it'd be too expensive for every desched, but as long as it's only on delivered signals, I think we're probably ok. |
Actually, maybe we don't even have to unpatch at all. What if we just fudge the ip in the signal frame to point into the original location where the syscall instruction would have been and fix that back up on sigreturns. The signal handler would be able to unwind correctly, because the cfi still matches what was there before. Sure, if it actually went and read instruction memory, it'd get confused, but that's the case with my unpatching scheme also. |
Do you mean unpatch on sigreturns, or adjust the IP on sigreturns? I think the former would be safer right? |
I mean adjust the IP to point into the tail of the extended jump patch. I've got this working now and it seems to be working well. Just fighting with GDB a bit to give somewhat reasonable unwind info. Will have a PR up soon. |
OK. I wonder how you handle the trailing instructions of the syscall hooks. |
Just copy them into the extended jump patch, so the patch looks like
|
This is a major redesign of the syscallbuf code with the goal of establishing the invariant that we never switch away from a tracee while it's in the syscallbuf code. Instead, we unwind the syscallbuf code completely and execute the syscall at a special syscall instruction now placed in the extended jump patch. The primary motivation for this that this fixes #3285, but I think the change is overall very beneficial. We have significant complexity in the recorder to deal with the possibility of interrupting the tracee during the syscallbuf. This commit does not yet remove this complexity (the change is already very big), but that should be easy to do as a follow up. Additionally, we used to be unable to perform syscall buffering for syscalls performed inside a signal handler that interrupted a syscall. This had performance implications on use cases like stack walkers, which often perform multiple memory-probing system calls for every frame to deal with the possibility of invalid unwind info. There are many details here, but here's a high level overview. The layout of the new extended jump patch is: ``` <stack setup> call <syscallbuf_hook> // Bail path returns here <stack restore> syscall <code from the original patch site> // Non-bail path returns here. jmp return_addr ``` One detail worth mentioning is what happens if a signal gets delivered once the tracee is out of the syscallbuf, but still in the extended jump patch (i.e. after the stack restore). In this case, rr will rewrite the ip of the signal frame to point to the equivalent ip in the original, now patched code section. Of course the instructions in question are no longer there, but the CFI will nevertheless be generally accurate for the current register state (excluding weird CFI that explicitly references the ip of course). This allows unwinders in the end-user-application to never have to unwind through any frame in the rr syscallbuf, which seems like a desirable property. Of course, `sigreturn` must perform the opposite transformation to avoid actually returning into a patched-out location. The main drawback of this scheme is that while the application will never see a location without CFI, GDB does still lack unwind information in the extended jump stub. This is not a new problem, but syscall events are now in the extended jump stub, so they come up quite frequently. I don't think this is a huge problem - it's basically the same situation we used to have before the vdso changes. I believe the best way to fix this would be to establish some way of having rr inform gdb of its jump patches (in fact gdb already has this kind of mechanism for tracepoints, it's just not exposed for tracepoints initiated by the gdb server), but I don't intend to do this myself anytime in the near future. That said, I should note that doing this would not require any changes on the record side, so could be done anytime and start working retroactively for already recorded traces.
This adds a test case to model #3285, where the test case pokes the sigframe to force sigreturn to switch to a different function than that which incurred the signal.
This is a major redesign of the syscallbuf code with the goal of establishing the invariant that we never switch away from a tracee while it's in the syscallbuf code. Instead, we unwind the syscallbuf code completely and execute the syscall at a special syscall instruction now placed in the extended jump patch. The primary motivation for this that this fixes #3285, but I think the change is overall very beneficial. We have significant complexity in the recorder to deal with the possibility of interrupting the tracee during the syscallbuf. This commit does not yet remove this complexity (the change is already very big), but that should be easy to do as a follow up. Additionally, we used to be unable to perform syscall buffering for syscalls performed inside a signal handler that interrupted a syscall. This had performance implications on use cases like stack walkers, which often perform multiple memory-probing system calls for every frame to deal with the possibility of invalid unwind info. There are many details here, but here's a high level overview. The layout of the new extended jump patch is: ``` <stack setup> call <syscallbuf_hook> // Bail path returns here <stack restore> syscall <code from the original patch site> // Non-bail path returns here. jmp return_addr ``` One detail worth mentioning is what happens if a signal gets delivered once the tracee is out of the syscallbuf, but still in the extended jump patch (i.e. after the stack restore). In this case, rr will rewrite the ip of the signal frame to point to the equivalent ip in the original, now patched code section. Of course the instructions in question are no longer there, but the CFI will nevertheless be generally accurate for the current register state (excluding weird CFI that explicitly references the ip of course). This allows unwinders in the end-user-application to never have to unwind through any frame in the rr syscallbuf, which seems like a desirable property. Of course, `sigreturn` must perform the opposite transformation to avoid actually returning into a patched-out location. The main drawback of this scheme is that while the application will never see a location without CFI, GDB does still lack unwind information in the extended jump stub. This is not a new problem, but syscall events are now in the extended jump stub, so they come up quite frequently. I don't think this is a huge problem - it's basically the same situation we used to have before the vdso changes. I believe the best way to fix this would be to establish some way of having rr inform gdb of its jump patches (in fact gdb already has this kind of mechanism for tracepoints, it's just not exposed for tracepoints initiated by the gdb server), but I don't intend to do this myself anytime in the near future. That said, I should note that doing this would not require any changes on the record side, so could be done anytime and start working retroactively for already recorded traces.
This adds a test case to model #3285, where the test case pokes the sigframe to force sigreturn to switch to a different function than that which incurred the signal.
This is a major redesign of the syscallbuf code with the goal of establishing the invariant that we never switch away from a tracee while it's in the syscallbuf code. Instead, we unwind the syscallbuf code completely and execute the syscall at a special syscall instruction now placed in the extended jump patch. The primary motivation for this that this fixes #3285, but I think the change is overall very beneficial. We have significant complexity in the recorder to deal with the possibility of interrupting the tracee during the syscallbuf. This commit does not yet remove this complexity (the change is already very big), but that should be easy to do as a follow up. Additionally, we used to be unable to perform syscall buffering for syscalls performed inside a signal handler that interrupted a syscall. This had performance implications on use cases like stack walkers, which often perform multiple memory-probing system calls for every frame to deal with the possibility of invalid unwind info. There are many details here, but here's a high level overview. The layout of the new extended jump patch is: ``` <stack setup> call <syscallbuf_hook> // Bail path returns here <stack restore> syscall <code from the original patch site> // Non-bail path returns here. jmp return_addr ``` One detail worth mentioning is what happens if a signal gets delivered once the tracee is out of the syscallbuf, but still in the extended jump patch (i.e. after the stack restore). In this case, rr will rewrite the ip of the signal frame to point to the equivalent ip in the original, now patched code section. Of course the instructions in question are no longer there, but the CFI will nevertheless be generally accurate for the current register state (excluding weird CFI that explicitly references the ip of course). This allows unwinders in the end-user-application to never have to unwind through any frame in the rr syscallbuf, which seems like a desirable property. Of course, `sigreturn` must perform the opposite transformation to avoid actually returning into a patched-out location. The main drawback of this scheme is that while the application will never see a location without CFI, GDB does still lack unwind information in the extended jump stub. This is not a new problem, but syscall events are now in the extended jump stub, so they come up quite frequently. I don't think this is a huge problem - it's basically the same situation we used to have before the vdso changes. I believe the best way to fix this would be to establish some way of having rr inform gdb of its jump patches (in fact gdb already has this kind of mechanism for tracepoints, it's just not exposed for tracepoints initiated by the gdb server), but I don't intend to do this myself anytime in the near future. That said, I should note that doing this would not require any changes on the record side, so could be done anytime and start working retroactively for already recorded traces.
This adds a test case to model #3285, where the test case pokes the sigframe to force sigreturn to switch to a different function than that which incurred the signal.
This is a major redesign of the syscallbuf code with the goal of establishing the invariant that we never switch away from a tracee while it's in the syscallbuf code. Instead, we unwind the syscallbuf code completely and execute the syscall at a special syscall instruction now placed in the extended jump patch. The primary motivation for this that this fixes #3285, but I think the change is overall very beneficial. We have significant complexity in the recorder to deal with the possibility of interrupting the tracee during the syscallbuf. This commit does not yet remove this complexity (the change is already very big), but that should be easy to do as a follow up. Additionally, we used to be unable to perform syscall buffering for syscalls performed inside a signal handler that interrupted a syscall. This had performance implications on use cases like stack walkers, which often perform multiple memory-probing system calls for every frame to deal with the possibility of invalid unwind info. There are many details here, but here's a high level overview. The layout of the new extended jump patch is: ``` <stack setup> call <syscallbuf_hook> // Bail path returns here <stack restore> syscall <code from the original patch site> // Non-bail path returns here. jmp return_addr ``` One detail worth mentioning is what happens if a signal gets delivered once the tracee is out of the syscallbuf, but still in the extended jump patch (i.e. after the stack restore). In this case, rr will rewrite the ip of the signal frame to point to the equivalent ip in the original, now patched code section. Of course the instructions in question are no longer there, but the CFI will nevertheless be generally accurate for the current register state (excluding weird CFI that explicitly references the ip of course). This allows unwinders in the end-user-application to never have to unwind through any frame in the rr syscallbuf, which seems like a desirable property. Of course, `sigreturn` must perform the opposite transformation to avoid actually returning into a patched-out location. The main drawback of this scheme is that while the application will never see a location without CFI, GDB does still lack unwind information in the extended jump stub. This is not a new problem, but syscall events are now in the extended jump stub, so they come up quite frequently. I don't think this is a huge problem - it's basically the same situation we used to have before the vdso changes. I believe the best way to fix this would be to establish some way of having rr inform gdb of its jump patches (in fact gdb already has this kind of mechanism for tracepoints, it's just not exposed for tracepoints initiated by the gdb server), but I don't intend to do this myself anytime in the near future. That said, I should note that doing this would not require any changes on the record side, so could be done anytime and start working retroactively for already recorded traces.
This adds a test case to model #3285, where the test case pokes the sigframe to force sigreturn to switch to a different function than that which incurred the signal.
I built from the PR (after cherry-picking some commits for gcc 13 compatibility), and still get this crash: Details
|
In trace
julia-9
from https://buildkite.com/organizations/julialang/pipelines/julia-master/builds/13170/jobs/01819131-f9a7-48e4-926f-23e48419c663/artifacts/018191c4-b7b8-461b-97c8-6f6f44bfc49f, we have when attempting to replay:[snip]
Will look into this further.
The text was updated successfully, but these errors were encountered: