-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle plain exit for tasks #2939
base: master
Are you sure you want to change the base?
Conversation
Can you reproduce that? Can you attach gdb using the emergency debugger and disassemble the instructions around 0xffffffffff600400 ? |
This comment has been minimized.
This comment has been minimized.
Can you disassemble the first frame ( |
This comment has been minimized.
This comment has been minimized.
Ok ... I don't see what in there would make a vsyscall (at least not directly). What does |
Hrm, sorry, actually |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Ok it looks like it's |
This comment has been minimized.
This comment has been minimized.
Or maybe a Christopher Nolan movie.
This doesn't make a lot of sense to me. The vsyscall address (0xffffffffff600400) is here but we don't ever call it? |
Totally!
Now we're both on the same page... and now? [side note: that still doesn't seem related to the PR ;-) ] |
No this has nothing to do with your PR but I'd still like to get to the bottom of it, because this is the third report we've had of it in a week and you're the most responsive reporter so far :) If you |
This comment has been minimized.
This comment has been minimized.
Very last question! What is the output of |
This comment has been minimized.
This comment has been minimized.
Looks like there is no VDSO present on this system, hence the vsyscall was selected. That is very old stuff that I don't want to worry about supporting ... though it could be done if we really wanted to or someone wanted to contribute it. Basically if we can't monkeypatch the vsyscall and restart it, instead we would emulate the vsyscalls in rr via AutoRemoteSyscall. |
As for the actual patch here, this is very tricky code. What is supposed to happen if the task is already exited, is that waitpid will return ECHILD, in which case So I don't want to accept this patch, at least not without a very clear explanation of why the current code doesn't work and how the new code fixes it. |
I don't know more about why it works that way, just have recognized that with the old code the actual process stood at "defunc" and rr waiting "forever" (I waited only 4 minutes) to return from |
This comment has been minimized.
This comment has been minimized.
Is this with the 3.10 kernel? Possibly the kernel behavior changed at some point. |
Yes. Should we query the kernel version and make that conditional? |
Monkeypatcher.cc
A check with different code to be executed if we can't patch the site.
Not really no. Like a lot of rr, this is not easy at all. Whoever fixes this will need to understand exactly how vsyscalls work, and learn to understand AutoRemoteSyscall and other rr machinery. |
No, I don't want to add stuff for a kernel version we don't officially support. I mean, maybe we could hack around this, but missing |
Can you please create a reasonable issue with enough information about that (the one above should be fine) so that we have it in one place - and can close the others that are related as duplicate? |
I'll add info to #2929. |
... and after the patch different simple programs actually work fine for record and replay on kernel 3.10. |
It would only be useful if there are actual users using 3.11 and hitting this bug. (And if this patch is actually correct, which is not clear yet.) |
I totally agree on this part, the only thing I can say is: works here better than the current version (which always lead to "inferior hang" at its process exit, while this version exits cleanly and seems to also record correct as the replay is possible - but because of the vsyscall issue only simple tests are covered so far). Is there anything else I can do or should we handle this as a "stale PR"? |
This comment has been minimized.
This comment has been minimized.
@rocallahan How did you get to this conclusion?
So vDSO "in general" is available in both cases from above. If something is missing in this area would a libc preload be useful or "nothing but a kernel update" (I guess at least at the minor version number, so nothing that will happen for the LTS RHEL7/CentOS7)? |
@khuey identified the vsyscall bug and fixed it here: 8a15f2a
This isn't enough to land a patch. We need to understand the problem we're fixing, why the patch fixes it, and why it makes rr's invariants stronger or at least not weaker. And even than, adding complexity to fix one testcase on a distro release where rr mostly doesn't work for other reasons is probably not worthwhile. |
Did that and recording works like a charm now, even with a much more complex scenario. Replay of the simple and medium tests now work fine; the complex one fails on replay, I'll create an issue for that later. |
The original point of this issue got a bit "hijacked", I've cleaned the post unrelated to the main issue - the actual pr - as those are resolved with later versions of RR. It may be reasonable if others could do the same. I've also checked again and found 4e10088 to be the origin of this code, coming from @Keno - but I have no idea if there's something he can do here (possibly share some wisdom outside of the comment). A small recap of the state of this PR:
"last" means:
--> attached to rr into
but this time the call to Any insights or ideas appreciated. |
The comment seems pretty clear that this will hang in a particular situation. Have you validated that it does not for a particular reason? I don't remember all the details, but I do remember thinking this through pretty carefully and this being the only option. I may have overlooked something, but I'd need to see some evidence that the problem in the comment is resolved. This kind of code you really need to develop while staring at the kernel wait code at the same time. |
The issue with the old kernel and old ode is that it always hangs. With the new code it hangs sometimes with the old kernel. |
Right, I'm suggesting with the new code it will sometimes hang in the new kernel also. |
Rechecked: the "some still hanging" tests after the patch and the old kernel also hang with the new kernel. I've already tested |
fixes #2627, fixes #2640
Note: this definitely should get a code review (seeing the file's history I guess goes to @rocallahan), and likely also an adjustment of the comment in the code.
I've just tested a bit and found in the simple tests I've done (single thread, all child processes of the "inferior" are already finished) pass and that the "hang" is fixed, the record works and the replay is possible, too.
... but other programs (multiple shared objects loaded) do fail with (is that #2929 ?):
I have no clue if this is because of my change in this PR (then they shouldn't be merged) or if this is another issue (the please merge and let others test/fix on top of that)...