-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement persistent checkpoints #2184
Comments
Thanks -- what ideas do you have in saving all the application state (registers, mmaped files, file descriptors etc. all in one go rather than a succession of deltas like it is currently)? How will that work conceptually with reference to the current codebase? (just some broad outlines -- I will dig deeper once you suggest a possible approach) |
rr already has a notion of checkpoints. At the The tricky bit is serializing the contents of the address spaces held by The main part of the checkpoint should be written using a new Capnproto type. The Capnproto data can be followed by the block data itself. Everything should be compressed using brotli and I think we can get all the data into a single file. |
For restoring checkpoints I think we need a static function in |
Let's call that new function I think it would be best if |
General update: I have started looking at the code you pointed out above. I'll definitely ask you questions as they come up. Right now I'm trying to achieve a basic understanding of the rr codebase and doing some stepping through of rr source (in gdb) to get a feel of things. |
Question:
In general the nomenclature: session, task, task group, leaders, group leaders, tid/pid etc. is a bit confusing as these terms are generally quite overloaded in system programming lingo. Is there is a glossary or list of definitions I can check out somewhere that would explain what these mean in the context of rr code? |
When cloning a
No, a session tracks an entire set of processes and their threads. When you record, everything in one recording is a Generally we try pretty hard to make our terms match exactly the way they're used in the kernel. Hence |
So:
Hope I've got this correct! |
Not quite. All threads in a thread-group share the same address space, but it is also possible for different thread-groups to share the same address space thanks to the magic of |
Thanks! Question: Looking at the code, when you are doing checkpointing you are essentially doing a remote |
Yes. We reset the hardware tick count every time we resume execution. The only ongoing tick count is the one we keep in |
When you say resume execution what does that exactly mean in this context. Resume execution after a syscall, resume execution from a checkpoint, or some other definition of resuming... Also practically speaking, the Now I also understand that rr uses checkpointing to home into a (reverse-continue) breakpoint. rr needs to keep finding a series of runs, starting at more and more advanced points of program execution, in which a user initiated breakpoint is hit, is hit, is hit and then (finally) is not hit. (It will just take the one before the not hit) scenario. So it will use various checkpoints created during this process to start running the execution from, correct? So essentially my question is: is the gdb concept of checkpointing and the internal use of checkpointing in rr to find the appropriate moment to trigger the breakpoint in the debugger uses the same underlying rr |
Yes, we override the
Yes. That's all controlled by
Yes. |
I understand this statement conceptually: we want to avoid writing redundant stuff. Can you explain the mmap_ file steps a bit. I understand that that rr saves the mappings upon recording and then successive stores deltas but the whole mmap_ steps explanation would be useful for greater understanding/context. |
The logic for deciding what to do for a mmap is all in Some pages of these files will later be changed in memory by the program (e.g. shared libraries after relocations are processed by ld.so). To support efficient checkpointing we could keep track of which pages are ever PROT_WRITE and avoid storing them in the checkpoint (because if they were never PROT_WRITE they can't have changed). |
I've been reading the extended rr technical report from ArXiv. Thank you for such a great technical summary of how rr works. As part of my familiarization with rr internals I have a question from that paper.
The read() system call is restarted with a PTRACE_SYSCALL (because we don't want to in-process recording as it was found to be blocking). Now as mentioned earlier, this read() could be blocking (maybe waiting for a write() on a pipe). So this should result in the thread being de-scheduled on the processor right away. I don't understand the comment "that the thread will stop when the system call exits". The thread has already been stopped i.e. descheduled waiting for the read() call to complete (as its waiting on a write()) isn't it? |
Descheduling the thread from the CPU does not count as a ptrace stop. From ptrace's point of view, the thread is still "running". In ptrace terms the thread doesn't stop until the syscall exits. |
Thanks -- so the next ptrace-stop happens when the So the next question is: how do you prevent a deadlock in the pipe example we have been discussing. The |
That's what the desched event is for. We set up a desched event fd to send a signal to the tracee when it is descheduled. That signal delivery causes a ptrace trap that rr can see. |
I thought the desched event was only to prevent deadlocks during in-process syscall recording. In fact the paper describes how the desched event prevents such a deadlock when the |
Oh, sorry, I misunderstood your question. On the normal ptrace path blocking syscalls return |
I have been studying the |
Look at |
I'm trying to understand ReverseTimeline a bit. However the code is pretty dense in there and I'm confused a bit by Marks, ProtoMark and InternalMark. Can someone give me a short (or long :-) ) explanation of terminology and basic concepts here? |
Read all the comments in |
A couple of sentences about the motivation of creating marks lazily. A brief explanation would be useful. Also Marks contain InternalMarks which contain ProtoMarks -- its not clear from this encapsulation how ProtoMarks would be a "lazy" version of a mark.
This comment could be elaborated a bit more -- its a bit cryptic/minimalist at the moment. Essentially ReplayTimeline is complex because of the various kinds of marks and various ways of measuring progress and various "keys". Your explanation of clone in #2184 (comment) was short but extremely useful. Using that as a jump off point I was able to explore the code in a debugger and I now understanding session and task cloning quite well . A similar overview explanation would be lovely for ReplayTimeline too! |
Also conceptually what this this map "mean":
MarkKey represents Frame time + ticks. InternalMark contains ProtoMark which itself stores a MarkKey and registers. So I guess what I want to know is how can one MarkKey correspond to many InternalMarks. |
Hopefully this helps: 0beec8c |
Thank you so much for the comments. It makes sense now! |
I have a question about Section 3 - In-process system call interception in the extended technical report
So I understand this interception library to be In Section 3.1
(Emphasis mine) The statements in Section 3 and Section 3.1 seem a bit confusing when put together. What do you mean by "common" system calls in Section 3? Are they are all the system calls except the ones that "escape" LD_PRELOAD because of (a) direct system calls (b) variations in C libraries (c) applications that require their own preloading ? Or are there some explicit and common calls simply enumerated and then overridden in The reason that this is important is that you're facing a penalty of 2 context switches for these calls if they raise a ptrace event to rr. Here is the algorithm as I understand it: In other words, is there an explicit enumeration of system calls intercepted in Case A or is it just the ones that "escape" due to (a) direct system calls (b) variations in C libraries (c) applications that require their own preloading. |
LD_PRELOAD doesn't intercept system calls. It just gets our library loaded. OK, there are a few API functions that get overridden, but that's not the common case. Syscall interception happens in your Case B. |
Sorry I'm still a bit confused. Isn't the whole point of in-process system call interception somewhat defeated if each and every system call has to raise at least a single ptrace event? Shouldn't we be raising zero ptrace events for some system calls at least so they do their stuff in the syscall buffer without rr getting in the way? In fact in Figure 2 in paper seems to indicate that is what happens for read(). There is a "syscall_hook()" in the digram that redirects arg 2 to the syscall buffer and presumably also makes sure things happen purely within the tracee process. So I guess my question is how, for some system calls does ptrace() event get avoided? In fact you even mention in the paper:
But in your above answer you say:
I thought that LD_PRELOAD should be defining stuff that intercepts some of system calls but now you're saying it really intercepting basically API calls (e.g. pthread, X windows etc). |
We raise one ptrace event per syscall instruction. When the same instruction is used again, no ptrace event is raised. |
Ok that makes sense. What if the instruction is in W^X memory? Do you need to change the memory flags to make it writable? What if the instruction is in a shared library e.g. libc? Does the COW nature of the mapping allow you to change the instructions that are in the libc space? |
We patch instructions in writeable memory, hoping that the process isn't going to modify them in a way that messes with our patch. We patch instructions in non-writeable and private memory; we write through |
@rocallahan @khuey |
OK thanks for the update! |
Hey, I am also very interested in this feature: is there any progress / help needed for it? |
It sounds like @sidkshatriya has not been able to make any progress, so if you want to have a go yourself, go ahead. |
So I figured I'd pop in here, to update that work on persistent checkpointing is under way. I've gotten it to work for no syscall buffer recordings ( |
We could speed up the debugging process by allowing the trace to contain persistent checkpoints. Here are some features we could support:
rr create-checkpoints
command that does a full replay of a trace and periodically saves checkpoints of the state into the trace directory.rr replay -g <N>
start replay from the most recent persisted checkpoint.rr replay
load persisted checkpoints during reverse execution.rr record
optionally save checkpoints periodically into the trace directory to speed up the above operations.The minimal implementation that would be useful would be the first two features. Other features could be added on top of that.
The main issues are the mechanics of saving and restoring checkpoints and the format of the persisted checkpoint data.
The text was updated successfully, but these errors were encountered: