Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement persistent checkpoints #2184

Open
rocallahan opened this issue Apr 21, 2018 · 40 comments
Open

Implement persistent checkpoints #2184

rocallahan opened this issue Apr 21, 2018 · 40 comments

Comments

@rocallahan
Copy link
Collaborator

We could speed up the debugging process by allowing the trace to contain persistent checkpoints. Here are some features we could support:

  • A rr create-checkpoints command that does a full replay of a trace and periodically saves checkpoints of the state into the trace directory.
  • Make rr replay -g <N> start replay from the most recent persisted checkpoint.
  • Make rr replay load persisted checkpoints during reverse execution.
  • Make rr record optionally save checkpoints periodically into the trace directory to speed up the above operations.

The minimal implementation that would be useful would be the first two features. Other features could be added on top of that.

The main issues are the mechanics of saving and restoring checkpoints and the format of the persisted checkpoint data.

@sidkshatriya
Copy link
Contributor

Thanks -- what ideas do you have in saving all the application state (registers, mmaped files, file descriptors etc. all in one go rather than a succession of deltas like it is currently)? How will that work conceptually with reference to the current codebase? (just some broad outlines -- I will dig deeper once you suggest a possible approach)

@rocallahan
Copy link
Collaborator Author

rr already has a notion of checkpoints. At the Session API level, ReplaySession::clone returns a ReplaySession that serves as a checkpoint. The returned session is "partially initialized", i.e. the thread states are stored in Session::clone_completion. We should implement persistent checkpoints by creating an API on ReplaySession that clones the current session and then writes the contents of CloneCompletion to a file. Serializing Task::CapturedState should be easy.

The tricky bit is serializing the contents of the address spaces held by AddressSpaceClone::clone_leader. A naive approach would be to walk the mappings of each clone_leader's AddressSpace and simply dump the contents into the checkpoint file along with the metadata for each mapping. That would probably be OK, but we could easily do a lot better by avoiding writing to the checkpoint file any pages that we know are equal to the corresponding pages of the mmap_ file in the trace they were mapped from.

The main part of the checkpoint should be written using a new Capnproto type. The Capnproto data can be followed by the block data itself. Everything should be compressed using brotli and I think we can get all the data into a single file.

@rocallahan
Copy link
Collaborator Author

For restoring checkpoints I think we need a static function in ReplaySession that's like create but takes two parameters: a trace_dir and a file descriptor for the opened checkpoint file, returning a new ReplaySession in the "partially initialized" state. It would deserialize the data, initialize the clone_leader tasks and create their memory mappings. It's a bit tricky but it would resemble how process_exec works in replay_syscalls.cc: exec an rr_exec_stub(_32), remove its existing mappings, and create new ones.

@rocallahan
Copy link
Collaborator Author

Let's call that new function load_checkpoint.

I think it would be best if load_checkpoint actually calls ReplaySession::create to create a new ReplaySession and then clones that initial task to create each clone_leader. That would avoid having to modify or duplicate the (complicated) session startup logic.

@sidkshatriya
Copy link
Contributor

sidkshatriya commented Apr 28, 2018

General update: I have started looking at the code you pointed out above. I'll definitely ask you questions as they come up. Right now I'm trying to achieve a basic understanding of the rr codebase and doing some stepping through of rr source (in gdb) to get a feel of things.

@sidkshatriya
Copy link
Contributor

sidkshatriya commented Apr 28, 2018

Question:

  • What exactly is a leader (clone leader, task leader, group leader)?
  • A session tracks multiple tasks. Is a session basically tracking everything in a process (pid)?

In general the nomenclature: session, task, task group, leaders, group leaders, tid/pid etc. is a bit confusing as these terms are generally quite overloaded in system programming lingo. Is there is a glossary or list of definitions I can check out somewhere that would explain what these mean in the context of rr code?

@rocallahan
Copy link
Collaborator Author

When cloning a ReplaySession, for each address space, we select one task (group_leader) that's using that address space and cause it to fork(), making a copy of that address space. Those cloned tasks are the clone_leaders.

task_leader isn't really a thing. The comment mentioning it needs to be updated.

Is a session basically tracking everything in a process (pid)?

No, a session tracks an entire set of processes and their threads. When you record, everything in one recording is a RecordSession. Replaying with rr replay -a uses exactly one ReplaySession.

Generally we try pretty hard to make our terms match exactly the way they're used in the kernel. Hence Task, ThreadGroup (not 'task group', which is something obscure that rr doesn't care about), tid. We don't use `pid' much because it's actually a quite ambiguous concept in Linux; 'thread group id' is more specific. The 'leader' idea is rr-specific but hopefully I've explained it adequately above.

@sidkshatriya
Copy link
Contributor

So:

  • tid == Task == task struct in linux and tgid == ThreadGroup.
  • group_leader is any tid out all tids under a specific tgid (as all tgid's share address spaces). If we have n different tgids then we will have 1 group leader per different tgid.

Hope I've got this correct!

@rocallahan
Copy link
Collaborator Author

Not quite. All threads in a thread-group share the same address space, but it is also possible for different thread-groups to share the same address space thanks to the magic of CLONE_VM.

@sidkshatriya
Copy link
Contributor

Thanks!

Question: Looking at the code, when you are doing checkpointing you are essentially doing a remote fork() call (in the tracee). This way you maintain the system state at that location as the fork() does a COW, correct? What impact does this have on "ticks", the number of RCBs? Does the checkpointed session (or process??) get the exact same ticks when it does a fork?

@rocallahan
Copy link
Collaborator Author

Yes.

We reset the hardware tick count every time we resume execution. The only ongoing tick count is the one we keep in Task, and we copy that when we clone each Task.

@sidkshatriya
Copy link
Contributor

sidkshatriya commented Apr 28, 2018

We reset the hardware tick count every time we resume execution.

When you say resume execution what does that exactly mean in this context. Resume execution after a syscall, resume execution from a checkpoint, or some other definition of resuming...

Also practically speaking, the gdb checkpoint command should trigger a ReplaySession::clone, correct? So it that sense when the gdb client asks for a checkpoint, the rr backend creates it, correct?

Now I also understand that rr uses checkpointing to home into a (reverse-continue) breakpoint. rr needs to keep finding a series of runs, starting at more and more advanced points of program execution, in which a user initiated breakpoint is hit, is hit, is hit and then (finally) is not hit. (It will just take the one before the not hit) scenario. So it will use various checkpoints created during this process to start running the execution from, correct?

So essentially my question is: is the gdb concept of checkpointing and the internal use of checkpointing in rr to find the appropriate moment to trigger the breakpoint in the debugger uses the same underlying rr ReplaySession::clone feature?

@rocallahan
Copy link
Collaborator Author

When you say resume execution what does that exactly mean in this context

Task::resume_execution. That runs every time we run any code in the tracee process via PTRACE_CONT or one of its variants --- user code, syscalls, whatever.

So it that sense when the gdb client asks for a checkpoint, the rr backend creates it, correct?

Yes, we override the checkpoint command to clone ReplaySession.

So it will use various checkpoints created during this process to start running the execution from, correct?

Yes. That's all controlled by ReplayTimeline.

is the gdb concept of checkpointing and the internal use of checkpointing in rr to find the appropriate moment to trigger the breakpoint in the debugger uses the same underlying rr ReplaySession::clone feature?

Yes.

@sidkshatriya
Copy link
Contributor

That would probably be OK, but we could easily do a lot better by avoiding writing to the checkpoint file any pages that we know are equal to the corresponding pages of the mmap_ file in the trace they were mapped from.

I understand this statement conceptually: we want to avoid writing redundant stuff. Can you explain the mmap_ file steps a bit. I understand that that rr saves the mappings upon recording and then successive stores deltas but the whole mmap_ steps explanation would be useful for greater understanding/context.

@khuey
Copy link
Collaborator

khuey commented Apr 28, 2018

The logic for deciding what to do for a mmap is all in TraceWriter::write_mapped_region. For certain files (like system libraries) we try to be efficient and clone (if the file system supports it) or hardlink the file into the trace. Neither of these operations use any (appreciable amount of) disk space compared to actually copying the file's contents into the trace.

Some pages of these files will later be changed in memory by the program (e.g. shared libraries after relocations are processed by ld.so). To support efficient checkpointing we could keep track of which pages are ever PROT_WRITE and avoid storing them in the checkpoint (because if they were never PROT_WRITE they can't have changed).

@sidkshatriya
Copy link
Contributor

I've been reading the extended rr technical report from ArXiv. Thank you for such a great technical summary of how rr works.

As part of my familiarization with rr internals I have a question from that paper.

3.3 Detecting Blocked System Calls

[...] Meanwhile the read system call in T is restarted and treated much like a regular non-intercepted system call. Its parameters are rewritten to use scratch buffers. T is re- sumed using PTRACE SYSCALL so that the thread will stop when the system call exits. After the system call exits, T will eventually be rescheduled. At that point the system-call exit will be recorded.

The read() system call is restarted with a PTRACE_SYSCALL (because we don't want to in-process recording as it was found to be blocking). Now as mentioned earlier, this read() could be blocking (maybe waiting for a write() on a pipe). So this should result in the thread being de-scheduled on the processor right away. I don't understand the comment "that the thread will stop when the system call exits". The thread has already been stopped i.e. descheduled waiting for the read() call to complete (as its waiting on a write()) isn't it?

@rocallahan
Copy link
Collaborator Author

Descheduling the thread from the CPU does not count as a ptrace stop. From ptrace's point of view, the thread is still "running". In ptrace terms the thread doesn't stop until the syscall exits.

@sidkshatriya
Copy link
Contributor

Thanks -- so the next ptrace-stop happens when the read() call exits. De-scheduling of a process does not cause a ptrace-stop (unless there is a perf event).

So the next question is: how do you prevent a deadlock in the pipe example we have been discussing. The read() call is blocking, ptrace still thinks the thread is running. rr would need to start running another thread which has the corresponding call to write() (to unblock the thread containing read system call) while maintaining the invariant that only 1 thread is running at any given time. How can rr schedule another thread to run while the thread containing read() is "running" i.e. blocked (but to ptrace it means the same thing)?

@khuey
Copy link
Collaborator

khuey commented Apr 29, 2018

That's what the desched event is for. We set up a desched event fd to send a signal to the tracee when it is descheduled. That signal delivery causes a ptrace trap that rr can see.

@sidkshatriya
Copy link
Contributor

I thought the desched event was only to prevent deadlocks during in-process syscall recording. In fact the paper describes how the desched event prevents such a deadlock when the read() call is not being ptraced. It then describes how the read() system call is restarted using PTRACE_SYSCALL. The question is is how do you prevent it from blocking this time over?

@khuey
Copy link
Collaborator

khuey commented Apr 29, 2018

Oh, sorry, I misunderstood your question. On the normal ptrace path blocking syscalls return ALLOW_SWITCH from rec_prepare_syscall_arch. That triggers a different path in the scheduler (grep it for PREVENT_SWITCH) which ends up checking is_task_runnable on the current task. That'll do a nonblocking waitpid and if that doesn't return a new ptrace event we'll switch tasks.

@sidkshatriya
Copy link
Contributor

I have been studying the ReplaySession::clone call in detail. While stepping through the debugger, one can see a remote clone syscall being made. The remote registers for the syscall are setup by doing a flush_regs() call. Then a PTRACE_SINGLESTEP is issued. What is interesting is that that ip() is at 0x7000 000F when the ptrace is issued. This is the syscall instruction in the rr_page_64_replay page so it all works out and it makes sense. But, what is surprising to me is... where/how does the ip() get set to 0x7000 000F? I can understand that it needs to be at that address but we could be anywhere in the tracee codebase. How do we manage to be set at the correct address for the syscall instruction to run?

@khuey
Copy link
Collaborator

khuey commented Apr 30, 2018

Look at AutoRemoteSyscalls::setup_path.

@sidkshatriya
Copy link
Contributor

sidkshatriya commented May 3, 2018

I'm trying to understand ReverseTimeline a bit. However the code is pretty dense in there and I'm confused a bit by Marks, ProtoMark and InternalMark. Can someone give me a short (or long :-) ) explanation of terminology and basic concepts here?

@rocallahan
Copy link
Collaborator Author

Read all the comments in ReplayTimeline.h and for whatever is still not clear, we should expand those comments.

@sidkshatriya
Copy link
Contributor

sidkshatriya commented May 4, 2018

  /**
   * All the information we'll need to construct a mark lazily.
   */
  struct ProtoMark {

A couple of sentences about the motivation of creating marks lazily. A brief explanation would be useful. Also Marks contain InternalMarks which contain ProtoMarks -- its not clear from this encapsulation how ProtoMarks would be a "lazy" version of a mark.

  /**
   * MarkKey + Registers are assumed to identify a unique program state.
   * We can't order these states directly based on this data, so we have to
   * record the ordering in the ReplayTimeline.
   */
  struct InternalMark {

This comment could be elaborated a bit more -- its a bit cryptic/minimalist at the moment.

Essentially ReplayTimeline is complex because of the various kinds of marks and various ways of measuring progress and various "keys". Your explanation of clone in #2184 (comment) was short but extremely useful. Using that as a jump off point I was able to explore the code in a debugger and I now understanding session and task cloning quite well . A similar overview explanation would be lovely for ReplayTimeline too!

@sidkshatriya
Copy link
Contributor

sidkshatriya commented May 4, 2018

Also conceptually what this this map "mean":

  std::map<MarkKey, std::vector<std::shared_ptr<InternalMark>>> marks;

MarkKey represents Frame time + ticks. InternalMark contains ProtoMark which itself stores a MarkKey and registers. So I guess what I want to know is how can one MarkKey correspond to many InternalMarks.

@rocallahan
Copy link
Collaborator Author

Hopefully this helps: 0beec8c

@sidkshatriya
Copy link
Contributor

Thank you so much for the comments. It makes sense now!

@sidkshatriya
Copy link
Contributor

I have a question about Section 3 - In-process system call interception in the extended technical report

(Section 3) [...] Therefore we inject into the recorded process a library that intercepts common system calls, performs the system call without triggering a ptrace trap [...]

So I understand this interception library to be librrpreload.so that is inserted via LD_PRELOAD.

In Section 3.1

[...] In practice, we have found that method to be insufficient, due to applications making direct system calls, and fragile due to variations in C libraries, and applications that require their own preloading.
Instead, when the tracee makes a system call, RR is notified via a ptrace trap and it tries to rewrite the system-call instruction to call into our interception library.

(Emphasis mine)

The statements in Section 3 and Section 3.1 seem a bit confusing when put together. What do you mean by "common" system calls in Section 3? Are they are all the system calls except the ones that "escape" LD_PRELOAD because of (a) direct system calls (b) variations in C libraries (c) applications that require their own preloading ? Or are there some explicit and common calls simply enumerated and then overridden in librrpreload.so.

The reason that this is important is that you're facing a penalty of 2 context switches for these calls if they raise a ptrace event to rr.

Here is the algorithm as I understand it:
System call happens
Case A: System call was intercepted by LD_PRELOAD
Allow/deny of the system call happens via seccomp-bpf. If "allow" then no ptrace event. If "deny" then ptrace event happens.
Case B: System call was not intercepted by LD_PRELOAD
Ptrace event is raised. Rewrite via ptrace for system call to go via interception library. i.e. Case A.

In other words, is there an explicit enumeration of system calls intercepted in Case A or is it just the ones that "escape" due to (a) direct system calls (b) variations in C libraries (c) applications that require their own preloading.

@rocallahan
Copy link
Collaborator Author

LD_PRELOAD doesn't intercept system calls. It just gets our library loaded. OK, there are a few API functions that get overridden, but that's not the common case. Syscall interception happens in your Case B.

@sidkshatriya
Copy link
Contributor

Sorry I'm still a bit confused. Isn't the whole point of in-process system call interception somewhat defeated if each and every system call has to raise at least a single ptrace event? Shouldn't we be raising zero ptrace events for some system calls at least so they do their stuff in the syscall buffer without rr getting in the way?

In fact in Figure 2 in paper seems to indicate that is what happens for read(). There is a "syscall_hook()" in the digram that redirects arg 2 to the syscall buffer and presumably also makes sure things happen purely within the tracee process.

So I guess my question is how, for some system calls does ptrace() event get avoided? In fact you even mention in the paper:

(Section 3) [...] Therefore we inject into the recorded process a library that intercepts common system calls, performs the system call without triggering a ptrace trap [...]

But in your above answer you say:

LD_PRELOAD doesn't intercept system calls. It just gets our library loaded. OK, there are a few API functions that get overridden, but that's not the common case.

I thought that LD_PRELOAD should be defining stuff that intercepts some of system calls but now you're saying it really intercepting basically API calls (e.g. pthread, X windows etc).

@rocallahan
Copy link
Collaborator Author

We raise one ptrace event per syscall instruction. When the same instruction is used again, no ptrace event is raised.

@sidkshatriya
Copy link
Contributor

Ok that makes sense. What if the instruction is in W^X memory? Do you need to change the memory flags to make it writable? What if the instruction is in a shared library e.g. libc? Does the COW nature of the mapping allow you to change the instructions that are in the libc space?

@rocallahan
Copy link
Collaborator Author

We patch instructions in writeable memory, hoping that the process isn't going to modify them in a way that messes with our patch. We patch instructions in non-writeable and private memory; we write through /proc/.../mem which bypasses page permissions (although see Task::write_bytes_fallible) and private mappings are COW so that works fine.

@sidkshatriya
Copy link
Contributor

@rocallahan @khuey
Update: I've not been able to devote time to this (very interesting) project in the last couple of months. I expect to restart work on this from this week onwards.

@rocallahan
Copy link
Collaborator Author

OK thanks for the update!

@alehander92
Copy link
Contributor

Hey, I am also very interested in this feature: is there any progress / help needed for it?

@rocallahan
Copy link
Collaborator Author

It sounds like @sidkshatriya has not been able to make any progress, so if you want to have a go yourself, go ahead.

@theIDinside
Copy link
Contributor

theIDinside commented Aug 5, 2022

So I figured I'd pop in here, to update that work on persistent checkpointing is under way. I've gotten it to work for no syscall buffer recordings (-n).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants