Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal Handling #1682

Open
skmp opened this issue May 4, 2022 · 42 comments
Open

Signal Handling #1682

skmp opened this issue May 4, 2022 · 42 comments
Labels
documentation Improvements or additions to documentation

Comments

@skmp
Copy link
Contributor

skmp commented May 4, 2022

Splitting from #1558 & #1677, as well as discussions with @neobrain and @Sonicadvance1.

The issues

(a) Signals can interrupt the JIT compiler or syscall, other FEX-related code, 3rd party libraries, or thunked libraries, which are not guaranteed to be signal re-entrant safe. Any code that touches non-stack memory, or uses mutexes is possibly not signal safe. We currently block signals around some code, either using ScopedSignalMaskWith* guards or manually (eg, the dispatcher disabling signal handling around calls to CompileCode)

(b) Signals can interrupt the translated code in the middle of operations that would normally be atomic wrt signals. This may or may not be a problem, depending on how we have implemented x86. A good example is REP* operations. This can be an issue even without LSE elimination, as the recovered guest state might be "teared".

(c) Similar to above, signals can interrupt the translated code in places where we can't recover the guest architectural place, due to optimisations.

(d) Similar to above, synchronous signals might be generated which need to recover a full context and cannot be deffered.

Group 1: From x86 instructions

  • SIGSEGV (memops, permissions / unmapped memory)
  • SIGBUS (meops, mapping past end of file)
  • SIGFPE (all floating point exceptions? Integer overflow too?)

Group 2: Handled from the x86 frontend

  • SIGILL (not handled instruction)
  • SIGTRAP (breakpoint, int3 or int 0x3
  • SIGEMT (not generated)

Group 3: Generated from system calls

  • SIGSYS (Bad system call, SVr4; seccomp)
  • SIGABRT (raise / __pthread_kill / kill others?)

(e) Signal latency. Whenever we disable the signal mask, like we do around ::CompileCode, or with the signal + mutex lock guards, signal delivery gets delayed. This is mostly a concern for long-standing/non constant time signal blocking, like around ::CompileCode (can take up to 10+ miliseconds with complex blocks). There is an argument to be made that we should compile blocks faster, though that will never 100% solve the issue. Also, signal handlers can be delayed while code for them is getting compiled, particularly during their first run.

(f) With deferred signals the opposite problem also appears, that we consume the signal too fast. I'm not sure if this results to an extra signal being possibly queued while a signal is deferred. Also, the signal might appear 'dequeued' to the sender, while it is still 'pending' in FEX, which might lead to some guest instructions running (a bit of 'execution overshoot'), a condition that can be detected, but extremely unlikely to matter to the guest.

(g) While signal delivery is not guaranteed to happen at any speed, lovely features like signal queue merging, which can lead to losing information about the delivered signals, can uncover bugs / assumptions done in the guest code.

Current status

Our current "signal safety strategy" for (a) is to sprinkle signal disabling code around regions that deadlock. This is very inconsistent throughput the codebase, and there are several bugs waiting to be hit. In general, this is a compromise between "likely to lockup" and "performant code".

For (b) and (c) we currently only partially recover the guest architectural state, store it alongside the host architectural state, and hope the guest code doesn't care too much about the contents of the guest state, and that it will not modify it. We depend on returning to the interrupted host code using the stored host architectural state, in order to resume execution in the middle of any teared instructions, and eventually exit from some point with a valid guest state. This poses another limitation, that the interrupted block cannot be discarded from the code cache, so the code cache cannot be cleared. This might also have further implications around SMC and code invalidations.

Proposed solutions

For (a) I'd like us to have clear guidelines on how to handle this, as well as a mode that might be slower but offers guaranteed stability. This needs some thought, but is not too hard.

For (b) and (c) the only viable solution I can think of is a combination of deferring the signal delivery until we have a fully recoverable guest state, and storing metadata that can help us exit from the middle of a block. (c) Can be avoided by limiting store elimination from LSE and disabling DSE. We can have a tradeoff between "defer delay" vs "runtime performance".

For (d.1), we'll need special state flushing semantics and/or recovery metadata and/or exit blocks in instructions that may cause them. This requires extra caution around SRA.

For (d.2), the frontend can take care of everything.

For (d.3), we can likely merge it with the syscall handling case of (a)

For (e) we can implement some form of 'aborts' for long running cases with blocked signals, ie early exits during ::CompileCode or even possible conditional aborts ie temporarily pausing the execution but only aborting if re-executed before getting resumed.

For (f), we can modify the behavior syscalls where signal queueing status can be detected, and make them take actual signal delivery by FEX to the guest into account. This cannot be perfect during guest/host process interop.

For (g), we can implement 'user mode queueing', possibly on top of (g), to get closer to native guest behaviour.

(e) + (f) + (g) are edge case behaviors that is unlikely to matter in practice, and can mostly get triggered by compilation stutter completely altering the expected timing of the guest application.

Related Tickets

#518, #650, #1228, #1666

Other information

Unity depends on at least graceful handling of asynchronous SIGPWR, SIGXCPU (GC, loose context requirements) and SIGSEGV w/ null pointers (NullReferenceException generation, strict context requirements).

@skmp
Copy link
Contributor Author

skmp commented May 10, 2022

(Updated with more details on synchronous signals, improved formatting and unity details)

@skmp
Copy link
Contributor Author

skmp commented May 11, 2022

(Updated with more details about signal latency, execution overshoot, and signal queue merging)

@skmp
Copy link
Contributor Author

skmp commented May 11, 2022

Apart from blocking signals during critical sections with signal deferring we can handle the signal after the critical section, or at specific signal handling spots. This greatly reduces the complexity and overhead of avoiding deadlocks.

@skmp
Copy link
Contributor Author

skmp commented May 11, 2022

Another complication are signals that cannot be blocked or caught, namely SIGKILL and SIGSTOP. While these are not as much of an issue when (a) targeted to the process itself, they may be an issue when/if they can be (b) directed to threads. pthread_kill doesn't allow that to happen, however tgkill/tkill might. Also pthread_suspend_np & friends[1] have to be implemented somehow if they exist on linux.

This kind of ties in with our thread / process lifetime and exit(2), exit_group(2), on which I'm not 100% solid on how it works. From a high level perspective, I know we use SIG63 to stop and terminate threads. We possibly need a separate documentation ticket there.

Complications from (a) might show up around post-guest-process-termination cleanup work, such as flushing / merging AOT/Code Caching logic.

Complications from (b) in addition might also show up from thread specific memory leaks, like code buffers or helper objects.

Ideas for solutions

The only way to handle SIGKILL gracefully is to do a sleight of hand in the sender, either sending another signal before, or replacing SIGKILL with some special signal. This won't work 100% correctly with guest/host interop.

Another way is to have a "watcher" daemon that takes care of things as threads and processes die. This is more complex, but can work 100% correctly with guest/host interop.

[1] pthread_suspend_np & friends

     int pthread_suspend_np(pthread_t thread);
     void pthread_suspend_all_np(void);
     int pthread_resume_np(pthread_t thread);
     void pthread_resume_all_np(void);

@skmp skmp added the documentation Improvements or additions to documentation label May 11, 2022
@skmp
Copy link
Contributor Author

skmp commented May 16, 2022

Another edge case that we have to consider is the host side of a thunked library registering signals handlers.

@skmp
Copy link
Contributor Author

skmp commented May 16, 2022

(Deferred signals investigation is in #1666, will update here with a summary here once that is closed)

@skmp
Copy link
Contributor Author

skmp commented May 16, 2022

Another edge case is how to handle cpu state reconstruction around thunks.

Of course, reconstructing context is impossible if we deliver the signal while the thunk is running. Also, depending in the library that is thunked, it might unreasonable for the guest to assume things about the cpu state or the code that is being run. Exceptions are applications that, eg, might try to interpret the opcodes, eg around segfaults, in the signal handler. Another case could be to detect interrupted system calls.

Thunks may perform syscalls, block or execute for an arbitrary amount of time, so deferring delivery to the return to guest could lead to bugs. And synchronous signals cannot be blocked.

A "plausible" state to present to the guest would be, {invalid register context, RIP pointing to the guest-thunk function thunkOp}.

Handling signal delivery in thunks does complicate things a bit for fully deferred signals as it forces us to store the host state along with the guest context.

@skmp
Copy link
Contributor Author

skmp commented May 17, 2022

On storing the host context and returning with rt_sigreturn

Based on the linux source (https://elixir.bootlin.com/linux/latest/source/arch/x86/include/uapi/asm/sigcontext.h#L1920)

struct sigframe
{
    char __user *pretcode;
    int sig;
    struct sigcontext sc;
    struct _xstate fpstate;
    unsigned long extramask[_NSIG_WORDS-1];
    char retcode[8];
};

struct rt_sigframe
{
    char __user *pretcode;
    int sig;
    struct siginfo __user *pinfo;
    void __user *puc;
    struct siginfo info;
    struct ucontext uc;
    struct _xstate fpstate;
    char retcode[8];
};

and

struct _xstate {
	struct _fpstate			fpstate;
	struct _header			xstate_hdr;
	struct _ymmh_state		ymmh;
	/* New processor state extensions go here: */
};

and

struct _header {
	__u64				xfeatures;
	__u64				reserved1[2];
	__u64				reserved2[5];
};

Where xfeatures can be

/*
 * List of XSAVE features Linux knows about:
 */
enum xfeature {
	XFEATURE_FP,
	XFEATURE_SSE,
	/*
	 * Values above here are "legacy states".
	 * Those below are "extended states".
	 */
	XFEATURE_YMM,
	XFEATURE_BNDREGS,
	XFEATURE_BNDCSR,
	XFEATURE_OPMASK,
	XFEATURE_ZMM_Hi256,
	XFEATURE_Hi16_ZMM,
	XFEATURE_PT_UNIMPLEMENTED_SO_FAR,
	XFEATURE_PKRU,
	XFEATURE_PASID,
	XFEATURE_RSRVD_COMP_11,
	XFEATURE_RSRVD_COMP_12,
	XFEATURE_RSRVD_COMP_13,
	XFEATURE_RSRVD_COMP_14,
	XFEATURE_LBR,
	XFEATURE_RSRVD_COMP_16,
	XFEATURE_XTILE_CFG,
	XFEATURE_XTILE_DATA,

	XFEATURE_MAX,
};

#define XFEATURE_MASK_FP		(1 << XFEATURE_FP)
#define XFEATURE_MASK_SSE		(1 << XFEATURE_SSE)
#define XFEATURE_MASK_YMM		(1 << XFEATURE_YMM)
#define XFEATURE_MASK_BNDREGS		(1 << XFEATURE_BNDREGS)
#define XFEATURE_MASK_BNDCSR		(1 << XFEATURE_BNDCSR)
#define XFEATURE_MASK_OPMASK		(1 << XFEATURE_OPMASK)
#define XFEATURE_MASK_ZMM_Hi256		(1 << XFEATURE_ZMM_Hi256)
#define XFEATURE_MASK_Hi16_ZMM		(1 << XFEATURE_Hi16_ZMM)
#define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
#define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
#define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
#define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
#define XFEATURE_MASK_XTILE_CFG		(1 << XFEATURE_XTILE_CFG)
#define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)

#define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
#define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
					 | XFEATURE_MASK_ZMM_Hi256 \
					 | XFEATURE_MASK_Hi16_ZMM)

#ifdef CONFIG_X86_64
# define XFEATURE_MASK_XTILE		(XFEATURE_MASK_XTILE_DATA \
					 | XFEATURE_MASK_XTILE_CFG)
#else
# define XFEATURE_MASK_XTILE		(0)
#endif

#define FIRST_EXTENDED_XFEATURE	XFEATURE_YMM

We could treat the aarch64 as some "new processor state extension" and storeContextBackup there instead of the host stack, presenting a more 'plausible' state to the guest. We could even use the top bit of xfeatures to indicate this as a 'new but unknown' extension to the guest.

This would allow us return via rt_sigreturn instead of the SignalOp that is used now. It would also allow us to set the retcode in the same way the kernel set its for guests that depend on that.

Currently we detect RIP changes in 64-bit contexts, we could detect them for any context -- or even hash the context to detect 'any' changes.

In conjunction with partial deferred signals

Apart from simplifying the existing flow and presenting a more plausible state to the guest, nothing would change.

In conjunction with guest state reconstruction

Whenever we have a reconstructed guest state, we can not set our 'xfeature' flag and not store the host state.

@skmp
Copy link
Contributor Author

skmp commented May 17, 2022

Defining some terms here to make things more understandable.

(fully/plausibly/not) redispatchable [guest state]

A state we can redispatch / recompile and the guest would not be able to detect any changes.

If fully, any code can be run.

if plausibly the same code that was interrupted during state reconstruction can be resumed, but not any code.

if not host state is needed in addition to the guest state to resume execution.

(full/plausible/partial) guest state reconstruction

A guest state reconstructed from the partial guest state stored in CPUState, the host registers as well as any metadata.

full means the guest state is fully redispatchable

plausible means the guest state is plausibly redispatchable

partial means the state is not redispatchable.

host deferred signals

(I've also mentioned this as partial in the past)

Signal delivery is deferred only for host code, except thunks. This avoids reentrancy amplification from our logic, and removes the overhead of using signal masks around locks. It also guarantees that any 3rd party libraries we use directly don't need to have their locks modified - again, thunks are not included here.

guest deferred signals

Signal delivery is deferred for guest code, so we can do guest state reconstruction.

@skmp skmp changed the title Signal safety Signal Handling May 17, 2022
@skmp
Copy link
Contributor Author

skmp commented May 17, 2022

Reading our code, we generate the context in the guest stack by reading RSP, without any reconstruction, will can lead us overwriting the guest stack.

We back off 128 bytes (redzone) in x86, which likely makes the problem less likely to happen.

We could either require the guest RSP to be synchronized, reconstructed or have a guaranteed uncertainty boundary so we skip over.

@skmp
Copy link
Contributor Author

skmp commented May 17, 2022

Redzone may be applied both for 32-bit and 64-bit processes in a 64-bit kernel. Source: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/signal.c#L252

ia32 compat in x64 kernels is in https://elixir.bootlin.com/linux/latest/source/arch/x86/ia32/ia32_signal.c

@skmp
Copy link
Contributor Author

skmp commented May 17, 2022

https://elixir.bootlin.com/linux/latest/source/arch/x86/ia32/ia32_signal.c#L347

https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/sigframe.h#L23

the rt_sigframe has special some meaning for some applications (eg, gdb) according to comments around the kernel, we could match ours with it for guest gdb debugging.

@skmp
Copy link
Contributor Author

skmp commented May 17, 2022

One more note on the actual flow in the linux kernel.

For non usermode-linux kernels, signal delivery starts from handle_signal_work (https://elixir.bootlin.com/linux/latest/source/kernel/entry/common.c#L143) which calls arch_do_signal_or_restart (https://elixir.bootlin.com/linux/latest/source/include/linux/entry-common.h#L280) implemented in https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/signal.c#L864.

arch_do_signal_or_restart calls handle_signal (https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/signal.c#L786) and returns if there is a signal to handle, or handles automatic syscall restart and calls restore_saved_sigmask and exits.

handle_signal (https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/signal.c#L786) first handles v8086 mode (we don't care), then handles syscall restart, then handles single stepping, then setups the frame with setup_rt_frame, and if there are no errors there, it clears DF, RF, TR and the fpu state. Finally, signal_setup_done is called.

setup_rt_frame (https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/signal.c#L763) ia32_setup_rt_frame (compat task, SA_SIGINFO) or ia32_setup_frame (compat task, no SA_SIGINFO) or __setup_rt_frame (x86-64 task)

signal_setup_done (https://elixir.bootlin.com/linux/latest/source/kernel/signal.c#L2905) calls force_sigsegv (https://elixir.bootlin.com/linux/latest/source/kernel/signal.c#L1697) on error or signal_delivered (https://elixir.bootlin.com/linux/latest/source/kernel/signal.c#L2886)

force_sigsegv (https://elixir.bootlin.com/linux/latest/source/kernel/signal.c#L1697) sends a non-fatal sigsegv if not in a sigsegv loop. I haven't fully followed that path, and will do so, but I expect no surprises.

signal_delivered(https://elixir.bootlin.com/linux/latest/source/kernel/signal.c#L2886) handles SA_NODEFER, SS_AUTODISARM

syscall restarting

The syscall number is saved in regs->orig_ax (syscall_get_nr, https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/syscall.h#L38). Interestingly enough, syscall_rollback is not used during syscall restart, but manually reimplemented regs->ax = regs->orig_ax;.

From there on, setup_signal_stack_si (https://elixir.bootlin.com/linux/latest/source/arch/x86/um/signal.c#L361)

@skmp
Copy link
Contributor Author

skmp commented May 18, 2022

Complication: Signal mask handling vs thunks

A host thunk may modify the signal mask, and the guest will not be informed about, so if it goes getprocmask it will get an unexpected value.

We could synchronize the masks upon entering guest after a thunk (either recursed, or returning), and making sure we forward the masking to the thunk.

This may not be 100% correct as we need to always have a couple of signals unblocked.

Also, a thunk may call sigaction behind our backs, and cause further issues there.

@skmp
Copy link
Contributor Author

skmp commented May 18, 2022

Complication: Signal mask handling on guest signal returns

When returning from a guest signal handler we need to give the guest's signal mask back to the kernel, not the host's. A guest application can modify the signal mask stored in ucontext.

@skmp
Copy link
Contributor Author

skmp commented May 18, 2022

To Investigate: X86ContextBackup/ArmContextBackup may not contain or restore all of the context

AVX and other extensions in x86, SVE and other extensions in arm.

There's a // XXX: Save 256bit and 512bit AVX register state in the relevant code.

@skmp
Copy link
Contributor Author

skmp commented May 19, 2022

From discussion with @Sonicadvance1, MINSIGSTKSZ (2048 bytes in old applications, variable in glibc > 2.34) is a limiting factor on how much stack we use.

This is also an issue with AVX512 contexts (see https://sourceware.org/bugzilla/show_bug.cgi?id=20305)

@skmp
Copy link
Contributor Author

skmp commented May 23, 2022

Run into guest stack overflows in a sample app that used 4kb stacks, both with main and skmp/guest-rt_sigreturn.

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

Another interesting tidbit from do_sigaction (common signal handling code) in the kernel source

POSIX 3.3.1.3:
 "Setting a signal action to SIG_IGN for a signal that is
  pending shall cause the pending signal to be discarded,
  whether or not it is blocked."
 "Setting a signal action to SIG_DFL for a signal that is
  pending and whose default action is to ignore the signal
  (for example, SIGCHLD), shall cause the pending signal to
  be discarded, whether or not it is blocked"

@Sonicadvance1
Copy link
Member

Watch out with how signal queueing has changed how this has worked a bit.

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

Yes, I'm not trusting anything (except how the kernel actually implements things) at this point

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

And everything is subject to a test case

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

Note: As a general design goal, it would be nice to move more (most?) of the guest signal handling to syscalls from the jit dispatcher

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

Bug: sigaction::sa_mask, and sigaction/signal self blocking is not working in the current implementation at all.

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

GuestSigAction: GuestSAMask doesn't match kernel structure padding. Kernels are configured by default with only 64 signals, so this should be ok, but they could be configured with up to 1024 signals, so there's space for that many in the kernel interfaces.
I don't know if any shipping kernel supports the extended signals.

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

Note from https://man7.org/linux/man-pages/man2/sigaction.2.html

   Undocumented
       Before the introduction of SA_SIGINFO, it was also possible to
       get some additional information about the signal.  This was done
       by providing an sa_handler signal handler with a second argument
       of type struct sigcontext, which is the same structure as the one
       that is passed in the uc_mcontext field of the ucontext structure
       that is passed (via a pointer) in the third argument of the
       sa_sigaction handler.  See the relevant Linux kernel sources for
       details.  This use is obsolete now.

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

Note: There are several flags of sigaction that we may not support correctly right now (SA_EXPOSE_TAGBITS, SA_RESETHAND, SA_RESTORER, SA_NODEFER)

@skmp
Copy link
Contributor Author

skmp commented May 24, 2022

Note: pselect6 (and possibly others) take the signal mask as a parameter, which might have complications around host -> guest signal handover.

@skmp
Copy link
Contributor Author

skmp commented May 25, 2022

Issue: Our internal signals that shouldn't be blocked (SIGRT31, any others?) will pass a guest-sent, guest-blocked signal to the guest, instead of holding it.

We need to at least defer in our signal handling and not deliver to the guest, though can can get tricky, and is probably leaky wrt sigwait and other system calls.

At least conformance-interfaces-sigwaitinfo-2-1.test.jit.posix triggers this, though the test itself ignores the issue.

From what I can see, there's no simple way (and maybe none at all) for us to "safely" steal or overload signals from the guest.

I did a quick search for alternative solutions, and possibly ptracing outselves is a cleaner solution to all this, though that needs a deeper investigation.

@skmp
Copy link
Contributor Author

skmp commented May 25, 2022

Issue: glibc uses internally signals 32, 33 (and possibly 34, only? in LinuxThreads which is no longer used). Guest glibc and host glibc handling might cause issues there. We need test cases.

This can also cause issues around thunks.

@skmp
Copy link
Contributor Author

skmp commented May 25, 2022

Note: Host-thunks probably should use the guest signal interface and delivery, to present a consistent view to the guest. Otherwise, our internal signal handling state would get corrupted.

@skmp
Copy link
Contributor Author

skmp commented May 25, 2022

from https://stackoverflow.com/questions/12680624/what-has-happened-to-the-32-and-33-kill-signals

The POSIX realtime signals option defines a set of signals from SIGRTMIN to SIGRTMAX which have various useful properties (e.g. they have a well-defined delivery priority -- lowest signal number first -- and multiple instances of the same signal can be queued, and associated with a parameter, via sigqueue()). These are implemented by the kernel using signal numbers 32 upwards.

@skmp
Copy link
Contributor Author

skmp commented May 25, 2022

Note:

After returning from a signal-catching function, the value of errno is unspecified if the signal-catching function or any function it called assigned a value to errno and the signal-catching function did not save and restore the original value of errno.

This might have some implications for thunk interworking

@skmp
Copy link
Contributor Author

skmp commented May 31, 2022

SIGSTOP pauses always the process group, not only a single thread (tested locally, some stackoverflow answer verified, citation needed).

ptrace works only across different process groups (tested locally, some stackoverflow answer verified, citation needed)

GLIBC internal signals: https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/sysdeps/unix/sysv/linux/internal-signals.h#L30

#define SIGCANCEL       __SIGRTMIN


/* Signal needed for the kernel-supported POSIX timer implementation.
   We can reuse the cancellation signal since we can distinguish
   cancellation from timer expirations.  */
#define SIGTIMER        SIGCANCEL


/* Signal used to implement the setuid et.al. functions.  */
#define SIGSETXID       (__SIGRTMIN + 1)


/* How many signal numbers need to be reserved for libpthread's private uses
   (SIGCANCEL and SIGSETXID).  */
#define RESERVED_SIGRT  2

@skmp
Copy link
Contributor Author

skmp commented May 31, 2022

For some timer details,

https://man7.org/linux/man-pages/man2/timer_create.2.html

Likely to be those two

SIGEV_THREAD
              Upon timer expiration, invoke sigev_notify_function as if
              it were the start function of a new thread.  See
              [sigevent(7)](https://man7.org/linux/man-pages/man7/sigevent.7.html) for details.

       SIGEV_THREAD_ID (Linux-specific)
              As for SIGEV_SIGNAL, but the signal is targeted at the
              thread whose ID is given in sigev_notify_thread_id, which
              must be a thread in the same process as the caller.  The
              sigev_notify_thread_id field specifies a kernel thread ID,
              that is, the value returned by [clone(2)](https://man7.org/linux/man-pages/man2/clone.2.html) or [gettid(2)](https://man7.org/linux/man-pages/man2/gettid.2.html).
              This flag is intended only for use by threading libraries.

@skmp
Copy link
Contributor Author

skmp commented May 31, 2022

Digging deeper in glibc,

It seems to create a helper thread (
https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/sysdeps/unix/sysv/linux/timer_routines.c#L62)

which does
while (__sigwaitinfo (&sigtimer_set, &si) < 0);
and then
if (si.si_code == SI_TIMER)
and then creates a new thread to run the callback if it is found.

The only other users are __libc_signal_block_sigtimer (never used) and __libc_signal_unblock_sigtimer (used in the second helper thread, https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/sysdeps/unix/sysv/linux/timer_routines.c#L44, no idea why, looks like a bug / not needed? so that thread can be canceled (SIGTIMER == SIGCANCEL). )

The thread itself is created in ___timer_create (https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/sysdeps/unix/sysv/linux/timer_create.c#L70)

It is re-set to be re-created post-fork (https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/sysdeps/unix/sysv/linux/timer_routines.c#L118)

called via fork_system_setup_after_fork https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/sysdeps/nptl/fork.h#L48

as only the forking thread makes it across the fork.

___timer_create sets the TID of the waiting thread so the signal gets received by only that special thread.

struct sigevent sev =
	  { .sigev_value.sival_ptr = newp,
	    .sigev_signo = SIGTIMER,
	    .sigev_notify = SIGEV_SIGNAL | SIGEV_THREAD_ID,
	    ._sigev_un = { ._pad = { [0] = __timer_helper_tid } } };

@skmp
Copy link
Contributor Author

skmp commented May 31, 2022

We correctly forward SIG32 to the guest (wrote two tests about it), and it should be safe to use posix timers with SIGEV_THREAD from within FEX or any host thunks, as glibc targets by tid for it.

We cannot use safely pthread_cancel towards a thread that is also a guest thread, as it will cause a guest abort, not a host abort, and any host thread-level destructors will not be called.

While not tested yet, SIG33 should work fine for the guest, however doing setuid from FEX or a thunk is likely to not work correctly.

SIG33 is sent through INLINE_SETXID_SYSCALL https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/sysdeps/nptl/setxid.h#L29 (only on multi threaded programs)

via __nptl_setxid (https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/nptl/nptl_setxid.c#L175)

and finally setxid_signal_thread
(https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/nptl/nptl_setxid.c#L154)

and received with __nptl_setxid_sighandler
(https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/nptl/nptl_setxid.c#L56)

While a bit ugly, we could piggyback on SIG33, as it is (almost? citation needed) never blocked, and use SI_TIMER, as only SI_TKILL is handled by glibc.

@skmp
Copy link
Contributor Author

skmp commented May 31, 2022

Looking in detail in our implementation,

SignalEvent::Stop is used to

  • Implement the exit syscall (via Context::StopThread)
  • Implement OP_Break in the interpreter (via Context::StopThread)
  • Implement graceful shutdown for exit_group (via Context::Stop, has a race bug)
  • Implement graceful shutdown for Debugger (via Context::Stop in CloseCallback and ExitHandler)

SignalEvent::Pause is used

  • To implement PauseCallback in Debugger/Main.cpp (via Context::Pause)
  • In several places in GdbServer.cpp to pause all threads (via Context::Pause)

SignalEvent::Return is used

  • in Context::Run (? why?) Which is used by GdbServer.cpp
  • In Context::Step via Context::Run, which also uses CoreRunningMode::MODE_SINGLESTEP which is broken, as the blocks check on entry to early exit, not on exit.
  • In SignalReturn of Interpreter/BranchOps.cpp to implement signal returns (this seems the only intended use)

JIT signal returns use either ud2 or hlt(0) in the dispatcher

Can we avoid using signals there?

  • Uses if StopThread to stop the current thread can be switched over to use the dispatcher
  • Uses of Context::StopThread to stop other threads, or Context::Stop is tricky and hard to do without re-thinking out stopping strategy.
  • SignalEvent::Pause is more tricky, though debugger only. We could pause threads via a change of CoreRunningMode, which is checked on every block when gdb is attached, though this would still block on syscalls. In a more complex solution, the debugger could ptrace FEX, which has other advantages as well.
  • SignalEvent::Return can be switched over to use the dispatcher, and/or even become not needed by implementing rt_sigreturn and sigreturn

I'll look into removing all uses, apart from SignalEvent::Pause for debugger, and switching that one to SIG33 + SI_TIMER.

@skmp
Copy link
Contributor Author

skmp commented Jun 1, 2022

We need to handle SIGSEGV, SIGBUS internally, which means if the guest masks them, we handle them incorrectly.

SIGSEGV, SIGBUS, SIGILL, SIGFPE all terminate the process based on my testing if masked and generated synchronously. They obey normal queueing rules otherwise (kill, t/tgkill, sigqueue, timer?)

Short test: https://github.com/FEX-Emu/fex-assorted-tests-bins/blob/main/src/tests/signal/synchronous-signal-block.cpp

@skmp
Copy link
Contributor Author

skmp commented Jun 1, 2022

Interesting read: http://davmac.org/davpage/linux/rtsignals.html

Well, with kernel 2.6.22.6 at least, you *can* have two of the same signal pending: one queued via sigqueue (to the process) plus one via raise or pthread_kill (to the thread). The thread-specific signal gets delivered first).

@skmp
Copy link
Contributor Author

skmp commented Jun 5, 2022

/proc/{pid}/status and ``/proc/self/status` include

SigQ:	1/192451
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	7be3c0fe28014a03
SigIgn:	0000000000001000
SigCgt:	00000001000004ec

Something to be mindful of

@skmp skmp moved this to 🆕 Unplanned in Next Project Milestone Aug 18, 2022
@skmp
Copy link
Contributor Author

skmp commented Aug 22, 2022

Another thing I realised today, we can have double faulting on arm64 due to atomics emulation, when a Guest SIGSEGV arises during SIGBUS handling of an unaligned atomic for host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: 🆕 Unschedulled
Development

No branches or pull requests

2 participants