Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the concept of a compiler fence make any sense? #347

Open
RalfJung opened this issue Jul 5, 2022 · 129 comments
Open

Does the concept of a compiler fence make any sense? #347

RalfJung opened this issue Jul 5, 2022 · 129 comments

Comments

@RalfJung
Copy link
Member

RalfJung commented Jul 5, 2022

Atomic fences are specified in terms of things that happen when certain atomic reads occur:

  • on thread A: first a fence, then some (atomic) store
  • on thread B: first an (atomic) load, then a fence

In that particular situation, if the load reads-from the store, then the fences kick in and have an effect. That is the only effect they have, I think.

So, if your program contains no atomic accesses, but some atomic fences, those fences do nothing.
We also think that an atomic fence has at least all the effects of a compiler fence, i.e., a compiler fence is strictly weaker than an atomic fence. But that means a compiler fence has no effect on programs without atomic accesses -- which is just wrong, that's not how they are supposed to behave.

So what is the operational spec of a compiler fence? I have no idea.
See the discussion here for some more details. Let's go on discussing here.

@Diggsey
Copy link

Diggsey commented Jul 5, 2022

I think atomic fences have additional guarantees that cannot be specified in terms of the abstract machine. Compiler fences provide a subset of those additional guarantees.

These guarantees would probably have to be specified in the same way that FFI or inline assembly is specified - as a relation between the abstract machine state and the underlying lower level state of the machine.

If you look at the uses for a compiler fence: https://stackoverflow.com/a/18454971

Both uses boil down to a single thread of execution being interrupted (either by an actual interrupt, or by a context switch) to run some other code, and the programmer wants to synchronize access to memory between the interrupted code and the interrupting code. This cannot be specified at an abstract machine level, because the abstract machine doesn't have a concept of a thread being interrupted in this way.

Going back to "atomic fences have additional guarantees that cannot be specified in terms of the abstract machine":

I think this is evidenced by the fact that we would expect atomic access from Rust code to be able to synchronize with atomic accesses from another language (or indeed assembly) in the same program. Therefore there must be "more to" atomic accesses than just their semantics within the abstract machine.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 5, 2022

So you are going in the direction of what was discussed around here.

These guarantees would probably have to be specified in the same way that FFI or inline assembly is specified - as a relation between the abstract machine state and the underlying lower level state of the machine.

That would block a lot more optimizations that you would want when writing concurrent software with atomics. In particular we want there to be a meaningful difference between atomic fences with different orderings, which I do not think is the case under your proposal. So I don't think atomic fences should have any effect like that.

So maybe the statement that an atomic fence is strictly stronger than a compiler fence is wrong?

I think this is evidenced by the fact that we would expect atomic access from Rust code to be able to synchronize with atomic accesses from another language (or indeed assembly) in the same program.

I don't agree. Atomics are a feature inside the language and its memory model, so one should only expect them to synchronize inside that memory model. So whatever the other code does must be cast in terms of that memory model.

C++ atomics are not some abstract way of talking about what the hardware does, they are their completely own thing that is then implemented in terms of what the hardware does. There are sometimes even multiple different mutually incompatible schemes to implement the same model on a given hardware!

@DemiMarie (in the old thread)

Are there semantics for interrupts at all? That is a prerequisite for having semantics for compiler fences.

I guess part of the question here is whether interrupts exist as a language feature (that we'd have to add to the AM) and compiler fences interact with that, or whether we consider them to be outside the AM and compiler-fence to be a way to somehow coordinate with out-of-AM activities.

Atomic operations are not an out-of-AM mechanism. This is one of the many key differences between atomic accesses and volatile accesses. (E.g., the compiler is allowed to reorder and remove atomic accesses when that preserve the semantics of the abstract memory model.) So it would be very strange to me if atomic fences were ouf-of-AM mechanisms, and indeed that would put a lot of extra cost on concurrent algorithms that are perfectly happy staying entirely inside the AM.

@DemiMarie
Copy link

I think this is evidenced by the fact that we would expect atomic access from Rust code to be able to synchronize with atomic accesses from another language (or indeed assembly) in the same program.

I don't agree. Atomics are a feature inside the language and its memory model, so one should only expect them to synchronize inside that memory model. So whatever the other code does must be cast in terms of that memory model.

Cross-FFI atomics need to work. The question is how to specify atomics in such a way that they do.

@Diggsey
Copy link

Diggsey commented Jul 5, 2022

In particular we want there to be a meaningful difference between atomic fences with different orderings, which I do not think is the case under your proposal. So I don't think atomic fences should have any effect like that.

That seems like a bit of a leap? I assume you are referring to:

There are sometimes even multiple different mutually incompatible schemes to implement the same model on a given hardware!

I would expect atomics to be like part of the ABI - ie. the compiler is not expected to make any kind of atomic accesses from FFI code work correctly, but as long as the FFI code follows the same rules (for that ordering) as the Rust compiler does, then they would be expected to synchronize. I don't see why this would make orderings redundant.

@Diggsey
Copy link

Diggsey commented Jul 5, 2022

I don't agree. Atomics are a feature inside the language and its memory model, so one should only expect them to synchronize inside that memory model. So whatever the other code does must be cast in terms of that memory model.

It's my understanding that atomics are the expected way to do IPC via shared memory. I don't see how to reconcile your statement with that. (This is certainly a use-case supported by C++'s atomics)

@RalfJung
Copy link
Member Author

RalfJung commented Jul 5, 2022

It's my understanding that atomics are the expected way to do IPC via shared memory. I don't see how to reconcile your statement with that. (This is certainly a use-case supported by C++'s atomics)

For shared memory with other code that follows the same memory model. Basically, with other instances of (roughly) the same AM.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 5, 2022

I'm unclear: are you saying that Rust atomics can sync with C++ atomics, or not?

@RalfJung
Copy link
Member Author

RalfJung commented Jul 5, 2022

That seems like a bit of a leap? I assume you are referring to:

No that's not what I am referring to. I'm saying atomic::fence has this ordering parameter and we agree it is relevant for the semantics, right? Because under your spec I don't see how it is.

I'm unclear: are you saying that Rust atomics can sync with C++ atomics, or not?

We are using the C++ memory model so that works fine.

But syncing e.g. with assembly code only makes sense if whatever that assembly code does can be expressed in terms of this memory model. You can't expect an atomic fence to have any effect other than what it says in the memory model.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 5, 2022

I guess part of the question here is whether interrupts exist as a language feature (that we'd have to add to the AM) and compiler fences interact with that, or whether we consider them to be outside the AM and compiler-fence to be a way to somehow coordinate with out-of-AM activities.

To be clear, I think saying that compiler_fence is an out-of-AM syncing mechanism is not a bad idea. However, that would make compiler_fence incomparable in strength with atomic fences (and having atomic::compiler_fence just makes no sense).

That is actually consistent with what @comex wrote

But if we do want to allow reordering, I'd say that compiler_fence should only have a specified effect in conjunction with volatile accesses, similar to how atomic fences only have an effect in conjunction with atomic accesses.

Volatile accesses and atomic accesses are also incomparable in strength.

However, I just realized compiler_fence has an Ordering, and I have no idea what that is supposed to mean then...

@DemiMarie

This comment was marked as off-topic.

@Diggsey
Copy link

Diggsey commented Jul 6, 2022

I'm saying atomic::fence has this ordering parameter and we agree it is relevant for the semantics, right? Because under your spec I don't see how it is.

Oh I see - because the memory ordering is part of the C++/Rust memory model.

I guess another option would be to add the concept of "interruption" to the abstract machine, and then define compiler fences in terms of that. I think it's possible to define everything about "interruption" except for what triggers it - that's the only part that's "outside" what we can define.

@RalfJung

This comment was marked as off-topic.

@m-ou-se
Copy link
Member

m-ou-se commented Jul 6, 2022

But that means a compiler fence has no effect on programs without atomic accesses -- which is just wrong, that's not how they are supposed to behave.

Why is that wrong?

Single-threaded / signal / compiler fences are relevant for signal handlers, interrupts, and things like Linux' SYS_membarrier.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 6, 2022

What's wrong is the "that means a compiler fence has no effect..." part, because they're supposed to have some sort of effect.

We all seem to agree they should do something.

@m-ou-se
Copy link
Member

m-ou-se commented Jul 6, 2022

They do have an effect, but not in a program without any atomic operations.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 6, 2022

Ah, then I was incorrect and we seem to disagree.

Many people (myself included) have been under the impression that a compiler_fence has an effect even in a program with no atomic operations (particularly, even for targets where there are no atomics). If that's not the case we really need to fix the docs.

@m-ou-se
Copy link
Member

m-ou-se commented Jul 6, 2022

The C++ standard defines atomic_signal_fence (aka our atomic::compiler_fence) as:

Equivalent to atomic_­thread_­fence(order)¹, except that the resulting ordering constraints are established only between a thread and a signal handler executed in the same thread.

(¹ Aka our atomic::fence().)

And adds:

[Note 1: atomic_­signal_­fence can be used to specify the order in which actions performed by the thread become visible to the signal handler.
Compiler optimizations and reorderings of loads and stores are inhibited in the same way as with atomic_­thread_­fence, but the hardware fence instructions that atomic_­thread_­fence would have inserted are not emitted.
— end note]

Other types of interrupts or things like SYS_membarrier aren't part of the C++ standard, but are also use cases for a single-threaded/compiler/signal fence.

A more practical/wider applicable way of defining atomic::compiler_fence would be to say it is identical to atomic::fence, except it doesn't work across hardware threads/cores. That means nothing in the abstract machine of course. So it's just a regular atomic fence, but with an additional assumption/requirement from outside the abstract machine. If the abstract machine defines signal handlers or interrupts, it could possibly mention something about those situations specifically.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 6, 2022

Well, I've never programmed C++, and so honestly I only vaguely follow what that's saying.

I'm unclear on what the implications are for targets without atomics, where we might assume that there's only one core and possibly an interrupt handler.

@m-ou-se
Copy link
Member

m-ou-se commented Jul 6, 2022

When you say "without atomics", do you mean without atomic read-modify-write/cas operations, or even without load and store?

If a platform has no atomic load or store for any size atomic.. then there's no way to communicate between a signal handler/interrupt and the main code at all through shared memory. Then there's indeed no use for a fence.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 6, 2022

I mean for example ARMv4T, which can uninterruptedly read-write "swap" with swp (u32) and swpb (u8), but there's no atomic-compare-and-swap, and no atomic-read-modify-write.

And it does have interrupts.

But you're telling me there's no official/legal way for the interrupt to communicate anything to the main program? That seems... unfortunate.

@m-ou-se
Copy link
Member

m-ou-se commented Jul 6, 2022

You don't need compare-and-swap or atomic-read-modify-write. Just a simple atomic store to set a flag and a load to check it later is enough.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 6, 2022

Well, there's normal load and store instructions, but not atomic anything instructions.

@RalfJung
Copy link
Member Author

RalfJung commented Jul 6, 2022

Aren't these loads and stores atomic, if they are "uninterruptable" and there is no multi-core version of this hardware?

@m-ou-se
Copy link
Member

m-ou-se commented Jul 6, 2022

Well, there's normal load and store instructions, but not atomic anything instructions.

That's the same. All load and store instructions on nearly all architectures are also atomic.

Aren't these loads and stores atomic, if they are "uninterruptable" and there is no multi-core version of this hardware?

Indeed.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 6, 2022

Oh, well then sure, there's atomic loading and storing.

@chorman0773
Copy link
Contributor

BTW, I interpret core::sync::atomic::compiler_fence as being Rust's version of atomic_signal_fence - that is, a fence that can only synchronize with atomic ops and fences on the same thread of execution as the calling code (including from a signal handler on that thread).

Reading it this way makes quite a bit of sense to me at least - it just modifies how the fence-atomic, atomic-fence, and fence-fence synchronization rules function.

@Lokathor
Copy link
Contributor

Lokathor commented Jul 7, 2022

Query: How does one enforce ordering between volatile and non-volatile ops.

Previously, I've also seen people assume that compiler_fence can do this.

But, again, apparently compiler_fence does nothing at all for volatile/plain ops ordering?

@chorman0773
Copy link
Contributor

Query: How does one enforce ordering between volatile and non-volatile ops.

You can't*.
If the volatile op was also atomic, though, then you could get the same synchronization behaviour in a signal/interrupt handler as a full fence would.

And yes, rust needs volatile atomic operations.

@repnop
Copy link

repnop commented Jul 7, 2022

How does one enforce ordering between volatile and non-volatile ops.

this is what I've been using compiler_fence for, I need some way to prevent the compiler from reordering non-volatile and volatile operations to write driver code, and the docs for compiler_fence don't mention anything about atomics until the example, so I was under the impression the reason for it being in core::sync::atomic was because it uses the Ordering enum from there. looking at the source, it indeed calls atomic fence intrinsics

#[inline]
#[stable(feature = "compiler_fences", since = "1.21.0")]
#[rustc_diagnostic_item = "compiler_fence"]
pub fn compiler_fence(order: Ordering) {
    // SAFETY: using an atomic fence is safe.
    unsafe {
        match order {
            Acquire => intrinsics::atomic_singlethreadfence_acq(),
            Release => intrinsics::atomic_singlethreadfence_rel(),
            AcqRel => intrinsics::atomic_singlethreadfence_acqrel(),
            SeqCst => intrinsics::atomic_singlethreadfence(),
            Relaxed => panic!("there is no such thing as a relaxed compiler fence"),
        }
    }
}

which makes me think that the docs really need improved to mention it only applies to atomic accesses if that's the only guarantee that it makes, and it doesn't help that it links to the Linux memory barrier docs which discuss much more than just atomic orderings. IMO there should be some other way to prevent non-volatile & volatile reordering as its necessary to be able to guarantee that for things like driver code. The Linux kernel has functions specifically for accessing MMIO with barriers to prevent reordering in code, but do we make any such guarantee about asm! preventing function calls from being reordered inside the caller?

@briansmith
Copy link

Two issues:

  1. The reordering model is not normative, atomic orderings affect visible side effects.
  2. compiler_fence performs no actions that synchronize operations on any thread with another thread. In particular, compiler_fence typically generates no code, include code required to prevent the processor from interfering in visibility across hardware threads.

I see what you're saying. Importantly, for any clarification of Mutex Acquire/Release semantics, we can do it in terms of atomic::fence instead of compiler_fence.

So it would instead need to be more like:

  • Mutex::lock() does std::sync::atomic::fence(Ordering::Acquire) or stronger synchronization.
  • Dropping the MutexGuard does std::sync::atomic::fence(Ordering::Release) or stronger synchronization.

...which makes it out of scope for this issue.

OTOH, I think the way compiler_fence uses the words "read" and "write" is confusing because I consider any atomic loads/stores to be "read"/"write" operations, respectively. So I think compiler_fence must do the same thing as fence w.r.t. ordering all atomic load/stores, but not necessarily w.r.t. non-atomic reads/writes.

@chorman0773
Copy link
Contributor

chorman0773 commented Jun 8, 2024

The documented definition of compiler_fence is mostly nonsencial IMO. The closest definition I have is the one I have above, which is that it is equivalent to atomic_signal_fence from C++ - IE. it will perform the same synchronizations as fence, but will only attach to atomic operations and other fences (including compiler_fences) on the same thread of execution (possibly being executed by a signal handler).

Any other definition is either not implemented (compiler_fence is as strong as fence), useless (compiler_fence is a no-op) or ill-defined in regards to the rest of the model (this includes the current documentation).

@briansmith
Copy link

will perform the same synchronizations as fence, but will only attach to atomic operations and other fences (including compiler_fences) on the same thread of execution (possibly being executed by a signal handler).

This could be simplified by saying that:

  • A compiler fence only attaches to other compiler fences, only on the same thread,
  • An atomic::fence is a compiler_fence / interacts with compiler fences as though it is also a compiler fence (define the mapping for each Ordering value.
  • atomic load with Ordering::Acquire, Mutex::[try_]lock(), Once::call_once_* (before calling the init function), OnceLock::get_* (before calling the init function), etc. all act like atomic::fence() (with the same mapping to compiler_fence), and analogously for Ordering::Release with the corresponding operations, etc.
  • Would OnceCell would have any interaction with compiler_fence analogously to how a OnceLock has to interact with atomic fences? It kind of seems like it would make sense by analogy above. Either way, it's worth explicitly documenting the (lack of) interaction.

Then every time we define a new thing that acts like an atomic fence, we would get the documentation for the interaction with compiler fences "for free."

Note that this could be done regardless o whether compiler fence is effectively renamed to something that indicates it is signal-handler specific.

@chorman0773
Copy link
Contributor

chorman0773 commented Jun 9, 2024

fence including compiler_fence is fine, from a semantics perspective (I don't know that it simplifies the semantics, though, to define it that way).

All non-Relaxed atomics being fences does not work, since a fence has affects that spply to multiple atomic accesses, whereas an aromic access alone only forms synchronizes-with between actual access/load pairs.

@chorman0773
Copy link
Contributor

chorman0773 commented Jun 9, 2024

https://eel.is/c++draft/atomics.fences

A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

(and similar provisions for synchronizing with atomic operations on both sides)

Notably, there's no restriction on which Xs are chosen, so it applies to allow possibly valid X and Y pairs.

@RalfJung
Copy link
Member Author

RalfJung commented Jun 10, 2024

Importantly, for any clarification of Mutex Acquire/Release semantics, we can do it in terms of atomic::fence instead of compiler_fence.

Yes... though we have to be a bit careful in the wording here. Mutex acquire can be implemented entirely with acquire loads, not using any acquire fence, so I don't think a fence-based explanation can work. Instead the clarification needs to be in terms of which happens-before / synchronizes-with edges are guaranteed to exist.

But as you said that's getting off-topic for this issue.

The documented definition of compiler_fence is mostly nonsencial IMO.

Agreed. Explanations in terms of reordering may be good for getting some first intuition, but they are unsuited as normative definitions. And the only normative definition for compiler_fence that I know of is C++'s atomic_signal_fence.

@briansmith
Copy link

Regarding clarifying how a Mutex works, I filed rust-lang/rust#126239.

@RalfJung
Copy link
Member Author

Seen on LWN:

My understanding is that compiler fences can be used instead of thread fences if you know that two threads only ever run on the same physical CPUs. So, as far as the compiler is concerned, they should block the same optimizations as thread fences, generating the same code apart from the fence instructions themselves. The assumptions that are made on generated instructions might be dubious in terms of portability, but they're fine with respect to data races and hence UB.

That's not how compiler fences are specified, but I wonder -- does anything go wrong with this approach?

@Diggsey
Copy link

Diggsey commented Aug 17, 2024

That sounds right - the "same physical CPU" doesn't have meaning inside the AM AIUI, but I think we all expected that there would be a non-AM component to the definition of a compiler fence.

I guess a follow-up question is "what counts as the same physical CPU?". Is it a logical CPU core, a physical CPU core, a physical CPU, or just left up to the platform to decide (in which case it would be good to have answers for x86[-64] and ARM64 at least.)

@RalfJung
Copy link
Member Author

I guess a follow-up question is "what counts as the same physical CPU?".

It's something that behaves entirely in-order.

So when you run in userspace, a single thread (with its signal handler) is "the same physical CPU". I would say that if you spawn a thread with thread::spawn, that is not the same "physical CPU" even if you pin all threads to one core; this is because the AM sees you create that thread so it may assume it to be truly an independent thread.

On hardware, it's whatever the unit of "sequential computation" in this hardware is. If "logical CPU core" refers to things like hyperthreading, then I think that's what it is.

@Diggsey
Copy link

Diggsey commented Aug 17, 2024

I would say that if you spawn a thread with thread::spawn, that is not the same "physical CPU" even if you pin all threads to one core; this is because the AM sees you create that thread so it may assume it to be truly an independent thread.

That doesn't seem like it would enable any optimizations, since the compiler is not necessarily aware of what signal handlers exist, and so would still have to restrict re-orderings on both threads (even assuming they execute on different cores) under the presumption that both threads have a signal handler.

Given that, it seems as though a user could effectively rely on compiler fences working cross-thread (on the same logical core) even if we don't guarantee that explicitly.

On hardware, it's whatever the unit of "sequential computation" in this hardware is. If "logical CPU core" refers to things like hyperthreading, then I think that's what it is.

Yeah, I guess the pathological case for hyper-threading is that despite two logical cores sharing the same L1 cache, the cores execute instructions out-of-order, breaking any guarantees of the fence.

@hanna-kruppe
Copy link

I have a use case for compiler fences that (as far as I can tell) was not yet discussed in this issue, and where I'd want compiler fences to mean something even across CPU cores: stack sampling for profiling, especially in bytecode interpreters.

The gold standard way of building sampling profilers is to periodically interrupt a currently running thread, and take a snapshot of its PC, call stack, etc. before you let it continue. Doing it this way requires some fiddly and platform-specific code to suspend threads but avoids large, systematic inaccuracies introduced by only sampling at certain "safe points" in the program. For this to work well, you need two kinds of "synchronization" in a muddy, conventional sense:

  1. Suspending the target threads needs to "happen before" (not necessarily in the AM sense) the act of taking a snapshot. In other words, you actually suspend the thread before you take a snapshot, so you don't have to worry about data races while taking the snapshot.
  2. The snapshot should, as much as possible, give a coherent view of some program point at the source level. While we want the program to run normally and enjoy any optimizations usually applied to it, there are a few "reorderings" that would be very problematic for sampling.

The first point is generally ensured by the platform-specific facilities used to suspend a thread. Only the second point is related to compiler fences. For example, setting up a stack frame during a call is not "atomic" in any real sense, but it's better if the stack pointer is (from profiler's POV) only adjusted after the return address is written to the expected position relative to the new stack pointer. An AOT or JIT compiler that generates machine code directly, or an interpreter written in assembly, can achieve this by just being careful about the sequence of CPU instructions it uses to manipulate the call stack. Because the thread will be completely interrupted when the sample is taken, this can use conventional non-atomic loads/stores even on machines with very weak memory models (as long as they have precise interrupts). After all, that's how any sampling profiling of C, C++, Rust, etc. code works.

But a bytecode interpreter written in Rust, which wants to permit cross-thread sampling of the bytecode call stack, needs to be more careful. There's basically three options:

  1. Use inline assembly for those bits of the interpreter, which could otherwise be pure Rust (and in many cases, 100% safe Rust). This would force some tricky platform-specific bits into a core part of the interpreter, and the necessary memory clobbers may inhibit a lot of compiler optimizations that would be unproblematic and desirable.
  2. Come up with some clever scheme to guarantee all the synchronization you want using only regular atomics and fences. This is likely to cause overhead: call stack manipulation happens frequently, and the bytecode PC (which is also sampled) is updated even more frequently. There's no contention but even uncontended atomics are measurably more expensive than the non-atomic equivalent if you do enough of them.
  3. Use only relaxed loads and stores (no RMW) plus compiler fences, hoping that those will be cheaper than the previous option while still effectively guaranteeing the desired behavior on the platforms where you've also implemented the "interrupt a thread" bit.

The third option might look like this (on the interpreter side; assume the code that takes the snapshot has a matching acquire fence):

struct Interpreter {
    // all Arcs shared with the profiler thread
    pc: Arc<AtomicU32>,
    sp: Arc<AtomicU32>,
    stack: Arc<[AtomicU32]>
}

impl Interpreter {
    fn exec_call(&mut self, new_pc: u32) {
        let sp = self.sp.load(Relaxed);
        let old_pc = self.pc.load(Relaxed);
        self.stack[sp as usize].store(old_pc, Relaxed);
        self.pc.store(new_pc, Relaxed);
        // SP increment needs to happen after storing the new return address
        compiler_fence(Release);
        self.sp.store(sp + 1, Relaxed);
    }

    fn exec_return(&mut self) {
        let sp = self.sp.load(Relaxed);
        let new_pc = self.stack[(sp - 1) as usize].load(Relaxed);
        self.pc.store(new_pc, Relaxed);
        self.sp.store(sp - 1, Relaxed);
        // SP decrement needs to happen before any writes to the stack from a subsequent call.
        compiler_fence(Release);
    }
}

Note that this is all safe code, taking the snapshot also uses relaxed loads, so the problem is only possible race conditions. The argument why this sort of thing should work necessarily depends on the platform-specific bits now shown above. On Linux, suspending is generally done by delivering a signal to the thread you want to interrupt. The signal handler takes the snapshot and uses proper synchronization primitives (e.g., POSIX semaphores because those are async-signal-safe) to talk to the profiler thread. If a compiler fence is like a regular fence restricted to a signal handler and the thread it interrupts, you can probably argue for the correctness almost entirely in Rust memory model terms. Or disprove it, if I made a mistake in the example code.

But on Windows and macOS, the preferred/only(?) way is to ask the OS to suspend the target thread entirely and take the sample from another thread, which will generally run on another CPU core. Even outside of AM concerns, thinking only about the CPU architecture, it's not obvious that this can be correct. However, in practice it should be fine:

  1. Once the OS has suspended the target thread by proactively interrupting it or waiting for its time slice to run out, the CPU has committed to the illusion of stopping at the boundary between two adjacent instructions, similar to how a signal handler might be invoked. If well-placed compiler fences restrict the translation from AM to machine code sufficiently for a signal handler to make sense of the program state in this situation, the only additional difficulty for taking a snapshot from another CPU core should be due to what that means for the CPU memory model.
  2. Within the CPU memory model, it's very likely that the combination of everything the OS does establishes as much synchronization as needed. That includes interrupting the target thread (if it was running), doing the front half of a context switch (if the target thread was running), and somehow getting the confirmation that it's done to the profiler thread (and the core it's running on).

Because it's all safe code in the end (well, except the "suspending a running thread" bit, but that's a different matter) I am not exactly losing sleep over it. Still, this is a very gray area and I'd love to have more official ammunition for this way of using compiler fences. Alternatively, a convincing argument why compiler fences are the wrong tool for this and I need to do X instead would also be welcome.

@RalfJung
Copy link
Member Author

RalfJung commented Sep 9, 2024

But on Windows and macOS, the preferred/only(?) way is to ask the OS to suspend the target thread entirely and take the sample from another thread, which will generally run on another CPU core.

Do these OSes provide any way to synchronize with the suspended thread?

An OS already has to ensure that migrating a thread to another core do sufficient synchronization and acts like logically, the thread continues running on the same core. I assume they have to do the same here. If so, then arguably one can say that a thread X that does things after thread A got suspended (in a "happens-before" sense of "after") is "running on the same core as A", even if it's not the same hardware core.

@hanna-kruppe
Copy link

hanna-kruppe commented Sep 9, 2024

I've not seen any OS documentation that comes out and says anything like that. This is a very sparsely documented area, e.g., it took me a whole lot of searching to learn that Windows requires you to call GetThreadContext (even if you don't care about its contents) because just calling SuspendThread won't wait for the thread to actually be suspended (edit: substituted better source). I still don't know if there's an equivalent gotcha on macOS, because its equivalent is even less documented.

I agree that migrating a thread between cores faces a similar problem and operating systems solve it somehow. In practical terms, it also seems very likely that they make "suspend another thread and read its thread state" apply enough synchronization (whatever that means on the CPU in question) to make basic tools like profilers work. However, it's not exactly the same as migrating a single thread and I'm not totally sure if that complicates anything under sufficiently weak memory models. For example, if thread migration requires some strong barrier instructions both on the old and the new core, then the OS would also have to do that on the other core when reporting that the thread was suspended, even if a weaker barrier was sufficient for basic message passing. Or maybe there's a fast path that checks a "is this thread running?" flag without necessarily creating enough synchronization. In this case, the semantics of compiler fence might affect whether it needs to emit machine code or not on the relevant platforms.

@RalfJung
Copy link
Member Author

RalfJung commented Sep 9, 2024

A compiler fence never has to emit machine code. That's pretty much its defining property. ;)

Fundamentally, the OS needs to provide some operation that is guaranteed to sync with a previously suspended thread (if the "suspend a thread" function does not do that itself). And it needs to be strong enough to do everything a migration of that thread to the current core would do. I understand OSes don't document that, but this is not something Rust is much involved with. Once that is given, one can make an argument that the "inspecting" thread has basically "taken over the logical core" of the previously running thread and therefore a compiler fence is sufficient.

@hanna-kruppe
Copy link

hanna-kruppe commented Sep 9, 2024

Well, in practice it sometimes does emit machine code (rust-lang/rust#62256) and on a sufficiently cursed target machine some extra code may be required just to implement whatever Rust ends up choosing as the meaning of "compiler fences" (even if such a target would likely also have crappy codegen for regular loads and stores). It shouldn't, but strong promises about generated machine code are always very dubious.

In any case, I see what you're saying. If compiler fences were defined with respect to logical CPU cores, one could make that arguments. I wouldn't love relying on it, but it would be better than if compiler fences only meant anything within a single AM thread (because I don't see how one could argue about that).

@RalfJung
Copy link
Member Author

RalfJung commented Sep 9, 2024

Well, in practice it sometimes does sometimes emit machine code (rust-lang/rust#62256)

That seems to bottom out here. I don't know what sw does, but no hardware fence should be needed here. Ofc the fence can change which instructions are emitted for surrounding operations.

But anyway that's kind of off-topic here.

In any case, I see what you're saying. If compiler fences were defined with respect to logical CPU cores, one could make that arguments. I wouldn't love relying on it, but it would be better than if compiler fences only meant anything within a single AM thread (because I don't see how one could argue about that).

The challenge for the docs is how to define "logical CPU core". But other than that, I think it is the right definition. So if someone wants to update the docs with an attempt at such a definition, please go ahead. :)

@hanna-kruppe
Copy link

hanna-kruppe commented Sep 9, 2024

For what it's worth, another use case for cross-thread compiler fences is when they're paired with another thread doing membarrier syscalls (mentioned briefly earlier in the thread) or other ways of achieving the same effect. In those cases, you can't very well argue that the side calling membarrier has taken over the CPU core from the thread using compiler fences, because the latter is not suspended and may in fact be running at the same time. Clearly the OS + hardware establishes some sort of synchronization, but how does that connect to the compiler fence? Perhaps you could pretend that the OS scheduler interrupts issue both a compiler fence and a regular fence every time it fires on a logical core? That seems to imply much more synchronization happening than the AM would like to admit. Maybe it won't prevent any optimizations if userspace code can't rely on those fences because it can't know when they happen (for lack of any shared state with the interrupt).

@RalfJung
Copy link
Member Author

RalfJung commented Sep 9, 2024 via email

@hanna-kruppe
Copy link

hanna-kruppe commented Sep 9, 2024

It is an asymmetric fence in a way, but IPI is only one possible way of implementing it. Linux membarrier can either do IPI or it can piggy-backs on the RCU grace period (whatever that means, don't ask me). barrierd is a userspace + eBPF solution mostly based on waiting until any interrupt has fired on every core at least once (relying on those being serializing for the CPU).

@chorman0773
Copy link
Contributor

Interrupts are a different conccurency context from the thread they're running on. Synchronizing between an interrupt handler running on a thread and another thread won't do anything to synchronize between the two threads.

@hanna-kruppe
Copy link

It's not obvious to me why there should be any difference between asking the OS to send my thread a signal every 10ms and doing something funky in the signal handler, or (in bare-metal code) setting up a timer interrupt for every 10 ms and doing the same funky stuff in the interrupt handler. It's not literally exactly the same, but both seem equally foreign to the AM and both have pretty much the same implications for the hardware memory model.

@DemiMarie
Copy link

There are accelerators on which one isn’t even guaranteed to read what one just wrote to memory unless one uses a barrier instruction.

@bjorn3
Copy link
Member

bjorn3 commented Sep 9, 2024

There are accelerators on which one isn’t even guaranteed to read what one just wrote to memory unless one uses a barrier instruction.

For such an accelerator to run Rust the compiler would need to insert a barrier between every read and every write. The C++ memory model enforces happens before within a single thread. If the cpu core doesn't guarantee this the compiler is required to insert barriers as necessary to enforce this.

@DemiMarie
Copy link

There are accelerators on which one isn’t even guaranteed to read what one just wrote to memory unless one uses a barrier instruction.

For such an accelerator to run Rust the compiler would need to insert a barrier between every read and every write. The C++ memory model enforces happens before within a single thread. If the cpu core doesn't guarantee this the compiler is required to insert barriers as necessary to enforce this.

Given that this would be ludicrously expensive, I suspect that the compiler would instead only allow local memory (which is usually coherent with program execution) to be accessed using Rust pointer syntax, and require access to incoherent memory to use function calls that are opaque to the compiler.

@DemiMarie
Copy link

This is off-topic, though.

@CAD97
Copy link

CAD97 commented Sep 10, 2024

Based on the stack sampling example, I think it could be potentially reasonable to document compiler_fence as synchronization with the concept of thread suspension. Expositionally, if you can determine that a thread of execution is suspended (i.e. not currently executing) through some target-specific manner (e.g. using thread suspension APIs or being an interrupt running on that thread), then the observable state will be consistent with

  • either no writes after any given compiler_fence(Release) have happened or all writes before that fence have, and
  • either no reads after any given compiler_fence(Acquire) have happened or all reads after that fence have

but unfortunately it's on the thread suspension to ensure this is actually true on weak memory targets (e.g. by ensuring all weak operations commit before considering a thread to be fully suspended) as otherwise the weak operations could become visible out of order across the intended ordering of the compiler_fence.

TL;DR: that's a lot of words to say again that synchronizing within a logical CPU core must be transitive across synchronizing with thread suspension on any useful system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests