-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: tight loops should be preemptible #10958
Comments
Another solution is to send a signal to the thread to force
it to be preempted.
i.e. once a thread is stopped, it masks a certain signal, and
if there are some threads running user code that refuse to
stop, we send that signal to the current process. In
runtime.sigtramp, it can force the switch of the goroutine
and then mask the signal.
This solution introduces no overhead to the code, at the
expense of complicating the runtime design.
|
The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging. |
The problem of inserting artificial preemption points is reduced
numerical performance.
Yet another solution is to just ignore the problem. User is supposed
to insert runtime.Gosched() calls at appropriate points as needed.
btw, isn't the long term GC plan is to never stop the world?
|
It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding
then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad. |
No pointer writes to the heap. But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop. |
True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem. However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.
This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes. |
This is the reference for doing a GC safepoint at every instruction. It is Most systems I know of do what Austin suggests, unroll the loop and insert On Tue, May 26, 2015 at 9:24 PM, Austin Clements [email protected]
|
To be precise, the preemption check seems only at newstack()? For example, package main
import (
"fmt"
"runtime"
"time"
)
func test() {
a := 100
for i := 1; i < 1000; i++ {
a = i*100/i + a
}
}
func main() {
runtime.GOMAXPROCS(1)
go func() {
for {
test()
}
}()
time.Sleep(100 * time.Millisecond)
fmt.Println("hello world")
} The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed. |
test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops? |
For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:
This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:
My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable. If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed. |
Another idea, similiar to @ianlancetaylor's idea, is to estimate the cost (in ns) of the loop and only check for required suspension every N loops through that loop body. Once the compiler can unroll/unpeel, that check-every-N logic can be either inlined after the unrolled body, if unrolling is beneficial, or kept when unrolling makes no sense for that loop body. Such logic also reads better when using debuggers. |
@nightlyone The check-every-N seems more complex than just having two extra instructions (compare and branch). And if unrolling is done, that compare-and-branch is already needed only every N loop iterations only (where N is the number of times the loop was unrolled). |
@griesemer not sure how the test will avoid atomic loads, that's why I suggested the check-every-N. So the pseudo-go-code before unrolling would look like this: loop:
// loop-body
if counter > N {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loop after unrolling it would look like loop:
// loop-body0
// loop-body2
// ...
// loop-bodyM
if counter > N/M {
counter = 0
if need_stop = atomic.LoadBool(&runtime.getg().need_stop); need_stop {
runtime.Gosched()
}
}
counter++
goto loop So the code inserted would be a constant overhead, but still run every N loops. |
The load doesn't have to be atomic. The current preemption check in the function prologue isn't atomic. One tricky bit with the compare and branch is that the actual preemption point needs to have no live registers. Presumably, for performance reasons, we want the loop to be able to keep things in registers, so the code we branch to on a preempt has to flush any live registers before the preempt and reload them after the preempt. I don't think this will be particularly hard, but it is something to consider, since it might affect which stage of the compiler is responsible for generating this code. |
@aclements Indeed. Which is why perhaps switching the pc to a 2nd version of the loop body that ends in a safe point might not be much more complex and permit the loop to run at full speed in the normal case. |
From the runtime's perspective, I think this would be more complicated because stealing a signal is a logistic pain and we'd have to deal with tables mapping from fast loop PCs to slow loop PCs. The compiler would have to generate these tables. This seems like a very clever plan B, but I think first we should trying adding a no-op compare and branch and see if it's actually a problem for dense numerical kernels. |
The new SSA register allocator handles situations like this well, keeping everything in registers on the common edges and spill/call/restore on the unlikely case. |
The load is not dependent on anything in the loop. Assuming the loop is On Fri, Sep 11, 2015 at 3:38 PM, Keith Randall [email protected]
|
CL https://golang.org/cl/19516 mentions this issue. |
TestCrashDumpsAllThreads carefully sets the number of Ps to one greater than the number of non-preemptible loops it starts so that the main goroutine can continue to run (necessary because of #10958). However, if GC starts, it can take over that one spare P and lock up the system while waiting for the non-preemptible loops, causing the test to eventually time out. This deadlock is easily reproducible if you run the runtime test with GOGC=1. Fix this by forcing GOGC=off when running this test. Change-Id: Ifb22da5ce33f9a61700a326ea92fcf4b049721d1 Reviewed-on: https://go-review.googlesource.com/19516 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Russ Cox <[email protected]>
This adds asynchronous preemption function for amd64 and 386. These functions spill and restore all register state that can be used by user Go code. For the moment we stub out the other arches. For #10958, #24543. Change-Id: I6f93fabe9875f4834922a5712362e79045c00aca Reviewed-on: https://go-review.googlesource.com/c/go/+/201759 Run-TryBot: Austin Clements <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
This adds support for scanning the stack when a goroutine is stopped at an async safe point. This is not yet lit up because asyncPreempt is not yet injected, but prepares us for that. This works by conservatively scanning the registers dumped in the frame of asyncPreempt and its parent frame, which was stopped at an asynchronous safe point. Conservative scanning works by only marking words that are pointers to valid, allocated heap objects. One complication is pointers to stack objects. In this case, we can't determine if the stack object is still "allocated" or if it was freed by an earlier GC. Hence, we need to propagate the conservative-ness of scanning stack objects: if all pointers found to a stack object were found via conservative scanning, then the stack object itself needs to be scanned conservatively, since its pointers may point to dead objects. For #10958, #24543. Change-Id: I7ff84b058c37cde3de8a982da07002eaba126fd6 Reviewed-on: https://go-review.googlesource.com/c/go/+/201761 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
This adds support for pausing a running G by sending a signal to its M. The main complication is that we want to target a G, but can only send a signal to an M. Hence, the protocol we use is to simply mark the G for preemption (which we already do) and send the M a "wake up and look around" signal. The signal checks if it's running a G with a preemption request and stops it if so in the same way that stack check preemptions stop Gs. Since the preemption may fail (the G could be moved or the signal could arrive at an unsafe point), we keep a count of the number of received preemption signals. This lets stopG detect if its request failed and should be retried without an explicit channel back to suspendG. For #10958, #24543. Change-Id: I3e1538d5ea5200aeb434374abb5d5fdc56107e53 Reviewed-on: https://go-review.googlesource.com/c/go/+/201760 Run-TryBot: Austin Clements <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
This adds signal-based preemption to preemptone. Since STW and forEachP ultimately use preemptone, this also makes these work with async preemption. This also makes freezetheworld more robust so tracebacks from fatal panics should be far less likely to report "goroutine running on other thread; stack unavailable". For #10958, #24543. (This doesn't fix it yet because asynchronous preemption only works on POSIX platforms on 386 and amd64 right now.) Change-Id: If776181dd5a9b3026a7b89a1b5266521b95a5f61 Reviewed-on: https://go-review.googlesource.com/c/go/+/201762 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
This adds a test of preempting a loop containing no synchronous safe points for STW and stack scanning. We couldn't add this test earlier because it requires scheduler, STW, and stack scanning preemption to all be working. For #10958, #24543. Change-Id: I73292db78ca3d14aab11bdafd26d03986920ef0a Reviewed-on: https://go-review.googlesource.com/c/go/+/201777 Run-TryBot: Austin Clements <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
Change https://golang.org/cl/204340 mentions this issue: |
Change https://golang.org/cl/207779 mentions this issue: |
Change https://golang.org/cl/207961 mentions this issue: |
On Windows, there is currently a race between unminit closing the thread's handle and profileloop1 suspending the thread using its handle. If another handle reuses the same handle value, this can lead to unpredictable results. To fix this, we protect the thread handle with a lock and duplicate it under this lock in profileloop1 before using it. This is going to become a much bigger problem with non-cooperative preemption (#10958, #24543), which uses the same basic mechanism as profileloop1. Change-Id: I9d62b83051df8c03f3363344438e37781a69ce16 Reviewed-on: https://go-review.googlesource.com/c/go/+/207779 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Cherry Zhang <[email protected]> Reviewed-by: Alex Brainman <[email protected]>
This implements preemptM on Windows using SuspendThead and ResumeThread. Unlike on POSIX platforms, preemptM on Windows happens synchronously. This means we need a make a few other tweaks to suspendG: 1. We need to CAS the G back to _Grunning before doing the preemptM, or there's a good chance we'll just catch the G spinning on its status in the runtime, which won't be preemptible. 2. We need to rate-limit preemptM attempts. Otherwise, if the first attempt catches the G at a non-preemptible point, the busy loop in suspendG may hammer it so hard that it never makes it past that non-preemptible point. Updates #10958, #24543. Change-Id: Ie53b098811096f7e45d864afd292dc9e999ce226 Reviewed-on: https://go-review.googlesource.com/c/go/+/204340 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Cherry Zhang <[email protected]>
Change https://golang.org/cl/208218 mentions this issue: |
Go 1.14 will handle loop preemption far better than previous versions, but there are still some loose ends to tie up in 1.15, so I'm bumping this issue to 1.15. |
Actually, rather than bump this to 1.15, since tight loops are preemptible now, I've created a new issue to keep track of loose ends (#36365) and will go ahead and close this issue as fixed. |
Change https://golang.org/cl/213837 mentions this issue: |
On Windows, we implement asynchronous preemption using SuspendThread to suspend other threads in our process. However, SuspendThread is itself actually asynchronous (it enqueues a kernel "asynchronous procedure call" and returns). Unfortunately, Windows' ExitProcess API kills all threads except the calling one and then runs APCs. As a result, if SuspendThread and ExitProcess are called simultaneously, the exiting thread can be suspended and the suspending thread can be exited, leaving behind a ghost process consisting of a single thread that's suspended. We've already protected against the runtime's own calls to ExitProcess, but if Go code calls external code, there's nothing stopping that code from calling ExitProcess. For example, in #35775, our own call to racefini leads to C code calling ExitProcess and occasionally causing a deadlock. This CL fixes this by introducing synchronization between calling external code on Windows and preemption. It adds an atomic field to the M that participates in a simple CAS-based synchronization protocol to prevent suspending a thread running external code. We use this to protect cgocall (which is used for both cgo calls and system calls on Windows) and racefini. Tested by running the flag package's TestParse test compiled in race mode in a loop. Before this change, this would reliably deadlock after a few minutes. Fixes #35775. Updates #10958, #24543. Change-Id: I50d847abcdc2688b4f71eee6a75eca0f2fee892c Reviewed-on: https://go-review.googlesource.com/c/go/+/213837 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Cherry Zhang <[email protected]> Reviewed-by: David Chase <[email protected]>
Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.
In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.
One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.
This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.
@RLH
The text was updated successfully, but these errors were encountered: