runtime: scanobject can use incorrect span metadata #11267

aclements · 2015-06-18T00:39:00Z

When scanobject looks up the span for a pointer it read from the object it's scanning, it's possible for it to get incorrect information for the span, causing it to ignore the pointer, or, much worse, attempt to mark uninitialized memory.

scanobject loads a pointer slot out of the object it's scanning, and this happens to be one of the special pointers from the heap into a stack. Call the pointer p and suppose it points into X's stack.
X, running on another thread, grows its stack and frees its old stack.
The old stack happens to be large or was the last stack in its span, so X frees this span, setting it to state _MSpanFree.
scanobject calls heapBitsForObject, which loads the span containing p, which is now in state _MSpanFree, so it ignore it. There's currently a disabled test here that would panic (issue runtime: reenable bad pointer check in GC #10591).

or

The span gets reused as a heap span.
scanobject calls heapBitsForObject, which loads the span containing p, which is now in state _MSpanInUse, but doesn't necessarily have an object at p. The not-object at p gets marked, and at this point all sorts of things can go wrong.

This is quite easy to reproduce with a well-placed sleep:

diff --git a/src/runtime/mbitmap.go b/src/runtime/mbitmap.go
index 146ffbf..fe36f3a 100644
--- a/src/runtime/mbitmap.go
+++ b/src/runtime/mbitmap.go
@@ -190,6 +190,9 @@ func heapBitsForObject(p uintptr) (base uintptr, hbits heapBits, s *mspan) {
        // Consult the span table to find the block beginning.
        k := p >> _PageShift
        s = h_spans[idx]
+       if s != nil && s.state == _MSpanStack {
+               usleep(100)
+       }
        if s == nil || pageID(k) < s.start || p >= s.limit || s.state != mSpanInUse {
                if s == nil || s.state == _MSpanStack {
                        // If s is nil, the virtual address has never been part of the heap.
@@ -201,7 +204,7 @@ func heapBitsForObject(p uintptr) (base uintptr, hbits heapBits, s *mspan) {
                // The following ensures that we are rigorous about what data
                // structures hold valid pointers.
                // TODO(rsc): Check if this still happens.
-               if false {
+               if true {
                        // Still happens sometimes. We don't know why.
                        printlock()
                        print("runtime:objectstart Span weird: p=", hex(p), " k=", hex(k))

With this sleep, GOGC=10 ./runtime.test -test.short crashes reliably for me.

@RLH @rsc

The text was updated successfully, but these errors were encountered:

aclements · 2015-06-18T15:32:43Z

I can think of a few possible ways to fix this.

Somehow synchronize stack growth and scanobject. I don't think this is a good idea unless we can find a way that doesn't slow down scanobject.
Delay freeing stack spans during a GC until the end of that GC. This would give stack spans the same monotonic property that makes heapBitsForObject safe on heap spans. I think we do something like this for some runtime data structure, though I forget what it is. This has the downside that, if the user is creating goroutines at a rapid clip during a GC, they'll eat up memory faster than they do now.
Replace all heap-to-stack pointers with uintptrs. The garbage collector simply won't trace these, and it can even blow up if it finds a pointer into a stack span.

3 seems like the cleanest solution, assuming it doesn't have other ramifications.

aclements · 2015-06-18T15:39:30Z

... and it can even blow up if it finds a pointer into a stack span.

... from a heap object. This code path is also used for scanning stacks, which can of course contain pointers into themselves (and, maybe, in the runtime, to another stack). Don't want to blow up on those. But if we're not scanning a stack, we shouldn't see a stack pointer.

randall77 · 2015-06-18T15:44:31Z

We already do #2. See stack1.go:620-628.

aclements · 2015-06-18T16:01:18Z

I see. Did we just miss the fact that the same logic applies equally to the growing branch of that if?

Still, the number of stacks that can shrink during a GC cycle is limited, since the GC is in charge of doing the shrink. These zombie stacks can't last more than the duration of mark termination. If we do the same for growing stacks, user code can leak an arbitrary amount of stack memory during a GC cycle. We don't have a mechanism for pushing back on that like we do for pushing back on allocation. Hence, I'm still leaning toward solution 3. I think that may also eliminate the need for the the delayed free of shrunk stacks.

aclements · 2015-06-19T18:52:16Z

I took a stab at 3, and quickly ran into pointers that could point in to either the stack or the heap for deep reasons (like g.sched.ctxt), which throws a monkey-wrench into this solution.

But there's a fourth solution that's like delayed freeing of stacks, but I think solves its downsides: free stacks immediately, but delay freeing stack spans. It's perfectly fine to reuse the stacks. It's fine to reuse the stack spans. The danger is only the state transition from stack span to free span (and possibly to heap span after that), so that's all we need to prevent.

This looks easy to do for small stacks: just don't immediately free them in stackpoolfree when ref==0. At the end of mark termination, walk the stackpools and free spans with ref==0. For large stacks, I'm not sure what the best way to do this is that also lets us reuse that memory for new stacks.

aclements · 2015-06-25T15:01:08Z

I chatted with @rsc about this. The plan is to simply return small stacks to the stack cache without freeing their spans until the end of GC. That way small stacks can be reused, so a program rapidly creating and exiting goroutines won't chew through memory. This should be a fairly small change. For large stacks, for now, we'll simply delay freeing them and not attempt to reuse that memory during a GC cycle. We should be able to reuse the existing delay free mechanism to do this, so that should also be a low risk change. When 1.6 opens, we can reconsider whether and how we could reuse the memory of large stacks.

gopherbot · 2015-06-25T21:00:33Z

CL https://golang.org/cl/11502 mentions this issue.

aclements mentioned this issue Jun 18, 2015

runtime: incorrect synchronization around arena_used/h_spans during concurrent GC #9984

Closed

aclements added this to the Go1.5 milestone Jun 18, 2015

aclements mentioned this issue Jun 24, 2015

runtime: Reenable "pointer-to-heap-doesn't-point-to-object" check. #9880

Closed

aclements closed this as completed in d57056b Jun 29, 2015

golang locked and limited conversation to collaborators Jun 28, 2016

gopherbot added the FrozenDueToAge label Jun 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: scanobject can use incorrect span metadata #11267

runtime: scanobject can use incorrect span metadata #11267

aclements commented Jun 18, 2015

aclements commented Jun 18, 2015

aclements commented Jun 18, 2015

randall77 commented Jun 18, 2015

aclements commented Jun 18, 2015

aclements commented Jun 19, 2015

aclements commented Jun 25, 2015

gopherbot commented Jun 25, 2015

runtime: scanobject can use incorrect span metadata #11267

runtime: scanobject can use incorrect span metadata #11267

Comments

aclements commented Jun 18, 2015

aclements commented Jun 18, 2015

aclements commented Jun 18, 2015

randall77 commented Jun 18, 2015

aclements commented Jun 18, 2015

aclements commented Jun 19, 2015

aclements commented Jun 25, 2015

gopherbot commented Jun 25, 2015