tracing: pool and reuse spans

This patch adds sync.Pool-ing and reuse to tracing spans, so that they don't need to be dynamically allocated on every span creation. This has a big effect on the heap footprint. Conceptually, a span is made available for reuse on Finish(). In practice, it's more complicated because there can still be references of the span used concurrently with Finish(), either internally in the tracing library or externally. The external ones are bugs by definition, but we want to avoid some particularly nasty consequences of such bugs. The BenchmarkTracing results below show that this saves around 10KB worth of heap allocations per simple query, when tracing is enabled (minimal tracing: TracingModeActiveSpansRegistry). In the single-node case, this is 25% of the query's allocations. As can be seen in the benchmarks below in the differences between the trace=on and trace=off rows, the impact of tracing is big on memory footprint; with this patch, there's not much impact. ``` name old alloc/op new alloc/op delta Tracing/1node/scan/trace=off-32 19.7kB ± 1% 19.7kB ± 1% ~ (p=0.768 n=10+5) Tracing/1node/scan/trace=on-32 29.2kB ± 0% 22.0kB ± 0% -24.85% (p=0.001 n=10+5) Tracing/1node/insert/trace=off-32 38.5kB ± 1% 38.4kB ± 1% ~ (p=0.440 n=10+5) Tracing/1node/insert/trace=on-32 45.5kB ± 1% 38.7kB ± 1% -15.03% (p=0.001 n=10+5) Tracing/3node/scan/trace=off-32 68.1kB ± 3% 67.9kB ± 3% ~ (p=0.768 n=10+5) Tracing/3node/scan/trace=on-32 86.8kB ± 2% 75.3kB ± 2% -13.21% (p=0.001 n=9+5) Tracing/3node/insert/trace=off-32 88.1kB ± 5% 90.8kB ± 7% ~ (p=0.112 n=9+5) Tracing/3node/insert/trace=on-32 96.1kB ± 3% 89.0kB ± 2% -7.39% (p=0.001 n=9+5) ``` Unfortunately, pooling spans only saves on the size of allocations, not the number of allocations. This is because the Context in which a Span lives still needs to be allocated dynamically, as it does not have a clear lifetime and so it cannot be re-used (plus it's immutable, etc). Before this patch, the code was optimized to allocate a Span and a Context together, through trickery (we had a dedicated Context type, which we now get rid of). So, this patch replaces an allocation for Span+Context with just a Context allocation, which is a win because Spans are big and Contexts are small. Micro-benchmarks only show minor improvements in the time/op, but the memory improvements are so large that I think they must translate into sufficient GC pressure wins to be worth doing. This patch brings us very close to enabling the TracingModeActiveSpansRegistry tracing mode by default in production - which would give us a registry of all in-flight spans/operations in the system. ### Interactions with use-after-Finish detection Span reuse interacts with the recently-introduced span-use-after-Finish detection. Spans are made available for reuse on Finish (technically, when certain references to the span have been drained; see below). When a span is reused in between Finish() and an erroneous use of the Finish()ed span, this bug cannot be detected and results in the caller operating on an unintended span. This can result in the wrong log message apearing in the wrong span, and such. Care has been taken so that use-after-Finish bugs do not result in more structural problems, such as loops in the parent-child relationships. ### Technical details The mechanism used for making spans available for reuse is reference counting; the release of a span to the pool is deferred to the release of the last counted reference. Counted references are held: - internally: children hold references to the parent and the parent holds references to the children. - externally: the WithParent(sp) option takes a reference on sp. Apart from WithParent, clients using Spans do not track their references because it'd be too burdensome to require all the references to have a clearly defined life cycle, and to be explicitly released when they're no longer in use. For clients, the contract is that a span can be used until Finish(). WithParent is special, though; see below. Different alternatives to reference counting were explored. In particular, instead of a deferred-release scheme, an alternative was a fat-pointer scheme where references that can outlive a span are tagged with the span's "generation". That works for the internal use cases, but the problem with this scheme is that WithParent(s) ends up allocating - which I want to avoid. Right now, WithParent(s) returns a pointer as an interface, which doesn't allocate. But if the pointer gets fat, it no longer fits into the class of things that can be put in interfaces without allocating. The reference counter is an atomic; it is not protected by the span's lock because a parent's counter needs to be accessed under both the parent's lock and the children's locks. In details, the reference counter serves a couple of purposes: 1) Prevent re-allocation of a Span while child spans are still operating on it. In particular, this ensures that races between Finish()ing a parent and a child cannot result in the child operating on a re-allocated parent. Because of the span's lock ordering convention, a child cannot hold its lock while operating on the parent. During Finish(), the child drops its lock and informs the parent that it is Finish()ing. If the parent Finish()es at the same time, that call could erroneously conclude that the parent can be made available for re-use, even through the child goroutine has a pending call into the parent. 2) Prevent re-allocation of child spans while a Finish()ing parent is in the process of transforming the children into roots and inserting them into the active spans registry. Operating on the registry is done without holding any span's lock, so a race with a child's Finish() could result in the registry operating on a re-allocated span. 3) Prevent re-allocation of a Span in between the time that WithParent(s) captures a reference to s and the time when the parent option is used to create the child. Such an inopportune reuse of a span could only happen is the span is Finish()ed concurrently with the creation of its child, which is illegal. Still, we optionally tolerate use-after-Finishes, and this use cannot be tolerated with the reference count protection. Without this protection, the tree of spans in a trace could degenerate into a graph through the introduction of loops. A loop could lead to deadlock due to the fact that we lock multiple spans at once. The lock ordering convention is that the parent needs to be locked before the child, which ensures deadlock-freedom in the absence of loops. For example: 1. parent := tr.StartSpan() 2. parentOpt := WithParent(parent) 3. parent.Finish() 4. child := tr.StartSpan(parentOpt) If "parent" would be re-allocated as "child", then child would have itself as a parent. The use of parentOpt in step 4) after parent was finished in step 3) is a use-after-Finish of parent; it is illegal and, if detection is enabled, it might be detected as such. However, if span pooling and re-use is enabled, then the detection is not realiable (it does not catch cases where a span is re-used before a reference to it taken before the prior Finish() is used). A span having itself as a parent is just the trivial case of the problem; loops of arbitrary length are also possible. For example, for a loop of length 2: 1. Say we have span A as a parent with span B as a child (A -> B). 2. parentA := WithParent(A) 3. parentB := WithParent(B) 4. A.Finish(); B.Finish(); 5. X := tr.StartSpan(parentA); Y := tr.StartSpan(parentB); If B is re-used as X, and A is re-used as Y, we get the following graph: B<-┐ | | └->A We avoid these hazards by having WithParent(s) increment s' reference count, so spans are not re-used while the creation of a child is pending. Spans can be Finish()ed while the creation of the child is pending, in which case the creation of the child will reliably detect the use-after-Finish (and turn it into a no-op if configured to tolerate such illegal uses). Release note: None
cockroachdb · Dec 27, 2021 · 77ab456 · 77ab456
1 parent bb5588d
commit 77ab456
Show file tree

Hide file tree

Showing 12 changed files with 872 additions and 169 deletions.
diff --git a/pkg/util/constants.go b/pkg/util/constants.go
@@ -95,7 +95,7 @@ func init() {
 
 // ConstantWithMetamorphicTestRange is like ConstantWithMetamorphicTestValue
 // except instead of returning a single metamorphic test value, it returns a
-// random test value in a range.
+// random test value in the semi-open range [min, max).
 //
 // The given name is used for logging.
 func ConstantWithMetamorphicTestRange(name string, defaultValue, min, max int) int {

diff --git a/pkg/util/tracing/BUILD.bazel b/pkg/util/tracing/BUILD.bazel
@@ -20,6 +20,7 @@ go_library(
     visibility = ["//visibility:public"],
     deps = [
         "//pkg/settings",
+        "//pkg/util",
         "//pkg/util/buildutil",
         "//pkg/util/envutil",
         "//pkg/util/iterutil",
@@ -61,17 +62,27 @@ go_test(
     srcs = [
         "bench_test.go",
         "grpc_interceptor_test.go",
+        "main_test.go",
         "span_test.go",
         "tags_test.go",
+        "tracer_external_test.go",
         "tracer_test.go",
     ],
     embed = [":tracing"],
     deps = [
+        "//pkg/base",
+        "//pkg/security",
+        "//pkg/security/securitytest",
+        "//pkg/server",
         "//pkg/testutils",
         "//pkg/testutils/grpcutils",
+        "//pkg/testutils/serverutils",
+        "//pkg/testutils/skip",
+        "//pkg/testutils/sqlutils",
         "//pkg/util",
         "//pkg/util/iterutil",
         "//pkg/util/leaktest",
+        "//pkg/util/log",
         "//pkg/util/stop",
         "//pkg/util/timeutil",
         "//pkg/util/tracing/tracingpb",
@@ -86,5 +97,6 @@ go_test(
         "@org_golang_google_grpc//:go_default_library",
         "@org_golang_google_grpc//metadata",
         "@org_golang_x_net//trace",
+        "@org_golang_x_sync//errgroup",
     ],
 )
diff --git a/pkg/util/tracing/context.go b/pkg/util/tracing/context.go
@@ -29,31 +29,20 @@ func SpanFromContext(ctx context.Context) *Span {
 	return nil
 }
 
-// optimizedContext is an implementation of context.Context special
-// cased to carry a Span under activeSpanKey{}. By making an explicit
-// type we unlock optimizations that save allocations by allocating
-// the optimizedContext together with the Span it eventually carries.
-type optimizedContext struct {
-	context.Context
-	sp *Span
-}
-
-func (ctx *optimizedContext) Value(k interface{}) interface{} {
-	if k == (interface{}(activeSpanKey{})) {
-		return ctx.sp
-	}
-	return ctx.Context.Value(k)
-}
-
 // maybeWrapCtx returns a Context wrapping the Span, with two exceptions:
 // 1. if ctx==noCtx, it's a noop
 // 2. if ctx contains the noop Span, and sp is also the noop Span, elide
 //    allocating a new Context.
 //
-// If a non-nil octx is passed in, it forms the returned Context. This can
-// avoid allocations if the caller is able to allocate octx together with
-// the Span, as is commonly possible when StartSpanCtx is used.
-func maybeWrapCtx(ctx context.Context, octx *optimizedContext, sp *Span) (context.Context, *Span) {
+// NOTE(andrei): Our detection of Span use-after-Finish() is not reliable
+// because spans are reused through a sync.Pool; we fail to detect someone
+// holding a reference to a Span (e.g. a Context referencing the span) from
+// before reuse and then using it after span reuse. We could make the detection
+// more reliable by storing the generation number of the span in the Context
+// along with the Span and checking it every time the Span is retrieved from the
+// Context. We'd have to implement our own Context struct so that the Context
+// and the generation number can be allocated together.
+func maybeWrapCtx(ctx context.Context, sp *Span) (context.Context, *Span) {
 	if ctx == noCtx {
 		return noCtx, sp
 	}
@@ -72,16 +61,11 @@ func maybeWrapCtx(ctx context.Context, octx *optimizedContext, sp *Span) (contex
 			return ctx, sp
 		}
 	}
-	if octx != nil {
-		octx.Context = ctx
-		octx.sp = sp
-		return octx, sp
-	}
 	return context.WithValue(ctx, activeSpanKey{}, sp), sp
 }
 
 // ContextWithSpan returns a Context wrapping the supplied Span.
 func ContextWithSpan(ctx context.Context, sp *Span) context.Context {
-	ctx, _ = maybeWrapCtx(ctx, nil /* octx */, sp)
+	ctx, _ = maybeWrapCtx(ctx, sp)
 	return ctx
 }
diff --git a/pkg/util/tracing/crdbspan.go b/pkg/util/tracing/crdbspan.go
@@ -50,13 +50,15 @@ type crdbSpan struct {
 	// tag's key to a user.
 	logTags *logtags.Buffer
 
-	// Locking rules: if locking both a parent and a child, the parent must be
-	// locked first. In practice, children don't take the parent's lock.
+	// Locking rules:
+	// - If locking both a parent and a child, the parent must be locked first. In
+	// practice, children don't take the parent's lock.
+	// - The active spans registry's lock must be acquired before this lock.
 	mu crdbSpanMu
 }
 
 type childRef struct {
-	*crdbSpan
+	spanRef
 	// collectRecording is set if this child's recording should be included in the
 	// parent's recording. This is usually the case, except for children created
 	// with the WithDetachedRecording() option.
@@ -74,9 +76,15 @@ type crdbSpanMu struct {
 	// Note that parent is mutable; a span can start by having a parent but then,
 	// if the parent finishes before the child does (which is uncommon), the
 	// child's parent is set to nil.
-	parent *crdbSpan
+	//
+	// While parent is set, this child is holding a reference in the parent's
+	// reference counter. The parent's ref count is decremented when this child
+	// Finish()es, or otherwise when this pointer is nil'ed (i.e. on parent
+	// Finish()).
+	parent spanRef
 
 	finished bool
+
 	// duration is initialized to -1 and set on Finish().
 	duration time.Duration
 
@@ -151,32 +159,42 @@ type sizeLimitedBuffer struct {
 	limit int64 // in bytes
 }
 
+func (buf *sizeLimitedBuffer) Reset() {
+	buf.Buffer.Reset()
+	buf.size = 0
+}
+
 // finish marks the span as finished. Further operations on the span are not
 // allowed. Returns false if the span was already finished.
 //
-// TODO(andrei): The intention is for a span to be available for reuse (e.g.
-// through a sync.Pool) after finish(), although we're not currently taking
-// advantage of this. I think there might be users that collect a span's
-// recording after finish(), which should be illegal. For now, use-after-finish
-// is generally tolerated - and that's also why this method returns false when
-// called a second time.
+// Calling finish() a second time is illegal, as is any use-after-finish().
+// Still, the Tracer can be configured to tolerate such uses. If the Tracer was
+// configured to not tolerate use-after-Finish, we would have crashed before
+// calling this.
 func (s *crdbSpan) finish() bool {
-	var children []*crdbSpan
-	var parent *crdbSpan
-	var needRegistryChange bool
+	// Finishing involves the following steps:
+	// 1) Take the lock and capture a reference to the parent.
+	// 2) Operate on the parent outside of the lock.
+	// 3) Take the lock again, operate on the children under the lock, and also
+	//    capture references to the children for further operations outside of the
+	//    lock.
+	// 4) Insert the children into the active spans registry outside of the lock.
+	//
+	// We could reorder things such that the lock is only taken once, but it
+	// results in more awkward code because operating on the s' parent expects to
+	// find s' children in place, to collect their recordings.
+
+	var parent spanRef
 	{
 		s.mu.Lock()
 		if s.mu.finished {
-			// Already finished.
+			// Already finished (or at least in the process of finish()ing). This
+			// check ensures that only one caller performs cleanup for this span. We
+			// don't want the span to be re-allocated while finish() is running.
 			s.mu.Unlock()
 			return false
 		}
 		s.mu.finished = true
-		// If the span is not part of the registry now, it never will be. So, we'll
-		// need to remove it from the registry only if it currently does not have a
-		// parent. We'll also need to manipulate the registry if there are open
-		// children (they'll need to be added to the registry).
-		needRegistryChange = s.mu.parent == nil || len(s.mu.openChildren) > 0
 
 		if s.recordingType() != RecordingOff {
 			duration := timeutil.Since(s.startTime)
@@ -186,26 +204,58 @@ func (s *crdbSpan) finish() bool {
 			s.mu.duration = duration
 		}
 
-		// Shallow-copy the children so they can be processed outside the lock.
-		children = make([]*crdbSpan, len(s.mu.openChildren))
-		for i, c := range s.mu.openChildren {
-			children[i] = c.crdbSpan
-		}
-
-		// We'll operate on the parent outside of the child's lock.
-		parent = s.mu.parent
-
+		// We'll operate on the parent below, outside the child's lock, as per the
+		// lock ordering convention between parents and children. The parent might
+		// get Finish()ed by the time we call parent.childFinished(s) on it below;
+		// that's OK because we're going to hold on taking a reference in the
+		// parent's reference counter. Notice that we move the reference out of
+		// s.mu.parent; leaving it there would not work because s.mu.parent can be
+		// released by s.parentFinished() after we drop our lock.
+		parent = s.mu.parent.move()
 		s.mu.Unlock()
 	}
 
-	if parent != nil {
-		parent.childFinished(s)
+	// Operate on the parent outside the child (our current receiver) lock.
+	// childFinished() might call back into the child and acquire the child's
+	// lock.
+	if !parent.empty() {
+		parent.Span.i.crdb.childFinished(s)
+		parent.release()
 	}
 
-	// Deal with the orphaned children - make them roots.
-	for _, c := range children {
-		c.parentFinished()
+	// Operate on children.
+	var children []spanRef
+	var needRegistryChange bool
+	{
+		s.mu.Lock()
+
+		// If the span is not part of the registry now, it never will be. So, we'll
+		// need to remove it from the registry only if it currently does not have a
+		// parent. We'll also need to manipulate the registry if there are open
+		// children (they'll need to be added to the registry).
+		needRegistryChange = s.mu.parent.empty() || len(s.mu.openChildren) > 0
+
+		// Deal with the orphaned children - make them roots. We call into the
+		// children while holding the parent's lock. As per the span locking
+		// convention, that's OK (but the reverse isn't).
+		//
+		// We also shallow-copy the children for operating on them outside the lock.
+		children = make([]spanRef, len(s.mu.openChildren))
+		for i, c := range s.mu.openChildren {
+			c.parentFinished()
+			// Move ownership of the child reference, and also nil out the pointer to
+			// the child, making it available for GC.
+			children[i] = c.spanRef.move()
+		}
+		// Make the openChildren array available for GC if it has grown from the
+		// initial capacity provided by the spanAllocHelper. We don't want to reuse
+		// it in case it has grown too much, so that we don't accumulate many large
+		// arrays.
+		s.mu.openChildren = nil
+
+		s.mu.Unlock()
 	}
+
 	if needRegistryChange {
 		// Atomically replace s in the registry with all of its still-open children.
 		s.tracer.activeSpansRegistry.swap(s.spanID, children)
@@ -297,7 +347,8 @@ func (s *crdbSpan) getVerboseRecording(includeDetachedChildren bool, finishing b
 
 	for _, child := range s.mu.openChildren {
 		if child.collectRecording || includeDetachedChildren {
-			result = append(result, child.getVerboseRecording(includeDetachedChildren, false /* finishing */)...)
+			sp := child.Span.i.crdb
+			result = append(result, sp.getVerboseRecording(includeDetachedChildren, false /* finishing */)...)
 		}
 	}
 	s.mu.Unlock()
@@ -331,7 +382,8 @@ func (s *crdbSpan) getStructuredRecording(includeDetachedChildren bool) Recordin
 	}
 	for _, c := range s.mu.openChildren {
 		if c.collectRecording || includeDetachedChildren {
-			buffer = c.getStructuredEventsRecursively(buffer, includeDetachedChildren)
+			sp := c.Span.i.crdb
+			buffer = sp.getStructuredEventsRecursively(buffer, includeDetachedChildren)
 		}
 	}
 
@@ -510,7 +562,8 @@ func (s *crdbSpan) getStructuredEventsRecursively(
 	buffer = s.getStructuredEventsLocked(buffer)
 	for _, c := range s.mu.openChildren {
 		if c.collectRecording || includeDetachedChildren {
-			buffer = c.getStructuredEventsRecursively(buffer, includeDetachedChildren)
+			sp := c.Span.i.crdb
+			buffer = sp.getStructuredEventsRecursively(buffer, includeDetachedChildren)
 		}
 	}
 	for _, c := range s.mu.recording.finishedChildren {
@@ -629,7 +682,7 @@ func (s *crdbSpan) getRecordingNoChildrenLocked(
 // The adding fails if the receiver has already Finish()ed. This should never
 // happen, since using a Span after Finish() is illegal. But still, we
 // defensively return false.
-func (s *crdbSpan) addChildLocked(child *crdbSpan, collectChildRec bool) bool {
+func (s *crdbSpan) addChildLocked(child *Span, collectChildRec bool) bool {
 	s.mu.AssertHeld()
 
 	if s.mu.finished {
@@ -638,39 +691,54 @@ func (s *crdbSpan) addChildLocked(child *crdbSpan, collectChildRec bool) bool {
 
 	s.mu.openChildren = append(
 		s.mu.openChildren,
-		childRef{crdbSpan: child, collectRecording: collectChildRec},
+		childRef{spanRef: makeSpanRef(child), collectRecording: collectChildRec},
 	)
 	return true
 }
 
 // childFinished is called when a child is Finish()ed. Depending on the
 // receiver's recording mode, the child is atomically removed and replaced it
-// with its recording. This allows the child span to be reused (since the parent
-// no longer references it).
+// with its recording.
 //
 // child is the child span that just finished.
 func (s *crdbSpan) childFinished(child *crdbSpan) {
 	s.mu.Lock()
 	defer s.mu.Unlock()
+
+	if s.mu.finished {
+		return
+	}
+
 	var childIdx int
 	found := false
 	for i, c := range s.mu.openChildren {
-		if c.crdbSpan == child {
+		sp := c.Span.i.crdb
+		if sp == child {
 			childIdx = i
 			found = true
 			break
 		}
 	}
 	if !found {
+		// The child is expected to be found. In particular, this parent span is not
+		// supposed to be reused while children are holding a reference to it
+		// (courtesy of the parent's reference counter).
 		panic("child not present in parent")
 	}
 
 	collectChildRec := s.mu.openChildren[childIdx].collectRecording
+	// Drop the child's reference.
+	if s.mu.openChildren[childIdx].decRef() {
+		// We're going to use the child below, so we don't want it to be
+		// re-allocated yet. It shouldn't be re-allocated, because each span holds a
+		// reference to itself that's only dropped at the end of Span.Finish() (and
+		// the child at this point is in the middle of its Finish() call).
+		panic(fmt.Sprintf("span's reference count unexpectedly dropped to zero: %s", child.operation))
+	}
 
 	// Unlink the child.
 	l := len(s.mu.openChildren)
 	s.mu.openChildren[childIdx] = s.mu.openChildren[l-1]
-	s.mu.openChildren[l-1].crdbSpan = nil // Make the child available to GC.
 	s.mu.openChildren = s.mu.openChildren[:l-1]
 
 	// Collect the child's recording.
@@ -715,10 +783,7 @@ func (s *crdbSpan) childFinished(child *crdbSpan) {
 func (s *crdbSpan) parentFinished() {
 	s.mu.Lock()
 	defer s.mu.Unlock()
-	if s.mu.finished {
-		return
-	}
-	s.mu.parent = nil
+	s.mu.parent.release()
 }
 
 // SetVerbose is part of the RegistrySpan interface.