Merge #73883

73883: tracing: pool and reuse spans r=andreimatei a=andreimatei This patch adds sync.Pool-ing and reuse to tracing spans, so that they don't need to be dynamically allocated on every span creation. This has a big effect on the heap footprint. Conceptually, a span is made available for reuse on Finish(). In practice, it's more complicated because there can still be references of the span used concurrently with Finish(), either internally in the tracing library or externally. The external ones are bugs by definition, but we want to avoid some particularly nasty consequences of such bugs. The BenchmarkTracing results below show that this saves around 10KB worth of heap allocations per simple query, when tracing is enabled (minimal tracing: TracingModeActiveSpansRegistry). I believe the span allocations go from being serviced from the shared heap to being serviced from CPU-local freelists, since they become small enough. In the single-node case, this is 25% of the query's allocations. As can be seen in the benchmarks below in the differences between the trace=on and trace=off rows, the impact of tracing is big on memory footprint; with this patch, there's not much impact. ``` name old alloc/op new alloc/op delta Tracing/1node/scan/trace=off-32 19.7kB ± 1% 19.7kB ± 1% ~ (p=0.768 n=10+5) Tracing/1node/scan/trace=on-32 29.2kB ± 0% 22.0kB ± 0% -24.85% (p=0.001 n=10+5) Tracing/1node/insert/trace=off-32 38.5kB ± 1% 38.4kB ± 1% ~ (p=0.440 n=10+5) Tracing/1node/insert/trace=on-32 45.5kB ± 1% 38.7kB ± 1% -15.03% (p=0.001 n=10+5) Tracing/3node/scan/trace=off-32 68.1kB ± 3% 67.9kB ± 3% ~ (p=0.768 n=10+5) Tracing/3node/scan/trace=on-32 86.8kB ± 2% 75.3kB ± 2% -13.21% (p=0.001 n=9+5) Tracing/3node/insert/trace=off-32 88.1kB ± 5% 90.8kB ± 7% ~ (p=0.112 n=9+5) Tracing/3node/insert/trace=on-32 96.1kB ± 3% 89.0kB ± 2% -7.39% (p=0.001 n=9+5) ``` Unfortunately, pooling spans only saves on the size of allocations, not the number of allocations. This is because the Context in which a Span lives still needs to be allocated dynamically, as it does not have a clear lifetime and so it cannot be re-used (plus it's immutable, etc). Before this patch, the code was optimized to allocate a Span and a Context together, through trickery (we had a dedicated Context type, which we now get rid of). So, this patch replaces an allocation for Span+Context with just a Context allocation, which is a win because Spans are big and Contexts are small. BenchmarkTracing (which runs SQL queries) only show minor improvements in the time/op, but the memory improvements are so large that I think they must translate into sufficient GC pressure wins to be worth doing. Micro-benchmarks from the tracing package show major time/op wins. ``` name old time/op new time/op delta Tracer_StartSpanCtx/opts=none-32 537ns ± 1% 275ns ± 2% -48.73% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real-32 537ns ± 2% 273ns ± 2% -49.16% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,logtag-32 565ns ± 1% 278ns ± 1% -50.81% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,autoparent-32 879ns ±29% 278ns ± 5% -68.36% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,manualparent-32 906ns ±26% 289ns ± 2% -68.08% (p=0.008 n=5+5) Span_GetRecording/root-only-32 11.1ns ± 2% 11.6ns ± 4% ~ (p=0.056 n=5+5) Span_GetRecording/child-only-32 11.1ns ± 4% 11.7ns ± 2% +5.44% (p=0.016 n=5+5) Span_GetRecording/root-child-32 18.9ns ± 3% 19.5ns ± 1% +3.55% (p=0.008 n=5+5) RecordingWithStructuredEvent-32 1.37µs ± 2% 1.17µs ± 2% -14.22% (p=0.008 n=5+5) SpanCreation/detached-child=false-32 1.84µs ± 2% 0.96µs ± 0% -47.56% (p=0.008 n=5+5) SpanCreation/detached-child=true-32 2.01µs ± 1% 1.14µs ± 1% -43.32% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Tracer_StartSpanCtx/opts=none-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,logtag-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,autoparent-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,manualparent-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Span_GetRecording/root-only-32 0.00B 0.00B ~ (all equal) Span_GetRecording/child-only-32 0.00B 0.00B ~ (all equal) Span_GetRecording/root-child-32 0.00B 0.00B ~ (all equal) RecordingWithStructuredEvent-32 1.54kB ± 0% 0.77kB ± 0% -49.86% (p=0.008 n=5+5) SpanCreation/detached-child=false-32 4.62kB ± 0% 0.29kB ± 0% ~ (p=0.079 n=4+5) SpanCreation/detached-child=true-32 5.09kB ± 0% 0.77kB ± 0% -84.87% (p=0.008 n=5+5) ``` This patch brings us very close to enabling the TracingModeActiveSpansRegistry tracing mode by default in production - which would give us a registry of all in-flight spans/operations in the system. ### Interactions with use-after-Finish detection Span reuse interacts with the recently-introduced span-use-after-Finish detection. Spans are made available for reuse on Finish (technically, when certain references to the span have been drained; see below). When a span is reused in between Finish() and an erroneous use of the Finish()ed span, this bug cannot be detected and results in the caller operating on an unintended span. This can result in the wrong log message apearing in the wrong span, and such. Care has been taken so that use-after-Finish bugs do not result in more structural problems, such as loops in the parent-child relationships. ### Technical details The mechanism used for making spans available for reuse is reference counting; the release of a span to the pool is deferred to the release of the last counted reference. Counted references are held: - internally: children hold references to the parent and the parent holds references to the children. - externally: the WithParent(sp) option takes a reference on sp. Apart from WithParent, clients using Spans do not track their references because it'd be too burdensome to require all the references to have a clearly defined life cycle, and to be explicitly released when they're no longer in use. For clients, the contract is that a span can be used until Finish(). WithParent is special, though; see below. Different alternatives to reference counting were explored. In particular, instead of a deferred-release scheme, an alternative was a fat-pointer scheme where references that can outlive a span are tagged with the span's "generation". That works for the internal use cases, but the problem with this scheme is that WithParent(s) ends up allocating - which I want to avoid. Right now, WithParent(s) returns a pointer as an interface, which doesn't allocate. But if the pointer gets fat, it no longer fits into the class of things that can be put in interfaces without allocating. The reference counter is an atomic; it is not protected by the span's lock because a parent's counter needs to be accessed under both the parent's lock and the children's locks. In details, the reference counter serves a couple of purposes: 1) Prevent re-allocation of a Span while child spans are still operating on it. In particular, this ensures that races between Finish()ing a parent and a child cannot result in the child operating on a re-allocated parent. Because of the span's lock ordering convention, a child cannot hold its lock while operating on the parent. During Finish(), the child drops its lock and informs the parent that it is Finish()ing. If the parent Finish()es at the same time, that call could erroneously conclude that the parent can be made available for re-use, even through the child goroutine has a pending call into the parent. 2) Prevent re-allocation of child spans while a Finish()ing parent is in the process of transforming the children into roots and inserting them into the active spans registry. Operating on the registry is done without holding any span's lock, so a race with a child's Finish() could result in the registry operating on a re-allocated span. 3) Prevent re-allocation of a Span in between the time that WithParent(s) captures a reference to s and the time when the parent option is used to create the child. Such an inopportune reuse of a span could only happen is the span is Finish()ed concurrently with the creation of its child, which is illegal. Still, we optionally tolerate use-after-Finishes, and this use cannot be tolerated with the reference count protection. Without this protection, the tree of spans in a trace could degenerate into a graph through the introduction of loops. A loop could lead to deadlock due to the fact that we lock multiple spans at once. The lock ordering convention is that the parent needs to be locked before the child, which ensures deadlock-freedom in the absence of loops. For example: 1. parent := tr.StartSpan() 2. parentOpt := WithParent(parent) 3. parent.Finish() 4. child := tr.StartSpan(parentOpt) If "parent" would be re-allocated as "child", then child would have itself as a parent. The use of parentOpt in step 4) after parent was finished in step 3) is a use-after-Finish of parent; it is illegal and, if detection is enabled, it might be detected as such. However, if span pooling and re-use is enabled, then the detection is not realiable (it does not catch cases where a span is re-used before a reference to it taken before the prior Finish() is used). A span having itself as a parent is just the trivial case of the problem; loops of arbitrary length are also possible. For example, for a loop of length 2: 1. Say we have span A as a parent with span B as a child (A -> B). 2. parentA := WithParent(A) 3. parentB := WithParent(B) 4. A.Finish(); B.Finish(); 5. X := tr.StartSpan(parentA); Y := tr.StartSpan(parentB); If B is re-used as X, and A is re-used as Y, we get the following graph: ``` B<-┐ | | └->A ``` We avoid these hazards by having WithParent(s) increment s' reference count, so spans are not re-used while the creation of a child is pending. Spans can be Finish()ed while the creation of the child is pending, in which case the creation of the child will reliably detect the use-after-Finish (and turn it into a no-op if configured to tolerate such illegal uses). Introducing this reference count, and only reusing spans with a reference count of zero, introduced the risk of leaking references if one does opt = WithParent(sp) and then discards the resulting opt without passing it to StartSpan(). This would cause a silent performance regression (not a memory leak though, as the GC is still there). This risk seems worth it for avoiding deadlocks in case of other buggy usage. Release note: None Co-authored-by: Andrei Matei <[email protected]>
cockroachdb · Jan 22, 2022 · 4fc0855 · 4fc0855
2 parents dc07599 + 2fca7ad
commit 4fc0855
Show file tree

Hide file tree

Showing 22 changed files with 1,358 additions and 295 deletions.
diff --git a/pkg/testutils/testcluster/testcluster.go b/pkg/testutils/testcluster/testcluster.go
@@ -16,6 +16,7 @@ import (
 	"fmt"
 	"net"
 	"reflect"
+	"runtime"
 	"strings"
 	"sync"
 	"testing"
@@ -155,6 +156,10 @@ func (tc *TestCluster) stopServers(ctx context.Context) {
 			return errors.Newf("%s", buf.String())
 		})
 	}
+	// Force a GC in an attempt to run finalizers. Some finalizers run sanity
+	// checks that panic on failure, and ideally we'd run them all before starting
+	// the next test.
+	runtime.GC()
 }
 
 // StopServer stops an individual server in the cluster.

diff --git a/pkg/util/constants.go b/pkg/util/constants.go
@@ -95,7 +95,7 @@ func init() {
 
 // ConstantWithMetamorphicTestRange is like ConstantWithMetamorphicTestValue
 // except instead of returning a single metamorphic test value, it returns a
-// random test value in a range.
+// random test value in the semi-open range [min, max).
 //
 // The given name is used for logging.
 func ConstantWithMetamorphicTestRange(name string, defaultValue, min, max int) int {

diff --git a/pkg/util/log/ambient_context_test.go b/pkg/util/log/ambient_context_test.go
@@ -75,6 +75,7 @@ func TestAnnotateCtxSpan(t *testing.T) {
 
 	ac.Tracer = tracer
 	ctx, sp := ac.AnnotateCtxWithSpan(context.Background(), "s")
+	defer sp.Finish()
 	require.Equal(t, sp, tracing.SpanFromContext(ctx))
 	require.NotNil(t, sp)
 	require.False(t, sp.IsVerbose())

diff --git a/pkg/util/ring/BUILD.bazel b/pkg/util/ring/BUILD.bazel
@@ -10,7 +10,14 @@ go_library(
 go_test(
     name = "ring_test",
     size = "small",
-    srcs = ["ring_buffer_test.go"],
+    srcs = [
+        "main_test.go",
+        "ring_buffer_test.go",
+    ],
     embed = [":ring"],
-    deps = ["@com_github_stretchr_testify//require"],
+    deps = [
+        "//pkg/util/leaktest",
+        "//pkg/util/randutil",
+        "@com_github_stretchr_testify//require",
+    ],
 )
diff --git a/pkg/util/ring/main_test.go b/pkg/util/ring/main_test.go
@@ -0,0 +1,25 @@
+// Copyright 2021 The Cockroach Authors.
+//
+// Use of this software is governed by the Business Source License
+// included in the file licenses/BSL.txt.
+//
+// As of the Change Date specified in that file, in accordance with
+// the Business Source License, use of this software will be governed
+// by the Apache License, Version 2.0, included in the file
+// licenses/APL.txt.
+
+package ring
+
+import (
+	"os"
+	"testing"
+
+	"github.com/cockroachdb/cockroach/pkg/util/randutil"
+)
+
+func TestMain(m *testing.M) {
+	randutil.SeedForTests()
+	os.Exit(m.Run())
+}
+
+//go:generate ../../util/leaktest/add-leaktest.sh *_test.go
diff --git a/pkg/util/ring/ring_buffer.go b/pkg/util/ring/ring_buffer.go
@@ -79,17 +79,40 @@ func (r *Buffer) GetLast() interface{} {
 	return r.buffer[(cap(r.buffer)+r.tail-1)%cap(r.buffer)]
 }
 
-func (r *Buffer) grow(n int) {
+func (r *Buffer) resize(n int) {
+	if n < r.Len() {
+		panic("resizing to fewer elements than current length")
+	}
+
+	if n == 0 {
+		r.Discard()
+		return
+	}
+
 	newBuffer := make([]interface{}, n)
+	r.copyTo(newBuffer)
+	r.tail = r.Len() % cap(newBuffer)
+	r.head = 0
+	r.buffer = newBuffer
+}
+
+// copyTo copies elements from r to dst. If len(dst) < r.Len(), only the first
+// len(dst) elements are copied.
+func (r *Buffer) copyTo(dst []interface{}) {
+	if !r.nonEmpty {
+		return
+	}
+	// Copy over the contents to dst.
 	if r.head < r.tail {
-		copy(newBuffer[:r.Len()], r.buffer[r.head:r.tail])
+		copy(dst, r.buffer[r.head:r.tail])
 	} else {
-		copy(newBuffer[:cap(r.buffer)-r.head], r.buffer[r.head:])
-		copy(newBuffer[cap(r.buffer)-r.head:r.Len()], r.buffer[:r.tail])
+		tailElements := r.buffer[r.head:]
+		copy(dst, tailElements)
+		// If there's space remaining, continue.
+		if len(dst) > len(tailElements) {
+			copy(dst[cap(r.buffer)-r.head:], r.buffer[:r.tail])
+		}
 	}
-	r.head = 0
-	r.tail = cap(r.buffer)
-	r.buffer = newBuffer
 }
 
 func (r *Buffer) maybeGrow() {
@@ -100,7 +123,7 @@ func (r *Buffer) maybeGrow() {
 	if n == 0 {
 		n = 1
 	}
-	r.grow(n)
+	r.resize(n)
 }
 
 // AddFirst add element to the front of the Buffer and doubles it's underlying
@@ -146,21 +169,63 @@ func (r *Buffer) RemoveLast() {
 	}
 }
 
-// Reserve reserves the provided number of elements in the Buffer. It is an
-// error to reserve a size less than the Buffer's current length.
+// Reserve reserves the provided number of elements in the Buffer. It is illegal
+// to reserve a size less than the r.Len().
+//
+// If the Buffer already has a capacity of n or larger, this is a no-op.
 func (r *Buffer) Reserve(n int) {
 	if n < r.Len() {
 		panic("reserving fewer elements than current length")
-	} else if n > cap(r.buffer) {
-		r.grow(n)
+	}
+	if n > cap(r.buffer) {
+		r.resize(n)
+	}
+}
+
+// Resize changes the Buffer's storage to be of the specified size. It is
+// illegal to resize to a size less than r.Len().
+//
+// This is a more general version of Reserve: Reserve only ever grows the
+// storage, whereas Resize() can also shrink it.
+//
+// Note that, if n != r.Len(), Resize always allocates new storage, even when n
+// is less than the current capacity. This can be useful to make the storage for
+// a buffer that used to be large available for GC, but it can also be wasteful.
+func (r *Buffer) Resize(n int) {
+	if n < r.Len() {
+		panic("resizing to fewer elements than current length")
+	}
+
+	if n != cap(r.buffer) {
+		r.resize(n)
 	}
 }
 
 // Reset makes Buffer treat its underlying memory as if it were empty. This
 // allows for reusing the same memory again without explicitly removing old
-// elements.
+// elements. Note that this does not nil out the elements, so they're not made
+// available to GC.
+//
+// See also Discard.
 func (r *Buffer) Reset() {
 	r.head = 0
 	r.tail = 0
 	r.nonEmpty = false
 }
+
+// Discard is like Reset, except it also does Resize(0) to nil out the
+// underlying slice. This makes the backing storage for the slice available to
+// GC if nobody else is referencing it. This is useful if r is still referenced,
+// but *r will be reassigned.
+//
+// See also Reset and Resize.
+func (r *Buffer) Discard() {
+	*r = Buffer{}
+}
+
+// all a slice with returns all the elements in the buffer.
+func (r *Buffer) all() []interface{} {
+	buf := make([]interface{}, r.Len())
+	r.copyTo(buf)
+	return buf
+}
diff --git a/pkg/util/ring/ring_buffer_test.go b/pkg/util/ring/ring_buffer_test.go
@@ -14,22 +14,23 @@ import (
 	"math/rand"
 	"testing"
 
+	"github.com/cockroachdb/cockroach/pkg/util/leaktest"
 	"github.com/stretchr/testify/require"
 )
 
-const maxCount = 100
-
-func testRingBuffer(t *testing.T, count int) {
+func TestRingBuffer(t *testing.T) {
+	defer leaktest.AfterTest(t)()
+	const operationCount = 100
 	var buffer Buffer
-	naiveBuffer := make([]interface{}, 0, count)
-	for elementIdx := 0; elementIdx < count; elementIdx++ {
-		switch rand.Intn(4) {
+	naiveBuffer := make([]interface{}, 0, operationCount)
+	for i := 0; i < operationCount; i++ {
+		switch rand.Intn(5) {
 		case 0:
-			buffer.AddFirst(elementIdx)
-			naiveBuffer = append([]interface{}{elementIdx}, naiveBuffer...)
+			buffer.AddFirst(i)
+			naiveBuffer = append([]interface{}{i}, naiveBuffer...)
 		case 1:
-			buffer.AddLast(elementIdx)
-			naiveBuffer = append(naiveBuffer, elementIdx)
+			buffer.AddLast(i)
+			naiveBuffer = append(naiveBuffer, i)
 		case 2:
 			if len(naiveBuffer) > 0 {
 				buffer.RemoveFirst()
@@ -42,32 +43,23 @@ func testRingBuffer(t *testing.T, count int) {
 				buffer.RemoveLast()
 				naiveBuffer = naiveBuffer[:len(naiveBuffer)-1]
 			}
+		case 4:
+			// If there's extra capacity, resize to trim it.
+			require.LessOrEqual(t, len(naiveBuffer), buffer.Cap())
+			spareCap := buffer.Cap() - len(naiveBuffer)
+			if spareCap > 0 {
+				buffer.Resize(len(naiveBuffer) + rand.Intn(spareCap))
+			}
 		default:
 			t.Fatal("unexpected")
 		}
 
-		require.Equal(t, len(naiveBuffer), buffer.Len())
-		for pos, el := range naiveBuffer {
-			res := buffer.Get(pos)
-			require.Equal(t, el, res)
-		}
-		if len(naiveBuffer) > 0 {
-			require.Equal(t, naiveBuffer[0], buffer.GetFirst())
-			require.Equal(t, naiveBuffer[len(naiveBuffer)-1], buffer.GetLast())
-		}
-	}
-}
-
-func TestRingBuffer(t *testing.T) {
-	for count := 1; count <= maxCount; count++ {
-		t.Run("Parallel", func(t *testing.T) {
-			t.Parallel() // SAFE FOR TESTING
-			testRingBuffer(t, count)
-		})
+		require.Equal(t, naiveBuffer, buffer.all())
 	}
 }
 
 func TestRingBufferCapacity(t *testing.T) {
+	defer leaktest.AfterTest(t)()
 	var b Buffer
 
 	require.Panics(t, func() { b.Reserve(-1) })
@@ -116,4 +108,8 @@ func TestRingBufferCapacity(t *testing.T) {
 	b.Reserve(0)
 	require.Equal(t, 0, b.Len())
 	require.Equal(t, 9, b.Cap())
+
+	b.Resize(3)
+	require.Equal(t, 0, b.Len())
+	require.Equal(t, 3, b.Cap())
 }
diff --git a/pkg/util/stop/stopper_test.go b/pkg/util/stop/stopper_test.go
@@ -724,13 +724,14 @@ func TestStopperRunAsyncTaskTracing(t *testing.T) {
 		},
 			func(ctx context.Context) {
 				log.Event(ctx, "async 3")
-				sp := tracing.SpanFromContext(ctx)
-				if sp == nil {
+				sp1 := tracing.SpanFromContext(ctx)
+				if sp1 == nil {
 					errC <- errors.Errorf("missing span")
 					return
 				}
-				sp = tr.StartSpan("child", tracing.WithParent(sp))
-				if sp.TraceID() == traceID {
+				sp2 := tr.StartSpan("child", tracing.WithParent(sp1))
+				defer sp2.Finish()
+				if sp2.TraceID() == traceID {
 					errC <- errors.Errorf("expected different trace")
 				}
 				close(errC)

diff --git a/pkg/util/tracing/BUILD.bazel b/pkg/util/tracing/BUILD.bazel
@@ -9,6 +9,8 @@ go_library(
         "grpc_interceptor.go",
         "recording.go",
         "span.go",
+        "span_finalizer_race_off.go",
+        "span_finalizer_race_on.go",
         "span_inner.go",
         "span_options.go",
         "tags.go",
@@ -21,6 +23,7 @@ go_library(
     visibility = ["//visibility:public"],
     deps = [
         "//pkg/settings",
+        "//pkg/util",
         "//pkg/util/buildutil",
         "//pkg/util/envutil",
         "//pkg/util/iterutil",
@@ -62,17 +65,27 @@ go_test(
     srcs = [
         "bench_test.go",
         "grpc_interceptor_test.go",
+        "main_test.go",
         "span_test.go",
         "tags_test.go",
+        "tracer_external_test.go",
         "tracer_test.go",
     ],
     embed = [":tracing"],
     deps = [
+        "//pkg/base",
+        "//pkg/security",
+        "//pkg/security/securitytest",
+        "//pkg/server",
         "//pkg/testutils",
         "//pkg/testutils/grpcutils",
+        "//pkg/testutils/serverutils",
+        "//pkg/testutils/skip",
+        "//pkg/testutils/sqlutils",
         "//pkg/util",
         "//pkg/util/iterutil",
         "//pkg/util/leaktest",
+        "//pkg/util/log",
         "//pkg/util/stop",
         "//pkg/util/timeutil",
         "//pkg/util/tracing/tracingpb",
@@ -87,5 +100,6 @@ go_test(
         "@org_golang_google_grpc//:go_default_library",
         "@org_golang_google_grpc//metadata",
         "@org_golang_x_net//trace",
+        "@org_golang_x_sync//errgroup",
     ],
 )