Skip to content

Commit

Permalink
Merge #73883
Browse files Browse the repository at this point in the history
73883: tracing: pool and reuse spans r=andreimatei a=andreimatei

This patch adds sync.Pool-ing and reuse to tracing spans, so that they
don't need to be dynamically allocated on every span creation. This has
a big effect on the heap footprint.

Conceptually, a span is made available for reuse on Finish(). In
practice, it's more complicated because there can still be references of
the span used concurrently with Finish(), either internally in the
tracing library or externally. The external ones are bugs by definition,
but we want to avoid some particularly nasty consequences of such bugs.

The BenchmarkTracing results below show that this saves around 10KB
worth of heap allocations per simple query, when tracing is enabled
(minimal tracing: TracingModeActiveSpansRegistry). I believe the span
allocations go from being serviced from the shared heap to being
serviced from CPU-local freelists, since they become small enough. In
the single-node case, this is 25% of the query's allocations. As can be
seen in the benchmarks below in the differences between the trace=on and
trace=off rows, the impact of tracing is big on memory footprint; with
this patch, there's not much impact.

```
name                               old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=off-32      19.7kB ± 1%    19.7kB ± 1%     ~     (p=0.768 n=10+5)
Tracing/1node/scan/trace=on-32       29.2kB ± 0%    22.0kB ± 0%  -24.85%  (p=0.001 n=10+5)
Tracing/1node/insert/trace=off-32    38.5kB ± 1%    38.4kB ± 1%     ~     (p=0.440 n=10+5)
Tracing/1node/insert/trace=on-32     45.5kB ± 1%    38.7kB ± 1%  -15.03%  (p=0.001 n=10+5)
Tracing/3node/scan/trace=off-32      68.1kB ± 3%    67.9kB ± 3%     ~     (p=0.768 n=10+5)
Tracing/3node/scan/trace=on-32       86.8kB ± 2%    75.3kB ± 2%  -13.21%  (p=0.001 n=9+5)
Tracing/3node/insert/trace=off-32    88.1kB ± 5%    90.8kB ± 7%     ~     (p=0.112 n=9+5)
Tracing/3node/insert/trace=on-32     96.1kB ± 3%    89.0kB ± 2%   -7.39%  (p=0.001 n=9+5)
```

Unfortunately, pooling spans only saves on the size of allocations, not
the number of allocations. This is because the Context in which a Span
lives still needs to be allocated dynamically, as it does not have a
clear lifetime and so it cannot be re-used (plus it's immutable, etc).
Before this patch, the code was optimized to allocate a Span and a
Context together, through trickery (we had a dedicated Context type,
which we now get rid of). So, this patch replaces an allocation for
Span+Context with just a Context allocation, which is a win because
Spans are big and Contexts are small.

BenchmarkTracing (which runs SQL queries) only show minor improvements
in the time/op, but the memory improvements are so large that I think
they must translate into sufficient GC pressure wins to be worth doing.
Micro-benchmarks from the tracing package show major time/op wins.

```
name                                           old time/op    new time/op    delta
Tracer_StartSpanCtx/opts=none-32                  537ns ± 1%     275ns ± 2%  -48.73%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real-32                  537ns ± 2%     273ns ± 2%  -49.16%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,logtag-32           565ns ± 1%     278ns ± 1%  -50.81%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,autoparent-32       879ns ±29%     278ns ± 5%  -68.36%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,manualparent-32     906ns ±26%     289ns ± 2%  -68.08%  (p=0.008 n=5+5)
Span_GetRecording/root-only-32                   11.1ns ± 2%    11.6ns ± 4%     ~     (p=0.056 n=5+5)
Span_GetRecording/child-only-32                  11.1ns ± 4%    11.7ns ± 2%   +5.44%  (p=0.016 n=5+5)
Span_GetRecording/root-child-32                  18.9ns ± 3%    19.5ns ± 1%   +3.55%  (p=0.008 n=5+5)
RecordingWithStructuredEvent-32                  1.37µs ± 2%    1.17µs ± 2%  -14.22%  (p=0.008 n=5+5)
SpanCreation/detached-child=false-32             1.84µs ± 2%    0.96µs ± 0%  -47.56%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-32              2.01µs ± 1%    1.14µs ± 1%  -43.32%  (p=0.008 n=5+5)

name                                           old alloc/op   new alloc/op   delta
Tracer_StartSpanCtx/opts=none-32                   768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real-32                   768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,logtag-32            768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,autoparent-32        768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,manualparent-32      768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Span_GetRecording/root-only-32                    0.00B          0.00B          ~     (all equal)
Span_GetRecording/child-only-32                   0.00B          0.00B          ~     (all equal)
Span_GetRecording/root-child-32                   0.00B          0.00B          ~     (all equal)
RecordingWithStructuredEvent-32                  1.54kB ± 0%    0.77kB ± 0%  -49.86%  (p=0.008 n=5+5)
SpanCreation/detached-child=false-32             4.62kB ± 0%    0.29kB ± 0%     ~     (p=0.079 n=4+5)
SpanCreation/detached-child=true-32              5.09kB ± 0%    0.77kB ± 0%  -84.87%  (p=0.008 n=5+5)
```

This patch brings us very close to enabling the
TracingModeActiveSpansRegistry tracing mode by default in production -
which would give us a registry of all in-flight spans/operations in the
system.

 ### Interactions with use-after-Finish detection

Span reuse interacts with the recently-introduced span-use-after-Finish
detection. Spans are made available for reuse on Finish (technically,
when certain references to the span have been drained; see below). When
a span is reused in between Finish() and an erroneous use of the
Finish()ed span, this bug cannot be detected and results in the caller
operating on an unintended span. This can result in the wrong log
message apearing in the wrong span, and such. Care has been taken so
that use-after-Finish bugs do not result in more structural problems,
such as loops in the parent-child relationships.

 ### Technical details

The mechanism used for making spans available for reuse is reference
counting; the release of a span to the pool is deferred to the release
of the last counted reference.
Counted references are held:
- internally: children hold references to the parent and the parent holds
  references to the children.
- externally: the WithParent(sp) option takes a reference on sp.

Apart from WithParent, clients using Spans do not track their references
because it'd be too burdensome to require all the references to have a
clearly defined life cycle, and to be explicitly released when they're
no longer in use. For clients, the contract is that a span can be used
until Finish(). WithParent is special, though; see below.

Different alternatives to reference counting were explored. In
particular, instead of a deferred-release scheme, an alternative was a
fat-pointer scheme where references that can outlive a span are tagged
with the span's "generation". That works for the internal use cases, but
the problem with this scheme is that WithParent(s) ends up allocating -
which I want to avoid. Right now, WithParent(s) returns a pointer as an
interface, which doesn't allocate.  But if the pointer gets fat, it no
longer fits into the class of things that can be put in interfaces
without allocating.

The reference counter is an atomic; it is not protected by the span's
lock because a parent's counter needs to be accessed under both the
parent's lock and the children's locks.

In details, the reference counter serves a couple of purposes:
1) Prevent re-allocation of a Span while child spans are still operating on
it. In particular, this ensures that races between Finish()ing a parent and
a child cannot result in the child operating on a re-allocated parent.
Because of the span's lock ordering convention, a child cannot hold its
lock while operating on the parent. During Finish(), the child drops its
lock and informs the parent that it is Finish()ing. If the parent
Finish()es at the same time, that call could erroneously conclude that the
parent can be made available for re-use, even through the child goroutine
has a pending call into the parent.

2) Prevent re-allocation of child spans while a Finish()ing parent is in
the process of transforming the children into roots and inserting them into
the active spans registry. Operating on the registry is done without
holding any span's lock, so a race with a child's Finish() could result in
the registry operating on a re-allocated span.

3) Prevent re-allocation of a Span in between the time that WithParent(s)
captures a reference to s and the time when the parent option is used to
create the child. Such an inopportune reuse of a span could only happen is
the span is Finish()ed concurrently with the creation of its child, which
is illegal. Still, we optionally tolerate use-after-Finishes, and this use
cannot be tolerated with the reference count protection. Without this
protection, the tree of spans in a trace could degenerate into a graph
through the introduction of loops. A loop could lead to deadlock due to the
fact that we lock multiple spans at once. The lock ordering convention is
that the parent needs to be locked before the child, which ensures
deadlock-freedom in the absence of loops.
For example:
1. parent := tr.StartSpan()
2. parentOpt := WithParent(parent)
3. parent.Finish()
4. child := tr.StartSpan(parentOpt)
If "parent" would be re-allocated as "child", then child would have itself
as a parent. The use of parentOpt in step 4) after parent was finished in
step 3) is a use-after-Finish of parent; it is illegal and, if detection is
enabled, it might be detected as such. However, if span pooling and re-use
is enabled, then the detection is not realiable (it does not catch cases
where a span is re-used before a reference to it taken before the prior
Finish() is used).
A span having itself as a parent is just the trivial case of the problem;
loops of arbitrary length are also possible. For example, for a loop of
length 2:
1. Say we have span A as a parent with span B as a child (A -> B).
2. parentA := WithParent(A)
3. parentB := WithParent(B)
4. A.Finish(); B.Finish();
5. X := tr.StartSpan(parentA); Y := tr.StartSpan(parentB);
If B is re-used as X, and A is re-used as Y, we get the following graph:
```
 B<-┐
 |  |
 └->A
```

We avoid these hazards by having WithParent(s) increment s' reference
count, so spans are not re-used while the creation of a child is pending.
Spans can be Finish()ed while the creation of the child is pending, in
which case the creation of the child will reliably detect the
use-after-Finish (and turn it into a no-op if configured to tolerate
such illegal uses).

Introducing this reference count, and only reusing spans with a
reference count of zero, introduced the risk of leaking references if
one does opt = WithParent(sp) and then discards the resulting opt
without passing it to StartSpan(). This would cause a silent performance
regression (not a memory leak though, as the GC is still there). This
risk seems worth it for avoiding deadlocks in case of other buggy usage.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
  • Loading branch information
craig[bot] and andreimatei committed Jan 22, 2022
2 parents dc07599 + 2fca7ad commit 4fc0855
Show file tree
Hide file tree
Showing 22 changed files with 1,358 additions and 295 deletions.
5 changes: 5 additions & 0 deletions pkg/testutils/testcluster/testcluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import (
"fmt"
"net"
"reflect"
"runtime"
"strings"
"sync"
"testing"
Expand Down Expand Up @@ -155,6 +156,10 @@ func (tc *TestCluster) stopServers(ctx context.Context) {
return errors.Newf("%s", buf.String())
})
}
// Force a GC in an attempt to run finalizers. Some finalizers run sanity
// checks that panic on failure, and ideally we'd run them all before starting
// the next test.
runtime.GC()
}

// StopServer stops an individual server in the cluster.
Expand Down
2 changes: 1 addition & 1 deletion pkg/util/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ func init() {

// ConstantWithMetamorphicTestRange is like ConstantWithMetamorphicTestValue
// except instead of returning a single metamorphic test value, it returns a
// random test value in a range.
// random test value in the semi-open range [min, max).
//
// The given name is used for logging.
func ConstantWithMetamorphicTestRange(name string, defaultValue, min, max int) int {
Expand Down
1 change: 1 addition & 0 deletions pkg/util/log/ambient_context_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ func TestAnnotateCtxSpan(t *testing.T) {

ac.Tracer = tracer
ctx, sp := ac.AnnotateCtxWithSpan(context.Background(), "s")
defer sp.Finish()
require.Equal(t, sp, tracing.SpanFromContext(ctx))
require.NotNil(t, sp)
require.False(t, sp.IsVerbose())
Expand Down
11 changes: 9 additions & 2 deletions pkg/util/ring/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,14 @@ go_library(
go_test(
name = "ring_test",
size = "small",
srcs = ["ring_buffer_test.go"],
srcs = [
"main_test.go",
"ring_buffer_test.go",
],
embed = [":ring"],
deps = ["@com_github_stretchr_testify//require"],
deps = [
"//pkg/util/leaktest",
"//pkg/util/randutil",
"@com_github_stretchr_testify//require",
],
)
25 changes: 25 additions & 0 deletions pkg/util/ring/main_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
// Copyright 2021 The Cockroach Authors.
//
// Use of this software is governed by the Business Source License
// included in the file licenses/BSL.txt.
//
// As of the Change Date specified in that file, in accordance with
// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0, included in the file
// licenses/APL.txt.

package ring

import (
"os"
"testing"

"github.com/cockroachdb/cockroach/pkg/util/randutil"
)

func TestMain(m *testing.M) {
randutil.SeedForTests()
os.Exit(m.Run())
}

//go:generate ../../util/leaktest/add-leaktest.sh *_test.go
91 changes: 78 additions & 13 deletions pkg/util/ring/ring_buffer.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,17 +79,40 @@ func (r *Buffer) GetLast() interface{} {
return r.buffer[(cap(r.buffer)+r.tail-1)%cap(r.buffer)]
}

func (r *Buffer) grow(n int) {
func (r *Buffer) resize(n int) {
if n < r.Len() {
panic("resizing to fewer elements than current length")
}

if n == 0 {
r.Discard()
return
}

newBuffer := make([]interface{}, n)
r.copyTo(newBuffer)
r.tail = r.Len() % cap(newBuffer)
r.head = 0
r.buffer = newBuffer
}

// copyTo copies elements from r to dst. If len(dst) < r.Len(), only the first
// len(dst) elements are copied.
func (r *Buffer) copyTo(dst []interface{}) {
if !r.nonEmpty {
return
}
// Copy over the contents to dst.
if r.head < r.tail {
copy(newBuffer[:r.Len()], r.buffer[r.head:r.tail])
copy(dst, r.buffer[r.head:r.tail])
} else {
copy(newBuffer[:cap(r.buffer)-r.head], r.buffer[r.head:])
copy(newBuffer[cap(r.buffer)-r.head:r.Len()], r.buffer[:r.tail])
tailElements := r.buffer[r.head:]
copy(dst, tailElements)
// If there's space remaining, continue.
if len(dst) > len(tailElements) {
copy(dst[cap(r.buffer)-r.head:], r.buffer[:r.tail])
}
}
r.head = 0
r.tail = cap(r.buffer)
r.buffer = newBuffer
}

func (r *Buffer) maybeGrow() {
Expand All @@ -100,7 +123,7 @@ func (r *Buffer) maybeGrow() {
if n == 0 {
n = 1
}
r.grow(n)
r.resize(n)
}

// AddFirst add element to the front of the Buffer and doubles it's underlying
Expand Down Expand Up @@ -146,21 +169,63 @@ func (r *Buffer) RemoveLast() {
}
}

// Reserve reserves the provided number of elements in the Buffer. It is an
// error to reserve a size less than the Buffer's current length.
// Reserve reserves the provided number of elements in the Buffer. It is illegal
// to reserve a size less than the r.Len().
//
// If the Buffer already has a capacity of n or larger, this is a no-op.
func (r *Buffer) Reserve(n int) {
if n < r.Len() {
panic("reserving fewer elements than current length")
} else if n > cap(r.buffer) {
r.grow(n)
}
if n > cap(r.buffer) {
r.resize(n)
}
}

// Resize changes the Buffer's storage to be of the specified size. It is
// illegal to resize to a size less than r.Len().
//
// This is a more general version of Reserve: Reserve only ever grows the
// storage, whereas Resize() can also shrink it.
//
// Note that, if n != r.Len(), Resize always allocates new storage, even when n
// is less than the current capacity. This can be useful to make the storage for
// a buffer that used to be large available for GC, but it can also be wasteful.
func (r *Buffer) Resize(n int) {
if n < r.Len() {
panic("resizing to fewer elements than current length")
}

if n != cap(r.buffer) {
r.resize(n)
}
}

// Reset makes Buffer treat its underlying memory as if it were empty. This
// allows for reusing the same memory again without explicitly removing old
// elements.
// elements. Note that this does not nil out the elements, so they're not made
// available to GC.
//
// See also Discard.
func (r *Buffer) Reset() {
r.head = 0
r.tail = 0
r.nonEmpty = false
}

// Discard is like Reset, except it also does Resize(0) to nil out the
// underlying slice. This makes the backing storage for the slice available to
// GC if nobody else is referencing it. This is useful if r is still referenced,
// but *r will be reassigned.
//
// See also Reset and Resize.
func (r *Buffer) Discard() {
*r = Buffer{}
}

// all a slice with returns all the elements in the buffer.
func (r *Buffer) all() []interface{} {
buf := make([]interface{}, r.Len())
r.copyTo(buf)
return buf
}
52 changes: 24 additions & 28 deletions pkg/util/ring/ring_buffer_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,22 +14,23 @@ import (
"math/rand"
"testing"

"github.com/cockroachdb/cockroach/pkg/util/leaktest"
"github.com/stretchr/testify/require"
)

const maxCount = 100

func testRingBuffer(t *testing.T, count int) {
func TestRingBuffer(t *testing.T) {
defer leaktest.AfterTest(t)()
const operationCount = 100
var buffer Buffer
naiveBuffer := make([]interface{}, 0, count)
for elementIdx := 0; elementIdx < count; elementIdx++ {
switch rand.Intn(4) {
naiveBuffer := make([]interface{}, 0, operationCount)
for i := 0; i < operationCount; i++ {
switch rand.Intn(5) {
case 0:
buffer.AddFirst(elementIdx)
naiveBuffer = append([]interface{}{elementIdx}, naiveBuffer...)
buffer.AddFirst(i)
naiveBuffer = append([]interface{}{i}, naiveBuffer...)
case 1:
buffer.AddLast(elementIdx)
naiveBuffer = append(naiveBuffer, elementIdx)
buffer.AddLast(i)
naiveBuffer = append(naiveBuffer, i)
case 2:
if len(naiveBuffer) > 0 {
buffer.RemoveFirst()
Expand All @@ -42,32 +43,23 @@ func testRingBuffer(t *testing.T, count int) {
buffer.RemoveLast()
naiveBuffer = naiveBuffer[:len(naiveBuffer)-1]
}
case 4:
// If there's extra capacity, resize to trim it.
require.LessOrEqual(t, len(naiveBuffer), buffer.Cap())
spareCap := buffer.Cap() - len(naiveBuffer)
if spareCap > 0 {
buffer.Resize(len(naiveBuffer) + rand.Intn(spareCap))
}
default:
t.Fatal("unexpected")
}

require.Equal(t, len(naiveBuffer), buffer.Len())
for pos, el := range naiveBuffer {
res := buffer.Get(pos)
require.Equal(t, el, res)
}
if len(naiveBuffer) > 0 {
require.Equal(t, naiveBuffer[0], buffer.GetFirst())
require.Equal(t, naiveBuffer[len(naiveBuffer)-1], buffer.GetLast())
}
}
}

func TestRingBuffer(t *testing.T) {
for count := 1; count <= maxCount; count++ {
t.Run("Parallel", func(t *testing.T) {
t.Parallel() // SAFE FOR TESTING
testRingBuffer(t, count)
})
require.Equal(t, naiveBuffer, buffer.all())
}
}

func TestRingBufferCapacity(t *testing.T) {
defer leaktest.AfterTest(t)()
var b Buffer

require.Panics(t, func() { b.Reserve(-1) })
Expand Down Expand Up @@ -116,4 +108,8 @@ func TestRingBufferCapacity(t *testing.T) {
b.Reserve(0)
require.Equal(t, 0, b.Len())
require.Equal(t, 9, b.Cap())

b.Resize(3)
require.Equal(t, 0, b.Len())
require.Equal(t, 3, b.Cap())
}
9 changes: 5 additions & 4 deletions pkg/util/stop/stopper_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -724,13 +724,14 @@ func TestStopperRunAsyncTaskTracing(t *testing.T) {
},
func(ctx context.Context) {
log.Event(ctx, "async 3")
sp := tracing.SpanFromContext(ctx)
if sp == nil {
sp1 := tracing.SpanFromContext(ctx)
if sp1 == nil {
errC <- errors.Errorf("missing span")
return
}
sp = tr.StartSpan("child", tracing.WithParent(sp))
if sp.TraceID() == traceID {
sp2 := tr.StartSpan("child", tracing.WithParent(sp1))
defer sp2.Finish()
if sp2.TraceID() == traceID {
errC <- errors.Errorf("expected different trace")
}
close(errC)
Expand Down
14 changes: 14 additions & 0 deletions pkg/util/tracing/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ go_library(
"grpc_interceptor.go",
"recording.go",
"span.go",
"span_finalizer_race_off.go",
"span_finalizer_race_on.go",
"span_inner.go",
"span_options.go",
"tags.go",
Expand All @@ -21,6 +23,7 @@ go_library(
visibility = ["//visibility:public"],
deps = [
"//pkg/settings",
"//pkg/util",
"//pkg/util/buildutil",
"//pkg/util/envutil",
"//pkg/util/iterutil",
Expand Down Expand Up @@ -62,17 +65,27 @@ go_test(
srcs = [
"bench_test.go",
"grpc_interceptor_test.go",
"main_test.go",
"span_test.go",
"tags_test.go",
"tracer_external_test.go",
"tracer_test.go",
],
embed = [":tracing"],
deps = [
"//pkg/base",
"//pkg/security",
"//pkg/security/securitytest",
"//pkg/server",
"//pkg/testutils",
"//pkg/testutils/grpcutils",
"//pkg/testutils/serverutils",
"//pkg/testutils/skip",
"//pkg/testutils/sqlutils",
"//pkg/util",
"//pkg/util/iterutil",
"//pkg/util/leaktest",
"//pkg/util/log",
"//pkg/util/stop",
"//pkg/util/timeutil",
"//pkg/util/tracing/tracingpb",
Expand All @@ -87,5 +100,6 @@ go_test(
"@org_golang_google_grpc//:go_default_library",
"@org_golang_google_grpc//metadata",
"@org_golang_x_net//trace",
"@org_golang_x_sync//errgroup",
],
)
Loading

0 comments on commit 4fc0855

Please sign in to comment.