Skip to content

Commit

Permalink
tracing: pool and reuse spans
Browse files Browse the repository at this point in the history
This patch adds sync.Pool-ing and reuse to tracing spans, so that they
don't need to be dynamically allocated on every span creation. This has
a big effect on the heap footprint.

Conceptually, a span is made available for reuse on Finish(). In
practice, it's more complicated because there can still be references of
the span used concurrently with Finish(), either internally in the
tracing library or externally. The external ones are bugs by definition,
but we want to avoid some particularly nasty consequences of such bugs.

The BenchmarkTracing results below show that this saves around 10KB
worth of heap allocations per simple query, when tracing is enabled
(minimal tracing: TracingModeActiveSpansRegistry). In the single-node
case, this is 25% of the query's allocations. As can be seen in the
benchmarks below in the differences between the trace=on and trace=off
rows, the impact of tracing is big on memory footprint; with this patch,
there's not much impact.

```
name                               old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=off-32      19.7kB ± 1%    19.7kB ± 1%     ~     (p=0.768 n=10+5)
Tracing/1node/scan/trace=on-32       29.2kB ± 0%    22.0kB ± 0%  -24.85%  (p=0.001 n=10+5)
Tracing/1node/insert/trace=off-32    38.5kB ± 1%    38.4kB ± 1%     ~     (p=0.440 n=10+5)
Tracing/1node/insert/trace=on-32     45.5kB ± 1%    38.7kB ± 1%  -15.03%  (p=0.001 n=10+5)
Tracing/3node/scan/trace=off-32      68.1kB ± 3%    67.9kB ± 3%     ~     (p=0.768 n=10+5)
Tracing/3node/scan/trace=on-32       86.8kB ± 2%    75.3kB ± 2%  -13.21%  (p=0.001 n=9+5)
Tracing/3node/insert/trace=off-32    88.1kB ± 5%    90.8kB ± 7%     ~     (p=0.112 n=9+5)
Tracing/3node/insert/trace=on-32     96.1kB ± 3%    89.0kB ± 2%   -7.39%  (p=0.001 n=9+5)
```

Unfortunately, pooling spans only saves on the size of allocations, not
the number of allocations. This is because the Context in which a Span
lives still needs to be allocated dynamically, as it does not have a
clear lifetime and so it cannot be re-used (plus it's immutable, etc).
Before this patch, the code was optimized to allocate a Span and a
Context together, through trickery (we had a dedicated Context type,
which we now get rid of). So, this patch replaces an allocation for
Span+Context with just a Context allocation, which is a win because
Spans are big and Contexts are small.

Micro-benchmarks only show minor improvements in the time/op, but the
memory improvements are so large that I think they must translate into
sufficient GC pressure wins to be worth doing.

This patch brings us very close to enabling the
TracingModeActiveSpansRegistry tracing mode by default in production -
which would give us a registry of all in-flight spans/operations in the
system.

 ### Interactions with use-after-Finish detection

Span reuse interacts with the recently-introduced span-use-after-Finish
detection. Spans are made available for reuse on Finish (technically,
when certain references to the span have been drained; see below). When
a span is reused in between Finish() and an erroneous use of the
Finish()ed span, this bug cannot be detected and results in the caller
operating on an unintended span. This can result in the wrong log
message apearing in the wrong span, and such. Care has been taken so
that use-after-Finish bugs do not result in more structural problems,
such as loops in the parent-child relationships.

 ### Technical details

The mechanism used for making spans available for reuse is reference
counting; the release of a span to the pool is deferred to the release
of the last counted reference.
Counted references are held:
- internally: children hold references to the parent and the parent holds
  references to the children.
- externally: the WithParent(sp) option takes a reference on sp.

Apart from WithParent, clients using Spans do not track their references
because it'd be too burdensome to require all the references to have a
clearly defined life cycle, and to be explicitly released when they're
no longer in use. For clients, the contract is that a span can be used
until Finish(). WithParent is special, though; see below.

Different alternatives to reference counting were explored. In
particular, instead of a deferred-release scheme, an alternative was a
fat-pointer scheme where references that can outlive a span are tagged
with the span's "generation". That works for the internal use cases, but
the problem with this scheme is that WithParent(s) ends up allocating -
which I want to avoid. Right now, WithParent(s) returns a pointer as an
interface, which doesn't allocate.  But if the pointer gets fat, it no
longer fits into the class of things that can be put in interfaces
without allocating.

The reference counter is an atomic; it is not protected by the span's
lock because a parent's counter needs to be accessed under both the
parent's lock and the children's locks.

In details, the reference counter serves a couple of purposes:
1) Prevent re-allocation of a Span while child spans are still operating on
it. In particular, this ensures that races between Finish()ing a parent and
a child cannot result in the child operating on a re-allocated parent.
Because of the span's lock ordering convention, a child cannot hold its
lock while operating on the parent. During Finish(), the child drops its
lock and informs the parent that it is Finish()ing. If the parent
Finish()es at the same time, that call could erroneously conclude that the
parent can be made available for re-use, even through the child goroutine
has a pending call into the parent.

2) Prevent re-allocation of child spans while a Finish()ing parent is in
the process of transforming the children into roots and inserting them into
the active spans registry. Operating on the registry is done without
holding any span's lock, so a race with a child's Finish() could result in
the registry operating on a re-allocated span.

3) Prevent re-allocation of a Span in between the time that WithParent(s)
captures a reference to s and the time when the parent option is used to
create the child. Such an inopportune reuse of a span could only happen is
the span is Finish()ed concurrently with the creation of its child, which
is illegal. Still, we optionally tolerate use-after-Finishes, and this use
cannot be tolerated with the reference count protection. Without this
protection, the tree of spans in a trace could degenerate into a graph
through the introduction of loops. A loop could lead to deadlock due to the
fact that we lock multiple spans at once. The lock ordering convention is
that the parent needs to be locked before the child, which ensures
deadlock-freedom in the absence of loops.
For example:
1. parent := tr.StartSpan()
2. parentOpt := WithParent(parent)
3. parent.Finish()
4. child := tr.StartSpan(parentOpt)
If "parent" would be re-allocated as "child", then child would have itself
as a parent. The use of parentOpt in step 4) after parent was finished in
step 3) is a use-after-Finish of parent; it is illegal and, if detection is
enabled, it might be detected as such. However, if span pooling and re-use
is enabled, then the detection is not realiable (it does not catch cases
where a span is re-used before a reference to it taken before the prior
Finish() is used).
A span having itself as a parent is just the trivial case of the problem;
loops of arbitrary length are also possible. For example, for a loop of
length 2:
1. Say we have span A as a parent with span B as a child (A -> B).
2. parentA := WithParent(A)
3. parentB := WithParent(B)
4. A.Finish(); B.Finish();
5. X := tr.StartSpan(parentA); Y := tr.StartSpan(parentB);
If B is re-used as X, and A is re-used as Y, we get the following graph:
 B<-┐
 |  |
 └->A

We avoid these hazards by having WithParent(s) increment s' reference
count, so spans are not re-used while the creation of a child is pending.
Spans can be Finish()ed while the creation of the child is pending, in
which case the creation of the child will reliably detect the
use-after-Finish (and turn it into a no-op if configured to tolerate
such illegal uses).

Release note: None
  • Loading branch information
andreimatei committed Dec 27, 2021
1 parent bb5588d commit 77ab456
Show file tree
Hide file tree
Showing 12 changed files with 872 additions and 169 deletions.
2 changes: 1 addition & 1 deletion pkg/util/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ func init() {

// ConstantWithMetamorphicTestRange is like ConstantWithMetamorphicTestValue
// except instead of returning a single metamorphic test value, it returns a
// random test value in a range.
// random test value in the semi-open range [min, max).
//
// The given name is used for logging.
func ConstantWithMetamorphicTestRange(name string, defaultValue, min, max int) int {
Expand Down
12 changes: 12 additions & 0 deletions pkg/util/tracing/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ go_library(
visibility = ["//visibility:public"],
deps = [
"//pkg/settings",
"//pkg/util",
"//pkg/util/buildutil",
"//pkg/util/envutil",
"//pkg/util/iterutil",
Expand Down Expand Up @@ -61,17 +62,27 @@ go_test(
srcs = [
"bench_test.go",
"grpc_interceptor_test.go",
"main_test.go",
"span_test.go",
"tags_test.go",
"tracer_external_test.go",
"tracer_test.go",
],
embed = [":tracing"],
deps = [
"//pkg/base",
"//pkg/security",
"//pkg/security/securitytest",
"//pkg/server",
"//pkg/testutils",
"//pkg/testutils/grpcutils",
"//pkg/testutils/serverutils",
"//pkg/testutils/skip",
"//pkg/testutils/sqlutils",
"//pkg/util",
"//pkg/util/iterutil",
"//pkg/util/leaktest",
"//pkg/util/log",
"//pkg/util/stop",
"//pkg/util/timeutil",
"//pkg/util/tracing/tracingpb",
Expand All @@ -86,5 +97,6 @@ go_test(
"@org_golang_google_grpc//:go_default_library",
"@org_golang_google_grpc//metadata",
"@org_golang_x_net//trace",
"@org_golang_x_sync//errgroup",
],
)
36 changes: 10 additions & 26 deletions pkg/util/tracing/context.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,31 +29,20 @@ func SpanFromContext(ctx context.Context) *Span {
return nil
}

// optimizedContext is an implementation of context.Context special
// cased to carry a Span under activeSpanKey{}. By making an explicit
// type we unlock optimizations that save allocations by allocating
// the optimizedContext together with the Span it eventually carries.
type optimizedContext struct {
context.Context
sp *Span
}

func (ctx *optimizedContext) Value(k interface{}) interface{} {
if k == (interface{}(activeSpanKey{})) {
return ctx.sp
}
return ctx.Context.Value(k)
}

// maybeWrapCtx returns a Context wrapping the Span, with two exceptions:
// 1. if ctx==noCtx, it's a noop
// 2. if ctx contains the noop Span, and sp is also the noop Span, elide
// allocating a new Context.
//
// If a non-nil octx is passed in, it forms the returned Context. This can
// avoid allocations if the caller is able to allocate octx together with
// the Span, as is commonly possible when StartSpanCtx is used.
func maybeWrapCtx(ctx context.Context, octx *optimizedContext, sp *Span) (context.Context, *Span) {
// NOTE(andrei): Our detection of Span use-after-Finish() is not reliable
// because spans are reused through a sync.Pool; we fail to detect someone
// holding a reference to a Span (e.g. a Context referencing the span) from
// before reuse and then using it after span reuse. We could make the detection
// more reliable by storing the generation number of the span in the Context
// along with the Span and checking it every time the Span is retrieved from the
// Context. We'd have to implement our own Context struct so that the Context
// and the generation number can be allocated together.
func maybeWrapCtx(ctx context.Context, sp *Span) (context.Context, *Span) {
if ctx == noCtx {
return noCtx, sp
}
Expand All @@ -72,16 +61,11 @@ func maybeWrapCtx(ctx context.Context, octx *optimizedContext, sp *Span) (contex
return ctx, sp
}
}
if octx != nil {
octx.Context = ctx
octx.sp = sp
return octx, sp
}
return context.WithValue(ctx, activeSpanKey{}, sp), sp
}

// ContextWithSpan returns a Context wrapping the supplied Span.
func ContextWithSpan(ctx context.Context, sp *Span) context.Context {
ctx, _ = maybeWrapCtx(ctx, nil /* octx */, sp)
ctx, _ = maybeWrapCtx(ctx, sp)
return ctx
}
157 changes: 111 additions & 46 deletions pkg/util/tracing/crdbspan.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,15 @@ type crdbSpan struct {
// tag's key to a user.
logTags *logtags.Buffer

// Locking rules: if locking both a parent and a child, the parent must be
// locked first. In practice, children don't take the parent's lock.
// Locking rules:
// - If locking both a parent and a child, the parent must be locked first. In
// practice, children don't take the parent's lock.
// - The active spans registry's lock must be acquired before this lock.
mu crdbSpanMu
}

type childRef struct {
*crdbSpan
spanRef
// collectRecording is set if this child's recording should be included in the
// parent's recording. This is usually the case, except for children created
// with the WithDetachedRecording() option.
Expand All @@ -74,9 +76,15 @@ type crdbSpanMu struct {
// Note that parent is mutable; a span can start by having a parent but then,
// if the parent finishes before the child does (which is uncommon), the
// child's parent is set to nil.
parent *crdbSpan
//
// While parent is set, this child is holding a reference in the parent's
// reference counter. The parent's ref count is decremented when this child
// Finish()es, or otherwise when this pointer is nil'ed (i.e. on parent
// Finish()).
parent spanRef

finished bool

// duration is initialized to -1 and set on Finish().
duration time.Duration

Expand Down Expand Up @@ -151,32 +159,42 @@ type sizeLimitedBuffer struct {
limit int64 // in bytes
}

func (buf *sizeLimitedBuffer) Reset() {
buf.Buffer.Reset()
buf.size = 0
}

// finish marks the span as finished. Further operations on the span are not
// allowed. Returns false if the span was already finished.
//
// TODO(andrei): The intention is for a span to be available for reuse (e.g.
// through a sync.Pool) after finish(), although we're not currently taking
// advantage of this. I think there might be users that collect a span's
// recording after finish(), which should be illegal. For now, use-after-finish
// is generally tolerated - and that's also why this method returns false when
// called a second time.
// Calling finish() a second time is illegal, as is any use-after-finish().
// Still, the Tracer can be configured to tolerate such uses. If the Tracer was
// configured to not tolerate use-after-Finish, we would have crashed before
// calling this.
func (s *crdbSpan) finish() bool {
var children []*crdbSpan
var parent *crdbSpan
var needRegistryChange bool
// Finishing involves the following steps:
// 1) Take the lock and capture a reference to the parent.
// 2) Operate on the parent outside of the lock.
// 3) Take the lock again, operate on the children under the lock, and also
// capture references to the children for further operations outside of the
// lock.
// 4) Insert the children into the active spans registry outside of the lock.
//
// We could reorder things such that the lock is only taken once, but it
// results in more awkward code because operating on the s' parent expects to
// find s' children in place, to collect their recordings.

var parent spanRef
{
s.mu.Lock()
if s.mu.finished {
// Already finished.
// Already finished (or at least in the process of finish()ing). This
// check ensures that only one caller performs cleanup for this span. We
// don't want the span to be re-allocated while finish() is running.
s.mu.Unlock()
return false
}
s.mu.finished = true
// If the span is not part of the registry now, it never will be. So, we'll
// need to remove it from the registry only if it currently does not have a
// parent. We'll also need to manipulate the registry if there are open
// children (they'll need to be added to the registry).
needRegistryChange = s.mu.parent == nil || len(s.mu.openChildren) > 0

if s.recordingType() != RecordingOff {
duration := timeutil.Since(s.startTime)
Expand All @@ -186,26 +204,58 @@ func (s *crdbSpan) finish() bool {
s.mu.duration = duration
}

// Shallow-copy the children so they can be processed outside the lock.
children = make([]*crdbSpan, len(s.mu.openChildren))
for i, c := range s.mu.openChildren {
children[i] = c.crdbSpan
}

// We'll operate on the parent outside of the child's lock.
parent = s.mu.parent

// We'll operate on the parent below, outside the child's lock, as per the
// lock ordering convention between parents and children. The parent might
// get Finish()ed by the time we call parent.childFinished(s) on it below;
// that's OK because we're going to hold on taking a reference in the
// parent's reference counter. Notice that we move the reference out of
// s.mu.parent; leaving it there would not work because s.mu.parent can be
// released by s.parentFinished() after we drop our lock.
parent = s.mu.parent.move()
s.mu.Unlock()
}

if parent != nil {
parent.childFinished(s)
// Operate on the parent outside the child (our current receiver) lock.
// childFinished() might call back into the child and acquire the child's
// lock.
if !parent.empty() {
parent.Span.i.crdb.childFinished(s)
parent.release()
}

// Deal with the orphaned children - make them roots.
for _, c := range children {
c.parentFinished()
// Operate on children.
var children []spanRef
var needRegistryChange bool
{
s.mu.Lock()

// If the span is not part of the registry now, it never will be. So, we'll
// need to remove it from the registry only if it currently does not have a
// parent. We'll also need to manipulate the registry if there are open
// children (they'll need to be added to the registry).
needRegistryChange = s.mu.parent.empty() || len(s.mu.openChildren) > 0

// Deal with the orphaned children - make them roots. We call into the
// children while holding the parent's lock. As per the span locking
// convention, that's OK (but the reverse isn't).
//
// We also shallow-copy the children for operating on them outside the lock.
children = make([]spanRef, len(s.mu.openChildren))
for i, c := range s.mu.openChildren {
c.parentFinished()
// Move ownership of the child reference, and also nil out the pointer to
// the child, making it available for GC.
children[i] = c.spanRef.move()
}
// Make the openChildren array available for GC if it has grown from the
// initial capacity provided by the spanAllocHelper. We don't want to reuse
// it in case it has grown too much, so that we don't accumulate many large
// arrays.
s.mu.openChildren = nil

s.mu.Unlock()
}

if needRegistryChange {
// Atomically replace s in the registry with all of its still-open children.
s.tracer.activeSpansRegistry.swap(s.spanID, children)
Expand Down Expand Up @@ -297,7 +347,8 @@ func (s *crdbSpan) getVerboseRecording(includeDetachedChildren bool, finishing b

for _, child := range s.mu.openChildren {
if child.collectRecording || includeDetachedChildren {
result = append(result, child.getVerboseRecording(includeDetachedChildren, false /* finishing */)...)
sp := child.Span.i.crdb
result = append(result, sp.getVerboseRecording(includeDetachedChildren, false /* finishing */)...)
}
}
s.mu.Unlock()
Expand Down Expand Up @@ -331,7 +382,8 @@ func (s *crdbSpan) getStructuredRecording(includeDetachedChildren bool) Recordin
}
for _, c := range s.mu.openChildren {
if c.collectRecording || includeDetachedChildren {
buffer = c.getStructuredEventsRecursively(buffer, includeDetachedChildren)
sp := c.Span.i.crdb
buffer = sp.getStructuredEventsRecursively(buffer, includeDetachedChildren)
}
}

Expand Down Expand Up @@ -510,7 +562,8 @@ func (s *crdbSpan) getStructuredEventsRecursively(
buffer = s.getStructuredEventsLocked(buffer)
for _, c := range s.mu.openChildren {
if c.collectRecording || includeDetachedChildren {
buffer = c.getStructuredEventsRecursively(buffer, includeDetachedChildren)
sp := c.Span.i.crdb
buffer = sp.getStructuredEventsRecursively(buffer, includeDetachedChildren)
}
}
for _, c := range s.mu.recording.finishedChildren {
Expand Down Expand Up @@ -629,7 +682,7 @@ func (s *crdbSpan) getRecordingNoChildrenLocked(
// The adding fails if the receiver has already Finish()ed. This should never
// happen, since using a Span after Finish() is illegal. But still, we
// defensively return false.
func (s *crdbSpan) addChildLocked(child *crdbSpan, collectChildRec bool) bool {
func (s *crdbSpan) addChildLocked(child *Span, collectChildRec bool) bool {
s.mu.AssertHeld()

if s.mu.finished {
Expand All @@ -638,39 +691,54 @@ func (s *crdbSpan) addChildLocked(child *crdbSpan, collectChildRec bool) bool {

s.mu.openChildren = append(
s.mu.openChildren,
childRef{crdbSpan: child, collectRecording: collectChildRec},
childRef{spanRef: makeSpanRef(child), collectRecording: collectChildRec},
)
return true
}

// childFinished is called when a child is Finish()ed. Depending on the
// receiver's recording mode, the child is atomically removed and replaced it
// with its recording. This allows the child span to be reused (since the parent
// no longer references it).
// with its recording.
//
// child is the child span that just finished.
func (s *crdbSpan) childFinished(child *crdbSpan) {
s.mu.Lock()
defer s.mu.Unlock()

if s.mu.finished {
return
}

var childIdx int
found := false
for i, c := range s.mu.openChildren {
if c.crdbSpan == child {
sp := c.Span.i.crdb
if sp == child {
childIdx = i
found = true
break
}
}
if !found {
// The child is expected to be found. In particular, this parent span is not
// supposed to be reused while children are holding a reference to it
// (courtesy of the parent's reference counter).
panic("child not present in parent")
}

collectChildRec := s.mu.openChildren[childIdx].collectRecording
// Drop the child's reference.
if s.mu.openChildren[childIdx].decRef() {
// We're going to use the child below, so we don't want it to be
// re-allocated yet. It shouldn't be re-allocated, because each span holds a
// reference to itself that's only dropped at the end of Span.Finish() (and
// the child at this point is in the middle of its Finish() call).
panic(fmt.Sprintf("span's reference count unexpectedly dropped to zero: %s", child.operation))
}

// Unlink the child.
l := len(s.mu.openChildren)
s.mu.openChildren[childIdx] = s.mu.openChildren[l-1]
s.mu.openChildren[l-1].crdbSpan = nil // Make the child available to GC.
s.mu.openChildren = s.mu.openChildren[:l-1]

// Collect the child's recording.
Expand Down Expand Up @@ -715,10 +783,7 @@ func (s *crdbSpan) childFinished(child *crdbSpan) {
func (s *crdbSpan) parentFinished() {
s.mu.Lock()
defer s.mu.Unlock()
if s.mu.finished {
return
}
s.mu.parent = nil
s.mu.parent.release()
}

// SetVerbose is part of the RegistrySpan interface.
Expand Down
Loading

0 comments on commit 77ab456

Please sign in to comment.