tracing: pool and reuse spans #73883

andreimatei · 2021-12-16T00:16:50Z

This patch adds sync.Pool-ing and reuse to tracing spans, so that they
don't need to be dynamically allocated on every span creation. This has
a big effect on the heap footprint.

Conceptually, a span is made available for reuse on Finish(). In
practice, it's more complicated because there can still be references of
the span used concurrently with Finish(), either internally in the
tracing library or externally. The external ones are bugs by definition,
but we want to avoid some particularly nasty consequences of such bugs.

The BenchmarkTracing results below show that this saves around 10KB
worth of heap allocations per simple query, when tracing is enabled
(minimal tracing: TracingModeActiveSpansRegistry). I believe the span
allocations go from being serviced from the shared heap to being
serviced from CPU-local freelists, since they become small enough. In
the single-node case, this is 25% of the query's allocations. As can be
seen in the benchmarks below in the differences between the trace=on and
trace=off rows, the impact of tracing is big on memory footprint; with
this patch, there's not much impact.

name                               old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=off-32      19.7kB ± 1%    19.7kB ± 1%     ~     (p=0.768 n=10+5)
Tracing/1node/scan/trace=on-32       29.2kB ± 0%    22.0kB ± 0%  -24.85%  (p=0.001 n=10+5)
Tracing/1node/insert/trace=off-32    38.5kB ± 1%    38.4kB ± 1%     ~     (p=0.440 n=10+5)
Tracing/1node/insert/trace=on-32     45.5kB ± 1%    38.7kB ± 1%  -15.03%  (p=0.001 n=10+5)
Tracing/3node/scan/trace=off-32      68.1kB ± 3%    67.9kB ± 3%     ~     (p=0.768 n=10+5)
Tracing/3node/scan/trace=on-32       86.8kB ± 2%    75.3kB ± 2%  -13.21%  (p=0.001 n=9+5)
Tracing/3node/insert/trace=off-32    88.1kB ± 5%    90.8kB ± 7%     ~     (p=0.112 n=9+5)
Tracing/3node/insert/trace=on-32     96.1kB ± 3%    89.0kB ± 2%   -7.39%  (p=0.001 n=9+5)

Unfortunately, pooling spans only saves on the size of allocations, not
the number of allocations. This is because the Context in which a Span
lives still needs to be allocated dynamically, as it does not have a
clear lifetime and so it cannot be re-used (plus it's immutable, etc).
Before this patch, the code was optimized to allocate a Span and a
Context together, through trickery (we had a dedicated Context type,
which we now get rid of). So, this patch replaces an allocation for
Span+Context with just a Context allocation, which is a win because
Spans are big and Contexts are small.

BenchmarkTracing (which runs SQL queries) only show minor improvements
in the time/op, but the memory improvements are so large that I think
they must translate into sufficient GC pressure wins to be worth doing.
Micro-benchmarks from the tracing package show major time/op wins.

name                                           old time/op    new time/op    delta
Tracer_StartSpanCtx/opts=none-32                  537ns ± 1%     275ns ± 2%  -48.73%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real-32                  537ns ± 2%     273ns ± 2%  -49.16%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,logtag-32           565ns ± 1%     278ns ± 1%  -50.81%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,autoparent-32       879ns ±29%     278ns ± 5%  -68.36%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,manualparent-32     906ns ±26%     289ns ± 2%  -68.08%  (p=0.008 n=5+5)
Span_GetRecording/root-only-32                   11.1ns ± 2%    11.6ns ± 4%     ~     (p=0.056 n=5+5)
Span_GetRecording/child-only-32                  11.1ns ± 4%    11.7ns ± 2%   +5.44%  (p=0.016 n=5+5)
Span_GetRecording/root-child-32                  18.9ns ± 3%    19.5ns ± 1%   +3.55%  (p=0.008 n=5+5)
RecordingWithStructuredEvent-32                  1.37µs ± 2%    1.17µs ± 2%  -14.22%  (p=0.008 n=5+5)
SpanCreation/detached-child=false-32             1.84µs ± 2%    0.96µs ± 0%  -47.56%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-32              2.01µs ± 1%    1.14µs ± 1%  -43.32%  (p=0.008 n=5+5)

name                                           old alloc/op   new alloc/op   delta
Tracer_StartSpanCtx/opts=none-32                   768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real-32                   768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,logtag-32            768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,autoparent-32        768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,manualparent-32      768B ± 0%       48B ± 0%  -93.75%  (p=0.008 n=5+5)
Span_GetRecording/root-only-32                    0.00B          0.00B          ~     (all equal)
Span_GetRecording/child-only-32                   0.00B          0.00B          ~     (all equal)
Span_GetRecording/root-child-32                   0.00B          0.00B          ~     (all equal)
RecordingWithStructuredEvent-32                  1.54kB ± 0%    0.77kB ± 0%  -49.86%  (p=0.008 n=5+5)
SpanCreation/detached-child=false-32             4.62kB ± 0%    0.29kB ± 0%     ~     (p=0.079 n=4+5)
SpanCreation/detached-child=true-32              5.09kB ± 0%    0.77kB ± 0%  -84.87%  (p=0.008 n=5+5)

This patch brings us very close to enabling the
TracingModeActiveSpansRegistry tracing mode by default in production -
which would give us a registry of all in-flight spans/operations in the
system.

Interactions with use-after-Finish detection

Span reuse interacts with the recently-introduced span-use-after-Finish
detection. Spans are made available for reuse on Finish (technically,
when certain references to the span have been drained; see below). When
a span is reused in between Finish() and an erroneous use of the
Finish()ed span, this bug cannot be detected and results in the caller
operating on an unintended span. This can result in the wrong log
message apearing in the wrong span, and such. Care has been taken so
that use-after-Finish bugs do not result in more structural problems,
such as loops in the parent-child relationships.

Technical details

The mechanism used for making spans available for reuse is reference
counting; the release of a span to the pool is deferred to the release
of the last counted reference.
Counted references are held:

internally: children hold references to the parent and the parent holds
references to the children.
externally: the WithParent(sp) option takes a reference on sp.

Apart from WithParent, clients using Spans do not track their references
because it'd be too burdensome to require all the references to have a
clearly defined life cycle, and to be explicitly released when they're
no longer in use. For clients, the contract is that a span can be used
until Finish(). WithParent is special, though; see below.

Different alternatives to reference counting were explored. In
particular, instead of a deferred-release scheme, an alternative was a
fat-pointer scheme where references that can outlive a span are tagged
with the span's "generation". That works for the internal use cases, but
the problem with this scheme is that WithParent(s) ends up allocating -
which I want to avoid. Right now, WithParent(s) returns a pointer as an
interface, which doesn't allocate. But if the pointer gets fat, it no
longer fits into the class of things that can be put in interfaces
without allocating.

The reference counter is an atomic; it is not protected by the span's
lock because a parent's counter needs to be accessed under both the
parent's lock and the children's locks.

In details, the reference counter serves a couple of purposes:

Prevent re-allocation of a Span while child spans are still operating on
it. In particular, this ensures that races between Finish()ing a parent and
a child cannot result in the child operating on a re-allocated parent.
Because of the span's lock ordering convention, a child cannot hold its
lock while operating on the parent. During Finish(), the child drops its
lock and informs the parent that it is Finish()ing. If the parent
Finish()es at the same time, that call could erroneously conclude that the
parent can be made available for re-use, even through the child goroutine
has a pending call into the parent.
Prevent re-allocation of child spans while a Finish()ing parent is in
the process of transforming the children into roots and inserting them into
the active spans registry. Operating on the registry is done without
holding any span's lock, so a race with a child's Finish() could result in
the registry operating on a re-allocated span.
Prevent re-allocation of a Span in between the time that WithParent(s)
captures a reference to s and the time when the parent option is used to
create the child. Such an inopportune reuse of a span could only happen is
the span is Finish()ed concurrently with the creation of its child, which
is illegal. Still, we optionally tolerate use-after-Finishes, and this use
cannot be tolerated with the reference count protection. Without this
protection, the tree of spans in a trace could degenerate into a graph
through the introduction of loops. A loop could lead to deadlock due to the
fact that we lock multiple spans at once. The lock ordering convention is
that the parent needs to be locked before the child, which ensures
deadlock-freedom in the absence of loops.
For example:

parent := tr.StartSpan()
parentOpt := WithParent(parent)
parent.Finish()
child := tr.StartSpan(parentOpt)
If "parent" would be re-allocated as "child", then child would have itself
as a parent. The use of parentOpt in step 4) after parent was finished in
step 3) is a use-after-Finish of parent; it is illegal and, if detection is
enabled, it might be detected as such. However, if span pooling and re-use
is enabled, then the detection is not realiable (it does not catch cases
where a span is re-used before a reference to it taken before the prior
Finish() is used).
A span having itself as a parent is just the trivial case of the problem;
loops of arbitrary length are also possible. For example, for a loop of
length 2:
Say we have span A as a parent with span B as a child (A -> B).
parentA := WithParent(A)
parentB := WithParent(B)
A.Finish(); B.Finish();
X := tr.StartSpan(parentA); Y := tr.StartSpan(parentB);
If B is re-used as X, and A is re-used as Y, we get the following graph:

 B<-┐
 |  |
 └->A

We avoid these hazards by having WithParent(s) increment s' reference
count, so spans are not re-used while the creation of a child is pending.
Spans can be Finish()ed while the creation of the child is pending, in
which case the creation of the child will reliably detect the
use-after-Finish (and turn it into a no-op if configured to tolerate
such illegal uses).

Introducing this reference count, and only reusing spans with a
reference count of zero, introduced the risk of leaking references if
one does opt = WithParent(sp) and then discards the resulting opt
without passing it to StartSpan(). This would cause a silent performance
regression (not a memory leak though, as the GC is still there). This
risk seems worth it for avoiding deadlocks in case of other buggy usage.

Release note: None

cockroach-teamcity · 2021-12-16T00:16:59Z

This change is

andreimatei · 2021-12-16T02:09:36Z

@nvanbenschoten thanks for agreeing to this review. I promise one of equal or greater value.

This commit contains a collection of small cleanups that I stumbled upon while getting familiar with this package, in preparation to give cockroachdb#73883 a code review. Most of the changes are minor, though there is one more involved tweak to use the `optsPool` in more places. During the code walkthrough, I also noticed two different possible areas to improve performance. These were based on code inspection alone, not on any runtime measurements, so take them with a grain of salt. 1. the `TraceInfo_OtelInfo` struct incurs unnecessary cost by storing its fixed-size `TraceID` and `SpanID` in byte slices. Can these be inlined into the `TraceInfo_OtelInfo` struct as arrays? Ironically, the slice headers (24-bytes) are actually larger than the slices themselves, so this would actually reduce the size of `TraceInfo_OtelInfo` and avoid any indirection or extra heap allocation. 2. the `WithParent` function does a nice job avoid heap allocations. Unfortunately, this is not the case for the `WithRemoteParent` function, which does heap allocate. Can we avoid this? Taking a `SpanMeta` by pointer seems to just shift the problem, because now we need a heap-allocated `SpanMeta`. But maybe we can address this by lazily decoding `SpanMeta` objects. Instead of decoding a SpanMeta and passing that in to `WithRemoteParent`, we could introduce a `SpanMetaSource` interface that is implemented on `*tracingpb.TraceInfo` and `*Tracer`. We can then pass these heap-allocated references in to `WithRemoteParent`. In general, the code in the package is clean and I was impressed by the degree to which we try to batch heap allocations in `startSpanGeneric`.

nvanbenschoten

thanks for agreeing to this review. I promise one of equal or greater value.

It just so happens that I finally got around to reviewing this on December 23rd, so consider it an early Christmas present. No repayment necessary.

Reviewed 8 of 8 files at r3, 13 of 13 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)

-- commits, line 16 at r3:
s/distructive/destructive/

-- commits, line 66 at r4:
s/and/are/

-- commits, line 172 at r4:
We discussed this before, but I think it's worth being explicit about what we're trading off. We're protecting against the risk of use-after-free bugs, but by increasing the potential for leaking a parent's reference count in a very subtle manner. This is the case when WithParent is called, but the result is not used to create a new child span for one reason or another.

I agree that we should err on the side of caution because a deadlock caused by a use-after-free bug is more severe than a leaked reference count — we still have a GC, after all. On the other hand, I have some mild concerns that a leaked reference count will introduce silent performance regressions, and we have little ability to assert against it or notice the regression. Meanwhile, at least we can blow up loudly on newly introduced use-after-free bugs.

Have you thought more about this? Perhaps we could use finalizers in some of our testing variants to assert against reference count leaks? Pebble does something similar.

pkg/util/tracing/crdbspan.go, line 190 at r4 (raw file):

	{
		s.mu.Lock()
		if s.mu.finished {

Make a note here about how this initial check ensures that only one caller tries to perform cleanup for any given span.

pkg/util/tracing/crdbspan.go, line 217 at r4 (raw file):

	// Operate on the parent outside the child's lock. childFinished() might call
	// back into the child and acquire the child's lock.

"the child (our current receiver)"

pkg/util/tracing/crdbspan.go, line 227 at r4 (raw file):

	var needRegistryChange bool
	{
		s.mu.Lock()

Do we like that we drop and then immediately re-acquire the lock when the span has no parent?

pkg/util/tracing/crdbspan.go, line 250 at r4 (raw file):

		// initial capacity provided by the spanAllocHelper. We don't want to reuse
		// it in case it has grown too much, so that we don't accumulate many large
		// arrays.

Add another sentence here that mentions that the next span will use the spanAllocHelper's small child alloc slice again.

pkg/util/tracing/crdbspan.go, line 728 at r4 (raw file):

	collectChildRec := s.mu.openChildren[childIdx].collectRecording
	// Remember to drop the child's reference at the end.
	childPtr := s.mu.openChildren[childIdx].move()

Can't we just s.mu.openChildren[childIdx].release() directly?

pkg/util/tracing/span.go, line 200 at r4 (raw file):

func (sp *Span) decRef() {
	refCnt := atomic.AddInt32(&sp.refCnt, -1)
	// If this was the last reference, make the span available for re-use.

Want to add an assertion that this didn't drop below 0?

pkg/util/tracing/span.go, line 514 at r4 (raw file):

}

// release decrements the parent's reference counter. If there is no parent, or

Is "parent" the right word here?

pkg/util/tracing/span.go, line 517 at r4 (raw file):

// if release() has already been called, a release() call is a no-op.
//
// Note that release() is not thread-safe. When a spanRef is shared between

This is subtle and may be misinterpreted as saying that two different spanRefs need to synchronize when each individually being released. Consider driving home that point somewhere. Or just say that spanRefs shouldn't be shared across different threads and that, instead, each thread should hold its own spanRef.

pkg/util/tracing/span_inner.go, line 93 at r4 (raw file):

	// Copy references that we might need later out of the span. The span can't be
	// used any more after s.crdb.finish() returns.

Explain why this is. It's not very obvious because it looks like you're just calling finish on one of the fields in spanInner. The circular reference through the alloc struct is hidden from view.

pkg/util/tracing/span_inner.go, line 103 at r4 (raw file):

	if otelSpan != nil {
		s.otelSpan.End()

Should this line be otelSpan.End() and the one below be netTr.Finish()?

pkg/util/tracing/span_options.go, line 54 at r4 (raw file):

	// If parent is set, its refCnt is assumed to have been incremented with the
	// reference held here (see WithParent()). StartSpan() will decrement refCnt.
	// Because of this, a spanOptions with Parent set cannot be reused.

Also mention (in some form) that a spanOption with Parent set can't be discarded without being used. So it must be used exactly once. Maybe down below on WithParent.

pkg/util/tracing/span_options.go, line 223 at r4 (raw file):

	// and we can add our own reference. Once we've added our reference, this
	// reference will keep the span good to use until we eventually release it. On
	// the other hand, if we ever see a zero refCnt, then we can't use the span.

Would this logic be worth centralizing in a method on Span? I imagine we may want to use it in other places as well. Something like (*Span) tryIncRef() bool.

pkg/util/tracing/span_options.go, line 253 at r4 (raw file):

	// which will own it (it will release() it when either the child or the parent
	// are Finish()ed).
	ref := makeSpanRef(sp)

The double incRef then decRef is strange. Can we avoid it somehow? Maybe the tryIncRef method I suggested above should return a spanRef if it succeeds.

pkg/util/tracing/span_options.go, line 266 at r4 (raw file):

	}

	return (parentOption)(ref.move())

Doesn't this heap allocate? I thought you tried hard to avoid that, which was why you didn't want the fat pointers.

I thought what you wanted was something like:

type Span struct {
	// ...
}

func (Span) incRef() {}
func (Span) decRef() {}

type spanRef Span

func makeSpanRef(sp *Span) *spanRef {
	sp.incRef()
	return (*spanRef)(sp)
}

func moveSpanRef(sr **spanRef) *spanRef {
	cpy := *sr
	*sr = nil
	return cpy
}

func (sr *spanRef) release() {
	if sr == nil {
		return
	}
}

type spanOptions struct {
	// ...
}
type SpanOption interface {
	apply(spanOptions) spanOptions
}

type parentOption spanRef

func WithParent(sp *Span) SpanOption {
	// ...
	ref := makeSpanRef(sp)
	defer ref.release()

	// ...
	return (*parentOption)(moveSpanRef(&ref))
}

func (p *parentOption) apply(opts spanOptions) spanOptions {
	// ...
	return opts
}

pkg/util/tracing/tracer.go, line 172 at r4 (raw file):

// Finish()ed. See Tracer.reuseSpans for details.
var reuseSpans = buildutil.CrdbTestBuild ||
	envutil.EnvOrDefaultBool("COCKROACH_REUSE_TRACING_SPANS", false)

You're intentionally defaulting to false for now? Also, is this even used?

pkg/util/tracing/tracer.go, line 487 at r4 (raw file):

			h.crdbSpan.tracer = t
			h.span.i.crdb = &h.crdbSpan
			h.crdbSpan.mu.tags = h.tagsAlloc[:]

Should this be [:0]?

Also, are we no longer using the childrenAlloc?

EDIT: or is all of this unnecessary because it's handled by reset? In which case, should it be removed?

pkg/util/tracing/tracer.go, line 845 at r4 (raw file):

	if &sp.helper.span != sp {
		panic("span inconsistent with its spanAllocHelper")
	}

Do we want to clear out the Span and all of its inner references before placing it back in the pool? Could this be holding on to other references that we'd like to GC?

pkg/util/tracing/tracer.go, line 992 at r4 (raw file):

	s := t.newSpan()
	s.reset(

nit: reset is a strange name for the method called immediately after allocating something. Should it be init?

pkg/util/tracing/span_test.go, line 563 at r4 (raw file):

	child := tr.StartSpan("child", WithParent(parent))
	rec := parent.FinishAndGetRecording(RecordingVerbose)
	child.Finish()

nit: consider moving this after the assertion. Right now, it looks like it's doing more than (I think) it actually is.

andreimatei

Didn't finish addressing all your comments yet, but I wanted to reply to your WithParent allocation discussion.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

-- commits, line 16 at r3:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/distructive/destructive/

done

-- commits, line 66 at r4:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/and/are/

done

pkg/util/tracing/crdbspan.go, line 190 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Make a note here about how this initial check ensures that only one caller tries to perform cleanup for any given span.

see now

pkg/util/tracing/crdbspan.go, line 217 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"the child (our current receiver)"

done

pkg/util/tracing/crdbspan.go, line 728 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can't we just s.mu.openChildren[childIdx].release() directly?

We can. I think I was reticent to do that because doing it relies this not being the last reference to the child. But that's a reasonable assumption, as I've now explained in an assertion here. PTAL.

pkg/util/tracing/span.go, line 200 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Want to add an assertion that this didn't drop below 0?

done

pkg/util/tracing/span.go, line 514 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is "parent" the right word here?

no, changed

pkg/util/tracing/span.go, line 517 at r4 (raw file):
Added

Different spanRefs can reference the same span though, and they can be used concurrently.

pkg/util/tracing/span_inner.go, line 103 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this line be otelSpan.End() and the one below be netTr.Finish()?

yes, good catch

pkg/util/tracing/span_options.go, line 266 at r4 (raw file):

Doesn't this heap allocate?

It does not seem to allocate, no. I imagine it's because the compiler is smart enough to equate

type spanRef struct {
	*Span
}

to a pointer. spanRef can't be direct alias for *Span (i.e. type spanRef *Span), because then the pointer receiver on the release() method would be illegal.

FWIW, the compiler is saying the following about the sp argument escaping:

./span_options.go:193:17: parameter sp leaks to {storage for parentOption(ref.move())} with derefs=0:
./span_options.go:193:17:   flow: sp = sp:
./span_options.go:193:17:     from sp := sp (assign-pair) at ./span_options.go:253:20
./span_options.go:193:17:   flow: ~R0 = sp:
./span_options.go:193:17:     from spanRef{...} (struct literal element) at ./span_options.go:253:20
./span_options.go:193:17:     from ~R0 = spanRef{...} (assign-pair) at ./span_options.go:253:20
./span_options.go:193:17:   flow: ref = ~R0:
./span_options.go:193:17:     from ref := spanRef(~R0) (assign) at ./span_options.go:253:6
./span_options.go:193:17:   flow: {storage for parentOption(ref.move())} = ref:
./span_options.go:193:17:     from ref (address-of) at ./span_options.go:266:27
./span_options.go:193:17:     from ref.move() (call parameter) at ./span_options.go:266:32
./span_options.go:193:17:     from parentOption(ref.move()) (interface-converted) at ./span_options.go:266:23
./span_options.go:193:17: leaking param: sp

I've verified that WithParent(sp) doesn't allocate through a benchmark. If we were to add any other fields to SpanRef, it would allocate.
Given that, I think the current form is cleaner than what you're suggesting. Sounds good?

pkg/util/tracing/tracer.go, line 172 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You're intentionally defaulting to false for now? Also, is this even used?

used now. It's a bit complicated because this is used in conjunction with the spanReusePercent metamorphic constant.

And yeah, it's intentionally off for starters. I wanna start with tests, then roachprod, then prod.

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

-- commits, line 172 at r4:

We discussed this before, but I think it's worth being explicit about what we're trading off. We're protecting against the risk of use-after-free bugs, but by increasing the potential for leaking a parent's reference count in a very subtle manner. This is the case when WithParent is called, but the result is not used to create a new child span for one reason or another.

Added such words to the commit msg.

On the other hand, I have some mild concerns that a leaked reference count will introduce silent performance regressions, and we have little ability to assert against it or notice the regression.

Well, I did add some testing checking that there is at least some span reuse. But I guess you're not impressed.

Have you thought more about this? Perhaps we could use finalizers in some of our testing variants to assert against reference count leaks? Pebble does something similar.

Yeah, I do like the finalizer idea in principle. How exactly would you do it? Install the finalizer on all spans in unit tests? Or on some percentage? Or try on all of them and see how expensive it is?

pkg/util/tracing/crdbspan.go, line 227 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we like that we drop and then immediately re-acquire the lock when the span has no parent?

meh, I guess not. I've changed it to avoid doing that. What do you think?

pkg/util/tracing/crdbspan.go, line 250 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add another sentence here that mentions that the next span will use the spanAllocHelper's small child alloc slice again.

done

pkg/util/tracing/span_inner.go, line 93 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Explain why this is. It's not very obvious because it looks like you're just calling finish on one of the fields in spanInner. The circular reference through the alloc struct is hidden from view.

actually, I think this stanza is not necessary, so I've removed it. s can still be used after s.crdb.finish(). There's no circular reference at play here. I think I used to have one at some point, but I've reorganized things.

pkg/util/tracing/span_inner.go, line 103 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

yes, good catch

actually, I think these lines were fine. I think copying the references above was unnecessary; I must have needed that in an intermediary version of the code. s.netTr and s.otelSpan are still usable after s.crdb.finish().

pkg/util/tracing/span_options.go, line 54 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Also mention (in some form) that a spanOption with Parent set can't be discarded without being used. So it must be used exactly once. Maybe down below on WithParent.

WithParent() already talks about this. I've added more words here too.

pkg/util/tracing/span_options.go, line 223 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would this logic be worth centralizing in a method on Span? I imagine we may want to use it in other places as well. Something like (*Span) tryIncRef() bool.

I don't think this is needed in other places. Apart from this, there's two other places that increment the refCount, and I think neither of them needs it.

Span.reset(). This is called when the span is being created, so it's refcount has to be zero at that point.
crdbSpan.addChildLocked increments the child's counter. This is called when the child is being created, so nobody should have a referent to it.

And so if nobody else needs it, I think it's better for this logic to leave in the one place that needs it. Thoughts?

pkg/util/tracing/span_options.go, line 253 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The double incRef then decRef is strange. Can we avoid it somehow? Maybe the tryIncRef method I suggested above should return a spanRef if it succeeds.

A bit strange, but not the worst in my opinion. I've slightly improved the comment.
If we end up with a tryIncRef(), then I'll do what you suggest. But if we don't, I'd leave it alone - so let's take the discussion above.

pkg/util/tracing/tracer.go, line 487 at r4 (raw file):
I've kinda hesitated with exactly what to put in reset() and what to put in this New. I wanted to put things that don't need to be repeated on every re-allocation in New.
As written, reset() does:

		c.mu.openChildren = sp.helper.childrenAlloc[:0]
		c.mu.tags = c.mu.tags[:0]

Notice that openChildren storage comes from the allocHelper, and the tags come from the span itself. So, as written, assigning the tags here to something is necessary.
The asymmetry is motivated (to the extent that it's rational) by the fact that the children are more dynamic than the tags: a span can end up with a ton of children, but not with a ton of tags (at least that's the expectation). As such, we want to GC the backing array for long lists of children and start with a pre-allocated small list for every span (re-)allocation, but for tags we can keep reusing the array that the previous generation of the span has used.
Do you buy this or do you think symmetrical treatment for the two arrays would be better?

Should this be [:0]?

Given what reset does, I think you can argue either way between h.tagsAlloc[:] vs [:0]. I've now switched it to [:0].

pkg/util/tracing/tracer.go, line 845 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we want to clear out the Span and all of its inner references before placing it back in the pool? Could this be holding on to other references that we'd like to GC?

This is also something I hesitated on. The span might have accumulated a couple of things:

a net.Trace reference. I don't really care about it; in fact I hope to delete that functionality.
an otel Span reference. I don't really care about it.
the tags array; this one is reused on reset(), as we're discussing above.
the logs buffer and the structured logs buffer. These ones are usually unused with default settings. But when they're used, they can grow arbitrarily, so they are potential problems perhaps.
note that childrens array is cleared in crdbSpans.finish().

I'm noticing now that Buffer.reset() does not nil out elements, nor does it have any provisions from regressing to a smaller array. I'm thinking I'll add this functionality, and call Reset() on the buffers eagerly as we're releasing a span to the pool, but otherwise I won't make a brutal overwrite with a Span zero value. What do you think?

pkg/util/tracing/tracer.go, line 992 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: reset is a strange name for the method called immediately after allocating something. Should it be init?

well I think reset is appropriate if the spans in the pool are not particularly zeroed out. So let's see where that ends up. No?

pkg/util/tracing/span_test.go, line 563 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider moving this after the assertion. Right now, it looks like it's doing more than (I think) it actually is.

done

nvanbenschoten

Reviewed 7 of 12 files at r6, 4 of 5 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)

-- commits, line 172 at r4:

Well, I did add some testing checking that there is at least some span reuse. But I guess you're not impressed.

But this isn't just the responsibility of the library, it's the responsibility of every user of the library. So a unit test here can't assert against the absence of abuse.

For instance, can't I write the following code in the middle of pkg/kv/kvclient and leak a pooled object?

parent := tr.StartSpan()
parentOpt := tracing.WithParent(parent)
if err := ... {
    return err
}
child := tr.StartSpan(parentOpt)

Yeah, I do like the finalizer idea in principle. How exactly would you do it? Install the finalizer on all spans in unit tests? Or on some percentage? Or try on all of them and see how expensive it is?

I don't quite know. You can take a look at how pebble does this, as it might provide some inspiration.

pkg/util/tracing/crdbspan.go, line 217 at r7 (raw file):

		//
		// If there is no parent, we avoid releasing and then immediately
		// re-acquiring the lock.

Mention that this is just a performance optimization and that this method does not assume continuous mutual exclusion between the two blocks.

pkg/util/tracing/crdbspan.go, line 224 at r7 (raw file):

	}

	hasParent := !parent.empty()

Want to set hasParent in the previous code block and use that to optionally unlock? That way there's parity between the two conditions that control the locking.

pkg/util/tracing/span_options.go, line 223 at r4 (raw file):

And so if nobody else needs it, I think it's better for this logic to leave in the one place that needs it. Thoughts?

I don't necessarily agree, mainly because it feels strange to have such specific reference counting logic in this top-level function that's in a different file than everything else. Reference counting is hard. Atomic reference counting is harder. Let's keep all of that centralized and let the business logic here work with a reasonable API.

pkg/util/tracing/span_options.go, line 253 at r4 (raw file):

If we end up with a tryIncRef(), then I'll do what you suggest. But if we don't, I'd leave it alone - so let's take the discussion above.

Based on the discussion above, want to type this out and see how you feel about it?

pkg/util/tracing/span_options.go, line 266 at r4 (raw file):

I imagine it's because the compiler is smart enough to equate to a pointer.

Huh, TIL. Do you mind confirming this is the case by adding another field to SpanRef like you mentioned and verifying that this does create allocations in your benchmarks? I just want to make sure we're not fooling ourselves here.

pkg/util/tracing/tracer.go, line 487 at r4 (raw file):

Do you buy this or do you think symmetrical treatment for the two arrays would be better?

I think this all makes me wonder why we have tagsAlloc at all. If we're going to let the tags re-allocate and then recycle this new backing array on each subsequent use of a pooled spanAllocHelper, then what does the tagsAlloc give us? My reasoning may be flawed here.

pkg/util/tracing/tracer.go, line 845 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This is also something I hesitated on. The span might have accumulated a couple of things:

a net.Trace reference. I don't really care about it; in fact I hope to delete that functionality.

an otel Span reference. I don't really care about it.

the tags array; this one is reused on reset(), as we're discussing above.

the logs buffer and the structured logs buffer. These ones are usually unused with default settings. But when they're used, they can grow arbitrarily, so they are potential problems perhaps.

note that childrens array is cleared in crdbSpans.finish().

I'm noticing now that Buffer.reset() does not nil out elements, nor does it have any provisions from regressing to a smaller array. I'm thinking I'll add this functionality, and call Reset() on the buffers eagerly as we're releasing a span to the pool, but otherwise I won't make a brutal overwrite with a Span zero value. What do you think?

I would be explicit about the inner references that are retained across recycled Spans. Even if you've thought through this now, things may change when we add a new field to Span. We often use a pattern like to achieve this behavior:

*sp = Span{
    fieldToRecycle: sp.fieldToRecycle,
}
``

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

-- commits, line 172 at r4:

Well, I did add some testing checking that there is at least some span reuse. But I guess you're not impressed.

But this isn't just the responsibility of the library, it's the responsibility of every user of the library. So a unit test here can't assert against the absence of abuse.

Well but for the record, I didn't write a "unit test". I wrote an "integration test", if you will, that runs some SQL queries and checks something. But of course that "something" is admittedly far from proving that there's no leaks.

Added detection with finalizers in CrdbTestBuild. Verified that it doesn't seem to affect test running time - presumably because I clear the finalizers in the happy path.

pkg/util/tracing/crdbspan.go, line 217 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mention that this is just a performance optimization and that this method does not assume continuous mutual exclusion between the two blocks.

done

pkg/util/tracing/crdbspan.go, line 224 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Want to set hasParent in the previous code block and use that to optionally unlock? That way there's parity between the two conditions that control the locking.

done

pkg/util/tracing/span_options.go, line 223 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

And so if nobody else needs it, I think it's better for this logic to leave in the one place that needs it. Thoughts?

I don't necessarily agree, mainly because it feels strange to have such specific reference counting logic in this top-level function that's in a different file than everything else. Reference counting is hard. Atomic reference counting is harder. Let's keep all of that centralized and let the business logic here work with a reasonable API.

ok, fair enough. I introduced a tryMakeSpanRef, although I struggled a bit to explain its point without tying it to WithParent. PTAL.

pkg/util/tracing/span_options.go, line 253 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we end up with a tryIncRef(), then I'll do what you suggest. But if we don't, I'd leave it alone - so let's take the discussion above.

Based on the discussion above, want to type this out and see how you feel about it?

done

pkg/util/tracing/span_options.go, line 266 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I imagine it's because the compiler is smart enough to equate to a pointer.

Huh, TIL. Do you mind confirming this is the case by adding another field to SpanRef like you mentioned and verifying that this does create allocations in your benchmarks? I just want to make sure we're not fooling ourselves here.

With an extra field on SpanRef, this happens:

name                                           old allocs/op  new allocs/op  delta
Tracer_StartSpanCtx/opts=real,autoparent-32        1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,manualparent-32      1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.008 n=5+5)

pkg/util/tracing/tracer.go, line 487 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you buy this or do you think symmetrical treatment for the two arrays would be better?

I think this all makes me wonder why we have tagsAlloc at all. If we're going to let the tags re-allocate and then recycle this new backing array on each subsequent use of a pooled spanAllocHelper, then what does the tagsAlloc give us? My reasoning may be flawed here.

Please see now; I've reworked this. Now, all the buffers are nilled in Span.Finish() and re-assigned to the spanAllocHelper in reset().

pkg/util/tracing/tracer.go, line 845 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I would be explicit about the inner references that are retained across recycled Spans. Even if you've thought through this now, things may change when we add a new field to Span. We often use a pattern like to achieve this behavior:
*sp = Span{
    fieldToRecycle: sp.fieldToRecycle,
}
``

please see now

nvanbenschoten

Reviewed 2 of 15 files at r8, 2 of 2 files at r10, 2 of 4 files at r11, 16 of 16 files at r12, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)

pkg/util/tracing/span.go, line 527 at r12 (raw file):

		h := sp.helper
		//// Zero out the buffers in the spanAllocHelper, to make their elements (if

Did you want to clean up this comment?

pkg/util/tracing/span_options.go, line 266 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

With an extra field on SpanRef, this happens:

name                                           old allocs/op  new allocs/op  delta
Tracer_StartSpanCtx/opts=real,autoparent-32        1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.008 n=5+5)
Tracer_StartSpanCtx/opts=real,manualparent-32      1.00 ± 0%      2.00 ± 0%  +100.00%  (p=0.008 n=5+5)

Thanks for checking.

andreimatei

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)

pkg/util/tracing/span.go, line 527 at r12 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you want to clean up this comment?

yes; done

andreimatei

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)

craig · 2022-01-20T19:08:35Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-21T22:00:50Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-21T22:02:19Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2022-01-21T23:00:38Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-22T01:59:24Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-22T03:12:52Z

Build failed:

GitHub CI (Cockroach)

This test was bizarre: it performed a series of random operations on a Buffer, and it was doing so by spinning up one goroutine doing 1 operation, a second doing 2 operations, etc. Instead, just do 100 operations. The code was also broken because the test parallel goroutines were capturing the operation count, and so in effect the tests were not getting the number of operations they wanted. Also, the random seed was not initialized, so there was less randomness then this test wanted. Release note: None

FinishAndGetRecording() needs to "atomically" finish the span and collect its recording. The atomic part means that the recording should not have an "_unfinished" tag in it. Avoiding the presence of that tag was a bit delicate because, naively, if the recording is collected before the span is "finished", then the tag would be present and, on the other hand, the reverse ordering is not easily possible because, if the span was first finished and then the recording would be collected, that would be a use-after-Finish() and the recording would not be collected. Before this patch, FinishAndGetRecording() was cheating a bit by technically still collecting the recording after finish, but eliding the use-after-Finish() protection. This shortcut is becoming problematic, because I need to make Finish() more destructive at lower levels (working towards the ability to reuse spans), and so even this "internal" use-after-finish will not work any more. Instead, this patch moves recording collection before finish, but plumbs down information about whether a span is in the process of finishing when its recording is collected, for avoiding the "_unfinished" span. Release note: None

Before this patch, a Buffer could only grow. Now, it learns to shrink on demand, and also to discard all its contents. Also the code is simplified; I found the arithmetic hard to follow before. The ability to discard changes is used in the next patch to help reset tracing Spans before being reused. The shrinking is not used, but it seems like a small and reasonable addition nonetheless. Release note: None

Make the tracing tests Finish() all their spans. It's generally the civilized thing to do, and a future patch introduces checking that all spans are finished in the form of finalizers. Release note: None

This patch adds sync.Pool-ing and reuse to tracing spans, so that they don't need to be dynamically allocated on every span creation. This has a big effect on the heap footprint. Conceptually, a span is made available for reuse on Finish(). In practice, it's more complicated because there can still be references of the span used concurrently with Finish(), either internally in the tracing library or externally. The external ones are bugs by definition, but we want to avoid some particularly nasty consequences of such bugs. The BenchmarkTracing results below show that this saves around 10KB worth of heap allocations per simple query, when tracing is enabled (minimal tracing: TracingModeActiveSpansRegistry). I believe the span allocations go from being serviced from the shared heap to being serviced from CPU-local freelists, since they become small enough. In the single-node case, this is 25% of the query's allocations. As can be seen in the benchmarks below in the differences between the trace=on and trace=off rows, the impact of tracing is big on memory footprint; with this patch, there's not much impact. ``` name old alloc/op new alloc/op delta Tracing/1node/scan/trace=off-32 19.7kB ± 1% 19.7kB ± 1% ~ (p=0.768 n=10+5) Tracing/1node/scan/trace=on-32 29.2kB ± 0% 22.0kB ± 0% -24.85% (p=0.001 n=10+5) Tracing/1node/insert/trace=off-32 38.5kB ± 1% 38.4kB ± 1% ~ (p=0.440 n=10+5) Tracing/1node/insert/trace=on-32 45.5kB ± 1% 38.7kB ± 1% -15.03% (p=0.001 n=10+5) Tracing/3node/scan/trace=off-32 68.1kB ± 3% 67.9kB ± 3% ~ (p=0.768 n=10+5) Tracing/3node/scan/trace=on-32 86.8kB ± 2% 75.3kB ± 2% -13.21% (p=0.001 n=9+5) Tracing/3node/insert/trace=off-32 88.1kB ± 5% 90.8kB ± 7% ~ (p=0.112 n=9+5) Tracing/3node/insert/trace=on-32 96.1kB ± 3% 89.0kB ± 2% -7.39% (p=0.001 n=9+5) ``` Unfortunately, pooling spans only saves on the size of allocations, not the number of allocations. This is because the Context in which a Span lives still needs to be allocated dynamically, as it does not have a clear lifetime and so it cannot be re-used (plus it's immutable, etc). Before this patch, the code was optimized to allocate a Span and a Context together, through trickery (we had a dedicated Context type, which we now get rid of). So, this patch replaces an allocation for Span+Context with just a Context allocation, which is a win because Spans are big and Contexts are small. BenchmarkTracing (which runs SQL queries) only show minor improvements in the time/op, but the memory improvements are so large that I think they must translate into sufficient GC pressure wins to be worth doing. Micro-benchmarks from the tracing package show major time/op wins. ``` name old time/op new time/op delta Tracer_StartSpanCtx/opts=none-32 537ns ± 1% 275ns ± 2% -48.73% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real-32 537ns ± 2% 273ns ± 2% -49.16% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,logtag-32 565ns ± 1% 278ns ± 1% -50.81% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,autoparent-32 879ns ±29% 278ns ± 5% -68.36% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,manualparent-32 906ns ±26% 289ns ± 2% -68.08% (p=0.008 n=5+5) Span_GetRecording/root-only-32 11.1ns ± 2% 11.6ns ± 4% ~ (p=0.056 n=5+5) Span_GetRecording/child-only-32 11.1ns ± 4% 11.7ns ± 2% +5.44% (p=0.016 n=5+5) Span_GetRecording/root-child-32 18.9ns ± 3% 19.5ns ± 1% +3.55% (p=0.008 n=5+5) RecordingWithStructuredEvent-32 1.37µs ± 2% 1.17µs ± 2% -14.22% (p=0.008 n=5+5) SpanCreation/detached-child=false-32 1.84µs ± 2% 0.96µs ± 0% -47.56% (p=0.008 n=5+5) SpanCreation/detached-child=true-32 2.01µs ± 1% 1.14µs ± 1% -43.32% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Tracer_StartSpanCtx/opts=none-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,logtag-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,autoparent-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,manualparent-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Span_GetRecording/root-only-32 0.00B 0.00B ~ (all equal) Span_GetRecording/child-only-32 0.00B 0.00B ~ (all equal) Span_GetRecording/root-child-32 0.00B 0.00B ~ (all equal) RecordingWithStructuredEvent-32 1.54kB ± 0% 0.77kB ± 0% -49.86% (p=0.008 n=5+5) SpanCreation/detached-child=false-32 4.62kB ± 0% 0.29kB ± 0% ~ (p=0.079 n=4+5) SpanCreation/detached-child=true-32 5.09kB ± 0% 0.77kB ± 0% -84.87% (p=0.008 n=5+5) ``` This patch brings us very close to enabling the TracingModeActiveSpansRegistry tracing mode by default in production - which would give us a registry of all in-flight spans/operations in the system. ### Interactions with use-after-Finish detection Span reuse interacts with the recently-introduced span-use-after-Finish detection. Spans are made available for reuse on Finish (technically, when certain references to the span have been drained; see below). When a span is reused in between Finish() and an erroneous use of the Finish()ed span, this bug cannot be detected and results in the caller operating on an unintended span. This can result in the wrong log message apearing in the wrong span, and such. Care has been taken so that use-after-Finish bugs do not result in more structural problems, such as loops in the parent-child relationships. ### Technical details The mechanism used for making spans available for reuse is reference counting; the release of a span to the pool is deferred to the release of the last counted reference. Counted references are held: - internally: children hold references to the parent and the parent holds references to the children. - externally: the WithParent(sp) option takes a reference on sp. Apart from WithParent, clients using Spans do not track their references because it'd be too burdensome to require all the references to have a clearly defined life cycle, and to be explicitly released when they're no longer in use. For clients, the contract is that a span can be used until Finish(). WithParent is special, though; see below. Different alternatives to reference counting were explored. In particular, instead of a deferred-release scheme, an alternative was a fat-pointer scheme where references that can outlive a span are tagged with the span's "generation". That works for the internal use cases, but the problem with this scheme is that WithParent(s) ends up allocating - which I want to avoid. Right now, WithParent(s) returns a pointer as an interface, which doesn't allocate. But if the pointer gets fat, it no longer fits into the class of things that can be put in interfaces without allocating. The reference counter is an atomic; it is not protected by the span's lock because a parent's counter needs to be accessed under both the parent's lock and the children's locks. In details, the reference counter serves a couple of purposes: 1) Prevent re-allocation of a Span while child spans are still operating on it. In particular, this ensures that races between Finish()ing a parent and a child cannot result in the child operating on a re-allocated parent. Because of the span's lock ordering convention, a child cannot hold its lock while operating on the parent. During Finish(), the child drops its lock and informs the parent that it is Finish()ing. If the parent Finish()es at the same time, that call could erroneously conclude that the parent can be made available for re-use, even through the child goroutine has a pending call into the parent. 2) Prevent re-allocation of child spans while a Finish()ing parent is in the process of transforming the children into roots and inserting them into the active spans registry. Operating on the registry is done without holding any span's lock, so a race with a child's Finish() could result in the registry operating on a re-allocated span. 3) Prevent re-allocation of a Span in between the time that WithParent(s) captures a reference to s and the time when the parent option is used to create the child. Such an inopportune reuse of a span could only happen is the span is Finish()ed concurrently with the creation of its child, which is illegal. Still, we optionally tolerate use-after-Finishes, and this use cannot be tolerated with the reference count protection. Without this protection, the tree of spans in a trace could degenerate into a graph through the introduction of loops. A loop could lead to deadlock due to the fact that we lock multiple spans at once. The lock ordering convention is that the parent needs to be locked before the child, which ensures deadlock-freedom in the absence of loops. For example: 1. parent := tr.StartSpan() 2. parentOpt := WithParent(parent) 3. parent.Finish() 4. child := tr.StartSpan(parentOpt) If "parent" would be re-allocated as "child", then child would have itself as a parent. The use of parentOpt in step 4) after parent was finished in step 3) is a use-after-Finish of parent; it is illegal and, if detection is enabled, it might be detected as such. However, if span pooling and re-use is enabled, then the detection is not realiable (it does not catch cases where a span is re-used before a reference to it taken before the prior Finish() is used). A span having itself as a parent is just the trivial case of the problem; loops of arbitrary length are also possible. For example, for a loop of length 2: 1. Say we have span A as a parent with span B as a child (A -> B). 2. parentA := WithParent(A) 3. parentB := WithParent(B) 4. A.Finish(); B.Finish(); 5. X := tr.StartSpan(parentA); Y := tr.StartSpan(parentB); If B is re-used as X, and A is re-used as Y, we get the following graph: B<-┐ | | └->A We avoid these hazards by having WithParent(s) increment s' reference count, so spans are not re-used while the creation of a child is pending. Spans can be Finish()ed while the creation of the child is pending, in which case the creation of the child will reliably detect the use-after-Finish (and turn it into a no-op if configured to tolerate such illegal uses). Introducing this reference count, and only reusing spans with a reference count of zero, introduced the risk of leaking references if one does opt = WithParent(sp) and then discards the resulting opt without passing it to StartSpan(). This would cause a silent performance regression (not a memory leak though, as the GC is still there). This risk seems worth it for avoiding deadlocks in case of other buggy usage. Release note: None

andreimatei · 2022-01-22T20:24:18Z

bors r+

craig · 2022-01-22T21:18:23Z

Build succeeded:

GitHub CI (Cockroach)

In cockroachdb#73883 we started reusing tracing spans through a sync.Pool. This causes the mutexes in the spans to be acquired by a goroutine repeatedly, in a random order. Multiple spans can be locked at the same time, but the ordering well defined (a child can be locked while the parent is also locked). This erroneously looks like a possible deadlock to the deadlock detector. The following test demonstrates the problem - it's flagged by the deadlock detector even though the code is deadlock-free. This patch fixes the issue by disabling span reuse under deadlock builds. I guess the moral of the story is that recycling structs with locks through sync.Pool is incompatible with the deadlock detector? One option, I guess, is to contribute a way to mark some mutexes as excluded from the deadlock detector. func TestXXX(t *testing.T) { p := sync.Pool{New: func() interface{} { return new(syncutil.Mutex) }, } var m1, m2 *syncutil.Mutex m1 = p.Get().(*syncutil.Mutex) m2 = p.Get().(*syncutil.Mutex) m1.Lock() m2.Lock() m2.Unlock() m1.Unlock() p.Put(m1) p.Put(m2) m2 = p.Get().(*syncutil.Mutex) m1 = p.Get().(*syncutil.Mutex) m1.Lock() m2.Lock() m2.Unlock() m1.Unlock() } Fixes cockroachdb#75351 and a million others. Release note: None

75423: util/tracing: disable span reuse under deadlock detector r=andreimatei a=andreimatei In #73883 we started reusing tracing spans through a sync.Pool. This causes the mutexes in the spans to be acquired by a goroutine repeatedly, in a random order. Multiple spans can be locked at the same time, but the ordering well defined (a child can be locked while the parent is also locked). This erroneously looks like a possible deadlock to the deadlock detector. The following test demonstrates the problem - it's flagged by the deadlock detector even though the code is deadlock-free. This patch fixes the issue by disabling span reuse under deadlock builds. I guess the moral of the story is that recycling structs with locks through sync.Pool is incompatible with the deadlock detector? One option, I guess, is to contribute a way to mark some mutexes as excluded from the deadlock detector. ``` func TestXXX(t *testing.T) { p := sync.Pool{New: func() interface{} { return new(syncutil.Mutex) }, } var m1, m2 *syncutil.Mutex m1 = p.Get().(*syncutil.Mutex) m2 = p.Get().(*syncutil.Mutex) m1.Lock() m2.Lock() m2.Unlock() m1.Unlock() p.Put(m1) p.Put(m2) m2 = p.Get().(*syncutil.Mutex) m1 = p.Get().(*syncutil.Mutex) m1.Lock() m2.Lock() m2.Unlock() m1.Unlock() } ``` Fixes #75351 and a million others. Release note: None Co-authored-by: Andrei Matei <[email protected]>

In cockroachdb#73883 we started reusing tracing spans through a sync.Pool. This causes the mutexes in the spans to be acquired by a goroutine repeatedly, in a random order. Multiple spans can be locked at the same time, but the ordering well defined (a child can be locked while the parent is also locked). This erroneously looks like a possible deadlock to the deadlock detector. The following test demonstrates the problem - it's flagged by the deadlock detector even though the code is deadlock-free. This patch fixes the issue by disabling span reuse under deadlock builds. I guess the moral of the story is that recycling structs with locks through sync.Pool is incompatible with the deadlock detector? One option, I guess, is to contribute a way to mark some mutexes as excluded from the deadlock detector. func TestXXX(t *testing.T) { p := sync.Pool{New: func() interface{} { return new(syncutil.Mutex) }, } var m1, m2 *syncutil.Mutex m1 = p.Get().(*syncutil.Mutex) m2 = p.Get().(*syncutil.Mutex) m1.Lock() m2.Lock() m2.Unlock() m1.Unlock() p.Put(m1) p.Put(m2) m2 = p.Get().(*syncutil.Mutex) m1 = p.Get().(*syncutil.Mutex) m1.Lock() m2.Lock() m2.Unlock() m1.Unlock() } Fixes cockroachdb#75351 and a million others. Release note: None

andreimatei changed the title ~~Tracing.pool spans~~ tracing: pool and reuse spans Dec 16, 2021

andreimatei force-pushed the tracing.pool-spans branch 2 times, most recently from 7c41605 to 05be2c1 Compare December 16, 2021 02:07

andreimatei requested review from nvanbenschoten, aayushshah15 and a team December 16, 2021 02:08

andreimatei marked this pull request as ready for review December 16, 2021 02:08

andreimatei force-pushed the tracing.pool-spans branch 3 times, most recently from e90830e to 211157c Compare December 20, 2021 23:58

nvanbenschoten mentioned this pull request Dec 23, 2021

tracing: various cleanup and small memory re-use improvement #74264

Closed

nvanbenschoten reviewed Dec 24, 2021

View reviewed changes

andreimatei force-pushed the tracing.pool-spans branch from 211157c to 77ab456 Compare December 27, 2021 19:10

andreimatei commented Dec 27, 2021

View reviewed changes

andreimatei force-pushed the tracing.pool-spans branch from 77ab456 to 46b4aac Compare December 27, 2021 21:30

andreimatei commented Dec 27, 2021

View reviewed changes

nvanbenschoten reviewed Jan 4, 2022

View reviewed changes

andreimatei force-pushed the tracing.pool-spans branch 2 times, most recently from 1194de3 to bc28cfb Compare January 11, 2022 01:27

andreimatei commented Jan 11, 2022

View reviewed changes

nvanbenschoten approved these changes Jan 14, 2022

View reviewed changes

andreimatei force-pushed the tracing.pool-spans branch from bc28cfb to 62c2605 Compare January 18, 2022 21:47

andreimatei commented Jan 18, 2022

View reviewed changes

andreimatei commented Jan 20, 2022

View reviewed changes

andreimatei added 5 commits January 22, 2022 15:16

tracing: test hygiene

1da9ac5

Make the tracing tests Finish() all their spans. It's generally the civilized thing to do, and a future patch introduces checking that all spans are finished in the form of finalizers. Release note: None

andreimatei force-pushed the tracing.pool-spans branch from f6aaacc to 2fca7ad Compare January 22, 2022 20:23

craig bot merged commit 4fc0855 into cockroachdb:master Jan 22, 2022

andreimatei deleted the tracing.pool-spans branch January 22, 2022 22:05

andreimatei mentioned this pull request Jan 23, 2022

util/tracing: disable span reuse under deadlock detector #75423

Merged

tbg mentioned this pull request Feb 10, 2022

server: TestAdminAPITableStats failed [priority range ID already set] #75939

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing: pool and reuse spans #73883

tracing: pool and reuse spans #73883

andreimatei commented Dec 16, 2021 •

edited

Loading

cockroach-teamcity commented Dec 16, 2021

andreimatei commented Dec 16, 2021

nvanbenschoten left a comment

andreimatei left a comment

andreimatei left a comment

nvanbenschoten left a comment

andreimatei left a comment

nvanbenschoten left a comment

andreimatei left a comment

andreimatei left a comment

craig bot commented Jan 20, 2022

craig bot commented Jan 21, 2022

craig bot commented Jan 21, 2022

craig bot commented Jan 21, 2022

craig bot commented Jan 22, 2022

craig bot commented Jan 22, 2022

andreimatei commented Jan 22, 2022

craig bot commented Jan 22, 2022

tracing: pool and reuse spans #73883

tracing: pool and reuse spans #73883

Conversation

andreimatei commented Dec 16, 2021 • edited Loading

Interactions with use-after-Finish detection

Technical details

cockroach-teamcity commented Dec 16, 2021

andreimatei commented Dec 16, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Jan 20, 2022

craig bot commented Jan 21, 2022

craig bot commented Jan 21, 2022

craig bot commented Jan 21, 2022

craig bot commented Jan 22, 2022

craig bot commented Jan 22, 2022

andreimatei commented Jan 22, 2022

craig bot commented Jan 22, 2022

andreimatei commented Dec 16, 2021 •

edited

Loading