Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
73883: tracing: pool and reuse spans r=andreimatei a=andreimatei This patch adds sync.Pool-ing and reuse to tracing spans, so that they don't need to be dynamically allocated on every span creation. This has a big effect on the heap footprint. Conceptually, a span is made available for reuse on Finish(). In practice, it's more complicated because there can still be references of the span used concurrently with Finish(), either internally in the tracing library or externally. The external ones are bugs by definition, but we want to avoid some particularly nasty consequences of such bugs. The BenchmarkTracing results below show that this saves around 10KB worth of heap allocations per simple query, when tracing is enabled (minimal tracing: TracingModeActiveSpansRegistry). I believe the span allocations go from being serviced from the shared heap to being serviced from CPU-local freelists, since they become small enough. In the single-node case, this is 25% of the query's allocations. As can be seen in the benchmarks below in the differences between the trace=on and trace=off rows, the impact of tracing is big on memory footprint; with this patch, there's not much impact. ``` name old alloc/op new alloc/op delta Tracing/1node/scan/trace=off-32 19.7kB ± 1% 19.7kB ± 1% ~ (p=0.768 n=10+5) Tracing/1node/scan/trace=on-32 29.2kB ± 0% 22.0kB ± 0% -24.85% (p=0.001 n=10+5) Tracing/1node/insert/trace=off-32 38.5kB ± 1% 38.4kB ± 1% ~ (p=0.440 n=10+5) Tracing/1node/insert/trace=on-32 45.5kB ± 1% 38.7kB ± 1% -15.03% (p=0.001 n=10+5) Tracing/3node/scan/trace=off-32 68.1kB ± 3% 67.9kB ± 3% ~ (p=0.768 n=10+5) Tracing/3node/scan/trace=on-32 86.8kB ± 2% 75.3kB ± 2% -13.21% (p=0.001 n=9+5) Tracing/3node/insert/trace=off-32 88.1kB ± 5% 90.8kB ± 7% ~ (p=0.112 n=9+5) Tracing/3node/insert/trace=on-32 96.1kB ± 3% 89.0kB ± 2% -7.39% (p=0.001 n=9+5) ``` Unfortunately, pooling spans only saves on the size of allocations, not the number of allocations. This is because the Context in which a Span lives still needs to be allocated dynamically, as it does not have a clear lifetime and so it cannot be re-used (plus it's immutable, etc). Before this patch, the code was optimized to allocate a Span and a Context together, through trickery (we had a dedicated Context type, which we now get rid of). So, this patch replaces an allocation for Span+Context with just a Context allocation, which is a win because Spans are big and Contexts are small. BenchmarkTracing (which runs SQL queries) only show minor improvements in the time/op, but the memory improvements are so large that I think they must translate into sufficient GC pressure wins to be worth doing. Micro-benchmarks from the tracing package show major time/op wins. ``` name old time/op new time/op delta Tracer_StartSpanCtx/opts=none-32 537ns ± 1% 275ns ± 2% -48.73% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real-32 537ns ± 2% 273ns ± 2% -49.16% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,logtag-32 565ns ± 1% 278ns ± 1% -50.81% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,autoparent-32 879ns ±29% 278ns ± 5% -68.36% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,manualparent-32 906ns ±26% 289ns ± 2% -68.08% (p=0.008 n=5+5) Span_GetRecording/root-only-32 11.1ns ± 2% 11.6ns ± 4% ~ (p=0.056 n=5+5) Span_GetRecording/child-only-32 11.1ns ± 4% 11.7ns ± 2% +5.44% (p=0.016 n=5+5) Span_GetRecording/root-child-32 18.9ns ± 3% 19.5ns ± 1% +3.55% (p=0.008 n=5+5) RecordingWithStructuredEvent-32 1.37µs ± 2% 1.17µs ± 2% -14.22% (p=0.008 n=5+5) SpanCreation/detached-child=false-32 1.84µs ± 2% 0.96µs ± 0% -47.56% (p=0.008 n=5+5) SpanCreation/detached-child=true-32 2.01µs ± 1% 1.14µs ± 1% -43.32% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Tracer_StartSpanCtx/opts=none-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,logtag-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,autoparent-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Tracer_StartSpanCtx/opts=real,manualparent-32 768B ± 0% 48B ± 0% -93.75% (p=0.008 n=5+5) Span_GetRecording/root-only-32 0.00B 0.00B ~ (all equal) Span_GetRecording/child-only-32 0.00B 0.00B ~ (all equal) Span_GetRecording/root-child-32 0.00B 0.00B ~ (all equal) RecordingWithStructuredEvent-32 1.54kB ± 0% 0.77kB ± 0% -49.86% (p=0.008 n=5+5) SpanCreation/detached-child=false-32 4.62kB ± 0% 0.29kB ± 0% ~ (p=0.079 n=4+5) SpanCreation/detached-child=true-32 5.09kB ± 0% 0.77kB ± 0% -84.87% (p=0.008 n=5+5) ``` This patch brings us very close to enabling the TracingModeActiveSpansRegistry tracing mode by default in production - which would give us a registry of all in-flight spans/operations in the system. ### Interactions with use-after-Finish detection Span reuse interacts with the recently-introduced span-use-after-Finish detection. Spans are made available for reuse on Finish (technically, when certain references to the span have been drained; see below). When a span is reused in between Finish() and an erroneous use of the Finish()ed span, this bug cannot be detected and results in the caller operating on an unintended span. This can result in the wrong log message apearing in the wrong span, and such. Care has been taken so that use-after-Finish bugs do not result in more structural problems, such as loops in the parent-child relationships. ### Technical details The mechanism used for making spans available for reuse is reference counting; the release of a span to the pool is deferred to the release of the last counted reference. Counted references are held: - internally: children hold references to the parent and the parent holds references to the children. - externally: the WithParent(sp) option takes a reference on sp. Apart from WithParent, clients using Spans do not track their references because it'd be too burdensome to require all the references to have a clearly defined life cycle, and to be explicitly released when they're no longer in use. For clients, the contract is that a span can be used until Finish(). WithParent is special, though; see below. Different alternatives to reference counting were explored. In particular, instead of a deferred-release scheme, an alternative was a fat-pointer scheme where references that can outlive a span are tagged with the span's "generation". That works for the internal use cases, but the problem with this scheme is that WithParent(s) ends up allocating - which I want to avoid. Right now, WithParent(s) returns a pointer as an interface, which doesn't allocate. But if the pointer gets fat, it no longer fits into the class of things that can be put in interfaces without allocating. The reference counter is an atomic; it is not protected by the span's lock because a parent's counter needs to be accessed under both the parent's lock and the children's locks. In details, the reference counter serves a couple of purposes: 1) Prevent re-allocation of a Span while child spans are still operating on it. In particular, this ensures that races between Finish()ing a parent and a child cannot result in the child operating on a re-allocated parent. Because of the span's lock ordering convention, a child cannot hold its lock while operating on the parent. During Finish(), the child drops its lock and informs the parent that it is Finish()ing. If the parent Finish()es at the same time, that call could erroneously conclude that the parent can be made available for re-use, even through the child goroutine has a pending call into the parent. 2) Prevent re-allocation of child spans while a Finish()ing parent is in the process of transforming the children into roots and inserting them into the active spans registry. Operating on the registry is done without holding any span's lock, so a race with a child's Finish() could result in the registry operating on a re-allocated span. 3) Prevent re-allocation of a Span in between the time that WithParent(s) captures a reference to s and the time when the parent option is used to create the child. Such an inopportune reuse of a span could only happen is the span is Finish()ed concurrently with the creation of its child, which is illegal. Still, we optionally tolerate use-after-Finishes, and this use cannot be tolerated with the reference count protection. Without this protection, the tree of spans in a trace could degenerate into a graph through the introduction of loops. A loop could lead to deadlock due to the fact that we lock multiple spans at once. The lock ordering convention is that the parent needs to be locked before the child, which ensures deadlock-freedom in the absence of loops. For example: 1. parent := tr.StartSpan() 2. parentOpt := WithParent(parent) 3. parent.Finish() 4. child := tr.StartSpan(parentOpt) If "parent" would be re-allocated as "child", then child would have itself as a parent. The use of parentOpt in step 4) after parent was finished in step 3) is a use-after-Finish of parent; it is illegal and, if detection is enabled, it might be detected as such. However, if span pooling and re-use is enabled, then the detection is not realiable (it does not catch cases where a span is re-used before a reference to it taken before the prior Finish() is used). A span having itself as a parent is just the trivial case of the problem; loops of arbitrary length are also possible. For example, for a loop of length 2: 1. Say we have span A as a parent with span B as a child (A -> B). 2. parentA := WithParent(A) 3. parentB := WithParent(B) 4. A.Finish(); B.Finish(); 5. X := tr.StartSpan(parentA); Y := tr.StartSpan(parentB); If B is re-used as X, and A is re-used as Y, we get the following graph: ``` B<-┐ | | └->A ``` We avoid these hazards by having WithParent(s) increment s' reference count, so spans are not re-used while the creation of a child is pending. Spans can be Finish()ed while the creation of the child is pending, in which case the creation of the child will reliably detect the use-after-Finish (and turn it into a no-op if configured to tolerate such illegal uses). Introducing this reference count, and only reusing spans with a reference count of zero, introduced the risk of leaking references if one does opt = WithParent(sp) and then discards the resulting opt without passing it to StartSpan(). This would cause a silent performance regression (not a memory leak though, as the GC is still there). This risk seems worth it for avoiding deadlocks in case of other buggy usage. Release note: None Co-authored-by: Andrei Matei <[email protected]>
- Loading branch information