-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing: rework StartSpan API #56109
Conversation
8cb1bae
to
7f815c5
Compare
7f815c5
to
51c9f36
Compare
Ready for review! 🙌 |
51c9f36
to
a8bfd7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)
pkg/util/tracing/shadow.go, line 68 at r2 (raw file):
// It is valid to call this on a nil shadowTracer; it will // return zero values in this case. func (st *shadowTracer) Typ() (string, bool) {
[nit] can be Type
pkg/util/tracing/shadow.go, line 79 at r2 (raw file):
} type otLogTagsOption logtags.Buffer
[nit] could use a small comment
pkg/util/tracing/span_options.go, line 21 at r2 (raw file):
// typically not used directly. Instead, the methods mentioned on each // field comment below are invoked as arguments to `Tracer.StartSpan`. type SpanOptions struct {
does this need to be exported?
pkg/util/tracing/span_options.go, line 99 at r2 (raw file):
// // When no local Span is available, WithRemoteParent should be used. func WithParent(sp *Span) SpanOption {
[supernit] It would be easier to follow the code if this method (with its comment) was before, not after the internal type definition and implementation (same for other options)
pkg/util/tracing/span_options.go, line 122 at r2 (raw file):
type tagsOption []opentracing.Tag var _ SpanOption = (*tagsOption)(nil)
remove the *
(it doesn't match the actual usage)
pkg/util/tracing/tracer.go, line 182 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Heh, the answer is simple: I missed the line. It's right there:
$ ./scripts/goescape ./pkg/util/tracing/tracer.go | grep tracer.go:181 ./tracer.go:181:6: moved to heap: opts
Sorry about the noise.
Changing the API to take and return by value seems like the cleanest solution here. It's not that big of a struct and we only expect one or two options in most cases.
The tracer previously offered three APIs for starting spans. First, there was `StartSpan`, mandated by the `opentracing` interfaces, which was unnecessarily allocation-heavy, in particular when the caller was holding on to a span directly (rather than having received information about a parent span from the RPC). Offsetting the inefficiencies, we had added two separate methods, `StartRootSpan` and `StartChildSpan`, that took in various options directly (and which were not conforming to the `opentracing` specs). These latter two in particular had become a hodge podge of settings that were added organically over time. Now, with the opentracing interfaces out of the picture, it was time to make some clarifications. The key change in this PR is that StartSpan now becomes the *only* way to start traces, and allows doing so efficiently in all cases. In particular, we provide two separate options that specify a parent span: - `WithParent`, for the case in which the caller has a `*Span` at their disposal, i.e. we're creating a new trace span locally, and - `WithRemoteParent`, for the case in which a span is to be derived from metadata about a parent span on the other side of an RPC. The implementations are now deliberate about which case they're in, which I have found greatly helps the clarity of the implementation. In particular, `*SpanContext`, which was previously used for both local and remote parents, and as a result was only partially populated in the remote case, now only contains the fields corresponding to data coming in over the wire, and is only ever used in the remote case. Reflecting this, it was renamed to `RemoteSpan`. Its usage has decreased dramatically due to the `WithParent` option, with only a single call site to `WithRemoteParent` remaining outside of the `tracing` package (to propagate trace information on the Raft streams, something difficult to wrap in an RPC middleware due to batching). The explicit distinction between remote and local parents has also clarified the current semantics of recording inheritance. Remember that (explicit opt-out aside) a child span created from a local parent shares its recording with that parent (which transitively shares with any local grandparent). "Obviously" this does not work across RPC boundaries without an additional mechanism to feed the spans back into the caller's span, but in the old API there really wasn't an obvious way to simulate the "remote" case (since calling `.Context()` on a local span and deriving from that locally would correspond to what is now `WithParent(span)`, i.e. the local case). Now, in tests such as `TestRecordingInRecording`, we clearly see the distinction in behavior. As an aside, the current thinking is to do away with the concept of sharing recordings and also the concept of inserting a recording into an existing span. Instead, we will endow the tracer with a registry that is handed any recordings received from remote RPCs. The registry is in charge of maintaining the partial trace tree and releasing it the moment the local span is `Finished()`, and allows retrieving the aggregate recording before this point in time. This is both conceptually simpler (spans never share recordings, so in particular the knob to control the behavior goes away), and also allows for straightforward generalizations towards true distributed tracing. The `Recordable` option was renamed to `WithForceRealSpan` for clarity. While technically the old name was apt, the dichotomy between noopSpan vs recordable span has been hard for me to internalize and I assume this holds for others as well. "real span" vs "noop span" is a more natural dichotomy to remember. Of course, ultimately the goal of moving to always-on tracing means that the concept of a noop span goes away (or at least stops being the case to optimize for), so more changes here will take place in the future. There is still cruft in the interfaces, but with PR we're within striking distance and will be able to adapt the semantics to the requirements of always-on tracing soon. Also, we are probably allocating more than we were before. However, there's nothing we can't fix. I am planning to polish cockroachdb#54738 soon and carry out a first round of improvements within. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR, Radu!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @nvanbenschoten, and @RaduBerinde)
pkg/util/tracing/span_options.go, line 99 at r2 (raw file):
Previously, RaduBerinde wrote…
[supernit] It would be easier to follow the code if this method (with its comment) was before, not after the internal type definition and implementation (same for other options)
Something deep down inside of me does not want to return the type before it's defined, but I a) removed the type assertions (the WithX
methods return the type as the interface, so they already imply the assertion), b) moved each WithX
method immediately behind the type def (i.e. before apply
).
pkg/util/tracing/tracer.go, line 182 at r1 (raw file):
Previously, RaduBerinde wrote…
Changing the API to take and return by value seems like the cleanest solution here. It's not that big of a struct and we only expect one or two options in most cases.
I just went and did that.
a8bfd7d
to
5ba137c
Compare
bors r=RaduBerinde |
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @knz, @nvanbenschoten, @RaduBerinde, and @tbg)
pkg/util/tracing/tracer.go, line 490 at r3 (raw file):
} tr := sp.Tracer() newSpan := tr.StartSpan(opName, WithParent(sp), WithCtxLogTags(ctx))
This used to be a follows-from relationship, now it's child-of. Did you intend this? I think you want to add a WithFollowsFrom()
option here.
Btw, is there a reason why one has to do WithParent(sp), WithFollowsFrom()
instead of WithFollowsFrom(sp)
?
"Shadow traces" (OpenTracing) were broken at node boundaries because the server shadow span was not using the remote span context. I believe this was broken in cockroachdb#56109. Fixes cockroachdb#62702 Release note (bug fix): OpenTracing traces work correctly again across nodes. The client and server spans for RPCs are once again part of the same trace instead of the server being a root span erroneously.
62703: tracing: fix OpenTracing traces across nodes r=tbg a=andreimatei "Shadow traces" (OpenTracing) were broken at node boundaries because the server shadow span was not using the remote span context. I believe this was broken in #56109. Fixes #62702 Release note (bug fix): OpenTracing traces work correctly again across nodes. The client and server spans for RPCs are once again part of the same trace instead of the server being a root span erroneously. Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
The tracer previously offered three APIs for starting spans. First, there was
StartSpan
, mandated by theopentracing
interfaces, which was unnecessarily allocation-heavy, in particular when the caller was holding on to a span directly (rather than having received information about a parent span from the RPC). Offsetting the inefficiencies, we had added two separate methods,StartRootSpan
andStartChildSpan
, that took in various options directly (and which were not conforming to theopentracing
specs). These latter two in particular had become a hodge podge of settings that were added organically over time.Now, with the opentracing interfaces out of the picture, it was time to make some clarifications. The key change in this PR is that StartSpan now becomes the only way to start traces, and allows doing so efficiently in all cases. In particular, we provide two separate options that specify a parent span:
WithParent
, for the case in which the caller has a*Span
at their disposal, i.e. we're creating a new trace span locally, andWithRemoteParent
, for the case in which a span is to be derived from metadata about a parent span on the other side of an RPC.The implementations are now deliberate about which case they're in, which I have found greatly helps the clarity of the implementation. In particular,
*SpanContext
, which was previously used for both local and remote parents, and as a result was only partially populated in the remote case, now only contains the fields corresponding to data coming in over the wire, and is only ever used in the remote case. Reflecting this, it was renamed toRemoteSpan
. Its usage has decreased dramatically due to theWithParent
option, with only a single call site toWithRemoteParent
remaining outside of thetracing
package (to propagate trace information on the Raft streams, something difficult to wrap in an RPC middleware due to batching).The explicit distinction between remote and local parents has also clarified the current semantics of recording inheritance. Remember that (explicit opt-out aside) a child span created from a local parent shares its recording with that parent (which transitively shares with any local grandparent). "Obviously" this does not work across RPC boundaries without an additional mechanism to feed the spans back into the caller's span, but in the old API there really wasn't an obvious way to simulate the "remote" case (since calling
.Context()
on a local span and deriving from that locally would correspond to what is nowWithParent(span)
, i.e. the local case). Now, in tests such asTestRecordingInRecording
, we clearly see the distinction in behavior.As an aside, the current thinking is to do away with the concept of sharing recordings and also the concept of inserting a recording into an existing span. Instead, we will endow the tracer with a registry that is handed any recordings received from remote RPCs. The registry is in charge of maintaining the partial trace tree and releasing it the moment the local span is
Finished()
, and allows retrieving the aggregate recording before this point in time. This is both conceptually simpler (spans never share recordings, so in particular the knob to control the behavior goes away), and also allows for straightforward generalizations towards true distributed tracing.The
Recordable
option was renamed toWithForceRealSpan
for clarity. While technically the old name was apt, the dichotomy between noopSpan vs recordable span has been hard for me to internalize and I assume this holds for others as well. "real span" vs "noop span" is a more natural dichotomy to remember. Of course, ultimately the goal of moving to always-on tracing means that the concept of a noop span goes away (or at least stops being the case to optimize for), so more changes here will take place in the future.There is still cruft in the interfaces, but with PR we're within striking distance and will be able to adapt the semantics to the requirements of always-on tracing soon.
Also, we are probably allocating more than we were before. However,
there's nothing we can't fix. I am planning to polish #54738 soon
and carry out a first round of improvements within.
Release note: None