Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: rework StartSpan API #56109

Merged
merged 1 commit into from
Oct 30, 2020
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 29, 2020

The tracer previously offered three APIs for starting spans. First, there was StartSpan, mandated by the opentracing interfaces, which was unnecessarily allocation-heavy, in particular when the caller was holding on to a span directly (rather than having received information about a parent span from the RPC). Offsetting the inefficiencies, we had added two separate methods, StartRootSpan and StartChildSpan, that took in various options directly (and which were not conforming to the opentracing specs). These latter two in particular had become a hodge podge of settings that were added organically over time.

Now, with the opentracing interfaces out of the picture, it was time to make some clarifications. The key change in this PR is that StartSpan now becomes the only way to start traces, and allows doing so efficiently in all cases. In particular, we provide two separate options that specify a parent span:

  • WithParent, for the case in which the caller has a *Span at their disposal, i.e. we're creating a new trace span locally, and
  • WithRemoteParent, for the case in which a span is to be derived from metadata about a parent span on the other side of an RPC.

The implementations are now deliberate about which case they're in, which I have found greatly helps the clarity of the implementation. In particular, *SpanContext, which was previously used for both local and remote parents, and as a result was only partially populated in the remote case, now only contains the fields corresponding to data coming in over the wire, and is only ever used in the remote case. Reflecting this, it was renamed to RemoteSpan. Its usage has decreased dramatically due to the WithParent option, with only a single call site to WithRemoteParent remaining outside of the tracing package (to propagate trace information on the Raft streams, something difficult to wrap in an RPC middleware due to batching).

The explicit distinction between remote and local parents has also clarified the current semantics of recording inheritance. Remember that (explicit opt-out aside) a child span created from a local parent shares its recording with that parent (which transitively shares with any local grandparent). "Obviously" this does not work across RPC boundaries without an additional mechanism to feed the spans back into the caller's span, but in the old API there really wasn't an obvious way to simulate the "remote" case (since calling .Context() on a local span and deriving from that locally would correspond to what is now WithParent(span), i.e. the local case). Now, in tests such as TestRecordingInRecording, we clearly see the distinction in behavior.
As an aside, the current thinking is to do away with the concept of sharing recordings and also the concept of inserting a recording into an existing span. Instead, we will endow the tracer with a registry that is handed any recordings received from remote RPCs. The registry is in charge of maintaining the partial trace tree and releasing it the moment the local span is Finished(), and allows retrieving the aggregate recording before this point in time. This is both conceptually simpler (spans never share recordings, so in particular the knob to control the behavior goes away), and also allows for straightforward generalizations towards true distributed tracing.

The Recordable option was renamed to WithForceRealSpan for clarity. While technically the old name was apt, the dichotomy between noopSpan vs recordable span has been hard for me to internalize and I assume this holds for others as well. "real span" vs "noop span" is a more natural dichotomy to remember. Of course, ultimately the goal of moving to always-on tracing means that the concept of a noop span goes away (or at least stops being the case to optimize for), so more changes here will take place in the future.

There is still cruft in the interfaces, but with PR we're within striking distance and will be able to adapt the semantics to the requirements of always-on tracing soon.

Also, we are probably allocating more than we were before. However,
there's nothing we can't fix. I am planning to polish #54738 soon
and carry out a first round of improvements within.

Release note: None

@tbg tbg requested a review from a team October 29, 2020 08:57
@tbg tbg requested review from a team as code owners October 29, 2020 08:57
@tbg tbg requested review from pbardea and removed request for a team October 29, 2020 08:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the tracing-start-api branch from 8cb1bae to 7f815c5 Compare October 29, 2020 11:06
pkg/util/tracing/tracer.go Outdated Show resolved Hide resolved
@tbg tbg force-pushed the tracing-start-api branch from 7f815c5 to 51c9f36 Compare October 29, 2020 14:59
@tbg tbg requested review from irfansharif and knz and removed request for a team and pbardea October 29, 2020 16:02
@tbg tbg changed the title [wip] tracing: rework StartSpan API tracing: rework StartSpan API Oct 29, 2020
@tbg
Copy link
Member Author

tbg commented Oct 29, 2020

Ready for review! 🙌

@tbg tbg force-pushed the tracing-start-api branch from 51c9f36 to a8bfd7d Compare October 29, 2020 16:25
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @knz, and @tbg)


pkg/util/tracing/shadow.go, line 68 at r2 (raw file):

// It is valid to call this on a nil shadowTracer; it will
// return zero values in this case.
func (st *shadowTracer) Typ() (string, bool) {

[nit] can be Type


pkg/util/tracing/shadow.go, line 79 at r2 (raw file):

}

type otLogTagsOption logtags.Buffer

[nit] could use a small comment


pkg/util/tracing/span_options.go, line 21 at r2 (raw file):

// typically not used directly. Instead, the methods mentioned on each
// field comment below are invoked as arguments to `Tracer.StartSpan`.
type SpanOptions struct {

does this need to be exported?


pkg/util/tracing/span_options.go, line 99 at r2 (raw file):

//
// When no local Span is available, WithRemoteParent should be used.
func WithParent(sp *Span) SpanOption {

[supernit] It would be easier to follow the code if this method (with its comment) was before, not after the internal type definition and implementation (same for other options)


pkg/util/tracing/span_options.go, line 122 at r2 (raw file):

type tagsOption []opentracing.Tag

var _ SpanOption = (*tagsOption)(nil)

remove the * (it doesn't match the actual usage)


pkg/util/tracing/tracer.go, line 182 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Heh, the answer is simple: I missed the line. It's right there:

$ ./scripts/goescape ./pkg/util/tracing/tracer.go | grep tracer.go:181
./tracer.go:181:6: moved to heap: opts

Sorry about the noise.

Changing the API to take and return by value seems like the cleanest solution here. It's not that big of a struct and we only expect one or two options in most cases.

The tracer previously offered three APIs for starting spans. First,
there was `StartSpan`, mandated by the `opentracing` interfaces, which
was unnecessarily allocation-heavy, in particular when the caller was
holding on to a span directly (rather than having received information
about a parent span from the RPC). Offsetting the inefficiencies, we had
added two separate methods, `StartRootSpan` and `StartChildSpan`, that
took in various options directly (and which were not conforming to the
`opentracing` specs). These latter two in particular had become a hodge
podge of settings that were added organically over time.

Now, with the opentracing interfaces out of the picture, it was time to
make some clarifications. The key change in this PR is that StartSpan
now becomes the *only* way to start traces, and allows doing so
efficiently in all cases. In particular, we provide two separate options
that specify a parent span:

- `WithParent`, for the case in which the caller has a `*Span` at their
  disposal, i.e. we're creating a new trace span locally, and
- `WithRemoteParent`, for the case in which a span is to be derived from
  metadata about a parent span on the other side of an RPC.

The implementations are now deliberate about which case they're in,
which I have found greatly helps the clarity of the implementation. In
particular, `*SpanContext`, which was previously used for both local and
remote parents, and as a result was only partially populated in the
remote case, now only contains the fields corresponding to data coming
in over the wire, and is only ever used in the remote case. Reflecting
this, it was renamed to `RemoteSpan`. Its usage has decreased
dramatically due to the `WithParent` option, with only a single call
site to `WithRemoteParent` remaining outside of the `tracing` package
(to propagate trace information on the Raft streams, something difficult
to wrap in an RPC middleware due to batching).

The explicit distinction between remote and local parents has also
clarified the current semantics of recording inheritance. Remember that
(explicit opt-out aside) a child span created from a local parent shares
its recording with that parent (which transitively shares with any local
grandparent). "Obviously" this does not work across RPC boundaries
without an additional mechanism to feed the spans back into the caller's
span, but in the old API there really wasn't an obvious way to simulate
the "remote" case (since calling `.Context()` on a local span and
deriving from that locally would correspond to what is now
`WithParent(span)`, i.e. the local case).
Now, in tests such as `TestRecordingInRecording`, we clearly see the
distinction in behavior.  As an aside, the current thinking is to do
away with the concept of sharing recordings and also the concept of
inserting a recording into an existing span. Instead, we will endow the
tracer with a registry that is handed any recordings received from
remote RPCs. The registry is in charge of maintaining the partial trace
tree and releasing it the moment the local span is `Finished()`, and
allows retrieving the aggregate recording before this point in time.
This is both conceptually simpler (spans never share recordings, so in
particular the knob to control the behavior goes away), and also allows
for straightforward generalizations towards true distributed tracing.

The `Recordable` option was renamed to `WithForceRealSpan` for clarity.
While technically the old name was apt, the dichotomy between noopSpan
vs recordable span has been hard for me to internalize and I assume this
holds for others as well. "real span" vs "noop span" is a more natural
dichotomy to remember. Of course, ultimately the goal of moving to
always-on tracing means that the concept of a noop span goes away (or at
least stops being the case to optimize for), so more changes here will
take place in the future.

There is still cruft in the interfaces, but with PR we're within
striking distance and will be able to adapt the semantics to the
requirements of always-on tracing soon.

Also, we are probably allocating more than we were before. However,
there's nothing we can't fix. I am planning to polish cockroachdb#54738 soon
and carry out a first round of improvements within.

Release note: None
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, Radu!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @nvanbenschoten, and @RaduBerinde)


pkg/util/tracing/span_options.go, line 99 at r2 (raw file):

Previously, RaduBerinde wrote…

[supernit] It would be easier to follow the code if this method (with its comment) was before, not after the internal type definition and implementation (same for other options)

Something deep down inside of me does not want to return the type before it's defined, but I a) removed the type assertions (the WithX methods return the type as the interface, so they already imply the assertion), b) moved each WithX method immediately behind the type def (i.e. before apply).


pkg/util/tracing/tracer.go, line 182 at r1 (raw file):

Previously, RaduBerinde wrote…

Changing the API to take and return by value seems like the cleanest solution here. It's not that big of a struct and we only expect one or two options in most cases.

I just went and did that.

@tbg tbg force-pushed the tracing-start-api branch from a8bfd7d to 5ba137c Compare October 29, 2020 21:27
@tbg
Copy link
Member Author

tbg commented Oct 30, 2020

bors r=RaduBerinde

@craig
Copy link
Contributor

craig bot commented Oct 30, 2020

Build succeeded:

@craig craig bot merged commit 5f414d3 into cockroachdb:master Oct 30, 2020
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @knz, @nvanbenschoten, @RaduBerinde, and @tbg)


pkg/util/tracing/tracer.go, line 490 at r3 (raw file):

		}
		tr := sp.Tracer()
		newSpan := tr.StartSpan(opName, WithParent(sp), WithCtxLogTags(ctx))

This used to be a follows-from relationship, now it's child-of. Did you intend this? I think you want to add a WithFollowsFrom() option here.
Btw, is there a reason why one has to do WithParent(sp), WithFollowsFrom() instead of WithFollowsFrom(sp) ?

andreimatei added a commit to andreimatei/cockroach that referenced this pull request Mar 28, 2021
"Shadow traces" (OpenTracing) were broken at node boundaries because the
server shadow span was not using the remote span context.
I believe this was broken in cockroachdb#56109.

Fixes cockroachdb#62702

Release note (bug fix): OpenTracing traces work correctly again across
nodes. The client and server spans for RPCs are once again part of the
same trace instead of the server being a root span erroneously.
craig bot pushed a commit that referenced this pull request Mar 29, 2021
62703: tracing: fix OpenTracing traces across nodes r=tbg a=andreimatei

"Shadow traces" (OpenTracing) were broken at node boundaries because the
server shadow span was not using the remote span context.
I believe this was broken in #56109.

Fixes #62702

Release note (bug fix): OpenTracing traces work correctly again across
nodes. The client and server spans for RPCs are once again part of the
same trace instead of the server being a root span erroneously.

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants