Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the interpretation of SpanKind #337

Merged
merged 6 commits into from
Nov 4, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 33 additions & 11 deletions specification/api-tracing.md
Original file line number Diff line number Diff line change
Expand Up @@ -530,19 +530,13 @@ Returns true if the canonical code of this `Status` is `Ok`, otherwise false.

## SpanKind

Depending on the `Span` position in a `Trace` and application components
boundaries, it can play a different role. This role often defines how `Span`
will be processed and visualized by various backends. So it is important to
record this "hint" whenever possible to the best of the caller's knowledge.
`SpanKind` describes the relationship bewteen the Span, its parents,
jmacd marked this conversation as resolved.
Show resolved Hide resolved
and its children in a Trace as a hint to the system while rendering
traces. Several conventional SpanKind values are defined:

These are the possible SpanKinds:

* `INTERNAL` Default value. Indicates that the span represents an internal
operation within an application, as opposed to an operations happening at the
boundaries.
* `SERVER` Indicates that the span covers server-side handling of an RPC or
other remote request.
* `CLIENT` Indicates that the span describes a request to some remote service.
other request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"remote" seemed relevant, why was it removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing to prevent a service from calling itself. Why does the "remote" aspect matter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because semantically we (I think) want "client" kind to mean "network call", not in process function call.

* `CLIENT` Indicates that the span describes a request to some service.
* `PRODUCER` Indicates that the span describes a producer sending a message to a
broker. Unlike client and server, there is often no direct critical path
latency relationship between producer and consumer spans. A `Producer` span ends
Expand All @@ -551,3 +545,31 @@ These are the possible SpanKinds:
* `CONSUMER` Indicates that the span describes a consumer receiving a message from
a broker. As for the `PRODUCER` kind, there is often no direct critical
path latency relationship between producer and consumer spans.

`SpanKind` serves as an additional annotation, where existing semantic
conventions do not fully specify a relationship. For example, we
could infer from a change of `host.name` and `service.name` resources
that a child has contacted a remote host and crossed a service
boundary, but we cannot be sure whether it is a CLIENT-to-SERVER
relationship or a PRODUCER-to-CONSUMER relationship. Use the
`SpanKind` attribute to specify this level of detail.

`SpanKind` values are strings. Implementations MUST accept any value
of `SpanKind`, to accommodate future versions of this specification.

Applications may use non-conventional values for `SpanKind` to provide
additional description that could be useful for offline analysis.
Tracing vendors SHOULD display the `SpanKind` as additional
description while rendering traces, even for unconventional values.
Depending on experience, future versions of this specification could
include new conventional `SpanKind` values, so users and implementors
are free to invent new kinds of span as appropriate (e.g.,
SENDER/RECEIVER, INGRESS/EGRESS).

`SpanKind` may be left empty, to indicate no description.

`SpanKind` is associated with the activity that started the span,
which helps resolve ambiguity. For example, a server span that
directly calls another server could be described as both a SERVER and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of such example in the spec, because it is against the traditional instrumentation guidelines that this scenario should have both server and client spans in that same service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that example to illustrate what I see as an ambiguity and a problem with SpanKind generally. I am not familiar with the guidelines you refer to, but it feels odd if we're suggesting that a server should start a child span in order to issue an RPC. This just raises the cost of tracing.

As far as I've been able to determine, there are a few uses for SpanKind. Currently, I believe SpanKind is ambiguous at best, so I'd like a better understanding.

One thing which was part of OpenTracing was this ChildOf vs FollowsFrom distinction. I believe the choice of "producer/consumer" is associated with FollowsFrom and that "client/server" is associated with ChildOf. The important question is should the tracing system assume that the child span is going to complete before the parent span. At LightStep we call client/server spans "well-formed", whereas FollowsFrom spans are not. We can't use FollowsFrom spans for critical path analysis.

Another thing I see SpanKind being used for is to detect spans which either start with a remote parent or are themselves a remote parent. At LightStep we refer to these as "ingress" and "egress" spans. They are more interesting for monitoring purposes than internal spans.

My understanding is that SpanKind is sort of doing both of these things, telling us whether a child should complete before its parent and whether a parent and child used context propagation rather than an internal relationship.

I don't see SpanKind actually accomplishing these things, unless the guidelines you refer to are met. I would prefer if we had less ambiguous ways to do this, which do not require extra spans where they are not necessary.

Some solutions that I would prefer.

Spans could count the number of times their context is Extracted. Spans with >0 extracted contexts ought to have remote children. These are "egress" spans. Spans that are started from an Injected context ought to have a remote parent. These are "ingress" spans.

The ChildOf vs FollowsFrom distinction ought to be a Link property, since a span can have multiple children. We do not currently use a Link to store the parent span relationship, but conceptually each Link should have an attribute to say whether it is a well-formed child or an asynchronous one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels odd if we're suggesting that a server should start a child span in order to issue an RPC. This just raises the cost of tracing.

Starting a child "client" span for outbound network call is how every instrumentation I ever came across operates. If you don't do that, you cannot add any tags describing the outbound call either, because they would end up on the parent "server" span where they don't belong.

The cost of tracing can be controlled by not recording those spans, but if instrumentation doesn't even create them then it is simply not describing the semantics of distributed transaction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, with this guideline in place the meaning of SpanKind makes sense. This guideline should be included in the explanation for SpanKind. I haven't seen it written down anywhere.

I think a server with high fanout shouldn't be required to create a child span for each call it makes. I could add an Event to the parent span to describe the call being made. The child span will have the tags I'm interested in, and I don't feel any semantics are lost.

I will update this PR with the guideline and explain how to interpret these values.

Copy link
Member

@yurishkuro yurishkuro Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a server with high fanout shouldn't be required to create a child span for each call it makes.

Yes, high fanout does occasionally present problems (e.g. 100s of calls to Redis). If the application itself knows about it and is conscious of the performance overhead of those tiny spans, then it makes sense for it to just log events. But if the API itself is efficient enough, then the optimization may still be done by the tracer, rather than the application, e.g. by combining some of those spans into summaries.

a CLIENT. In this case, it should use the SERVER `SpanKind` because
it was started to service a RPC or other request.