Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Add support for more expressive span status API #123

Closed
wants to merge 4 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
303 changes: 303 additions & 0 deletions text/0123-improve-span-status-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
# Improve Span Status API

Allow the Span Status API to represent more kinds of status

## Motivation

Right now OpenTelemetry Status is defined as an enumeration of gRPC status
codes. Although I couldn't find design criteria written down for this API
I fear it is too narrowly defined to be useful across the full breadth of
scenarios OpenTelemetry targets.

OpenTelemetry allows Spans to be created to represent any operation, including
those which don't involve communication with another component (Kind =
Internal). These underlying operations can have native status representations
from a particular domain or a language such as POSIX status codes, HRESULTs,
many variations of exceptions, error messages, HTTP status, or gRPC status.
However to capture the status as part of an OpenTelemetry span it must first be
mapped to something in OpenTelemetry's object model and this mapping has the
potential to create a few problems:

- **Inconsistency** - If the mapping from native representation to
OpenTelemetry representation isn't well-defined then API users or SDK
implementations are unlikely to choose the same mapping. This makes collected
data hard to work with because it can no longer be treated uniformly.
- **Loss of fidelity** - Mapping from a status representation that
distinguishes a large number of different results to one that only
distinguishes a few is inherently lossy. This prevents users from isolating
Copy link
Contributor

@jmacd jmacd Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the span status code is slightly different than the summary in this document. I see the span status code as a categorization of the status meant to augment our understanding of the span's condition. I would not characterize the span status code as a lossy transformation: I'd describe it as additive.

As an example, suppose we have an HTTP server and it encounters some kind of condition. In the HTTP domain we have well-defined status codes, which can be encoded in a non-lossy way using the http.status_code semantic convention.

I believe one of the sources of confusion here is that in order to categorize an error into one of the canonical span status codes, the author has to take a position, and that position will not be unique. Depending on positionality, different users will categorize the various HTTP status codes differently. Accordingly, I would define the span status code as the span actor's perspective on the activity it performed, how it would categorize its own status. Other actors in a system may witness consequential failures (e.g., a client receiving a HTTP 500 code) and handle them in a way that disagrees with the self-assessment used by a particular span. E.g., a server may decide that a 500 status is maps to Internal errors, and the client that receives a 500 may decide to retry, thus suppressing the 500. That 500 could become a span event with http.status_code and http.status_text attributes, and still the client could have a successful retry and set the span status to OK.

Summarizing, I don't see a problem with Status code as it stands, but think we need to recognize that the Status code is not a unique position.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful feedback!

Summarizing, I don't see a problem with Status code as it stands

I'm not sure how to relate your opinion on the status code concept to the OTEP overall. How do you feel about the goals outlined in the goals section of the OTEP:
a) They aren't useful and don't need to be pursued?
b) They are useful but the existing Span API already satisfies them?
c) They are useful, and the existing Span API doesn't satisfy them. However we shouldn't frame the proposal as a change to the 'Status' API because that API is already fulfilling a different important purpose
d) Something else?

I believe one of the sources of confusion here is that in order to categorize an error into one of the canonical span status codes, the author has to take a position, and that position will not be unique

That sounds like the problem I named Inconsistency in the text, yes? "If the mapping from native representation to
OpenTelemetry representation isn't well-defined then API users or SDK implementations are unlikely to choose the same mapping."

different status results they care about. It can also prevent UI from showing
useful status information because users don't relate to the reduced
representation and the transformation isn't reversible.
- **Conversion difficulty** - If the conversions are non-trivial then they
are unlikely to be implemented correctly or perhaps at all. Past feedback
suggests end users want to spend little to no effort on this task. SDK
implementers may be more diligent but are constrained to native status
representations that are known a-priori.

These are challenges for any design of the status API, not solely the current
one. We will need to evaluate these issues as a matter of degree and make a
judgement call about what is sufficient.

### Goals

In order to determine what status information is useful we first need a common
understanding what we expect the collected data to be used for. These are
tasks I anticipate tool vendors and end-users would like to be possible with
OpenTelemetry status information:

1. **Viewing** - Developers diagnosing a specific distributed trace want to
understand the status of spans that occurred while it ran. To do this they want
to see status information annotated in the trace, ideally with progressive
levels of detail as the focus of investigation narrows. The viewed status
information should be easy to correlate back to the domain that generated it
and diagnostically useful.

2. **Searching/Filtering** - Developers suspect a particular status condition
might be occurring due to customer feedback, some behavior they observed
locally, in another trace, or perhaps code review. They want to search
collected telemetry to determine if and how often that status condition
occurs. If it occurs they want to explore example traces producing it to
better understand when it manifests and how it impacts their system. The
search terms should be intuitive given the developers initial knowledge
about the status condition they are searching for and the domain it arose from.

3. **Grouping and metrics** - Developers monitoring a service want to
understand what kinds of status conditions are occurring and how frequently.
To do this they want the monitoring tool to provide UI that buckets results
into categories and show the top categories ranked by count of occurrence.
They may also want metrics that track category counts over time to identify
trends and deviations from the trend. Many useful groupings are based on
sharing a common problem symptom or common root cause, but other more coarse
groupings may be useful in trend analysis. At the most coarse spans can be
divided into some definition of "successful" and "failed" but there is no
consensus on how some status results should be bucketed (for example http
4xx results).

### Scope

I hope to define only the tracing API here, not an SDK API nor a wire
protocol. I assume that a simple SDK API will follow relatively automatically
if we first have agreement on the tracing API and OpenTelemetry can still be
successful in an initial release without having standardized a wire protocol.

The operation represented by a span might be composed of sub-operations (for
example a function that calls into other lower level functions) and each of
those sub-operations might have a notion of status or error. If those
sub-operations aren't represented with a child span then it is out of scope
how their status is captured.

## Explanation

#### Data representation of Span Status

We should aim for representations that require minimal-to-no end-developer
translation effort from the native representation and capture at least the key
numeric/string data developers are likely to search for or relate to. I suggest
status is this set of information:

1. StatusType - The name of the type of error such as "HTTP", "gRPC",
"LanguageException", "HRESULT", "POSIX", or "ErrorMessage". The list is
end-user-extensible but common status type names should be standardized.
(Perhaps there is already some standardization we could borrow?)
2. StatusData - A discriminated union of:
- An integer, a string, or a tuple of integer and string. These options can
be used for:
- Enumerated status codes: For example an http status code could be
represented as 404, "Not Found", or both. In the case of common status codes
OpenTelemetry SDK or backend could optionally assist in filling out the
remainder of a partially specified enumeration value. For enumerations that
aren't well-known the community of users is responsible for determining any
conventions.
- Free-form error messages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a named error with a separate message be captured? I would treat these separately and not combine them in one free-form string. This would allow for searching, filtering and grouping by the error name alone but still preserve the message. A backend could, for example, display the number of all EntityNotFound errors in one place and filter for them in another place where users can inspect the messages bearing the entities that were not found.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current proposal wouldn't be able to capture this, but we could change the proposal to allow it. I've added some text to the trade-offs and mitigations section mentioning the idea. I'd be fine adding it, not sure if we'll see additional opinions about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd add it right from the start. I'd expect users want to capture the information in a way that most closely resembles how it's reported by their respective language or framework, rather than having to combine both in one string that potentially has to be parsed/decomposed for analysis later again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intent wasn't that users would combine strings, but that they would provide either an enumerated status code OR a message, not both. If we anticipate it is common to provide both I agree I'd rather provide an API that easily facilitates that vs. an API that encourages users to create private conventions to merge and split text.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it to be common enough to support it in the API. Often it will indeed be a type of error that can be enumerated but will not necessarily have an integer representation like we have in HTTP but rather a name of the error (like "Entity not found" or an exception name) with a (detailed) description (e.g. "The entity with ID '7583' was not found in the table 'products'" or an exception message).
Therefore I'm suggesting to have two strings called, for example, name, brief or kind and a separate string called details, message or description.
Happy to hear others' opinions on that.

- Exception object - whatever the SDK language's default exception datatype
is, if it has one.
- void
3. SuccessHint - An optional boolean that represents the span author's best
guess whether this status represents a successful or failed operation, however
they choose to define those terms. For well-known status types I'd suggest the
hint be ignored but for user-defined status types this is likely the only clue
whether the span should be surfaced in a UI as being abnormal or failed.

Although UI creators are free experiment with how the data is presented I
expect most presentations would either be the StatusData alone, or the
StatusData qualified with the StatusType and some separator character. For
example StatusData alone might create names like "FileNotFoundException",
"503", "E_FAIL (0x80004005)", "SyntaxError on line 405: Did you forget a
semicolon?", and "BadQuery".

Exceptions could have progressive level of detail drilling into messages,
stack traces, inner exceptions, links to source, etc if the exporter serialized
sufficient data but how and whether that occurs is out-of-scope in this design.

#### API to capture this data

I would suggest an API called Span.SetStatus(...) that takes all the arguments
above, and optionally overloads or default parameters that make common calls
easier. For example in C#:

````C#
void SetStatus(Exception spanException, string statusType = "LanguageException", bool? successHint = null);
Copy link
Member

@Oberon00 Oberon00 Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing an exception with traceback and storing it until export is potentially extremely memory heavy (e.g. in Python the exception has a reference to a traceback, and the traceback has references to all local variables on the stack, preventing them from being GCd).

So the solution to this would probably be to call into some exporter-specific code already when calling this API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added a comment about this issue and potential solution to the internal details section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple exceptions can occur during the lifetime of a span and experiencing an exception may or may not result in a non-ok status depending on the circumstances. How can multiple exceptions can handled using this API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple functions returning error codes may be called during the lifetime of a span and an error return may result in a non-ok status or not.

I think it's up to the user/instrumentation to decide which exception was the semantically important one. Most likely the one that went uncaugth.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @Oberon00. I also attempted to capture this concept in the scoping section above but perhaps the wording wasn't good?

The operation represented by a span might be composed of sub-operations (for
example a function that calls into other lower level functions) and each of
those sub-operations might have a notion of status or error. If those
sub-operations aren't represented with a child span then it is out of scope
how their status is captured.

void SetStatus(string enumNameOrMessage, string statusType = "ErrorMessage", bool? successHint = null);
void SetStatus(int enumValue, string statusType, bool? successHint = null );
void SetStatus(int enumValue, string enumName, string statusType, bool? successHint = null);
void SetStatus(string statusType, bool? successHint);
````



## Internal details

I expect a basic and reasonable implementation would be to define fields on the
Span object, set them in the setter and then implement some getter for the
exporters to use. A second possibility is to store the data into attributes or
events. The choice of storage may have some modest effects on memory usage
but primarily I expect the choice would be driven by the SDK API we want to
read stored data back.

It is also possible for the SDK to destructure the Exception data into simpler
serializable types though I'd expect serialization is typically the domain of
the exporter and there is a fair amount of policy involved in terms of what
data is captured and how it is formatted for transport. There are definite
risks that the end-to-end scenario will be less functional or less performant
if SDKs intercede here.

In some languages storing an exception with traceback could be very
memory intensive. Python prevents locals in the callstack from being
GC'ed. One suggestion is to use a synchronous callback to the exporter
allowing it process the data immediately in some way that would lower
the memory usage. For example the exporter could indicate the
exception could be disposed after optionally extracting a fraction of the
information to serialize.


## Trade-offs and mitigations

As mentioned in the motivation section, the issues of inconsistency, loss of
fidelity and conversion difficulty are all on a sliding scale. I expect this
design improves each of these issues at the expensive of some increased
object-model/API complexity. Making the OpenTelemetry status representation
more expressive also could cause inconsistency problems for the opposite reason
- now there might be multiple reasonable representations for a status and the
user becomes unclear which one to pick. In general I expect this to be mitigated
with documented conventions, API defaults, and the potentially canonicalizing
data anywhere in the processing pipeline.

One place I neglected to go further was defining additional types of error data
format beyond string/int/Exception. This might encapsulate things such as
key/value pairs (ala structured logging) or more complex or niche status types
(COM IErrorInfo). There is nothing inherently problematic with them but I felt
the increased expressiveness was getting diminishing returns. One mitigation to
key/value pair data in particular would be to put auxilliary data in the span
attributes using some convention, for example "Error.UserName"="Bob" might be
used together with an error message string "Failed to find user {UserName}".
Another mitigation might be adding language specific overloads that handle
additional error types.

The current proposal only gives one string which can be used either for a
freeform message or a textual status code. Adding a 2nd string to the
StatusData would allow both to be collected side by side. This is another example of
increasing expressivity at the cost of some complexity. I'd be happy to see
this added if the community agreed.

If all the data collected by the API is transmitted to the back-end this also
increases the size of transmitted telemetry, but the exporter authors always
retain the freedom to drop or reduce any information they don't believe is
valuable with potentially a slight

## Prior art and alternatives

#### Prior discussion

There have been a few past attempts to make improvements here:

- [open-telemetry/oteps#69](https://github.com/open-telemetry/oteps/pull/69)
- [#427](https://github.com/open-telemetry/opentelemetry-specification/pull/427)
- [#432](https://github.com/open-telemetry/opentelemetry-specification/pull/432)
- [#521](https://github.com/open-telemetry/opentelemetry-specification/pull/521)
- [#599](https://github.com/open-telemetry/opentelemetry-specification/issues/599)
- https://gitter.im/open-telemetry/error-events-wg

There are also some links to further prior art within those links, sorry
I didn't organize it all nicely here : )

#### Alternatives

**Status information using logging** - It is possible to collect status and
error message via log messages and correlate it to a span via trace context.
While I have no objection to improved logging capabilities or better
integration between distributed tracing and logging I don't believe this is
a problem that distributed tracing should abdicate for various reasons:

- Logging isn't a simple or cheap dependency to take in the implementation.
While OpenTelemetry may solidify a logging offering for API and SDK components,
an end-to-end scenario still requires snooping the log stream to determine
relevant messages for a given span. Regardless whether this is done
client-side, at the database or in the UI layer it potentially involves
performance overhead of handling an order of magnitude more data.
- It requires establishing conventions that designate which log message
represents 'the status' for a Span rather than one of potentially many results
or errors that were recorded during the Span's duration. This likely means this
status is a special case on the logging API if it isn't a special case on the
distributed trace API.
- I expect developers both at the time they are emitting trace data and when
viewing that trace data would find it is idiomatic to include the status of
the Span's workload together with the description of the Span. Identifying
failed or abnormal spans is a typical APM operation.

**Closed vs. open-ended status descriptions** - We could make status
represented by a fixed set of options such as the current gRPC codes
or any larger/different set. It likely is simpler for tools consuming
the data to only operate on fixed options and it is pleasant to imagine that
all (or most) status results will be reasonably categorized into the fixed
list. However I do not expect most users to take up the responsibility of
mapping their status results if OpenTelemetry/back-end tools don't provide
a clear representation. Instead these users will not set status or choose
options that are generic and inconsistent rendering the data useless to
everyone. For a fixed list to work I believe we either have to curate a very
large list of options or accept that significant portions of the developer
audience will not find the status representation useful.

**API using attribute semantic conventions** - It is also possible to do this
via semantic conventions on attributes and although I think the strongly typed
API is preferable I don't have that strong of an opinion. Semantic conventions
are likely to have higher performance overhead, higher risk of error in key
names and are less discoverable/refactorable in IDEs. The advantages are that
there is some past precedent for doing this specific to http codes and new
conventions can be added easily. If we go semantic conventions it does imply
that Exception becomes a type that can be directly passed as an argument to
SetAttribute(). Requiring the user to destructure the exception into a list of
key value pairs would be overly onerous and error-prone for a common usage
scenario. If desired the SDK or exporter could destructure it, but that can be
determined independently from API design and I'd like to keep it out of scope.

**API using Span event semantic conventions** - Most of the rationale for attribute semantic
conventions also applies here, events are effectively another key-value store
mechanism. The timestamp that is attached to an event appears to hold little
value as status is probably produced at the same approximate time the span end
timestamp is recorded. Similar to attribute conventions it sounds like there is
precedent for storing some errors as events.

**Move the API to a non-core package** - It is possible to have the Tracing
API expose status using Attribute or Event APIs, and then have a 2nd library
that exposes a strongly-typed API that converts the status information into
attrbute/event updates. This adds some complexity over a strongly typed API
declared directly on Span but if we identify this as an area that needs to be
more decoupled/versionable than other Span tracing APIs perhaps it would be
valuable.

**Represent an error message in addition to a string error name**

## Open questions

1. Although I specified that common status types could be given standardized
names, I didn't define what that list was. We would need to define what
criteria makes a status type common enough to be on the list and maintain it
over time.

2. Above the design mentioned that the SDK might fill in the name or integer
value of a well-known status code when only one of the two was specified by a
user. We'd have to decide if that is functionality we want, and which values
are included in mapping tables.
3. I left the SDK API out-of-scope, but we will need a way to retrieve the
stored data via the SDK before adding data to span has any value in an
end-to-end scenario.