-
Notifications
You must be signed in to change notification settings - Fork 164
Proposal: Expand and Improve Span Events #69
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a philosophical question - does this information really belong to a trace? An exception may happen during application start-up or some other activity that is not necessarily modeled with a trace, and the data format needed to capture such exception will be identical to this. There are standalone solutions dedicated to capturing exceptions (e.g. Sentry), which have rich data format and instrumentation libraries already. It seems it would be more useful if that separate exception capture pipeline could be enriched with trace/span IDs so that the data could be merged on the backend.
repeated exceptions = 10; | ||
|
||
// Method argument values in use when the error occurred. | ||
repeated AttributeKeyValue arguments = 11; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be a part of StackFrame? I remember in Sentry for Python I could inspect the exception and look at local variables (not just method args) at every frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to support by stack frame. In many cases, the exception may be caught and recorded a substantial distance from where it was thrown. In those cases, the only values available may be parameter values submitted by the external system. Those values need to be recorded also because oftentimes they prove valuable in diagnosing issues.
StackTrace stack = 5; | ||
} | ||
|
||
// Collection of exceptions which triggered the error or fault. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what a "collection" means semantically. Usually exceptions are nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I just copied the AWS X-Ray approach with parent ids denoting nesting. I will look at improving comprehensibility.
Why can't this be handled with semantic conventions? There was discussion about having "typed Span" APIs (e.g. provinding Also, regarding exceptions, there is discussion on that topic in open-telemetry/opentelemetry-specification#306. |
@kbrockhoff I think generally our approach has been that we favour semantic conventions over specialized fields. We do this for Span attributes and for Resource labels. Is there a good reason to deviate from this general approach for Events? Why not have semantic conventions for Event attributes instead and have uniform approach across Spans, Resources and Events? |
In fact, we already do have some semantic conventions for events, see the semantic conventions for gRPC. |
I think semantic conventions is a preferable approach if complex and/or large values are supported. I don't think it is workable to record full exception info without having data types specifically designed to do so. I will do some experimenting with AttributeKeyValue to see if it makes sense to expand the allowed values to hold an exception data structure and also large message content stored as a byte array or string. |
Just a clarification on wording. I am not proposing typed spans. Only typed events. |
I think that currently this is certainly the case as semantic attributes don't support nesting (although in theory you could do things like |
There are a couple options here:
|
I am going to repost my comment, as it seems to have gotten lost in the discussion.I have a philosophical question - does this information really belong to a trace? An exception may happen during application start-up or some other activity that is not necessarily modeled with a trace, and the data format needed to capture such exception will be identical to this. There are standalone solutions dedicated to capturing exceptions (e.g. Sentry), which have rich data format and instrumentation libraries already. It seems it would be more useful if that separate exception capture pipeline could be enriched with trace/span IDs so that the data could be merged on the backend. |
@yurishkuro I think it can belong to a trace but you are right in that it may also need to be represented without association with a trace. That's one of the areas I plan to explore in the near future - the standalone Events (or Logs if you will). I think the Event concept inside the span is intended for use cases like the one that @kbrockhoff describes although I do not think that Event need to be specialized that way. In my opinion the generic key/value list (possibly extended with arrays and nesting) supported in the Event is the right level of abstraction. This is so also because it allows to avoid the need for OpenTelemetry spec to tell whether that particular information belongs to a trace. If the end user wants to associate that particular information with a trace then so be it, OpenTelemetry does not prohibit it and provides a generic storage that can contain it. |
@yurishkuro I like your philosophical question, but I want to flip it around. If you had a data format to capture events like you described, then wouldn't it be natural to represent the Span as a simple aggregation of events having common { span_id, trace_id } properties? This was how the Golang streaming-Tracer prototype (removed in open-telemetry/opentelemetry-go#221) behaved. Even though we answer "yes" to this question, you will still need a Span struct to represent aggregated events (one START, one END, some ERRORs, some MESSAGE_SENTs, some MESSAGE_RECEIVEDs, ...), and these events will still belong to those Span structs. I'm in agreement with @Oberon00, all kinds of event could be described by semantic conventions. I'd prefer if these conventions were gathered somewhere, perhaps as a part of schema.org. I've written up some examples of how an event-based format can encode schematic data, along with speculative remarks about how in a traditional log (i.e., where you read a sequence of events from the start) you can compress events by referring backwards in the log--it includes a semantic convention for referring to earlier events. Take note--this is super sketchy--I cut it out of context from an internal document! https://docs.google.com/document/d/1IDolP2gAzJBQ2hRByhxWw_y6mTkCi_DYqz5up5YtqD8 |
@jmacd yes, but I would still avoid reporting exception this way, because some exceptions may not have an associated trace/span ID, so why should they go to a different backend because of that? |
I expect to have APIs to record events via Span.addEvent as well as via a standalone logger/publisher. If recorded via the standalone, then the SDK should assign to a Span if one is in progress. It is the job of the SDK to make sure all events associated with a span are included in the event collection of exported spans as well as routing all events no matter how recorded to event exporters. After all the above comments I still think it is necessary to have specific data structures for recording error information. Maps of maps with semantic convention names will be no more future proof or flexible than having defined types. This is not an open-ended problem. The APIs of Sentry, Rollbar, Stacktrace and X-Ray are very similar. It will be straight forward to define datatypes that can easily be translated into any of them. After thinking about the problem some more, I think there have to be two hash values as required fields.
To make this work there will have to be very specific specifications on a language by language basis on how to calculate the hashes. |
On the subject of defining event publishing and/or logging API what exactly is the objective? I saw this line below in a code review I did several days ago. It was written by one of the tech leads here. logger.info("*******WorkItem executed successfully for ***" + workUnit.getWorkType() + " workItemId=" + workUnit.getWorkItemId()); The documentation for the SLF4J API is 4 paragraphs long yet 90%+ of all enterprise Java developers have never read them. I pulled the table below off my Slack feed from yesterday's performance test. It was generated by searching for ERROR level logs and doing group by in Kibana. (some details changed) log:"com.example.business.SystemBusiness - testPurposeEmailAddress property in systemtype.properties is either null or of zero length" 8,494 Of the 9,231 total ERRORs, 8,506 are noise from issues that have zero impact on the functionality or performance of the application. This is from one of the LEAST dysfunctional organziations in the world. Most enterprise software is much worse. So again what kind of events and logs do you expect to get recorded via the OpenTelemetry API and where will it be routed to? |
Another thing to consider is making assumptions about what is a trace. One of my team members is currently developing a Spring Framework ApplicationContextEvent listener which will start a trace and parent span when the context lifecycle initializes and ends the trace when the application is ready to accept work. Another trace will cover the context shutdown lifecycle. This will make it easy to find all the logs related to application startup and shutdown as well as monitoring total application startup time and any remote calls made during startup. |
@yurishkuro I'm comfortable sending exceptions to a tracing system. I have prior experience telling me that users want to find traces for (rare, interesting) exceptions. It could be the exception-monitoring system requesting traces for rare errors, but a rare or interesting exception is only one of many signals that might tell a tracing system what to collect, and the tracing system has to be wary of its total resource budget. I think an integrated tracing system stands a better chance of finding exceptional traces than two separate systems. I would also be comfortable (as a tracing system) accepting spans with 0 values for trace_id and span_id containing exceptions that happen in an untraced context. Simply generate identifiers for them and report them as 0-duration root spans. It's not unreasonable, I mean. |
I have started a project to test the design of the message types needed for recording errors. In addition to testing population, I plan to verify the data can populate all needed fields in various expection monitoring system APIs. https://github.com/kbrockhoff/opentelemetry-spec-research |
Missing from "prior art" is OpenCensus stacktrace https://github.com/census-instrumentation/opencensus-proto/blob/master/src/opencensus/proto/trace/v1/trace.proto#L346 |
|
||
## Internal details | ||
|
||
This section specifies data format in Protocol Buffers for `TimedEvent` messages within the overall OTLP. It follows and expands on the flattened data structure in [#59 OTLP Trace Data Format](https://github.com/open-telemetry/oteps/pull/59). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data format has evolved after OTEP acceptance. The current entity is named Event
: https://github.com/open-telemetry/opentelemetry-proto/blob/839beca3755237c22fb950ae4ca5e1305dd85015/opentelemetry/proto/trace/v1/trace.proto#L148
@@ -0,0 +1,150 @@ | |||
# Typed Events Specification | |||
|
|||
Add enumerated subtype and subtype-specific data structures to the existing generic tracing `TimedEvent`. Initial supported subtypes would be `ANNOTATION` (attributes only), `MESSAGE_SENT`, `MESSAGE_RECEIVED`, `ERROR`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The choice of subtypes is not clear. Why this specific list? What is the reasoning? Do these subtypes unambiguously map to event subtypes commonly used by event generation, transmission and recording systems and protocols? What is the proposed mapping approach for subtypes that do not fit one of these?
|
||
The error reporting APIs of AWS X-Ray, Google Stackdriver Error Reporting and Rollbar were analyzed to determine the data supported by each. This specification includes structures for providing most or all of the data these systems support. The AWS X-Ray SDKs for various languages provide an example of how this data can be populated by SDK implementations. | ||
|
||
The annotation and sent/received message type data structures are from the OpenCensus data protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When proposing an Events data model I believe it is important to consider a wider range of prior art, including legacy formats and protocols (e.g. SYSLOG RFC5424, SNMP, Windows Events, etc). In my opinion it should be a goal to unambiguously represent legacy formats. This is important in order to build tools that understand all telemetry data types and can correlate telemetry data generated in legacy formats with telemetry data generated according to newer standards. This will make the data model applicable to wider range of real-world systems.
My note from the meeting this morning: It would be great if the proto message were kept in a separate proto file to make clear it can in the future be that exceptions and stacktraces are reported to a service separately from a trace. If not an exception/error reporting then maybe as part of a structured logging api eventually. |
I believe this applies to Events generally, not just errors. Some events may be associated with Spans and be contained in Span data, some other events may have no association with Spans. For the proto I would support Span Events (what we have now) and standalone Events. |
@kbrockhoff I'm going to close this, as we have taken a different approach by adding events concept to logs, which can also be associated with spans. Please reopen or file a new OTEP if you feel there is more to address here. |
Provide specific subtypes of TimedEvent for messages and errors with data structures to hold relevant information.