Proposal: Expand and Improve Span Events #69

kbrockhoff · 2019-11-18T17:15:34Z

Provide specific subtypes of TimedEvent for messages and errors with data structures to hold relevant information.

yurishkuro

I have a philosophical question - does this information really belong to a trace? An exception may happen during application start-up or some other activity that is not necessarily modeled with a trace, and the data format needed to capture such exception will be identical to this. There are standalone solutions dedicated to capturing exceptions (e.g. Sentry), which have rich data format and instrumentation libraries already. It seems it would be more useful if that separate exception capture pipeline could be enriched with trace/span IDs so that the data could be merged on the backend.

yurishkuro · 2019-11-18T17:28:14Z

text/0069-typed-event.md

+    repeated exceptions = 10;
+
+    // Method argument values in use when the error occurred.
+    repeated AttributeKeyValue arguments = 11;


shouldn't this be a part of StackFrame? I remember in Sentry for Python I could inspect the exception and look at local variables (not just method args) at every frame.

Good idea to support by stack frame. In many cases, the exception may be caught and recorded a substantial distance from where it was thrown. In those cases, the only values available may be parameter values submitted by the external system. Those values need to be recorded also because oftentimes they prove valuable in diagnosing issues.

yurishkuro · 2019-11-18T17:29:07Z

text/0069-typed-event.md

+      StackTrace stack = 5;
+    }
+
+    // Collection of exceptions which triggered the error or fault.


It's not clear what a "collection" means semantically. Usually exceptions are nested.

Good point. I just copied the AWS X-Ray approach with parent ids denoting nesting. I will look at improving comprehensibility.

Oberon00 · 2019-11-19T10:13:21Z

Why can't this be handled with semantic conventions? There was discussion about having "typed Span" APIs (e.g. provinding HttpSpan.setStatusCode(int code) mapping to Span.setAttribute("http.status_code", code)) to improve type safety at the API level without requiring us to lock into certain data structures in a backwards-incompatible way, see #25.

Also, regarding exceptions, there is discussion on that topic in open-telemetry/opentelemetry-specification#306.

tigrannajaryan · 2019-11-20T00:43:25Z

@kbrockhoff I think generally our approach has been that we favour semantic conventions over specialized fields. We do this for Span attributes and for Resource labels. Is there a good reason to deviate from this general approach for Events? Why not have semantic conventions for Event attributes instead and have uniform approach across Spans, Resources and Events?

Oberon00 · 2019-11-20T08:55:54Z

In fact, we already do have some semantic conventions for events, see the semantic conventions for gRPC.

kbrockhoff · 2019-11-20T13:42:42Z

I think semantic conventions is a preferable approach if complex and/or large values are supported. I don't think it is workable to record full exception info without having data types specifically designed to do so. I will do some experimenting with AttributeKeyValue to see if it makes sense to expand the allowed values to hold an exception data structure and also large message content stored as a byte array or string.

kbrockhoff · 2019-11-20T13:43:48Z

Just a clarification on wording. I am not proposing typed spans. Only typed events.

Oberon00 · 2019-11-20T15:58:35Z

I don't think it is workable to record full exception info without having data types specifically designed to do so

I think that currently this is certainly the case as semantic attributes don't support nesting (although in theory you could do things like "exception.0.cause": 1 with 1 referring to the "exception.1.*" attributes, but I think that would be an awkward hack). However, I think it would be better to resolve open-telemetry/opentelemetry-specification#341 in such a way that exception information can be encoded properly.

tigrannajaryan · 2019-11-20T17:57:26Z

There are a couple options here:

Wait until there is a clarity around multivalue attributes and nested attributes. I am going to work on multivalue attributes proposal next week. Nested attributes may come after that. Once we have these in the spec I intend to add support in the proto as well.
If waiting is not an option arbitrary complex nested structures can be represented today as json, yaml, etc strings (assuming tree-like structure). The format will be part of the semantic convention.

yurishkuro · 2019-11-20T18:48:53Z

I am going to repost my comment, as it seems to have gotten lost in the discussion.

I have a philosophical question - does this information really belong to a trace? An exception may happen during application start-up or some other activity that is not necessarily modeled with a trace, and the data format needed to capture such exception will be identical to this. There are standalone solutions dedicated to capturing exceptions (e.g. Sentry), which have rich data format and instrumentation libraries already. It seems it would be more useful if that separate exception capture pipeline could be enriched with trace/span IDs so that the data could be merged on the backend.

tigrannajaryan · 2019-11-20T19:07:01Z

I have a philosophical question - does this information really belong to a trace?

@yurishkuro I think it can belong to a trace but you are right in that it may also need to be represented without association with a trace. That's one of the areas I plan to explore in the near future - the standalone Events (or Logs if you will).

I think the Event concept inside the span is intended for use cases like the one that @kbrockhoff describes although I do not think that Event need to be specialized that way. In my opinion the generic key/value list (possibly extended with arrays and nesting) supported in the Event is the right level of abstraction. This is so also because it allows to avoid the need for OpenTelemetry spec to tell whether that particular information belongs to a trace. If the end user wants to associate that particular information with a trace then so be it, OpenTelemetry does not prohibit it and provides a generic storage that can contain it.

jmacd · 2019-11-20T21:07:58Z

@yurishkuro I like your philosophical question, but I want to flip it around. If you had a data format to capture events like you described, then wouldn't it be natural to represent the Span as a simple aggregation of events having common { span_id, trace_id } properties? This was how the Golang streaming-Tracer prototype (removed in open-telemetry/opentelemetry-go#221) behaved. Even though we answer "yes" to this question, you will still need a Span struct to represent aggregated events (one START, one END, some ERRORs, some MESSAGE_SENTs, some MESSAGE_RECEIVEDs, ...), and these events will still belong to those Span structs.

I'm in agreement with @Oberon00, all kinds of event could be described by semantic conventions. I'd prefer if these conventions were gathered somewhere, perhaps as a part of schema.org. I've written up some examples of how an event-based format can encode schematic data, along with speculative remarks about how in a traditional log (i.e., where you read a sequence of events from the start) you can compress events by referring backwards in the log--it includes a semantic convention for referring to earlier events.

Take note--this is super sketchy--I cut it out of context from an internal document!

https://docs.google.com/document/d/1IDolP2gAzJBQ2hRByhxWw_y6mTkCi_DYqz5up5YtqD8

yurishkuro · 2019-11-21T23:20:01Z

then wouldn't it be natural to represent the Span as a simple aggregation of events having common { span_id, trace_id } properties?

@jmacd yes, but I would still avoid reporting exception this way, because some exceptions may not have an associated trace/span ID, so why should they go to a different backend because of that?

kbrockhoff · 2019-11-22T14:11:19Z

I expect to have APIs to record events via Span.addEvent as well as via a standalone logger/publisher. If recorded via the standalone, then the SDK should assign to a Span if one is in progress. It is the job of the SDK to make sure all events associated with a span are included in the event collection of exported spans as well as routing all events no matter how recorded to event exporters.

After all the above comments I still think it is necessary to have specific data structures for recording error information. Maps of maps with semantic convention names will be no more future proof or flexible than having defined types. This is not an open-ended problem. The APIs of Sentry, Rollbar, Stacktrace and X-Ray are very similar. It will be straight forward to define datatypes that can easily be translated into any of them.

After thinking about the problem some more, I think there have to be two hash values as required fields.

Hash of the specific exception instance. If the application has multiple instrumented libraries a single exception may get recorded more than once. The SDK needs to deduplicate these.
Hash of the exception type and location it was thrown. This would dramatically cut down on the amount of data that needs to be transported as well as making aggregated exception counts easy.

To make this work there will have to be very specific specifications on a language by language basis on how to calculate the hashes.

kbrockhoff · 2019-11-22T14:27:55Z

On the subject of defining event publishing and/or logging API what exactly is the objective?

I saw this line below in a code review I did several days ago. It was written by one of the tech leads here.

logger.info("*******WorkItem executed successfully for ***" + workUnit.getWorkType() + " workItemId=" + workUnit.getWorkItemId());

The documentation for the SLF4J API is 4 paragraphs long yet 90%+ of all enterprise Java developers have never read them.

I pulled the table below off my Slack feed from yesterday's performance test. It was generated by searching for ERROR level logs and doing group by in Kibana. (some details changed)

log:"com.example.business.SystemBusiness - testPurposeEmailAddress property in systemtype.properties is either null or of zero length" 8,494
log:"org.postgresql.util.PSQLException: ERROR: deadlock detected" 496
log:"u0009org.postgresql.util.PSQLException: The connection attempt failed" 184
log:"Connection to perf1-db.example.com:5432 refused" 31
log:"com.example.telemetry.servlet.TelemetryHttpServletFilter - request failed" 14
log:"breaking change introduced in apache client" 8
log:"]: still saving" 4

Of the 9,231 total ERRORs, 8,506 are noise from issues that have zero impact on the functionality or performance of the application.

This is from one of the LEAST dysfunctional organziations in the world. Most enterprise software is much worse.

So again what kind of events and logs do you expect to get recorded via the OpenTelemetry API and where will it be routed to?

kbrockhoff · 2019-11-22T16:21:13Z

Another thing to consider is making assumptions about what is a trace. One of my team members is currently developing a Spring Framework ApplicationContextEvent listener which will start a trace and parent span when the context lifecycle initializes and ends the trace when the application is ready to accept work. Another trace will cover the context shutdown lifecycle. This will make it easy to find all the logs related to application startup and shutdown as well as monitoring total application startup time and any remote calls made during startup.

jmacd · 2019-11-26T05:08:26Z

@yurishkuro I'm comfortable sending exceptions to a tracing system. I have prior experience telling me that users want to find traces for (rare, interesting) exceptions. It could be the exception-monitoring system requesting traces for rare errors, but a rare or interesting exception is only one of many signals that might tell a tracing system what to collect, and the tracing system has to be wary of its total resource budget. I think an integrated tracing system stands a better chance of finding exceptional traces than two separate systems.

I would also be comfortable (as a tracing system) accepting spans with 0 values for trace_id and span_id containing exceptions that happen in an untraced context. Simply generate identifiers for them and report them as 0-duration root spans. It's not unreasonable, I mean.

kbrockhoff · 2019-11-26T13:40:40Z

I have started a project to test the design of the message types needed for recording errors. In addition to testing population, I plan to verify the data can populate all needed fields in various expection monitoring system APIs. https://github.com/kbrockhoff/opentelemetry-spec-research

tsloughter · 2020-01-28T16:12:34Z

Missing from "prior art" is OpenCensus stacktrace https://github.com/census-instrumentation/opencensus-proto/blob/master/src/opencensus/proto/trace/v1/trace.proto#L346

yurishkuro · 2020-01-28T18:39:45Z

and Sentry https://docs.sentry.io/development/sdk-dev/event-payloads/stacktrace/

tigrannajaryan · 2020-01-29T00:42:17Z

text/0069-typed-event.md

+
+## Internal details
+
+This section specifies data format in Protocol Buffers for `TimedEvent` messages within the overall OTLP. It follows and expands on the flattened data structure in [#59 OTLP Trace Data Format](https://github.com/open-telemetry/oteps/pull/59).


The data format has evolved after OTEP acceptance. The current entity is named Event: https://github.com/open-telemetry/opentelemetry-proto/blob/839beca3755237c22fb950ae4ca5e1305dd85015/opentelemetry/proto/trace/v1/trace.proto#L148

tigrannajaryan · 2020-01-29T00:43:04Z

text/0069-typed-event.md

@@ -0,0 +1,150 @@
+# Typed Events Specification
+
+Add enumerated subtype and subtype-specific data structures to the existing generic tracing `TimedEvent`. Initial supported subtypes would be `ANNOTATION` (attributes only), `MESSAGE_SENT`, `MESSAGE_RECEIVED`, `ERROR`.


The choice of subtypes is not clear. Why this specific list? What is the reasoning? Do these subtypes unambiguously map to event subtypes commonly used by event generation, transmission and recording systems and protocols? What is the proposed mapping approach for subtypes that do not fit one of these?

tigrannajaryan · 2020-01-29T00:51:44Z

text/0069-typed-event.md

+
+The error reporting APIs of AWS X-Ray, Google Stackdriver Error Reporting and Rollbar were analyzed to determine the data supported by each. This specification includes structures for providing most or all of the data these systems support. The AWS X-Ray SDKs for various languages provide an example of how this data can be populated by SDK implementations.
+
+The annotation and sent/received message type data structures are from the OpenCensus data protocol.


When proposing an Events data model I believe it is important to consider a wider range of prior art, including legacy formats and protocols (e.g. SYSLOG RFC5424, SNMP, Windows Events, etc). In my opinion it should be a goal to unambiguously represent legacy formats. This is important in order to build tools that understand all telemetry data types and can correlate telemetry data generated in legacy formats with telemetry data generated according to newer standards. This will make the data model applicable to wider range of real-world systems.

tsloughter · 2020-01-29T02:42:10Z

My note from the meeting this morning: It would be great if the proto message were kept in a separate proto file to make clear it can in the future be that exceptions and stacktraces are reported to a service separately from a trace.

If not an exception/error reporting then maybe as part of a structured logging api eventually.

tigrannajaryan · 2020-01-30T00:28:56Z

My note from the meeting this morning: It would be great if the proto message were kept in a separate proto file to make clear it can in the future be that exceptions and stacktraces are reported to a service separately from a trace.

If not an exception/error reporting then maybe as part of a structured logging api eventually.

I believe this applies to Events generally, not just errors. Some events may be associated with Spans and be contained in Span data, some other events may have no association with Spans. For the proto I would support Span Events (what we have now) and standalone Events.

tedsuo · 2023-02-13T18:02:51Z

@kbrockhoff I'm going to close this, as we have taken a different approach by adding events concept to logs, which can also be associated with spans. Please reopen or file a new OTEP if you feel there is more to address here.

added typed event specification

481ca8d

kbrockhoff requested review from AloisReitbauer, bogdandrutu, c24t, carlosalberto, iredelmeier, jmacd, reyang, SergeyKanzhelev, songy23, tedsuo, tigrannajaryan and yurishkuro as code owners November 18, 2019 17:15

rename file to match pull request #

c9c2c69

yurishkuro reviewed Nov 18, 2019

View reviewed changes

tedsuo mentioned this pull request Dec 4, 2019

RFC: Structural definition of trace events & distinct structures for all data points open-telemetry/opentelemetry-specification#20

Closed

kbrockhoff mentioned this pull request Dec 20, 2019

Common event attribute names open-telemetry/opentelemetry-specification#397

Closed

mwear mentioned this pull request Jan 27, 2020

Add semantic conventions for errors open-telemetry/opentelemetry-specification#427

Closed

tigrannajaryan reviewed Jan 29, 2020

View reviewed changes

Merge branch 'master' into typed-events

baf5a64

mwear mentioned this pull request Feb 6, 2020

Request to add error span tag to Opentelemetry specification #86

Closed

Merge branch 'master' into typed-events

5def710

kbrockhoff requested a review from arminru as a code owner March 21, 2020 18:42

Merge remote-tracking branch 'upstream/master' into typed-events

22bf3f6

arminru mentioned this pull request May 12, 2020

Add proper error reporting before OpenTelemetry goes GA open-telemetry/opentelemetry-specification#599

Closed

arminru added the error reporting label May 12, 2020

kintel mentioned this pull request Jul 29, 2020

feat: add Cloud propagator GoogleCloudPlatform/opentelemetry-operations-js#116

Merged

4 tasks

Base automatically changed from master to main January 27, 2021 20:37

bogdandrutu requested a review from a team January 27, 2021 20:37

jmacd mentioned this pull request Sep 30, 2022

Structured Stacktrace in Span Attributes/Logs open-telemetry/opentelemetry-specification#2839

Open

tedsuo added the triaged label Feb 13, 2023

tedsuo closed this Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Expand and Improve Span Events #69

Proposal: Expand and Improve Span Events #69

kbrockhoff commented Nov 18, 2019

yurishkuro left a comment

yurishkuro Nov 18, 2019

kbrockhoff Nov 20, 2019

yurishkuro Nov 18, 2019

kbrockhoff Nov 20, 2019

Oberon00 commented Nov 19, 2019

tigrannajaryan commented Nov 20, 2019

Oberon00 commented Nov 20, 2019 •

edited

Loading

kbrockhoff commented Nov 20, 2019

kbrockhoff commented Nov 20, 2019

Oberon00 commented Nov 20, 2019

tigrannajaryan commented Nov 20, 2019

yurishkuro commented Nov 20, 2019

tigrannajaryan commented Nov 20, 2019

jmacd commented Nov 20, 2019

yurishkuro commented Nov 21, 2019

kbrockhoff commented Nov 22, 2019

kbrockhoff commented Nov 22, 2019

kbrockhoff commented Nov 22, 2019

jmacd commented Nov 26, 2019

kbrockhoff commented Nov 26, 2019

tsloughter commented Jan 28, 2020

yurishkuro commented Jan 28, 2020

tigrannajaryan Jan 29, 2020

tigrannajaryan Jan 29, 2020

tigrannajaryan Jan 29, 2020

tsloughter commented Jan 29, 2020

tigrannajaryan commented Jan 30, 2020

tedsuo commented Feb 13, 2023


		## Internal details

		This section specifies data format in Protocol Buffers for `TimedEvent` messages within the overall OTLP. It follows and expands on the flattened data structure in [#59 OTLP Trace Data Format](https://github.com/open-telemetry/oteps/pull/59).

		@@ -0,0 +1,150 @@
		# Typed Events Specification

		Add enumerated subtype and subtype-specific data structures to the existing generic tracing `TimedEvent`. Initial supported subtypes would be `ANNOTATION` (attributes only), `MESSAGE_SENT`, `MESSAGE_RECEIVED`, `ERROR`.


		The error reporting APIs of AWS X-Ray, Google Stackdriver Error Reporting and Rollbar were analyzed to determine the data supported by each. This specification includes structures for providing most or all of the data these systems support. The AWS X-Ray SDKs for various languages provide an example of how this data can be populated by SDK implementations.

		The annotation and sent/received message type data structures are from the OpenCensus data protocol.

Proposal: Expand and Improve Span Events #69

Proposal: Expand and Improve Span Events #69

Conversation

kbrockhoff commented Nov 18, 2019

yurishkuro left a comment

Choose a reason for hiding this comment

yurishkuro Nov 18, 2019

Choose a reason for hiding this comment

kbrockhoff Nov 20, 2019

Choose a reason for hiding this comment

yurishkuro Nov 18, 2019

Choose a reason for hiding this comment

kbrockhoff Nov 20, 2019

Choose a reason for hiding this comment

Oberon00 commented Nov 19, 2019

tigrannajaryan commented Nov 20, 2019

Oberon00 commented Nov 20, 2019 • edited Loading

kbrockhoff commented Nov 20, 2019

kbrockhoff commented Nov 20, 2019

Oberon00 commented Nov 20, 2019

tigrannajaryan commented Nov 20, 2019

yurishkuro commented Nov 20, 2019

I am going to repost my comment, as it seems to have gotten lost in the discussion.

tigrannajaryan commented Nov 20, 2019

jmacd commented Nov 20, 2019

yurishkuro commented Nov 21, 2019

kbrockhoff commented Nov 22, 2019

kbrockhoff commented Nov 22, 2019

kbrockhoff commented Nov 22, 2019

jmacd commented Nov 26, 2019

kbrockhoff commented Nov 26, 2019

tsloughter commented Jan 28, 2020

yurishkuro commented Jan 28, 2020

tigrannajaryan Jan 29, 2020

Choose a reason for hiding this comment

tigrannajaryan Jan 29, 2020

Choose a reason for hiding this comment

tigrannajaryan Jan 29, 2020

Choose a reason for hiding this comment

tsloughter commented Jan 29, 2020

tigrannajaryan commented Jan 30, 2020

tedsuo commented Feb 13, 2023

Oberon00 commented Nov 20, 2019 •

edited

Loading