Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Proposal: Expand and Improve Span Events #69

Closed
wants to merge 5 commits into from

Conversation

kbrockhoff
Copy link
Member

Provide specific subtypes of TimedEvent for messages and errors with data structures to hold relevant information.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a philosophical question - does this information really belong to a trace? An exception may happen during application start-up or some other activity that is not necessarily modeled with a trace, and the data format needed to capture such exception will be identical to this. There are standalone solutions dedicated to capturing exceptions (e.g. Sentry), which have rich data format and instrumentation libraries already. It seems it would be more useful if that separate exception capture pipeline could be enriched with trace/span IDs so that the data could be merged on the backend.

repeated exceptions = 10;

// Method argument values in use when the error occurred.
repeated AttributeKeyValue arguments = 11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be a part of StackFrame? I remember in Sentry for Python I could inspect the exception and look at local variables (not just method args) at every frame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to support by stack frame. In many cases, the exception may be caught and recorded a substantial distance from where it was thrown. In those cases, the only values available may be parameter values submitted by the external system. Those values need to be recorded also because oftentimes they prove valuable in diagnosing issues.

StackTrace stack = 5;
}

// Collection of exceptions which triggered the error or fault.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what a "collection" means semantically. Usually exceptions are nested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I just copied the AWS X-Ray approach with parent ids denoting nesting. I will look at improving comprehensibility.

@Oberon00
Copy link
Member

Why can't this be handled with semantic conventions? There was discussion about having "typed Span" APIs (e.g. provinding HttpSpan.setStatusCode(int code) mapping to Span.setAttribute("http.status_code", code)) to improve type safety at the API level without requiring us to lock into certain data structures in a backwards-incompatible way, see #25.

Also, regarding exceptions, there is discussion on that topic in open-telemetry/opentelemetry-specification#306.

@tigrannajaryan
Copy link
Member

@kbrockhoff I think generally our approach has been that we favour semantic conventions over specialized fields. We do this for Span attributes and for Resource labels. Is there a good reason to deviate from this general approach for Events? Why not have semantic conventions for Event attributes instead and have uniform approach across Spans, Resources and Events?

@Oberon00
Copy link
Member

Oberon00 commented Nov 20, 2019

In fact, we already do have some semantic conventions for events, see the semantic conventions for gRPC.

@kbrockhoff
Copy link
Member Author

I think semantic conventions is a preferable approach if complex and/or large values are supported. I don't think it is workable to record full exception info without having data types specifically designed to do so. I will do some experimenting with AttributeKeyValue to see if it makes sense to expand the allowed values to hold an exception data structure and also large message content stored as a byte array or string.

@kbrockhoff
Copy link
Member Author

Just a clarification on wording. I am not proposing typed spans. Only typed events.

@Oberon00
Copy link
Member

I don't think it is workable to record full exception info without having data types specifically designed to do so

I think that currently this is certainly the case as semantic attributes don't support nesting (although in theory you could do things like "exception.0.cause": 1 with 1 referring to the "exception.1.*" attributes, but I think that would be an awkward hack). However, I think it would be better to resolve open-telemetry/opentelemetry-specification#341 in such a way that exception information can be encoded properly.

@tigrannajaryan
Copy link
Member

There are a couple options here:

  1. Wait until there is a clarity around multivalue attributes and nested attributes. I am going to work on multivalue attributes proposal next week. Nested attributes may come after that. Once we have these in the spec I intend to add support in the proto as well.
  2. If waiting is not an option arbitrary complex nested structures can be represented today as json, yaml, etc strings (assuming tree-like structure). The format will be part of the semantic convention.

@yurishkuro
Copy link
Member

I am going to repost my comment, as it seems to have gotten lost in the discussion.

I have a philosophical question - does this information really belong to a trace? An exception may happen during application start-up or some other activity that is not necessarily modeled with a trace, and the data format needed to capture such exception will be identical to this. There are standalone solutions dedicated to capturing exceptions (e.g. Sentry), which have rich data format and instrumentation libraries already. It seems it would be more useful if that separate exception capture pipeline could be enriched with trace/span IDs so that the data could be merged on the backend.

@tigrannajaryan
Copy link
Member

I have a philosophical question - does this information really belong to a trace?

@yurishkuro I think it can belong to a trace but you are right in that it may also need to be represented without association with a trace. That's one of the areas I plan to explore in the near future - the standalone Events (or Logs if you will).

I think the Event concept inside the span is intended for use cases like the one that @kbrockhoff describes although I do not think that Event need to be specialized that way. In my opinion the generic key/value list (possibly extended with arrays and nesting) supported in the Event is the right level of abstraction. This is so also because it allows to avoid the need for OpenTelemetry spec to tell whether that particular information belongs to a trace. If the end user wants to associate that particular information with a trace then so be it, OpenTelemetry does not prohibit it and provides a generic storage that can contain it.

@jmacd
Copy link
Contributor

jmacd commented Nov 20, 2019

@yurishkuro I like your philosophical question, but I want to flip it around. If you had a data format to capture events like you described, then wouldn't it be natural to represent the Span as a simple aggregation of events having common { span_id, trace_id } properties? This was how the Golang streaming-Tracer prototype (removed in open-telemetry/opentelemetry-go#221) behaved. Even though we answer "yes" to this question, you will still need a Span struct to represent aggregated events (one START, one END, some ERRORs, some MESSAGE_SENTs, some MESSAGE_RECEIVEDs, ...), and these events will still belong to those Span structs.

I'm in agreement with @Oberon00, all kinds of event could be described by semantic conventions. I'd prefer if these conventions were gathered somewhere, perhaps as a part of schema.org. I've written up some examples of how an event-based format can encode schematic data, along with speculative remarks about how in a traditional log (i.e., where you read a sequence of events from the start) you can compress events by referring backwards in the log--it includes a semantic convention for referring to earlier events.

Take note--this is super sketchy--I cut it out of context from an internal document!

https://docs.google.com/document/d/1IDolP2gAzJBQ2hRByhxWw_y6mTkCi_DYqz5up5YtqD8

@yurishkuro
Copy link
Member

then wouldn't it be natural to represent the Span as a simple aggregation of events having common { span_id, trace_id } properties?

@jmacd yes, but I would still avoid reporting exception this way, because some exceptions may not have an associated trace/span ID, so why should they go to a different backend because of that?

@kbrockhoff
Copy link
Member Author

I expect to have APIs to record events via Span.addEvent as well as via a standalone logger/publisher. If recorded via the standalone, then the SDK should assign to a Span if one is in progress. It is the job of the SDK to make sure all events associated with a span are included in the event collection of exported spans as well as routing all events no matter how recorded to event exporters.

After all the above comments I still think it is necessary to have specific data structures for recording error information. Maps of maps with semantic convention names will be no more future proof or flexible than having defined types. This is not an open-ended problem. The APIs of Sentry, Rollbar, Stacktrace and X-Ray are very similar. It will be straight forward to define datatypes that can easily be translated into any of them.

After thinking about the problem some more, I think there have to be two hash values as required fields.

  1. Hash of the specific exception instance. If the application has multiple instrumented libraries a single exception may get recorded more than once. The SDK needs to deduplicate these.
  2. Hash of the exception type and location it was thrown. This would dramatically cut down on the amount of data that needs to be transported as well as making aggregated exception counts easy.

To make this work there will have to be very specific specifications on a language by language basis on how to calculate the hashes.

@kbrockhoff
Copy link
Member Author

On the subject of defining event publishing and/or logging API what exactly is the objective?

I saw this line below in a code review I did several days ago. It was written by one of the tech leads here.

logger.info("*******WorkItem executed successfully for ***" + workUnit.getWorkType() + " workItemId=" + workUnit.getWorkItemId());

The documentation for the SLF4J API is 4 paragraphs long yet 90%+ of all enterprise Java developers have never read them.

I pulled the table below off my Slack feed from yesterday's performance test. It was generated by searching for ERROR level logs and doing group by in Kibana. (some details changed)

log:"com.example.business.SystemBusiness - testPurposeEmailAddress property in systemtype.properties is either null or of zero length" 8,494
log:"org.postgresql.util.PSQLException: ERROR: deadlock detected" 496
log:"u0009org.postgresql.util.PSQLException: The connection attempt failed" 184
log:"Connection to perf1-db.example.com:5432 refused" 31
log:"com.example.telemetry.servlet.TelemetryHttpServletFilter - request failed" 14
log:"breaking change introduced in apache client" 8
log:"]: still saving" 4

Of the 9,231 total ERRORs, 8,506 are noise from issues that have zero impact on the functionality or performance of the application.

This is from one of the LEAST dysfunctional organziations in the world. Most enterprise software is much worse.

So again what kind of events and logs do you expect to get recorded via the OpenTelemetry API and where will it be routed to?

@kbrockhoff
Copy link
Member Author

Another thing to consider is making assumptions about what is a trace. One of my team members is currently developing a Spring Framework ApplicationContextEvent listener which will start a trace and parent span when the context lifecycle initializes and ends the trace when the application is ready to accept work. Another trace will cover the context shutdown lifecycle. This will make it easy to find all the logs related to application startup and shutdown as well as monitoring total application startup time and any remote calls made during startup.

@jmacd
Copy link
Contributor

jmacd commented Nov 26, 2019

@yurishkuro I'm comfortable sending exceptions to a tracing system. I have prior experience telling me that users want to find traces for (rare, interesting) exceptions. It could be the exception-monitoring system requesting traces for rare errors, but a rare or interesting exception is only one of many signals that might tell a tracing system what to collect, and the tracing system has to be wary of its total resource budget. I think an integrated tracing system stands a better chance of finding exceptional traces than two separate systems.

I would also be comfortable (as a tracing system) accepting spans with 0 values for trace_id and span_id containing exceptions that happen in an untraced context. Simply generate identifiers for them and report them as 0-duration root spans. It's not unreasonable, I mean.

@kbrockhoff
Copy link
Member Author

I have started a project to test the design of the message types needed for recording errors. In addition to testing population, I plan to verify the data can populate all needed fields in various expection monitoring system APIs. https://github.com/kbrockhoff/opentelemetry-spec-research

@tsloughter
Copy link
Member

@yurishkuro
Copy link
Member


## Internal details

This section specifies data format in Protocol Buffers for `TimedEvent` messages within the overall OTLP. It follows and expands on the flattened data structure in [#59 OTLP Trace Data Format](https://github.com/open-telemetry/oteps/pull/59).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,150 @@
# Typed Events Specification

Add enumerated subtype and subtype-specific data structures to the existing generic tracing `TimedEvent`. Initial supported subtypes would be `ANNOTATION` (attributes only), `MESSAGE_SENT`, `MESSAGE_RECEIVED`, `ERROR`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The choice of subtypes is not clear. Why this specific list? What is the reasoning? Do these subtypes unambiguously map to event subtypes commonly used by event generation, transmission and recording systems and protocols? What is the proposed mapping approach for subtypes that do not fit one of these?


The error reporting APIs of AWS X-Ray, Google Stackdriver Error Reporting and Rollbar were analyzed to determine the data supported by each. This specification includes structures for providing most or all of the data these systems support. The AWS X-Ray SDKs for various languages provide an example of how this data can be populated by SDK implementations.

The annotation and sent/received message type data structures are from the OpenCensus data protocol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When proposing an Events data model I believe it is important to consider a wider range of prior art, including legacy formats and protocols (e.g. SYSLOG RFC5424, SNMP, Windows Events, etc). In my opinion it should be a goal to unambiguously represent legacy formats. This is important in order to build tools that understand all telemetry data types and can correlate telemetry data generated in legacy formats with telemetry data generated according to newer standards. This will make the data model applicable to wider range of real-world systems.

@tsloughter
Copy link
Member

My note from the meeting this morning: It would be great if the proto message were kept in a separate proto file to make clear it can in the future be that exceptions and stacktraces are reported to a service separately from a trace.

If not an exception/error reporting then maybe as part of a structured logging api eventually.

@tigrannajaryan
Copy link
Member

My note from the meeting this morning: It would be great if the proto message were kept in a separate proto file to make clear it can in the future be that exceptions and stacktraces are reported to a service separately from a trace.

If not an exception/error reporting then maybe as part of a structured logging api eventually.

I believe this applies to Events generally, not just errors. Some events may be associated with Spans and be contained in Span data, some other events may have no association with Spans. For the proto I would support Span Events (what we have now) and standalone Events.

@kbrockhoff kbrockhoff requested a review from arminru as a code owner March 21, 2020 18:42
@tedsuo
Copy link
Contributor

tedsuo commented Feb 13, 2023

@kbrockhoff I'm going to close this, as we have taken a different approach by adding events concept to logs, which can also be associated with spans. Please reopen or file a new OTEP if you feel there is more to address here.

@tedsuo tedsuo closed this Feb 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants