Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OTLP Trace Data Format specification #59

Merged
371 changes: 371 additions & 0 deletions text/0059-otlp-trace-data-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@
# OTLP Trace Data Format

_Author: Tigran Najaryan, Splunk_

**Status:** `approved`

OTLP Trace Data Format specification describes the structure of the trace data that is transported by OpenTelemetry Protocol (RFC0035).

## Motivation

This document is a continuation of OpenTelemetry Protocol RFC0035 and is necessary part of OTLP specification.

## Explanation

OTLP Trace Data Format is primarily inherited from OpenCensus protocol. Several changes are introduced with the goal of more efficient serialization. Notable differences from OpenCensus protocol are:

1. Removed `Node` as a concept.
2. Extended `Resource` to better describe the source of the telemetry data.
3. Replaced attribute maps by lists of key/value pairs.
4. Eliminated unnecessary additional nesting in various values.

Changes 1-2 are conceptual, changes 3-4 improve performance.

## Internal details

This section specifies data format in Protocol Buffers.

### Resource

```protobuf
// Resource information. This describes the source of telemetry data.
message Resource {
// labels is a collection of attributes that describe the resource. See OpenTelemetry
// specification semantic conventions for standardized label names:
// https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/data-resource-semantic-conventions.md
repeated AttributeKeyValue labels = 1;
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved

// dropped_labels_count is the number of dropped labels. If the value is 0, then
// no labels were dropped.
int32 dropped_labels_count = 2;
}
```

### Span

```protobuf
// Span represents a single operation within a trace. Spans can be
// nested to form a trace tree. Spans may also be linked to other spans
// from the same or different trace and form graphs. Often, a trace
// contains a root span that describes the end-to-end latency, and one
// or more subspans for its sub-operations. A trace can also contain
// multiple root spans, or none at all. Spans do not need to be
// contiguous - there may be gaps or overlaps between spans in a trace.
//
// The next field id is 18.
message Span {
// trace_id is the unique identifier of a trace. All spans from the same trace share
// the same `trace_id`. The ID is a 16-byte array. An ID with all zeroes
// is considered invalid.
//
// This field is semantically required. If empty or invalid trace_id is received:
// - The receiver MAY reject the invalid data and respond with the appropriate error
// code to the sender.
// - The receiver MAY accept the invalid data and attempt to correct it.
bytes trace_id = 1;

// span_id is a unique identifier for a span within a trace, assigned when the span
// is created. The ID is an 8-byte array. An ID with all zeroes is considered
// invalid.
//
// This field is semantically required. If empty or invalid span_id is received:
// - The receiver MAY reject the invalid data and respond with the appropriate error
// code to the sender.
// - The receiver MAY accept the invalid data and attempt to correct it.
bytes span_id = 2;

// TraceStateEntry is the entry that is repeated in tracestate field (see below).
message TraceStateEntry {
// key must begin with a lowercase letter, and can only contain
// lowercase letters 'a'-'z', digits '0'-'9', underscores '_', dashes
// '-', asterisks '*', and forward slashes '/'.
string key = 1;

// value is opaque string up to 256 characters printable ASCII
// RFC0020 characters (i.e., the range 0x20 to 0x7E) except ',' and '='.
// Note that this also excludes tabs, newlines, carriage returns, etc.
string value = 2;
}

// tracestate conveys information about request position in multiple distributed tracing graphs.
// It is a collection of TracestateEntry with a maximum of 32 members in the collection.
//
// See the https://github.com/w3c/distributed-tracing for more details about this field.
repeated TraceStateEntry tracestate = 3;

// parent_span_id is the `span_id` of this span's parent span. If this is a root span, then this
// field must be omitted. The ID is an 8-byte array.
bytes parent_span_id = 4;

// resource that is associated with this span. Optional. If not set, this span
// should be part of a ResourceSpans message that does include the resource information,
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved
// unless resource information is unknown.
Resource resource = 5;

// name describes the span's operation.
//
// For example, the name can be a qualified method name or a file name
// and a line number where the operation is called. A best practice is to use
// the same display name at the same call point in an application.
// This makes it easier to correlate spans in different traces.
//
// This field is semantically required to be set to non-empty string.
//
// This field is required.
string name = 6;

// SpanKind is the type of span. Can be used to specify additional relationships between spans
// in addition to a parent/child relationship.
enum SpanKind {
// Unspecified. Do NOT use as default.
// Implementations MAY assume SpanKind to be INTERNAL when receiving UNSPECIFIED.
SPAN_KIND_UNSPECIFIED = 0;

// Indicates that the span represents an internal operation within an application,
// as opposed to an operations happening at the boundaries. Default value.
INTERNAL = 1;

// Indicates that the span covers server-side handling of an RPC or other
// remote network request.
SERVER = 2;

// Indicates that the span describes a request to some remote service.
CLIENT = 3;

// Indicates that the span describes a producer sending a message to a broker.
// Unlike CLIENT and SERVER, there is often no direct critical path latency relationship
// between producer and consumer spans. A PRODUCER span ends when the message was accepted
// by the broker while the logical processing of the message might span a much longer time.
PRODUCER = 4;

// Indicates that the span describes consumer receiving a message from a broker.
// Like the PRODUCER kind, there is often no direct critical path latency relationship
// between producer and consumer spans.
CONSUMER = 5;
}

// kind field distinguishes between spans generated in a particular context. For example,
// two spans with the same name may be distinguished using `CLIENT` (caller)
// and `SERVER` (callee) to identify network latency associated with the span.
SpanKind kind = 7;

// start_time_unixnano is the start time of the span. On the client side, this is the time
// kept by the local machine where the span execution starts. On the server side, this
// is the time when the server's application handler starts running.
//
// This field is semantically required and it is expected that end_time >= start_time.
//
// This field is required.
int64 start_time_unixnano = 8;

// end_time_unixnano is the end time of the span. On the client side, this is the time
// kept by the local machine where the span execution ends. On the server side, this
// is the time when the server application handler stops running.
//
// This field is semantically required and it is expected that end_time >= start_time.
//
// This field is required.
int64 end_time_unixnano = 9;

// attributes is a collection of key/value pairs. The value can be a string,
// an integer, a double or the Boolean values `true` or `false`. Note, global attributes
// like server name can be set using the resource API. Examples of attributes:
//
// "/http/user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
// "/http/server_latency": 300
// "abc.com/myattribute": true
// "abc.com/score": 10.239
repeated AttributeKeyValue attributes = 10;

// dropped_attributes_count is the number of attributes that were discarded. Attributes
// can be discarded because their keys are too long or because there are too many
// attributes. If this value is 0, then no attributes were dropped.
int32 dropped_attributes_count = 11;
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved

// TimedEvent is a time-stamped annotation of the span, consisting of either
// user-supplied key-value pairs, or details of a message sent/received between Spans.
message TimedEvent {
// time_unixnano is the time the event occurred.
int64 time_unixnano = 1;

// name is a user-supplied description of the event.
string name = 2;

// attributes is a collection of attribute key/value pairs on the event.
repeated AttributeKeyValue attributes = 3;

// dropped_attributes_count is the number of dropped attributes. If the value is 0,
// then no attributes were dropped.
int32 dropped_attributes_count = 4;
}

// timed_events is a collection of TimedEvent items.
repeated TimedEvent timed_events = 12;

// dropped_timed_events_count is the number of dropped timed events. If the value is 0,
// then no events were dropped.
int32 dropped_timed_events_count = 13;

// Link is a pointer from the current span to another span in the same trace or in a
// different trace. For example, this can be used in batching operations,
// where a single batch handler processes multiple requests from different
// traces or when the handler receives a request from a different project.
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved
// See also Links specification:
// https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md#links-between-spans
message Link {
// trace_id is a unique identifier of a trace that this linked span is part of.
// The ID is a 16-byte array.
bytes trace_id = 1;

// span_id is a unique identifier for the linked span. The ID is an 8-byte array.
bytes span_id = 2;

// tracestate is the trace state associated with the link.
repeated TraceStateEntry tracestate = 3;

// attributes is a collection of attribute key/value pairs on the link.
repeated AttributeKeyValue attributes = 4;

// dropped_attributes_count is the number of dropped attributes. If the value is 0,
// then no attributes were dropped.
int32 dropped_attributes_count = 5;
}

// links is a collection of Links, which are references from this span to a span
// in the same or different trace.
repeated Link links = 14;

// dropped_links_count is the number of dropped links after the maximum size was
// enforced. If this value is 0, then no links were dropped.
int32 dropped_links_count = 15;

// status is an optional final status for this span. Semantically when status
// wasn't set it is means span ended without errors and assume Status.Ok (code = 0).
Status status = 16;

// child_span_count is an optional number of local child spans that were generated while this
// span was active. If set, allows an implementation to detect missing child spans.
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved
int32 child_span_count = 17;
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved
}

// The Status type defines a logical error model that is suitable for different
// programming environments, including REST APIs and RPC APIs.
message Status {

// StatusCode mirrors the codes defined at
// https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#statuscanonicalcode
enum StatusCode {
Ok = 0;
Cancelled = 1;
UnknownError = 2;
InvalidArgument = 3;
DeadlineExceeded = 4;
NotFound = 5;
AlreadyExists = 6;
PermissionDenied = 7;
ResourceExhausted = 8;
FailedPrecondition = 9;
Aborted = 10;
OutOfRange = 11;
Unimplemented = 12;
InternalError = 13;
Unavailable = 14;
DataLoss = 15;
Unauthenticated = 16;
};

// The status code. This is optional field. It is safe to assume 0 (OK)
// when not set.
StatusCode code = 1;

// A developer-facing human readable error message.
string message = 2;
}
```

### AttributeKeyValue

```protobuf
// AttributeKeyValue is a key-value pair that is used to store Span attributes, Resource
// labels, etc.
message AttributeKeyValue {
// ValueType is the enumeration of possible types that value can have.
enum ValueType {
STRING = 0;
BOOL = 1;
INT64 = 2;
DOUBLE = 3;
};

// key part of the key-value pair.
string key = 1;

// The type of the value.
ValueType type = 2;

// Only one of the following fields is supposed to contain data (determined by `type` field value).
// This is deliberately not using Protobuf `oneof` for performance reasons (verified by benchmarks).
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved

// A string value.
string string_value = 3;
// A 64-bit signed integer.
int64 int64_value = 4;
// A Boolean value represented by `true` or `false`.
bool bool_value = 5;
// A double value.
double double_value = 6;
}
```

## Trade-offs and mitigations

Timestamps were changed from google.protobuf.Timestamp to a int64 representation in Unix epoch nanoseconds. This change reduces the type-safety but benchmarks show that for small spans there is 15-20% encoding/decoding CPU speed gain. This is the right trade-off to make because encoding/decoding CPU consumption tends to dominate many workloads (particularly in OpenTelemetry Service).
tigrannajaryan marked this conversation as resolved.
Show resolved Hide resolved

## Prior art and alternatives

OpenCensus and Jaeger protocol buffer data schemas were used as the inspiration for this specification. OpenCensus was the starting point, Jaeger provided performance improvement ideas.

## Open questions

A follow up RFC is required to define the data format for metrics.

One of the original aspiring goals for OTLP was to _"support very fast pass-through mode (when no modifications to the data are needed), fast augmenting or tagging of data and partial inspection of data"_. This particular goal was not met directly (although performance improvements over OpenCensus encoding make OTLP more suitable for these tasks). This goal remains a good direction of future research and improvement.

## Appendix A - Benchmarking

The following shows [benchmarking of encoding/decoding in Go](https://github.com/tigrannajaryan/exp-otelproto/) using various schemas.

Legend:
- OpenCensus - OpenCensus protocol schema.
- OTLP/AttrMap - OTLP schema using map for attributes.
- OTLP/AttrList - OTLP schema using list of key/values for attributes and with reduced nesting for values.
- OTLP/AttrList/TimeWrapped - Same as OTLP/AttrList, except using google.protobuf.Timestamp instead of int64 for timestamps.

Suffixes:
- Attributes - a span with 3 attributes.
- TimedEvent - a span with 3 timed events.

```
BenchmarkEncode/OpenCensus/Attributes-8 10 605614915 ns/op
BenchmarkEncode/OpenCensus/TimedEvent-8 10 1025026687 ns/op
BenchmarkEncode/OTLP/AttrAsMap/Attributes-8 10 519539723 ns/op
BenchmarkEncode/OTLP/AttrAsMap/TimedEvent-8 10 841371163 ns/op
BenchmarkEncode/OTLP/AttrAsList/Attributes-8 50 128790429 ns/op
BenchmarkEncode/OTLP/AttrAsList/TimedEvent-8 50 175874878 ns/op
BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/Attributes-8 50 153184772 ns/op
BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 30 232705272 ns/op
BenchmarkDecode/OpenCensus/Attributes-8 10 644103382 ns/op
BenchmarkDecode/OpenCensus/TimedEvent-8 5 1132059855 ns/op
BenchmarkDecode/OTLP/AttrAsMap/Attributes-8 10 529679038 ns/op
BenchmarkDecode/OTLP/AttrAsMap/TimedEvent-8 10 867364162 ns/op
BenchmarkDecode/OTLP/AttrAsList/Attributes-8 50 228834160 ns/op
BenchmarkDecode/OTLP/AttrAsList/TimedEvent-8 20 321160309 ns/op
BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/Attributes-8 30 277597851 ns/op
BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 20 443386880 ns/op
```

The benchmark encodes/decodes 1000 batches of 100 spans, each span containing 3 attributes or 3 timed events. The total uncompressed, encoded size of each batch is around 20KBytes.

The results show OTLP/AttrList is 5-6 times faster than OpenCensus in encoding and about 3 times faster in decoding.

Using google.protobuf.Timestamp instead of int64-encoded unix timestamp results in 1.18-1.32 times slower encoding and 1.21-1.38 times slower decoding (depending on what the span contains).