From f913f9922a4f4a9d015042fe06c8f91347da0f9c Mon Sep 17 00:00:00 2001
From: Tigran Najaryan <tigran@omnition.io>
Date: Fri, 11 Oct 2019 10:07:48 -0400
Subject: [PATCH] Add OTLP Trace Data Format specification

This is a continuation of OTLP RFC proposal https://github.com/open-telemetry/oteps/pull/35

This change defines the data format used by Span and Resource messages.

The data format is a result of research of prior art (primarily OpenCensus and Jaeger),
as well as experimentation and benchmarking done as part of OTLP RFC proposal.

Go benchmark source code is available at https://github.com/tigrannajaryan/exp-otelproto
(use `make benchmark-encoding` target). Benchmarking shows that depending on the payload
composition this data format is about 4x-5x faster in encoding and 2x-3x faster in
decoding equivalent data compared to OpenCensus data format (all benchmarks in Go).

Notable differences from OpenCensus:

- Attribute key/value pairs are represented as a list rather than as a map.
  This results in significant performance gains and at the same time changes
  the semantic of attributes because now it is possible to have multiple attributes
  with the same key. This is also in-line with Jaeger's tags representation.

- Removed unnecessary wrappers such as google.protobuf.Timestamp which resulted in
  significant performance improvements for certain payload compositions (e.g. lots of
  TimedEvents).

- Resource labels use the same data type as Span attributes which now allows
  to have labels with other data types (OpenCensus only allowed strings).
---
 text/0000-otlp-trace-data-format.md | 361 ++++++++++++++++++++++++++++
 1 file changed, 361 insertions(+)
 create mode 100644 text/0000-otlp-trace-data-format.md

diff --git a/text/0000-otlp-trace-data-format.md b/text/0000-otlp-trace-data-format.md
new file mode 100644
index 000000000..12fbf179b
--- /dev/null
+++ b/text/0000-otlp-trace-data-format.md
@@ -0,0 +1,361 @@
+# OTLP Trace Data Format
+
+_Author: Tigran Najaryan, Splunk_
+
+**Status:** `proposed`
+
+OTLP Trace Data Format specification describes the structure of the trace data that is transported by OpenTelemetry Protocol (RFC0035).
+
+## Motivation
+
+This document is a continuation of OpenTelemetry Protocol RFC0035 and is necessary part of OTLP specification.
+
+## Explanation
+
+OTLP Trace Data Format is primarily inherited from OpenCensus protocol. Several changes are introduced with the goal of more efficient serialization. Notable differences from OpenCensus protocol are:
+
+1. Removed `Node` as a concept.
+2. Extended `Resource` to better describe the source of the telemetry data.
+3. Replaced attribute maps by lists of key/value pairs.
+4. Eliminated unnecessary additional nesting in various values.
+
+Changes 1-2 are conceptual, changes 3-4 improve performance.
+
+## Internal details
+
+This section specifies data format in Protocol Buffers.
+
+### Resource
+
+```
+// Resource information. This describes the source of telemetry data.
+message Resource {
+  // Set of labels that describe the resource. See OpenTelemetry specification
+  // semantic conventions for standardized label names:
+  // https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/data-semantic-conventions.md
+
+  repeated AttributeKeyValue labels = 3;
+  int32 dropped_labels_count = 11;
+}
+```
+
+### Span
+
+```
+// A span represents a single operation within a trace. Spans can be
+// nested to form a trace tree. Spans may also be linked to other spans
+// from the same or different trace. And form graphs. Often, a trace
+// contains a root span that describes the end-to-end latency, and one
+// or more subspans for its sub-operations. A trace can also contain
+// multiple root spans, or none at all. Spans do not need to be
+// contiguous - there may be gaps or overlaps between spans in a trace.
+//
+// The next id is 17.
+message Span {
+  // A unique identifier for a trace. All spans from the same trace share
+  // the same `trace_id`. The ID is a 16-byte array. An ID with all zeroes
+  // is considered invalid.
+  //
+  // This field is semantically required. Receiver should generate new
+  // random trace_id if empty or invalid trace_id was received.
+  //
+  // This field is required.
+  bytes trace_id = 1;
+
+  // A unique identifier for a span within a trace, assigned when the span
+  // is created. The ID is an 8-byte array. An ID with all zeroes is considered
+  // invalid.
+  //
+  // This field is semantically required. Receiver should generate new
+  // random span_id if empty or invalid span_id was received.
+  //
+  // This field is required.
+  bytes span_id = 2;
+
+  // This field conveys information about request position in multiple distributed tracing graphs.
+  // It is a list of Tracestate.Entry with a maximum of 32 members in the list.
+  //
+  // See the https://github.com/w3c/distributed-tracing for more details about this field.
+  message Tracestate {
+    message Entry {
+      // The key must begin with a lowercase letter, and can only contain
+      // lowercase letters 'a'-'z', digits '0'-'9', underscores '_', dashes
+      // '-', asterisks '*', and forward slashes '/'.
+      string key = 1;
+
+      // The value is opaque string up to 256 characters printable ASCII
+      // RFC0020 characters (i.e., the range 0x20 to 0x7E) except ',' and '='.
+      // Note that this also excludes tabs, newlines, carriage returns, etc.
+      string value = 2;
+    }
+
+    // A list of entries that represent the Tracestate.
+    repeated Entry entries = 1;
+  }
+
+  // The Tracestate on the span.
+  Tracestate tracestate = 3;
+
+  // The `span_id` of this span's parent span. If this is a root span, then this
+  // field must be empty. The ID is an 8-byte array.
+  bytes parent_span_id = 4;
+
+  // An optional resource that is associated with this span. If not set, this span
+  // should be part of a ResourceSpan that does include the resource information, unless resource
+  // information is unknown.
+  Resource resource = 5;
+
+  // A description of the span's operation.
+  //
+  // For example, the name can be a qualified method name or a file name
+  // and a line number where the operation is called. A best practice is to use
+  // the same display name at the same call point in an application.
+  // This makes it easier to correlate spans in different traces.
+  //
+  // This field is semantically required to be set to non-empty string.
+  // When null or empty string received - receiver may use string "name"
+  // as a replacement. There might be smarted algorithms implemented by
+  // receiver to fix the empty span name.
+  //
+  // This field is required.
+  string name = 6;
+
+  // Type of span. Can be used to specify additional relationships between spans
+  // in addition to a parent/child relationship.
+  enum SpanKind {
+    // Unspecified. Do NOT use as default.
+    // Implementations MAY assume SpanKind to be INTERNAL when receiving UNSPECIFIED.
+    SPAN_KIND_UNSPECIFIED = 0;
+
+    // Indicates that the span is used internally. Default value.
+    INTERNAL = 1;
+
+    // Indicates that the span covers server-side handling of an RPC or other
+    // remote network request.
+    SERVER = 2;
+
+    // Indicates that the span covers the client-side wrapper around an RPC or
+    // other remote request.
+    CLIENT = 3;
+
+    // Indicates that the span describes producer sending a message to a broker.
+    // Unlike client and  server, there is no direct critical path latency relationship
+    // between producer and consumer spans.
+    PRODUCER = 4;
+
+    // Indicates that the span describes consumer recieving a message from a broker.
+    // Unlike client and  server, there is no direct critical path latency relationship
+    // between producer and consumer spans.
+    CONSUMER = 5;
+  }
+
+  // Distinguishes between spans generated in a particular context. For example,
+  // two spans with the same name may be distinguished using `CLIENT` (caller)
+  // and `SERVER` (callee) to identify queueing latency associated with the span.
+  SpanKind kind = 7;
+
+  // The start time of the span. On the client side, this is the time kept by
+  // the local machine where the span execution starts. On the server side, this
+  // is the time when the server's application handler starts running.
+  //
+  // This field is semantically required. When not set on receive -
+  // receiver should set it to the value of end_time field if it was
+  // set. Or to the current time if neither was set. It is important to
+  // keep end_time > start_time for consistency.
+  //
+  // This field is required.
+  int64 start_time_unixnano = 8;
+
+  // The end time of the span. On the client side, this is the time kept by
+  // the local machine where the span execution ends. On the server side, this
+  // is the time when the server application handler stops running.
+  //
+  // This field is semantically required. When not set on receive -
+  // receiver should set it to start_time value. It is important to
+  // keep end_time > start_time for consistency.
+  //
+  // This field is required.
+  int64 end_time_unixnano = 9;
+
+  // The set of attributes. The value can be a string, an integer, a double
+  // or the Boolean values `true` or `false`. Note, global attributes like
+  // server name can be set as tags using resource API. Examples of attributes:
+  //
+  //     "/http/user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
+  //     "/http/server_latency": 300
+  //     "abc.com/myattribute": true
+  //     "abc.com/score": 10.239
+  repeated AttributeKeyValue attributes = 10;
+
+  // The number of attributes that were discarded. Attributes can be discarded
+  // because their keys are too long or because there are too many attributes.
+  // If this value is 0, then no attributes were dropped.
+  int32 dropped_attributes_count = 11;
+
+  // A time-stamped event in the Span.
+  message TimedEvent {
+    // The time the event occurred.
+    int64 time_unixnano = 1;
+
+    // A user-supplied name describing the event.
+    string name = 2;
+
+    // A set of attributes on the event.
+    repeated AttributeKeyValue attributes = 3;
+
+    int32 dropped_attributes_count = 4;
+  }
+
+  // A collection of `TimedEvent`s. A `TimedEvent` is a time-stamped annotation
+  // on the span, consisting of either user-supplied key-value pairs, or
+  // details of a message sent/received between Spans.
+  message TimedEvents {
+    // A collection of `TimedEvent`s.
+    repeated TimedEvent timed_event = 1;
+
+    // The number of dropped timed events. If the value is 0, then no events were dropped.
+    int32 dropped_timed_events_count = 2;
+  }
+
+  // The included timed events.
+  TimedEvents timed_events = 12;
+
+  // A pointer from the current span to another span in the same trace or in a
+  // different trace. For example, this can be used in batching operations,
+  // where a single batch handler processes multiple requests from different
+  // traces or when the handler receives a request from a different project.
+  message Link {
+    // A unique identifier of a trace that this linked span is part of. The ID is a
+    // 16-byte array.
+    bytes trace_id = 1;
+
+    // A unique identifier for the linked span. The ID is an 8-byte array.
+    bytes span_id = 2;
+
+    // The Tracestate associated with the link.
+    Tracestate tracestate = 3;
+
+    // A set of attributes on the link.
+    repeated AttributeKeyValue attributes = 4;
+
+    int32 dropped_attributes_count = 5;
+  }
+
+  // A collection of links, which are references from this span to a span
+  // in the same or different trace.
+  message Links {
+    // A collection of links.
+    repeated Link link = 1;
+
+    // The number of dropped links after the maximum size was enforced. If
+    // this value is 0, then no links were dropped.
+    int32 dropped_links_count = 2;
+  }
+
+  // The included links.
+  Links links = 13;
+
+  // An optional final status for this span. Semantically when Status
+  // wasn't set it is means span ended without errors and assume
+  // Status.Ok (code = 0).
+  Status status = 14;
+
+  // An optional number of child spans that were generated while this span
+  // was active. If set, allows an implementation to detect missing child spans.
+  google.protobuf.UInt32Value child_span_count = 15;
+}
+
+// The `Status` type defines a logical error model that is suitable for different
+// programming environments, including REST APIs and RPC APIs. This proto's fields
+// are a subset of those of
+// [google.rpc.Status](https://github.com/googleapis/googleapis/blob/master/google/rpc/status.proto),
+// which is used by [gRPC](https://github.com/grpc).
+message Status {
+  // The status code. This is optional field. It is safe to assume 0 (OK)
+  // when not set.
+  int32 code = 1;
+
+  // A developer-facing error message, which should be in English.
+  string message = 2;
+}
+```
+
+### AttributeKeyValue
+
+```
+message AttributeKeyValue {
+  enum ValueType {
+    STRING  = 0;
+    BOOL    = 1;
+    INT64   = 2;
+    FLOAT64 = 3;
+    BINARY  = 4;
+  };
+
+  string key = 1;
+  // The type of the value.
+  ValueType type = 2;
+  // A string up to 256 bytes long.
+  string string_value = 3;
+  // A 64-bit signed integer.
+  int64 int_value = 4;
+  // A Boolean value represented by `true` or `false`.
+  bool bool_value = 5;
+  // A double value.
+  double double_value = 6;
+  // A binary value of bytes.
+  bytes binary_value = 7;
+}
+
+```
+
+## Trade-offs and mitigations
+
+Timestamps were changed from google.protobuf.Timestamp to a int64 representation in Unix epoch nanoseconds. This change reduces the type-safety but benchmarks show that for small spans there is 15-20% encoding/decoding CPU speed gain. This is the right trade-off to make because encoding/decoding CPU consumption tends to dominate many workloads (particularly in OpenTelemetry Service).
+
+## Prior art and alternatives
+
+OpenCensus and Jaeger protocol buffer data schemas were used as the inspiration for this specification. OpenCensus was the starting point, Jaeger provided performance improvement ideas.
+
+## Open questions
+
+A follow up RFC is required to define the data format for metrics.
+
+## Appendix A - Benchmarking
+
+The following shows [benchmarking of encoding/decoding in Go](https://github.com/tigrannajaryan/exp-otelproto/) using various schemas.
+
+Legend:
+- OpenCensus    - OpenCensus protocol schema.
+- OTLP/AttrMap  - OTLP schema using map for attributes.
+- OTLP/AttrList - OTLP schema using list of key/values for attributes and with reduced nesting for values.
+- OTLP/AttrList/TimeWrapped - Same as OTLP/AttrList, except using google.protobuf.Timestamp instead of int64 for timestamps.
+
+Suffixes:
+- Attributes - a span with 3 attributes.
+- TimedEvent - a span with 3 timed events.
+
+```
+BenchmarkEncode/OpenCensus/Attributes-8         	      10	 605614915 ns/op
+BenchmarkEncode/OpenCensus/TimedEvent-8         	      10	1025026687 ns/op
+BenchmarkEncode/OTLP/AttrAsMap/Attributes-8     	      10	 519539723 ns/op
+BenchmarkEncode/OTLP/AttrAsMap/TimedEvent-8     	      10	 841371163 ns/op
+BenchmarkEncode/OTLP/AttrAsList/Attributes-8    	      50	 128790429 ns/op
+BenchmarkEncode/OTLP/AttrAsList/TimedEvent-8    	      50	 175874878 ns/op
+BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/Attributes-8         	      50	 153184772 ns/op
+BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8         	      30	 232705272 ns/op
+BenchmarkDecode/OpenCensus/Attributes-8                          	      10	 644103382 ns/op
+BenchmarkDecode/OpenCensus/TimedEvent-8                          	       5	1132059855 ns/op
+BenchmarkDecode/OTLP/AttrAsMap/Attributes-8                      	      10	 529679038 ns/op
+BenchmarkDecode/OTLP/AttrAsMap/TimedEvent-8                      	      10	 867364162 ns/op
+BenchmarkDecode/OTLP/AttrAsList/Attributes-8                     	      50	 228834160 ns/op
+BenchmarkDecode/OTLP/AttrAsList/TimedEvent-8                     	      20	 321160309 ns/op
+BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/Attributes-8         	      30	 277597851 ns/op
+BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8         	      20	 443386880 ns/op
+```
+
+The benchmark encodes/decodes 1000 batches of 100 spans, each span containing 3 attributes or 3 timed events. The total uncompressed, encoded size of each batch is around 20KBytes.
+
+The results show OTLP/AttrList is 5-6 times faster than OpenCensus in encoding and about 3 times faster in decoding.
+
+Using google.protobuf.Timestamp instead of int64-encoded unix timestamp results in 1.18-1.32 times slower encoding and 1.21-1.38 times slower decoding (depending on what the span contains).