w3c · SergeyKanzhelev · Oct 22, 2019 · Oct 22, 2019 · Oct 24, 2019 · nicmunroe
diff --git a/spec/20-http_header_format.md b/spec/20-http_header_format.md
@@ -102,14 +102,24 @@ trace-flags      = 2HEXDIGLC   ; 8 bit flags. Currently, only one bit is used. S
 
 This is the ID of the whole trace forest and is used to uniquely identify a [distributed trace](https://w3c.github.io/trace-context/#dfn-distributed-traces) through a system. It is represented as a 16-byte array, for example, `4bf92f3577b34da6a3ce929d0e0e4736`. All bytes as zero (`00000000000000000000000000000000`) is considered an invalid value.
 
-
 A vendor SHOULD generate globally unique values for `trace-id`. Many unique identification generation algorithms create IDs where one part of the value is constant (often time- or host-based), and the other part is a randomly generated value. Because tracing systems may make sampling decisions based on the value of `trace-id`, for increased interoperability vendors MUST keep the random part of `trace-id` ID on the left side.
 
-
-When a system operates with a `trace-id` that is shorter than 16 bytes, it SHOULD fill-in the extra bytes with random values rather than zeroes. Let's say the system works with an 8-byte `trace-id` like `3ce929d0e0e4736`. Instead of setting `trace-id` value to `0000000000000003ce929d0e0e4736` it SHOULD generate a value like `4bf92f3577b34da6a3ce929d0e0e4736` where `4bf92f3577b34da6a` is a random value or a function of time and host value.
-
-
-**Note**: Even though a system may operate with a shorter `trace-id` for [distributed trace](https://w3c.github.io/trace-context/#dfn-distributed-traces) reporting, the full `trace-id` MUST be propagated to conform to the specification.
+When a system operates with a `trace-id` that is shorter than 16 bytes, on new
+`trace-id` generation it SHOULD fill-in the extra bytes with random values
+rather than zeroes. Let's say the system works with an 8-byte `trace-id` like
+`23ce929d0e0e4736`. Instead of setting `trace-id` value to
+`23ce929d0e0e47360000000000000000` (note, that random part is kept on the left
+side as mentioned one paragraph above) it SHOULD generate a value like
+`23ce929d0e0e47364bf92f3577b34da6` where `4bf92f3577b34da6` is a random value or
+a function of time and host value. Note, that on receiving a `trace-id` which is
+longer than what system operates with, even though `trace-id` may be recorded
+with the shorter id, the entire `trace-id` MUST be propagated to the downstream
+components. In situations, when it is absolutely impossible to propagate the
+entire `trace-id` to the downstream components, but only a subset of the
+original `trace-id` will be propagated, system SHOULD fill up extra bytes with
+zeroes. This MAY be used as an indication for the downstream service that
+special logic can be applied to correlate the [distributed
+trace](https://w3c.github.io/trace-context/#dfn-distributed-traces).
 
 If the `trace-id` value is invalid (for example if it contains non-allowed characters or all zeros), vendors MUST ignore the `traceparent`.
 

diff --git a/spec/21-http_header_format_rationale.md b/spec/21-http_header_format_rationale.md
@@ -16,6 +16,103 @@ Making `trace-flags` optional doesn't save a lot, but makes specification more c
 
 We were using the term `span-id` in the `traceparent`, but not all tracing systems are built around span model, e.g. X-Trace, Canopy, SolarWinds, are built around event model, which is considered more expressive than the span model. There is nothing in the spec actually requires the model to be span-based, and passing the ID of the happened-before "thing" should work for both types of trace models. We considered names `call-id`, `request-id`. However out of all replacements `parent-id` is probably the best name. First, it matched the header name. Second it indicates a difference between caller and callee. Discussing AMQP we realized that `message-id` header defined by AMQP refers to individual message, and semantically not the same as traceparent. Message id can be used to dedup messages on the server when traceparent only defines the source this message came from.
 
+## Trace ID size
+
+In high load apps 64 bits is not enough to guarantee enough uniqueness of
+`trace-id` over a typical period of time - say 72 hours. That's said 128 bit
+`trace-id` may provide an excessive randomness for a smaller apps. However, in
+modern world many apps are using cloud services and SaaS components that may be
+shared by numerous smaller apps. So if those apps are using smaller `trace-id`,
+cloud services may not be able to correlate incoming requests from those apps to
+the proper distributed trace inside the cloud component.
+
+Thus for improved interoperability, this specification defines `trace-id` as a
+128 bit array of bytes.
+
+## Trace ID and interoperability with 64bit systems
+
+There are systems today using 64-bit `trace-id`s. These systems are not always
+easy to switch to longer `trace-id` to confirm to this specification
+requirement. The cost of this switch can be prohibitively expensive from both -
+backend capacity and indexes as well as in-process propagation limits.
+
+When addressing interoperability with these systems requirements the following
+is taken into consideration:
+
+1. The main objective of the specification is to promote interoperability of
+   various vendors and platforms.
+2. Specification needs to suggest a best practices that will improve
+   interoperability.
+3. There must be a way forward to implement Trace Context protocol by systems
+   that doesn't support long trace identifiers.
+
+### How 64bit systems may switch to Trace Context
+
+#### Absolute minimum
+
+The absolute minimum requirement from specification perspective is to receive
+and send a valid `traceparent` header. Systems operating with shorter `trace-id`
+may use only a subset of `trace-id` bytes to read and set `trace-id`. This
+behavior will break interoperability with vendors and platforms operating with
+the longer identifiers. If only a subset of `trace-id` bytes were read on
+incoming requests and sent with the outgoing call, for the systems operating
+with longer identifiers incoming and outgoing `trace-id` will not match. These
+systems will identify this situation as a restarted trace.
+
+Note, that vendors and platforms may implement a special logic to interoperate
+with the systems like this.
+
+Specification uses "SHOULD language" when asking to fill up extra bytes with
+random numbers instead of zeros. Typically, tracing systems that will not
+propagate extra bytes of incoming `trace-id` will not follow this ask and will
+fill up extra bytes with zeros.
+
+#### Linking to the longer thread-id
+
+As a minor step to improve interoperability between tracing systems, system that
+operates with shorter identifiers may record a longer incoming `trace-id` as a
+property of a telemetry item representing the incoming request.
+
+#### Propagating extra bytes of trace-id
+
+Preserving the `trace-id` unchanged is a major improvement in interoperability
+of tracing systems using different number of `trace-id` bytes. It is typical
+that tracing systems may propagate extra information from incoming request to
+outgoing calls. While recording of these extra bytes to the tracing systems
+backend is not possible.
+
+If tracing system can propagate these extra bytes, than it MUST do it.
+
+When tracing system has this capability, specification suggests that extra bytes
+are fill out with random numbers instead of zeros. This requirement helps to
+validate that tracing systems implements propagation from incoming request to
+outgoing calls correctly.
+
+#### Propagating tracestate as well
+
+Note, that even greater interoperability will be achieved with propagating a
+`tracestate` header. Dropping this header may break vendor specific distributed
+tracing scenarios. But this behavior conforms to the specification and older
+tracing systems may do it.
+
+As noted in previously, these tracing systems must do a best effort of
+propagating this header even if it will not be recorded to the tracing system
+backend.
+
+## Trace ID randomization "left padding"
+
+Specification explains the "left padding" requirement for the random `trace-id`
+generation. Tracing systems will implement various algorithms that use
+`trace-id` as a base to make a sampling decision. Typically, it will be some
+variation of hash calculation. Those algorithms may be giving different "weight"
+to different bytes of a `trace-id`. So the requirement to keep randomness to the
+left helps interoperability between tracing systems by suggesting which bytes
+carry a bigger weight in hash calculation.
+
+Practically, there will be tracing systems filling first bytes with zeros (see
+section "How 64bit systems may switch to Trace Context") or not following this
+guidance. Tracing systems must account for these violations.
+
 ## Ordering of keys in `tracestate`
 
 The specification calls for ordering of values in tracestate. This requirement allows better interoperability between tracing vendors.