Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address #337 #344

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 16 additions & 6 deletions spec/20-http_header_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,24 @@ trace-flags = 2HEXDIGLC ; 8 bit flags. Currently, only one bit is used. S

This is the ID of the whole trace forest and is used to uniquely identify a [distributed trace](https://w3c.github.io/trace-context/#dfn-distributed-traces) through a system. It is represented as a 16-byte array, for example, `4bf92f3577b34da6a3ce929d0e0e4736`. All bytes as zero (`00000000000000000000000000000000`) is considered an invalid value.


A vendor SHOULD generate globally unique values for `trace-id`. Many unique identification generation algorithms create IDs where one part of the value is constant (often time- or host-based), and the other part is a randomly generated value. Because tracing systems may make sampling decisions based on the value of `trace-id`, for increased interoperability vendors MUST keep the random part of `trace-id` ID on the left side.


When a system operates with a `trace-id` that is shorter than 16 bytes, it SHOULD fill-in the extra bytes with random values rather than zeroes. Let's say the system works with an 8-byte `trace-id` like `3ce929d0e0e4736`. Instead of setting `trace-id` value to `0000000000000003ce929d0e0e4736` it SHOULD generate a value like `4bf92f3577b34da6a3ce929d0e0e4736` where `4bf92f3577b34da6a` is a random value or a function of time and host value.


**Note**: Even though a system may operate with a shorter `trace-id` for [distributed trace](https://w3c.github.io/trace-context/#dfn-distributed-traces) reporting, the full `trace-id` MUST be propagated to conform to the specification.
When a system operates with a `trace-id` that is shorter than 16 bytes, on new
`trace-id` generation it SHOULD fill-in the extra bytes with random values
rather than zeroes. Let's say the system works with an 8-byte `trace-id` like
`23ce929d0e0e4736`. Instead of setting `trace-id` value to
`23ce929d0e0e47360000000000000000` (note, that random part is kept on the left

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by others, this example is backwards and doesn't match the comment in parenthesis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random part is kept to the left. Why is it not matching?

Copy link

@nicmunroe nicmunroe Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the big problem here is the words "random part" can be interpreted multiple ways. The text says that the "random part is kept on the left" but I think it means to say "the original 8-byte trace ID is kept on the left". Yes/no? This section is discussing backfilling with random values vs. zeros, so when you say "random part is kept on the left" I'm reading that as the random backfilling on the left. But you're using the original 8-byte trace ID on the left, not random-backfill values, and the right side is all zeros, which nobody here is arguing should be done.

And looking at it from the zeros point of view: why is the 0000000000000000 on the right here? The text says "Instead of setting trace-id value to
23ce929d0e0e47360000000000000000". Who would ever set it to 23ce929d0e0e47360000000000000000? If it was zero-backfilled wouldn't it be 000000000000000023ce929d0e0e4736? i.e. the backfilled zeros would be on the left.

side as mentioned one paragraph above) it SHOULD generate a value like
`23ce929d0e0e47364bf92f3577b34da6` where `4bf92f3577b34da6` is a random value or
a function of time and host value. Note, that on receiving a `trace-id` which is
longer than what system operates with, even though `trace-id` may be recorded
with the shorter id, the entire `trace-id` MUST be propagated to the downstream
components. In situations, when it is absolutely impossible to propagate the
entire `trace-id` to the downstream components, but only a subset of the
original `trace-id` will be propagated, system SHOULD fill up extra bytes with
zeroes. This MAY be used as an indication for the downstream service that
special logic can be applied to correlate the [distributed
trace](https://w3c.github.io/trace-context/#dfn-distributed-traces).

If the `trace-id` value is invalid (for example if it contains non-allowed characters or all zeros), vendors MUST ignore the `traceparent`.

Expand Down
97 changes: 97 additions & 0 deletions spec/21-http_header_format_rationale.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,103 @@ Making `trace-flags` optional doesn't save a lot, but makes specification more c

We were using the term `span-id` in the `traceparent`, but not all tracing systems are built around span model, e.g. X-Trace, Canopy, SolarWinds, are built around event model, which is considered more expressive than the span model. There is nothing in the spec actually requires the model to be span-based, and passing the ID of the happened-before "thing" should work for both types of trace models. We considered names `call-id`, `request-id`. However out of all replacements `parent-id` is probably the best name. First, it matched the header name. Second it indicates a difference between caller and callee. Discussing AMQP we realized that `message-id` header defined by AMQP refers to individual message, and semantically not the same as traceparent. Message id can be used to dedup messages on the server when traceparent only defines the source this message came from.

## Trace ID size

In high load apps 64 bits is not enough to guarantee enough uniqueness of
`trace-id` over a typical period of time - say 72 hours. That's said 128 bit
`trace-id` may provide an excessive randomness for a smaller apps. However, in
modern world many apps are using cloud services and SaaS components that may be
shared by numerous smaller apps. So if those apps are using smaller `trace-id`,
cloud services may not be able to correlate incoming requests from those apps to
the proper distributed trace inside the cloud component.

Thus for improved interoperability, this specification defines `trace-id` as a
128 bit array of bytes.

## Trace ID and interoperability with 64bit systems

There are systems today using 64-bit `trace-id`s. These systems are not always
easy to switch to longer `trace-id` to confirm to this specification
requirement. The cost of this switch can be prohibitively expensive from both -
backend capacity and indexes as well as in-process propagation limits.

When addressing interoperability with these systems requirements the following
is taken into consideration:

1. The main objective of the specification is to promote interoperability of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these points are true, then shouldn't the recommendation be to backfill with zeros, not another random 64 bits? Wouldn't that promote better interoperability with brown field ecosystems that are currently on 64 bit, with the way forward being that those systems would organically move to 128 bit as they can? Currently, by recommending backfilling with random 64 bits, the spec is encouraging breakage (and it's just surprising/confusing/unexpected to see backfilling with randomness - I would expect something deterministic, and the brown field has chosen the deterministic path of backfilling with zeros).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further in the doc there is a note that backfilling with random numbers will encourage you to test that those numbers will be propagated. And 64 system will not break the 128 bit system passing trace thru it. Spec is encouraging interoperability of different systems, which don't know about each other.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encourage you to test

This doesn't seem like a good enough reason to break compatibility with some 64-bit systems. The spec can encourage you to test by stating so in the spec. I'm not sure I can see many spec implementors who would only test because the random numbers encouraged testing, and would fail to test if it was simply stated in the spec.

various vendors and platforms.
2. Specification needs to suggest a best practices that will improve
interoperability.
3. There must be a way forward to implement Trace Context protocol by systems
that doesn't support long trace identifiers.

### How 64bit systems may switch to Trace Context

#### Absolute minimum

The absolute minimum requirement from specification perspective is to receive
and send a valid `traceparent` header. Systems operating with shorter `trace-id`
may use only a subset of `trace-id` bytes to read and set `trace-id`. This
behavior will break interoperability with vendors and platforms operating with
the longer identifiers. If only a subset of `trace-id` bytes were read on
incoming requests and sent with the outgoing call, for the systems operating
with longer identifiers incoming and outgoing `trace-id` will not match. These
systems will identify this situation as a restarted trace.

Note, that vendors and platforms may implement a special logic to interoperate
with the systems like this.

Specification uses "SHOULD language" when asking to fill up extra bytes with
random numbers instead of zeros. Typically, tracing systems that will not
propagate extra bytes of incoming `trace-id` will not follow this ask and will

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but an upstream system that follows the SHOULD advice to fill up the left side with random 64 bits will break a downstream that is using brown field 64 bit tracers. Wouldn't it be better to have the SHOULD advice recommend filling with zeros to maximize default interoperability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systems that fail to propagate extra bytes of trace id already breaking couple MUST of a spec =). This comment is simply saying that IF system uses 64 bits, has no way to propagate extra bits unchanged, but still wants to follow the spec, while pretty confident that all components are controlled by this system - zero backfill is what this system will implement.

fill up extra bytes with zeros.

#### Linking to the longer thread-id

As a minor step to improve interoperability between tracing systems, system that
operates with shorter identifiers may record a longer incoming `trace-id` as a
property of a telemetry item representing the incoming request.

#### Propagating extra bytes of trace-id

Preserving the `trace-id` unchanged is a major improvement in interoperability
of tracing systems using different number of `trace-id` bytes. It is typical
that tracing systems may propagate extra information from incoming request to
outgoing calls. While recording of these extra bytes to the tracing systems
backend is not possible.

If tracing system can propagate these extra bytes, than it MUST do it.

When tracing system has this capability, specification suggests that extra bytes
are fill out with random numbers instead of zeros. This requirement helps to
validate that tracing systems implements propagation from incoming request to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What use is this validation? Isn't it better to have de-facto interoperability by default? Why is this validation more beneficial and outweigh default interoperability with existing brown field systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key here is that spec is addressing interoperabilty of different tracing systems. So if 64 bit system receives 128bit trace-id, it must do the best effort to propagate the entire 128 bits. The best way to ensure that this is implemented is ask to fill up spaces with random numbers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actual users are telling you that the best way to ensure that is to backfill with zeros.

outgoing calls correctly.

#### Propagating tracestate as well

Note, that even greater interoperability will be achieved with propagating a
`tracestate` header. Dropping this header may break vendor specific distributed
tracing scenarios. But this behavior conforms to the specification and older
tracing systems may do it.

As noted in previously, these tracing systems must do a best effort of
propagating this header even if it will not be recorded to the tracing system
backend.

## Trace ID randomization "left padding"

Specification explains the "left padding" requirement for the random `trace-id`
generation. Tracing systems will implement various algorithms that use
`trace-id` as a base to make a sampling decision. Typically, it will be some

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh ... is this the main reason for recommending random 64 bits for backfill? I can understand how this would be beneficial for such sampling algorithms, but I would suggest that interoperability-by-default with existing brown field systems is a much better tradeoff.

(Also, how is "typically" determined here? How common is this really? How many major large-userbase tracers have this functionality? How many users actually use it? I'd be surprised if the userbase using this kind of sampling algorithm is (1) really big enough to warrant breaking default interoperability with 64 bit systems, and (2) would mostly use the leftmost bytes such that they would actually be negatively affected by backfilling with zeros.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's the only reason, and it doesn't hold water for me. Someone mentioned elsewhere that some systems may use UUID algorithms for generating trace IDs, and the left-most bytes could be composed of a timestamp - not random at all.

If systems want to avoid RNG by reusing randomness in the trace ID, an easier way to do that is XOR the left and right 8 bytes of the full 16-bytes ID and stop caring about which side of it contains more randomness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhhh ... is this the main reason for recommending

we need to keep this rationale file up to date. It really helps I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the recommendation - we had this issue in Microsoft with the "bad ordering" of randomness and I saw a few implementations which only looking at 4 bytes to calculate sampling priority. However I personally don't feel strongly about this clause. Exactly because of these reasons that it is hard to enforce randomness.

variation of hash calculation. Those algorithms may be giving different "weight"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking compatibility by default for a "may be" seems wrong. How common is this really?

to different bytes of a `trace-id`. So the requirement to keep randomness to the
left helps interoperability between tracing systems by suggesting which bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that this is prioritizing a suggestion of interoperability with a much smaller number of systems over the definite breakage with a large volume of brown field systems.

(I'm assuming numbers on hash-based-left-bytes sampling algorithm are small compared to brown field 64 bit systems. I'm open to being proven wrong if anyone has solid numbers - I could just be out of the loop.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that many systems uses trace-ID where the right most part is random, not the left most? Or we are still talking about zero backfill? If backfill - why not backfill other bytes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-backfill.

Because of all the reasons multiple users have stated in this thread. They are telling you that backfilling with zeros will provide better compatibility and interoperability than backfilling with randomness.

carry a bigger weight in hash calculation.

Practically, there will be tracing systems filling first bytes with zeros (see

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather see the specification reverse expectations here. IMO systems SHOULD backfill with zeros, and there could be a note that since it's only a SHOULD and not a requirement, then systems could choose to do the random 64 bit backfill if they feel they need to due to a hash-based-left-bytes sampling algorithm.

In other words, default to maximum interoperability, and point out where there might be valid use cases for doing something else. As it stands, this feels backwards.

The reality is that implementors will see SHOULD, and the language around 64 bit trace IDs being "violations", and they will blindly backfill with random 64 bits. As a result we'll see breakage by default instead of interoperability by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't follow what system will break. And why the randomness requirement will break it? It felt almost like you suggesting all systems only fill up 64 bits.

In the doc there are three types of systems described. First, which simply ignores extra bytes. These systems definitely will be better off by filling up with zeros. And they can. They already breaking couple MUST of this spec. Ignoring second. Third, that only recording 64 but respecting others and propagates 128. For this tracing systems - why not ask to fill up extra bytes with random numbers? What difference does it make? It still will do it when some external 128bit system will call into it and extra bytes are needed to be propagated.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not ask to fill up extra bytes with random numbers?

Try looking at this from the other direction - why not fill up with zeros? Multiple people are telling you that backfilling with zeros will allow for greater compatibility with 64 bit systems.

section "How 64bit systems may switch to Trace Context") or not following this
guidance. Tracing systems must account for these violations.

## Ordering of keys in `tracestate`

The specification calls for ordering of values in tracestate. This requirement allows better interoperability between tracing vendors.
Expand Down