Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Non-power-of-two consistent tail probability sampling in TraceState #226

Closed
117 changes: 99 additions & 18 deletions text/trace/0226-sampling-random-traceids.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,37 @@

## Motivation

The existing, experimental [specification for probability sampling using TraceState](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md)
The existing, experimental [specification for probability sampling
using
TraceState](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md)
supporting Span-to-Metrics pipelines is limited to powers-of-two
probabilities and is designed to work without making assumptions about
TraceID randomness.
probabilities and is designed to work without making assumptions about
TraceID randomness. The existing mechanism could only achieve
non-power-of-two sampling using interpolation between powers of two,
which was only possible at the head sampling time. It could not be
used with non-power-of-two sampling probabilities for span sampling in
the rest of the collection path. This proposal aims to address the
above two limitations for a couple of reasons:

1. Certain customers want support for non-powers-of-two probabilities
(e.g., 10% sampling rate or 75% sampling rate) and it should be
possible to do it cleanly irrespective of where the sampling is
happening.
2. There is a need for consistent sampling in the collection path
(outside of the head-sampling paths) and using the inherent
randomness in the traceID is a less-expensive solution than
referencing a custom "r-value" from the tracestate in every span.

In this proposal, we will cover how this new mechanism can be used in
both head-based sampling and different forms of tail-based sampling.

The term "Tail sampling" is in common use to describe _various_ forms
of sampling that take place after a span starts. The term "Tail" in
this phrase distinguishes other techniques from head sampling, however
the term is only broadly descriptive.

Head sampling requires the use of TraceState to propagate context
about sampling decisions parent spans to child spans. With sampling
about sampling decisions from parent spans to child spans. With sampling
information included in the TraceState, spans can be labeled with their
effective adjusted count, making it possible to count spans as they
arrive at their destination in real time, meaning before assembling
Expand All @@ -37,10 +56,11 @@ This proposal makes use of the [draft-standard W3C tracecontext
`random`
flag](https://w3c.github.io/trace-context/#random-trace-id-flag),
which is an indicator that 56 bits of true randomness are available
for probability sampler decisions. As an added benefit, we find that
this proposal _also works for Head sampling_, and that when 56 bits of
definite randomness are available in the TraceID we can use simpler
sampling logic compared with the p-value, r-value approach.
for probability sampler decisions. The benefit of this is that this
inherently random value can be used by intermediate span samplers to
make _consistent_ sampling decisions. It would be a less-expensive
solution than the earlier proposal of looking up the r-value from the
tracestate of each span.

This proposes to create a specification with support for 56-bit
precision consistent Head and Intermediate Span sampling. Because
Expand All @@ -55,25 +75,63 @@ with equivalent use and interpretation as the (W3C trace-context)
TraceState field. It would be appropriate to name this field
`LogState`.

This proposal does makes r-value an optional 56-bit number as opposed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: This proposal makes...

to a required 6-bit number. When the r-value is supplied, it acts as
an alternative source of randomness which allows tail-samplers to
support versions of tracecontext without the `random` bit as well as
more advanced use-cases. For example, independent traces can be
consistently sampled by starting them with identical r-values.

This proposal deprecates the experimental p-value. For existing
stored data, the specification may recommend replacing `p:X` with an
equivalent t-value; for example, `p:2` can be replaced by `t:4` and
`p:20` can be replaced by `t:0x1p-20`.

## Explanation

This document recommends deprecating the experimental p-value, r-value
specification.
This document proposes a new OpenTelemetry specific tracestate value
called t-value. This t-value encodes either the sampling probability
(a floating point value) directly or the "adjusted count" of a span
(an integer). The letter "t" here is a shorthand for "threshold". The
value encoded here can be mapped to a threshold value that a sampler
can compare to a value formed using the rightmost 7 bytes of the
traceID.

The syntax of the r-value changes in this proposal, as it contains 56
bits of information. The recommended syntax is to use 14 hexadecimal
characters (e.g., `r:1a2b3c4d5e6f78`). The specification will
recommend samplers drop invalid r-values, so that existing
implementations of r-value are not mistakenly sampled.

Like the existing specification, r-values will be synthesized as
necessary. However, the specification will recommend that r-values
not be synthesized automatically when the W3C tracecontext `random`
flag is set. To achieve the advanced use-case involving multiple
traces with the same r-value, users should set the `r-value` in the
tracestate before starting correlated trace root spans.

### Detailed design

Let's look at the details of how this threshold can be calculated.
This proposal defines the sampling "threshold" as a 7-byte string used
to make consistent sampling decisions, as follows.

1. Bytes 9-16 of the TraceID are interpreted as a 56-bit random
value in big-endian byte order.
2. The sampling probability (range `[0x1p-56, 1]`) is multipled by
1. When the r-value is present and parses as a 56-bit random value,
use it, otherwise bytes 10-16 of the TraceID are interpreted as a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth specifying whether this counts from 0 or 1, or, even better, including an annotated traceID here, just for clarity.

56-bit random value in big-endian byte order
2. The sampling probability (range `[0x1p-56, 1]`) is multiplied by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most readers will be unfamiliar with floating point hex notation (I was) and this is probably needlessly terse. One way to express it would be (0, 1], but that also might be too confusing. Perhaps "greater than 0 and less than or equal to 1" or even 0 < n <= 1?

Similarly below, I might say 2^56 rather than using the hex notation.

`0x1p+56`, yielding a unsigned Threshold value in the range `[1,
0x1p+56]`.
3. If the unsigned TraceID random value (range `[0, 0x1p+56)`) is
less-than the sampling Threshold, the span is sampled, otherwise it
is discarded.


For head samplers, there is an opportunity to synthesize a new r-value
when the tracecontext does not set the `random` bit (as the existing
specification recommends synthesizing r-values for head samplers
whenever there is none). However, this opportunity is not available
to tail samplers.

To calculate the Sampling threshold, we began with an IEEE-754
standard double-precision floating point number. With 52-bits of
significand and a floating exponent, the probability value used to
Copy link
Contributor

@oertl oertl Jun 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 52-bits of significand...

Double-precision floating-point values have a 52-bit mantissa but are able to represent 53-bit significands (except for subnormal values). See https://cs.stackexchange.com/a/152267/102560.

Expand All @@ -95,7 +153,7 @@ to machine precision) the adjusted count of each span. For example,
given a sampling probability encoded as "0.1", we first compute the
nearest base-2 floating point, which is exactly 0x1.999999999999ap-04,
which is approximately 0.10000000000000000555. The exact quantity in
this example, 0x1.999999999999ap-04, is multipled by `0x1p+56` and
this example, 0x1.999999999999ap-04, is multiplied by `0x1p+56` and
rounded to an unsigned integer (7205759403792794). This specification
says that to carry out sampling probability "0.1", we should keep
Traces whose least-significant 56 bits form an unsigned value less
Expand Down Expand Up @@ -166,26 +224,49 @@ threshold and compared against the new threshold. These are two cases:
Sampler's threshold, the span passes through with the current
sampler's t-value, otherwise the span is discarded.

## S-value encoding for non-consistent adjusted counts

There are cases where sampling does not need to be consistent or is
intentionally not consistent. Existing samplers often apply a simple
probability test, for example. This specification recommends
introducing a new tracestate member `s-value` for conveying the
accumulation of adjusted count due to independent sampling stages.

Unlike resampling with `t-value`, independent non-consistent samplers
will multiply the effect of their sampling into `s-value`.

## Examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add two more examples that shows how consistent probability sampling can be achieved across multiple participants.

Example 1:

  • Upstream participant samples at 10% probability (ot=t:0.1 is sent as part of tracestate)
  • Downstream participant does parent-based sampling. It uses the sampled flag to make the decision, gets the t-value from the parent context and emits it as part of its context (ot=t:0.1 is sent as part of tracestate to further downstream participants)

Example 2:

  • Upstream participant samples at 10% probability (ot=t:0.1 is sent as part of tracestate)
  • Downstream participant samples at 5% probability - it calculates a threshold based on its sampling rate and compares with the traceID last 7 bytes to make the sampling decision (ot=t:20 is sent as part of tracestate).
  • Downstream participant does parent-based sampling (uses the sampled flag to make the decision, gets the t-value from the parent context and emits it as part of its context)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples sound good to me! Will do.


### 90% Intermediate Span sampling
### 90% consistent intermediate span sampling

A span that has been sampled at 90% by an intermediate processor will
have `ot=t:0.9` added to its TraceState field in the Span record. The
sampling threshold is `0.9 * 0x1p+56`.

### 90% Head sampling
### 90% head consistent sampling

A span that has been sampled at 90% by a head sampler will add
`ot=t:0.9` to the TraceState context propagated to its children and
record the same in its Span record. The sampling threshold is `0.9 *
0x1p+56`.

### 1-in-3 sampling
### 1-in-3 consistent sampling

The tracestate value `ot=t:3` corresponds with 1-in-3 sampling. The
sampling threshold is `1/3 * 0x1p+56`.

### 30% simple probability sampling

The tracestate value `ot=s:0.3` corresponds with 30% sampling by one
or more sampling stages. This would be the tracestate recorded by
`probabilisticsampler` when using a `HashSeed` configuration instead
of the consistent approach.

### 10% probability sampling twice

The tracestate value `ot=s:0.01` corresponds with 10% sampling by one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe expand this to show how the tracestate would be modified at each stage?

stage and then 10% sampling by a second stage.

## Trade-offs and mitigations

Support for encoding t-value as either a probability or an adjusted
Expand Down