-
Notifications
You must be signed in to change notification settings - Fork 165
Non-power-of-two consistent tail probability sampling in TraceState #226
Non-power-of-two consistent tail probability sampling in TraceState #226
Conversation
@oertl Thank you. I accept your suggestions and wonder, would you be interested in drafting blocks of replacement text for your scheme? I'll be glad to work out the pseudocode snippets from there, but your words for "hex/binary threshold equals number of spans dropped out of 2^56", for explaining the 52-bits vs 56 bits issue, and the use of greater-or-equal, I take it, for filtering. I would be glad to then work on writing a spec to deprecate p-value in favor of t-value, where as you say the threshold "0" equals "00000000000000" indicating all spans with 7-bytes >= "00000000000000". |
As alternative to the t-value, the p-value definition could be extended. For example,
The resulting value corresponds be the number of kept spans out of 2^56 subtracted by 1. The example above would correspond to a sampling probability of (28853590294527 + 1)/2^56 = 0.000400424 With this definition |
By the way, in my prototype I used the syntax As for whether we overload how to parse |
|
||
This proposes to extend that specification with support for 56-bit | ||
precision sampling probability. This is seen as particularly | ||
important for implementation of probabilistic tail samplers (e.g., in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be good to elaborate a bit more on the motivation for this requirement of higher precision sampling probability.
If p-value is not used currently in any implementations (or if they are okay with a breaking change since it is still experimental as you call out), yes it looks like extending p-value to also encode non-powers-of-two sampling probabilities is a good idea by @oertl. Reasoning: Since p-value and t-value are conceptually for the same purpose (to encode sampling probabilities), ideally it will be good to just extend the p-value concept - it will make it a bit easier to understand for folks who are already familiar with the purpose of p-value, rather than trying to think of t-value as a new concept. |
With respect to obsoleting the r-value, I reported a new issue 3307 today. If we agree that the issue is real (I'm not entirely sure), the solution I proposed would make consistent probability samplers work differently if there's no r-value. |
This was discussed in yesterday's Sampling SIG. Since the initial feedback, I had come to the following rough idea to use all the bits of the TraceID, to consistently decide how to interpolate between powers of two. Considering an 8-bit example with 75% sampling:
cc @oertl @PeterF778 |
Unfortunately, after rethinking this proposal, I have concluded that it would not allow correct estimation of trace quantities (e.g. estimating the number of traces touching one service A and another service B) as described in my paper. It would only work for span estimates (e.g. estimating the number of spans of service A). The proposal violates the basic assumption that the choice of the sampling probability is independent of the shared randomness (trace ID or p-value). |
Let's take one step back. The Consistent Probability Sampling schema already allows consistent head and tail sampling if the sampling probability is a power of 2. Now we want to extend it so that consistency (head and tail) is preserved even for non-power-of-2 probabilities. It is obvious, I think, that we need to use the random bits of trace-id somehow, if we want to get consistent sampling with any given probability, meaning multiple instances of (head or tail) samplers making the same sampling decisions wrt spans belonging to the same trace. From the past discussions, it looks like we have two categories of possible extensions: A. Allow (approximation of) any non-power-of-2 sampling probabilities in TraceState. B. Continue to restrict the sampling probabilities in TraceState to power-of-2 values. and we can always have C. Drop Consistent Probability Sampling as it has been proposed and replace it with something else. Now we could try to summarize the benefits and disadvantages of these approaches.
while the potential issues are:
Respectively, approach B is the reverse of that, with the drawbacks being:
and the advantages are:
With respect to C, it is a very wide open field, but let's keep in mind that for expressing sampling probabilities we do not need high precision, we need wide range of values. That's why the logarithmic scale of r-values and p-values is so powerful and efficient. If we try to replace that with a linear model based on 56 random bits in trace-id, we quickly run out of bits. |
Thanks @PeterF778 for consolidating the tradeoffs - overall the summary of the tradeoffs makes sense to me. I didn't quite understand the below two points - can you please clarify/elaborate?
and this final point about running out of bits:
|
03f693c Contains an update based on recent discussions. |
@kentquirk new draft: |
A very nice and clean proposal! |
user's intended sampling probability without floating point conversion | ||
loss. | ||
|
||
## Prior art and alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Towards the end, we may want to call out that one benefit of the r-value based randomness was that it could be used to get consistent sampling across multiple traces (e.g., all traces started within a time window by a participant) - it would be good to call out that it should be possible to support it in the future as a complement to the current proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to use arbitrary sampling probabilities, we should not use the current definition of the r-value. It makes no sense to have different discretizations for the r-value (powers of two) and for the t-value (56-bit values). Therefore, the r-value should rather be a 14-digit hex value that overrides the random bits of the trace ID, if present. This way we could also handle traces where the random flag is not set in the trace context. If the flag is not set and there is also no r-value, we could require consistent samplers to set the r-value by generating a 56-bit random value.
Sampler's threshold, the span passes through with the current | ||
sampler's t-value, otherwise the span is discarded. | ||
|
||
## Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add two more examples that shows how consistent probability sampling can be achieved across multiple participants.
Example 1:
- Upstream participant samples at 10% probability (ot=t:0.1 is sent as part of tracestate)
- Downstream participant does parent-based sampling. It uses the sampled flag to make the decision, gets the t-value from the parent context and emits it as part of its context (ot=t:0.1 is sent as part of tracestate to further downstream participants)
Example 2:
- Upstream participant samples at 10% probability (ot=t:0.1 is sent as part of tracestate)
- Downstream participant samples at 5% probability - it calculates a threshold based on its sampling rate and compares with the traceID last 7 bytes to make the sampling decision (ot=t:20 is sent as part of tracestate).
- Downstream participant does parent-based sampling (uses the sampled flag to make the decision, gets the t-value from the parent context and emits it as part of its context)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples sound good to me! Will do.
Traces whose least-significant 56 bits form an unsigned value less | ||
than 7205759403792794. | ||
|
||
## T-value encoding for adjusted counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to define the mutation rules and propagation rules for t-value. E.g., something on the lines of:
- if a participant is doing parent-based sampling, it should propagate the t-value from its parent.
- if a participant is doing consistent probability sampling using its own sampling rate, it should mutate the t-value to set the new adjusted count / sampling rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite answering your question, but I've prototyped open-telemetry/opentelemetry-collector-contrib#22058 with a different sort of answer to your question.
In this case referring to span data records, where there are multiple collectors in a pipeline. The first collector may sample at 1/10; when a subsequent collector samples at 1/20, the t-value of the selected spans will be updated. If the subsequent collector samples at 1/2, however, it is being less selective than the first collector, so it should not modify the t-value. That is to say that t-value adjusted counts should not fall and t-valued probabilities should not rise.
See the logic here: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/22058/files#diff-33f10350e2875f926dd2be6fc4c6bb88cfd8043cf6ac6d100295cf654771d90dR210-R219
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a problem with such sampling behavior. Let's assume that the previous collector in chain sampled all traces with errors with probability 1, and all remaining traces with 1/100. If the next collector in chain is configured with 1/10, it will not touch the healthy traces, but will decimate the traces with errors. So any stratified sampling logic must be known and repeated by all collectors in the pipeline. Even if we prohibit stratified sampling, to set up a collector sampling probability in any meaningful way we have to know the minimum sampling probability of all the preceding collectors.
I would like to see how to provide algorithms for tail sampling based on the t-value, for the following use cases:
The potential challenge I see is that the algorithms may require multiplying t-values and/or sorting the spans according to their t-value. |
I believe collecting Therefore we need
In summary, it takes
spans. The average processing costs per span are therefore
which is constant. I am not 100% sure yet what t-value needs to be assigned to the surviving spans to get unbiased counts. Probably it is sufficent to replace the current t-value, if it is larger, by the threshold
If there was no previous sampling stage, we need to estimate the incoming rate from spans in the past (e.g., with exponential smoothing). Using this estimate and the desired rate, we can calculate a sampling probability that is used as the threshold. If there were previous sampling stages, the random values are no longer uniformly distributed and the actual distribution must be considered when choosing the threshold. For example, to achieve a 50% reduction, the threshold would need to be set to the median of all random values. To obtain a sampling probability of |
I'm not sure I understood your algorithm for static sampling correctly, @oertl. Correct me, if I'm wrong, but it looks like the selection of spans to survive is affected heavily by their t-value, rather than the source of randomness (trace-id), which introduces a bias. |
It depends on what is meant by bias. My understanding is the following: The expected adjusted count is equal to 1. This is ensured by setting the adjusted count to the inverse of the sampling probability. This should be the case with the algorithm described above. @PeterF778, I think your concern is that the algorithm balances the sampling probabilities of the previous sampling stages. Depending on what you want to estimate, this sampling strategy may or may not be beneficial. I haven't checked, but I think the algorithm is similar to VarOpt sampling (see here), which minimizes variance when estimating arbitrary subset sums. This algorithm makes sense if you have a sampling stage that combines samples collected in earlier sampling stages with different probabilities for no good reason, e.g., due to unbalanced load or short-term load fluctuations. The situation is different if you have already identified certain classes of spans (e.g., spans with errors) that should be sampled with a higher probability. In this case, you want to sample these spans more frequently than others by purpose. I think the correct term for this is stratified sampling, where weights are defined for different classes of spans and the sampling algorithm tries to sample them so that the ratios of the weights are reflected by the corresponding ratios of the t-values. To complicate matters further, different stages of sampling would define different weights for the same span. Early stages may not be able to assess the importance of a span while later sampling stages have a more holistic view of the trace and therefore might assign a different weight to a span. For example, if the child span has an error. Anyway, discussion of these sampling strategies takes us a bit off topic. The same issues must be solved when using power-of-two sampling probabilities. |
Yes, I imagine "static sampling" to be applied to aging data, after one or many stratified sampling steps were performed .
If we want to compare the two competing consistent probability sampling mechanisms, we have to understand what they will entail in all processing stages. I believe the re-sampling algorithms will be very similar in principle, but there could be differences in complexity or accuracy of the results. |
Yes, this is true. For example, for "dynamic sampling" it is probably much simpler to estimate quantiles if there is a power-of-two discretization as you could simply aggregate into a histogram as there are only a small number of relevant values. However, any sampling stage is free to use just power-of-two sampling probabilities, if it is more efficient. Even, if spans come with non-power-of-two t-values, they could be easily downsampled to the next power of two in a first step. |
New developments discussed in the Sampling SIG today. We proposed "s-value" as a mechanism to encode the accumulation of independent non-consistent sampling stage adjusted counts. t-value and s-value would be separate fields, both included in tracestate for consistency. Existing vendor-specific sampling probabilities with unknown-and-presumed-independent sampling mechanisms will encode probability or adjusted count (as with t-value encoding) using tracestate s-value. In the coming week I will resolve the conversations above. The plan is to draft a proposed change to https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md and document/justify the changes in this OTEP. Most of the existing specification will be re-used. |
Co-authored-by: J. Kalyana Sundaram <[email protected]>
Co-authored-by: J. Kalyana Sundaram <[email protected]>
I've updated the document making "r-value" optional (and 56-bits) and making "s-value" for independent non-consistent sampling. I've incorporated some helpful feedback from @kalyanaj. Thanks. |
TraceState field. It would be appropriate to name this field | ||
`LogState`. | ||
|
||
This proposal does makes r-value an optional 56-bit number as opposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: This proposal makes...
|
||
To calculate the Sampling threshold, we began with an IEEE-754 | ||
standard double-precision floating point number. With 52-bits of | ||
significand and a floating exponent, the probability value used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With 52-bits of significand...
Double-precision floating-point values have a 52-bit mantissa but are able to represent 53-bit significands (except for subnormal values). See https://cs.stackexchange.com/a/152267/102560.
value in big-endian byte order. | ||
2. The sampling probability (range `[0x1p-56, 1]`) is multipled by | ||
1. When the r-value is present and parses as a 56-bit random value, | ||
use it, otherwise bytes 10-16 of the TraceID are interpreted as a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth specifying whether this counts from 0 or 1, or, even better, including an annotated traceID here, just for clarity.
1. When the r-value is present and parses as a 56-bit random value, | ||
use it, otherwise bytes 10-16 of the TraceID are interpreted as a | ||
56-bit random value in big-endian byte order | ||
2. The sampling probability (range `[0x1p-56, 1]`) is multiplied by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most readers will be unfamiliar with floating point hex notation (I was) and this is probably needlessly terse. One way to express it would be (0, 1]
, but that also might be too confusing. Perhaps "greater than 0 and less than or equal to 1" or even 0 < n <= 1
?
Similarly below, I might say 2^56 rather than using the hex notation.
|
||
### 10% probability sampling twice | ||
|
||
The tracestate value `ot=s:0.01` corresponds with 10% sampling by one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expand this to show how the tracestate would be modified at each stage?
@jmacd , @kalyanaj , @oertl , @PeterF778 As promised in the SIG meeting, I have implemented a proposal for how to calculate threshold, sampling rate, and sampling probability in several different languages (Go, JavaScript, and Python). The proposal is simple -- here's the Go version: func threshold(tValue float64) int64 {
const k = 0x1p+56
if tValue < 1.0 {
return int64(k*tValue + 0.5)
}
return int64(k/tValue + 0.5)
}
func samplingRate(tValue float64) int64 {
if tValue < 1.0 {
return int64(1.0/tValue + 0.5)
}
return int64(tValue + 0.5)
}
func samplingProbability(tValue float64) float64 {
if tValue < 1.0 {
return tValue
}
return 1.0 / tValue
} All 3 languages seem to deliver identical results in the first 2 functions. I was unable to convince all 3 languages to format floating point values identically (I probably could have by installing various libraries but I wanted to use the standard libraries). But they seemed to be identical through the first 14 digits. Thoughts? |
I wonder if we should prefer a canonical lossless (e.g. hex-encoded and only values less than or equal to 1) representation of the t-value that does not depend on platform- or language-dependent behavior (like the r-value). This would make encoding and parsing much simpler and less costly. For scenarios with multiple sampling stages, the t-value needs to be parsed and encoded frequently, and therefore this should be done as efficient as possible. |
I understand your point, but I'm personally much more concerned about human usability. What I see from users in real sampling configurations are sampling rates like 1 in 10, 1 in 6, and 1 in 1000.
All of those representations lose the user's intent, while There is also the problem that as far as I have found, many languages do not support hex representation in floating point. Or if they do, they do so differently. I strongly feel that we should bias toward comprehensible and easily implementable human representations. |
I also understand your point. But the limited number of bits simply does not allow to sample exactly 1 out of 3 consistently. In my opinion, it is more important to forward the actual sampling probability applied, not the user's intent. If you know that all your sampling configurations are of kind "1 out of X", it is relatively easy to reconstruct X from the applied sampling probability to make the user believe that it really was sampled as originally intended. Maybe an additional flag indicating that the reported t-value comes from a "1 out of X" rule could be a compromise. In large systems, sampling probabilities are typically automatically chosen (e.g. based on rate limits), and it is more important that the parsing/encoding overhead is small.
It is easy to specify a lossless hex representation for the t-value. It could be defined as the integer threshold value used when comparing with the 56 random bits of the trace ID (or the optional r-value). If the t-value is an integer, it is straightforward to find a platform/language-independent hex representation. This definition would also reduce floating point operations as the sampling decision is simply the result of comparing the t-value with the random bits. |
This OTEP has been well-replaced by #235. Thanks @kentquirk! |
This is meant to address open-telemetry/opentelemetry-specification#1413.
Follows https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md and https://github.com/open-telemetry/oteps/blob/main/text/trace/0168-sampling-propagation.md.
Cc: @oertl @kalyanaj @PeterF778