-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review approach & specify algorithm for TraceIdRatioBasedSampler (ProbabilitySampler) #1413
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I've seen implementation that use "inflationary sampling probability", and the algorithm is important for them. For example if Service A (0.2% sampling) -> Service B (0.4% sampling) and you want to end up with about 0.2 full traces and another 0.4 (at Service B level, including the Service A traces). The algorithm that I've seen (I think @jmacd was one of the author) is that if you have the same algorithm across Services (which means across languages as well) that has a deterministic implementation, and also ensures that every trace sampled at a lower rate will be sampled at a higher rate, then you can achieve that. traceId1 = "..." // sampled at 0.2 rate ServiceASampler = new TraceIdBasedSampler(0.2):
ServiceBSampler = new TraceIdBasedSampler(0.4):
|
FYI for the record this wasn't my doing. This "inflationary" sampling technique predated me on that project. 😀 |
The Go implementation of this algorithm has lead to an incompatibility Amazon XRay Trace IDs. The first 4 bytes of XRay Trace IDs are time based and the Go ratio sampler expects these bytes to be random. It would be advantageous to the Go SIG, and likely others, if we could resolve this issue. That way, when we change our algorithm we will only need to do so once. |
@Aneurysm9 had mentioned we switch to "hashing" the middle part of the trace ID to make a sampling decision. |
Current Go implementation for reference: func (ts traceIDRatioSampler) ShouldSample(p SamplingParameters) SamplingResult {
psc := trace.SpanContextFromContext(p.ParentContext)
x := binary.BigEndian.Uint64(p.TraceID[0:8]) >> 1
if x < ts.traceIDUpperBound {
return SamplingResult{
Decision: RecordAndSample,
Tracestate: psc.TraceState(),
}
}
return SamplingResult{
Decision: Drop,
Tracestate: psc.TraceState(),
}
} |
@MrAlias that is why the w3c specification is actually moving to make it explicit which bytes should be random. You can see the draft here https://github.com/w3c/trace-context/blob/main/spec/20-http_request_header_format.md#trace-id The reason we used 7 and not 8 bytes is that the 8th byte contains the sign bit. This is the same reason go has the method to generate a 63 bit random number. |
Ah, super helpful! Thanks 🙏 |
Is the plan for OTel to just adopt this ^ when it lands? |
@dyladan does this mean that the AWS XRay TraceID which uses a non psudo-random value for the left-most 4 bytes (based on the time) would not be W3C complaint? Looking into switching the Go implementation to the mentioned algorithm, I think the XRay spans would continue to be sampled in a non-random manner given this static prefix. cc @Aneurysm9 |
Trace IDs are 16 bytes, so there should be no issue.
X-Ray IDs will have non-random data in region x := binary.BigEndian.Uint64(p.TraceID[0:8]) >> 1 Instead, we could use region x := binary.BigEndian.Uint64(p.TraceID[9:]) This would also make the sampler safe to use with 64-bit trace IDs still generated by some legacy systems. |
I am expecting us to use the W3C trace context "random" flag to address this issue: |
We spoke about this at the w3c meeting yesterday actually. I was going to bring it up at the next maintainers meeting. The level 2 spec for trace context was delayed by some extended summer vacations but is about to go into wide review for publication as a recommendation. Obviously until something is an official recommendation the working group can't guarantee anything, but we do not expect any major changes. Here are the important points:
No. The randomness requirement only applies to traceparents where the edit: Missed that @Aneurysm9 clarified the region in question is random.
I would feel safe using the rightmost 7 bytes as my random number, and the fewer rightmost bytes that are used the safer I would feel. Hashing the random part of the ID or the whole ID would be another way to guarantee safety, but also comes at the cost of implementation complexity (and ensuring all implementations are the same). I would probably recommend to restrict to inverse power of 2 sampling probabilities (1/2, 1/4, 1/8, etc) which would allow you to use the minimum number of bits without hashing. For example, a 50% sampling rate only needs the single rightmost bit, a 25% sampling rate needs only the rightmost 2 bits, etc. This has also been discussed to have other benefits with respect to @jmacd's and @oertl's probability propagation proposal. edit: alternative to power of 2 restriction would be restricting probability to a whole number percent. Only 7 bits are required to represent every whole number from 1 to 99 |
@dyladan Thank you. I am pleased to see that we are nearly ready to adopt the random bit for w3c traceparent. Those familiar with the current tracestate-based proposal for probability sampling may remember that we would be able to eliminate the If we have 7-bytes of randomness, it means we can agree to a consistent method to evaluate TraceIDRatioBased sampling policy like the OTel-Go example in #1413 (comment). |
p value propagation I hope is next in level 3. for now it was agreed to leave it out of the level 2 spec because the r value was less controversial and less likely to get held up in wider review |
We discussed this in the 9/22 Sampling SIG. The action items were loosely discussed and will continue in the next SIG (10/6). My opinions, roughly, are:
The same decision would be returned by a Consistent Probability Sampler as in the experimental specification here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md. Moreover, the current experimental specification can be updated to rely on the W3C randomness bit, which is a huge improvement for us (thank you W3C TraceContext group, thank you @dyladan!) as follows:
|
Thanks for the update. @jmacd mind sharing what the reasoning was behind using 56 bits? Is it just because that is the number the W3C already has in the draft spec or was there some need for sampling thresholds with that level of randomness? |
I do not believe anyone requires 56-bits of sampling precision. I'm interested in what others think is a good value, maybe 16 or 20 bits will do. |
I am not sure if 16 bits are enough if really arbitrary sample rates should be supported. With 16 bits, the smallest possible sampling rate would be 1/2^16 = 0.00001525878 and the second smallest possible sampling rate would be 2/2^16 = 0.00003051757. There is a large relative gap between these two sampling rates. |
If the hashing algorithm is not specified, the SDKs can simply use any random number. Using the trace ID is not very useful in this case. |
For an assessment on the lower end of needed probabilities, let's look at one example. Google handles about 100 billion requests daily. If we design long term storage for traces which decreases traces cardinality as the data gets older, and keeps only 1000 traces per day (for example, for data older than 5 years), we need probability of about 2^-28. So, I'd say, we need at least 31 random bits, and this is still playing with chances. |
When the new random flag is NOT set, couldn't we still require SDKs to use the SAME hashing algorithm (instead of an unspecified algorithm)? i.e., a best effort in treating that the same last set of bytes in traceID is random... Reasoning: My understanding is that the TraceID generated by many systems today, though not required by the Level 1 of the W3C TraceContext spec, have their rightmost bytes randomly generated. So, shouldn't we do a "best effort" consistent probability sampling as adoption of the new flag can take a while (W3C TraceContext level 2 spec has to get to recommendation stage, implementations have to adopt it etc.).? |
This was discussed by me @oertl @PeterF778 @kalyanaj and @kentquirk in the Sampling SIG today. Notes: The W3C We debated whether 56-bits, 48-bits, or 32-bits of randomness would be preferred. There is not a strong preference between 48 and 56, but we think 32 bits is not sufficient. There was a brief question of whether we might wish to reserve some (e.g., 8) bits of the TraceID for future/alternative use on the assumption that 16-bytes is more than sufficient for global uniqueness (provided 48 bits are truly-random bits). For example, we could directly encode today's powers-of-two-sampling We discussed hashing approaches and were reminded why we don't like them (they're expensive, faulty, and not portable, see https://github.com/rurban/smhasher). We discussed how to test for suitably random TraceIDs, it could follow this previous work: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md#appendix-statistical-test-requirements We arrived at the following recommendation to address this issue. Assume that the least-significant 7 bytes of the TraceID are random _as though the anticipated W3C This is compatible with X-ray and we believe it is compatible with all existing OTel SDKs. The work remaining for this issue, if the proposal is accepted, will be to update the Trace SDK specification with details. The TraceIDRatioBasedSampler.ShouldSample() logic uses
The group reasons that this is no worse than doing nothing at all, and assuming that the W3C proposal does not change this is also forward-compatible. This |
During our group meeting on Oct 20, @PeterF778, @kalyanaj, @spencerwilson and @kentquirk identified some use cases which are easier to handle if sampling decisions are based on r-value generated independently from the trace-id. One is with tracking user sessions and the other is with linked traces. Assuming non-instrumented browser, user sessions involve several requests to some backend, each generating a new trace. It is beneficial to keep all these traces consistently sampled. This can be achieved by generating the r-value for the first request as usual, but reusing it for all traces belonging to the same session. This requires a mapping from the session-id to the r-value, which can be technically challenging, but should be feasible. Linked traces can be used in a number of ways, but one typical use case is when one trace leaves a message in queue to be picked up by another trace hours or days later. The root span of the consuming trace links itself to the producing trace. Again, it is beneficial to make the same sampling decisions for both traces. This can be helped by the consuming trace cloning the r-value from the producing trace. |
@PeterF778 would you agree that would be possible for an SDK to do what you described by fixing the 7 bytes of random TraceID and then generating multiple correlated TraceIDs from the single random source? Can you see any problems that might result from avoiding r-value, in that case? |
In the SDK trace-id is generated automatically when the root span is created. There are no mechanisms that could be used to customize this behavior. In contrast, the r-values are created within calls to Sampler.shouldSample() where span attributes are available. Reusing the 7 random bytes of trace id remains a theoretical possibility, but it would be very hard to implement. Even if we could somehow get it to work, such a change could break some vendors' features if they assume uniqueness of these 7 bytes. |
…29720) **Description:** This is the `pkg/sampling` portion of of #24811. **Link to tracking Issue:** #29738 open-telemetry/opentelemetry-specification#1413 **Testing:** Complete. **Documentation:** New README added. --------- Co-authored-by: Juraci Paixão Kröhling <[email protected]> Co-authored-by: Kent Quirk <[email protected]>
…pen-telemetry#29720) **Description:** This is the `pkg/sampling` portion of of open-telemetry#24811. **Link to tracking Issue:** open-telemetry#29738 open-telemetry/opentelemetry-specification#1413 **Testing:** Complete. **Documentation:** New README added. --------- Co-authored-by: Juraci Paixão Kröhling <[email protected]> Co-authored-by: Kent Quirk <[email protected]>
@jmacd Is this something for sampling SIG? |
See also discussion on #1412 (comment)
The sampling alorithm for TraceIdRatioBasedSampler is unspecified. As a result, trace IDs that are sampled by some implementations might get non-sampled or re-sampled by SDKs in other languages, even though they have the same or a a higher probability than the parent.
TODO list for this issue:
If it is a problem, since trace IDs can come from untrusted, non-random sources, do we open up a DDoS/Security/performance issue when using trace IDs as sole, deterministic input for our sampling algorithm? Do we need to put a warning there?Let's assume it is no problem for this issue, this should be handled in Support restarting the trace with a different trace ID #1188.The text was updated successfully, but these errors were encountered: