-
Notifications
You must be signed in to change notification settings - Fork 164
Probability sampling basics for telemetry events #148
Conversation
I had an off-line meeting with @oertl, who mentioned that Dynatrace has a field named |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this proposal, thank you!
The described approach solves two particular use cases that I think come up quite often:
- Allowing a tracing system to easily estimate the size of a population (in visualizations, counts, etc)
- Allowing clients (probably through tail sampling) to make dynamic sampling decisions
I've made some suggested changes, one that should make the linters happy and a couple of minor edits.
Co-authored-by: Paul Osman <[email protected]> Co-authored-by: Steven E. Harris <[email protected]>
Maybe we should also consider allowing only discrete sampling rates. I'm thinking in particular of powers of 1/2 (1/2^0, 1/2^1, 1/2^2,1/2^3,...). In my opinion, this is not a big limitation in practice, but would have some really nice benefits:
|
Please take another look. @oertl I included your recipe for encoding inclusion probability as a good option. This OTEP no longer makes a specific recommendation about how to encode this information, only that when this information is encoded we know what it means. |
Dynatrace has published a paper on partial trace sampling with a focus on the unbiased estimation from incomplete trace data https://arxiv.org/abs/2107.07703. It provides arguments for limiting the sampling rate to powers of 1/2 (see section "2.8 Practical Considerations"). |
@yurishkuro Do you feel that we can merge this OTEP? I believe I addressed your concerns, and any remaining concerns or matters of opinion can be ironed out as we move on to update the specification. The recent sampling SIG meeting, summarized in open-telemetry/opentelemetry-specification#1819, found little objection over the contents of this OTEP. We seem to have reached a consensus about the use of TraceIDRatio sampling. |
@oertl Thank you for posting your research. Partial Sampling is a fantastic addition to the state of the art, and now I understand why you've been proposing power-of-two sampling rates. The "negative base-2 logarithm" topic is mentioned in this OTEP already, and I can add more current information if that will help us merge this and move on. I take it you would like to see the specification support a directly-encoded span probability, using this approach. It would be an unsigned integer field in the protocol containing the base-2 logarithm of the adjusted count (i.e., the negative base-2 logarithm of the inclusion probability): 0: The span's adjusted count is 1 So, perhaps the However, this field will not support tail sampling at arbitrary rates, which is an application with known potential and existing uses. In that sense, the proposed use of a span attributr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yurishkuro Do you feel that we can merge this OTEP?
So I have two concerns:
- The propagation story is not completely clear, see comments inline.
- There is only one approval so far. I think this topic needs a lot more attention, especially from different vendors who already had to deal with sampling.
@yurishkuro Thank you. I will address your questions with one more round of work on this PR, however I want to highlight that the result of this OTEP is largely negative on propagating probability, and while I have given a proposed/draft specification for it I do not expect we will move forward with a specification that involves propagating anything. We are left with a desire to complete the TraceIDRatio sampler (which is to achieve consistent pseudo-randomness) and (independently) to know when traces are complete. I believe this OTEP has met the standard for an OTEP, discussed here. This OTEP is now too long to review and another PR is needed. The next step will be, I think, another PR to revise this document, to focus specifically on the questions you've raised. Another piece of text is needed to fully document how to count spans based on the adjusted count that is recorded in the span.
I don't see any part of this proposal that would prevent legacy counting strategies. If you agree that this proposal needs more editing, please approve so we can merge it and open a new PR. Thanks! |
Not sure I agree with that. The process we agreed for OTEPs is that we do not capture OTEP status in the document, but use GitHub status as a proxy. I.e. approved & merged PR means approved OTEP, which in turn means it is the official position for the project and only pending mechanical translation into Spec changes. The link you provided does not say that a PR can be merged with intention for revisions via another PR, it actually says the opposite that the old PR should be closed if revisions are needed. Otherwise we are left with officially looking document in OTEPs that does not reflect the agreements. Concretely, if you do not believe that we should proceed with implementing probability propagation, then I would move that text into a Discussion area with pros/cons, and not include it in the normative portion that recommends the actual changes to the spec (which, incidentally, most of my comments are about since the proposed changes are inconsistent). |
If I remove the entire proposed specification section of this document, would you approve? The goal was to outline our options, and I included the specification text as an example. It states ("For example:") before that section of text. It's the least interesting part of this document to me, what it really did was show us how we do not want to propagate probability. I'm interested in updating this OTEP with a minimal summary of what we concluded and merging it, not continuing to address minor points in what is ultimately not a specification document. |
If you want to remove the proposed normative changes section, then it ceases to be an enhancement proposal (OTEP) imo. But I wouldn't mind merging it as it's a great read that can be referenced from other places. |
I think I will prefer to close and re-open a new PR. I will leave this open until then. |
@yurishkuro I wonder if we can salvage the bulk of this PR by removing the normative text that I had and specifying much less. My original goal with this OTEP was to specify a foundation, which is the basic idea that we can record an adjusted count when sampling to convey probability, and that adjusted count is a good way to convey that because users intuitively understand how to count but do not intuitively understand how to count inverses. Thus, I've replaced all the text with a proposal for a new Originally I had proposed only an Parent sampling, unknown probability:
Parent sampling, known probability (for 0.1 probability)
TraceIDRatio sampling (for 0.1 probability)
And for no sampler, no attributes are needed. Spans carrying these attributes can be correctly counted, the only problem is for Parent sampling with unknown probability. At least now we have clearly identified when the adjusted count is missing. See the replacement text here: c4c06cd I will take all the discussion about how to modify Samplers to produce these attributes as well as how to optionally propagate inclusion probability into another PR. Thanks. |
The AlwaysOn sampler is (afaik) equivalent to no sampler, so I don't see any use for the name in that case. The adjusted count is 1 with or without the AlwaysOn sampler.
Yes, I see this point and feel ambivalent about it. I do not see any viable uses for the existing Description other than to configure a composite sampler in the SDK. It has the appearance of something that can be logged and parsed, but I wouldn't use it. If the sampler description is "jaeger_remote", for example, it tells me nothing useful. The piece of information that is needed is not the composite policy that was configured as a Sampler, it is the effective "leaf" Sampler that was selected. If the Jaeger remote sampler selects a TraceIDRatio policy, that's what I want to know. For the TraceIDRatio description, it encodes a floating point probability which is also (IMO) not a great representation for conveying adjusted count. The specification talks about how much precision should be logged for example, which only adds to the confusion. If I am logging 1 in 1024 spans, which @oertl would prefer we represent using the number 10 (i.e., 1 byte to say sampling probability is 1/2^10), the TraceIDRatio description will read "0.000977" (i.e., 8 bytes to encode a number with the addition of a 0.05% error, since 1/0.000977 = 1023.5). As a vendor, we aren't actually concerned with knowing the sampler name. As this OTEP hopes to convince us, we can count spans without anything more than an adjusted count. I had two reasons to include it here:
|
See the related OTEP #168 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I will re-open this OTEP with a fresh PR. This is too long to review. |
This is replaced by #170. |
This OTEP defines a foundation for probability sampling in OpenTelemetry.
This drafts a specification for how to encode the sampling probabilities in a span, to enable statistical summarization from sampled traces using the
sampler.name
andsampler.adjusted_count
attributes.