Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Probability sampling basics for telemetry events #148

Closed
wants to merge 74 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Mar 6, 2021

This OTEP defines a foundation for probability sampling in OpenTelemetry.

This drafts a specification for how to encode the sampling probabilities in a span, to enable statistical summarization from sampled traces using the sampler.name and sampler.adjusted_count attributes.

@jmacd jmacd requested a review from a team March 6, 2021 08:02
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Mar 11, 2021

I had an off-line meeting with @oertl, who mentioned that Dynatrace has a field named multiplicity used to convey sampling information. This is a term I had not considered, and one that I like.

Copy link
Member

@paulosman paulosman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this proposal, thank you!

The described approach solves two particular use cases that I think come up quite often:

  1. Allowing a tracing system to easily estimate the size of a population (in visualizations, counts, etc)
  2. Allowing clients (probably through tail sampling) to make dynamic sampling decisions

I've made some suggested changes, one that should make the linters happy and a couple of minor edits.

text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
Co-authored-by: Paul Osman <[email protected]>
Co-authored-by: Steven E. Harris <[email protected]>
@oertl
Copy link
Contributor

oertl commented Mar 20, 2021

Maybe we should also consider allowing only discrete sampling rates. I'm thinking in particular of powers of 1/2 (1/2^0, 1/2^1, 1/2^2,1/2^3,...). In my opinion, this is not a big limitation in practice, but would have some really nice benefits:

  1. A single byte for the exponent would be sufficient to represent the sampling rate.
  2. The extrapolation factor (sample_count) would always be an integer. When extrapolating integer quantities such as counts, floating-point operations could be completely avoided and the estimate would be an integer as well.
  3. Trace ID ratio based sampling would be much simpler. The sampling decision could be made simply based on the number of leading zeros (NLZ) of the trace ID. If the NLZ is equal to or greater than the exponent of the sampling rate, the span would be sampled. Only a few cheap CPU operations would be necessary for a sampling decision. Furthermore, by avoiding floating-point operations, sampling decisions would become more consistent across different platforms.

@jmacd
Copy link
Contributor Author

jmacd commented May 4, 2021

Please take another look. @oertl I included your recipe for encoding inclusion probability as a good option. This OTEP no longer makes a specific recommendation about how to encode this information, only that when this information is encoded we know what it means.

Joshua MacDonald added 2 commits May 4, 2021 12:31
@oertl
Copy link
Contributor

oertl commented Jul 20, 2021

Dynatrace has published a paper on partial trace sampling with a focus on the unbiased estimation from incomplete trace data https://arxiv.org/abs/2107.07703. It provides arguments for limiting the sampling rate to powers of 1/2 (see section "2.8 Practical Considerations").

@jmacd
Copy link
Contributor Author

jmacd commented Jul 21, 2021

@yurishkuro Do you feel that we can merge this OTEP? I believe I addressed your concerns, and any remaining concerns or matters of opinion can be ironed out as we move on to update the specification.

The recent sampling SIG meeting, summarized in open-telemetry/opentelemetry-specification#1819, found little objection over the contents of this OTEP. We seem to have reached a consensus about the use of TraceIDRatio sampling.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 21, 2021

@oertl Thank you for posting your research. Partial Sampling is a fantastic addition to the state of the art, and now I understand why you've been proposing power-of-two sampling rates. The "negative base-2 logarithm" topic is mentioned in this OTEP already, and I can add more current information if that will help us merge this and move on.

I take it you would like to see the specification support a directly-encoded span probability, using this approach. It would be an unsigned integer field in the protocol containing the base-2 logarithm of the adjusted count (i.e., the negative base-2 logarithm of the inclusion probability):

0: The span's adjusted count is 1
1: The span counts for 2
2: The span counts for 4
3: The span counts for 8

So, perhaps the Span field should be named log2_adjusted_count? I will support that as an optimization, if there is a general agreement. Setting the log2_adjusted_count field to X is equivalent to setting the proposed span attribute sampling.adjusted_count to 2^X.

However, this field will not support tail sampling at arbitrary rates, which is an application with known potential and existing uses. In that sense, the proposed use of a span attributr sampling.adjusted_count seems less contentious and more flexible (e.g., supports integer and non-integer adjusted counts).

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro Do you feel that we can merge this OTEP?

So I have two concerns:

  1. The propagation story is not completely clear, see comments inline.
  2. There is only one approval so far. I think this topic needs a lot more attention, especially from different vendors who already had to deal with sampling.

text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
text/0148-sampling-probability.md Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor Author

jmacd commented Jul 22, 2021

@yurishkuro Thank you. I will address your questions with one more round of work on this PR, however I want to highlight that the result of this OTEP is largely negative on propagating probability, and while I have given a proposed/draft specification for it I do not expect we will move forward with a specification that involves propagating anything. We are left with a desire to complete the TraceIDRatio sampler (which is to achieve consistent pseudo-randomness) and (independently) to know when traces are complete.

I believe this OTEP has met the standard for an OTEP, discussed here. This OTEP is now too long to review and another PR is needed. The next step will be, I think, another PR to revise this document, to focus specifically on the questions you've raised. Another piece of text is needed to fully document how to count spans based on the adjusted count that is recorded in the span.

Extrapolation from re-assembled trace is possible as long as the count is captured in the root.
should degrade nicely into this legacy mode

I don't see any part of this proposal that would prevent legacy counting strategies. If you agree that this proposal needs more editing, please approve so we can merge it and open a new PR. Thanks!

@yurishkuro
Copy link
Member

I believe this OTEP has met the standard for an OTEP, discussed here.

Not sure I agree with that. The process we agreed for OTEPs is that we do not capture OTEP status in the document, but use GitHub status as a proxy. I.e. approved & merged PR means approved OTEP, which in turn means it is the official position for the project and only pending mechanical translation into Spec changes. The link you provided does not say that a PR can be merged with intention for revisions via another PR, it actually says the opposite that the old PR should be closed if revisions are needed. Otherwise we are left with officially looking document in OTEPs that does not reflect the agreements.

Concretely, if you do not believe that we should proceed with implementing probability propagation, then I would move that text into a Discussion area with pros/cons, and not include it in the normative portion that recommends the actual changes to the spec (which, incidentally, most of my comments are about since the proposed changes are inconsistent).

@jmacd jmacd closed this Jul 22, 2021
@jmacd jmacd reopened this Jul 22, 2021
@jmacd
Copy link
Contributor Author

jmacd commented Jul 22, 2021

If I remove the entire proposed specification section of this document, would you approve? The goal was to outline our options, and I included the specification text as an example. It states ("For example:") before that section of text. It's the least interesting part of this document to me, what it really did was show us how we do not want to propagate probability. I'm interested in updating this OTEP with a minimal summary of what we concluded and merging it, not continuing to address minor points in what is ultimately not a specification document.

@yurishkuro
Copy link
Member

If you want to remove the proposed normative changes section, then it ceases to be an enhancement proposal (OTEP) imo. But I wouldn't mind merging it as it's a great read that can be referenced from other places.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 22, 2021

I think I will prefer to close and re-open a new PR. I will leave this open until then.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 23, 2021

@yurishkuro I wonder if we can salvage the bulk of this PR by removing the normative text that I had and specifying much less. My original goal with this OTEP was to specify a foundation, which is the basic idea that we can record an adjusted count when sampling to convey probability, and that adjusted count is a good way to convey that because users intuitively understand how to count but do not intuitively understand how to count inverses.

Thus, I've replaced all the text with a proposal for a new trace/semantic_conventions/sampling.md file with two attributes, sampler.adjusted_count and sampler.name. I think this is something we could all agree on, but it leaves a lot to be specified. It doesn't tell anyone how to modify the Samplers, but at least it would let users of OTLP who are not yet using OTel SDKs convey sampling information.

Originally I had proposed only an adjusted_count attribute, not the name attribute, but @oertl has shown why knowing which sampler computed the probability matters in some cases. Moreover, knowing the sampler name addresses an ambiguity that was discussed above: If the probability is not propagated then we cannot know the inclusion probability of a span written by the Parent sampler. We also should not presume the adjusted count is 1 in this case, so we have several possibilities:

Parent sampling, unknown probability:

sampler.name=Parent

Parent sampling, known probability (for 0.1 probability)

sampler.name=Parent
sampler.adjusted_count=10

TraceIDRatio sampling (for 0.1 probability)

sampler.name=TraceIDRatio
sampler.adjusted_count=10

And for no sampler, no attributes are needed. Spans carrying these attributes can be correctly counted, the only problem is for Parent sampling with unknown probability. At least now we have clearly identified when the adjusted count is missing. See the replacement text here: c4c06cd

I will take all the discussion about how to modify Samplers to produce these attributes as well as how to optionally propagate inclusion probability into another PR. Thanks.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 23, 2021

Why not always record sampler name?

The AlwaysOn sampler is (afaik) equivalent to no sampler, so I don't see any use for the name in that case. The adjusted count is 1 with or without the AlwaysOn sampler.

avoid inventing new naming scheme here

Yes, I see this point and feel ambivalent about it. I do not see any viable uses for the existing Description other than to configure a composite sampler in the SDK. It has the appearance of something that can be logged and parsed, but I wouldn't use it.

If the sampler description is "jaeger_remote", for example, it tells me nothing useful. The piece of information that is needed is not the composite policy that was configured as a Sampler, it is the effective "leaf" Sampler that was selected. If the Jaeger remote sampler selects a TraceIDRatio policy, that's what I want to know.

For the TraceIDRatio description, it encodes a floating point probability which is also (IMO) not a great representation for conveying adjusted count. The specification talks about how much precision should be logged for example, which only adds to the confusion. If I am logging 1 in 1024 spans, which @oertl would prefer we represent using the number 10 (i.e., 1 byte to say sampling probability is 1/2^10), the TraceIDRatio description will read "0.000977" (i.e., 8 bytes to encode a number with the addition of a 0.05% error, since 1/0.000977 = 1023.5).

As a vendor, we aren't actually concerned with knowing the sampler name. As this OTEP hopes to convince us, we can count spans without anything more than an adjusted count. I had two reasons to include it here:

  1. @oertl gave a convincing reason why knowing that TraceIDRatio was used for example, as opposed to a tail-based sampling scheme
  2. We had identified a gap for the Parent sampler, where we may or may not know its probability. Knowing the sampler avoids any ambiguity here and leaves the door open for specifying how to propagate the Parent probability.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 23, 2021

See the related OTEP #168

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 27, 2021

I will re-open this OTEP with a fresh PR. This is too long to review.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 27, 2021

This is replaced by #170.

@jmacd jmacd deleted the jmacd/sample branch July 27, 2021 19:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.