-
Notifications
You must be signed in to change notification settings - Fork 164
Introduce sampling score and propagate it with the trace #135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the overall direction of this, and I got inspired by one idea from this OTEP.
We can have only one "ProbabilitySampler" that allows to configure how the score is calculated:
- Using a deterministic hash as the current TraceIdRatioBased
- Using a
probability.score
generate at the root of the trace and propagated via tracestate.
What I would do different?
- I would encourage to rewrite the initial motivation part to clarify why using the "TraceId" as the source of the score is not good enough.
- Propose to have a "ProbabilitySampler" that allows customizing how the score is calculated and propagated (will support the
TraceIdRatioBased
as well as the new proposed way).
text/trace/0107-sampling-score.md
Outdated
Score is also exposed through span attributes. Vendors can leverage it | ||
to sort traces based on their completeness: the lower the value of score is, | ||
the higher the chance it was sampled it by each component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always need to have a score? How do we calculate the score if a custom sampler is used like always on? How does the score interact with custom samplers that do rate based sampling 1 trace every 1 second? How does the backend know when score was used to do sampling decision or just ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't always need to have a score. We need it when there are multiple possible sampling algorithms in the same app.
Example when we need it:
Service A uses Azure Monitor SDK, uses hash1(trace-id), assigns score -> Service B uses OTel with hash2(trace-id)
-> Service C uses AzureMontior with hash1(trace-id)
Services A and C use Azure Monitor SDK for years and it's unreasonable to ask them to upgrade to OTel.
It is reasonable to ask them to update to the new version of Azure Monitor that supports score.
We don't need score when only OTel is used everywhere (or the same algorithm).
So, we'll tell our customers to configure ExternalScoreSampler
and fallback to OTel TraceIdRationBased algorithm if score is not in the tracestate. Default case for OTel users does not change, this is opt-in behavior.
Custom samplers that use 1 trace every 1 second are not compatible with score - they don't pursuit consistent sampling goal anyway and applying them together with ExternalScoreSampler
seems to be a misconfiguration.
text/trace/0107-sampling-score.md
Outdated
- if it's not there, invokes `ProbabilitySampler`, which calculates score | ||
and populates it on the attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I've tried to follow this approach in the new version. To make it clean, I suggest separating score generation from
sampling.
Samplers can use score generation, and also attach attributes, change tracestate.
If we allow falling back to another sampler:
- we need to do score propagation back from one sampler to the ExtenralScore one
- coordinate possible multiple changes of tracestate and attributes
If we have a layer responsible for score calculation (random, deterministic of any sort), we can make it much simpler and cleaner.
thanks for the review @bogdandrutu. I believe I addressed your questions/comments in the spec. |
I think
which you already created open-telemetry/opentelemetry-specification#856 for should be done separately. That change is IMHO the most important one (breaking in many interface-based languages), it is also the simplest one and the other parts can be added later. |
I agree that other parts could be added later implementation-wise. |
@specs-trace-approvers what could be the next steps to move this spec forward? |
I think we should definitely do open-telemetry/opentelemetry-specification#856 before GA but I suspect that most don't want to take the time to read and understand the whole OTEP before GA. CC @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers |
@lmolkova You could maybe move this forward by making a spec PR for open-telemetry/opentelemetry-specification#856. Then after that is merged, you can make the OTEP a bit shorter 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I am ok with the direction of this OTEP
Fixes #856 ## Changes Added `Tracestate` to `SamplingResult` Related [oteps](https://github.com/open-telemetry/oteps) open-telemetry/oteps#135
Incorporating changes from open-telemetry/opentelemetry-specification#988
`tracestate` so downstream services can re-use it to make their sampling | ||
decisions *instead of* re-calculating score as a function of trace-id | ||
(or trace-flags). This allows to configure sampling algorithm on the first | ||
service ans avoid coordination of algorithms when multiple tracing tools are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: ans -> and
I was wondering what the status of this is? @lmolkova @bogdandrutu , is it blocked by #148 ? |
@ericmustin I remember the history of this thread, but I believe @lmolkova has moved on from OpenTelemetry. The conclusion from #148 is that encoding inclusion probability (how we count sampled events) is different from how we ensure that traces are complete when sampled by some scheme. OTel's current specification includes only one scheme for ensuring that sampled traces are complete, the W3C "is-sampled" flag (part of TraceFlags, in SpanContext). May we close this issue? |
@jmacd I prefer to keep this issue open. Propagating the sample rate or adjusted count from the root span, where the sampling decision was made, allows to set the adjusted count for a span properly as already mentioned in #148 (comment). This would allow extrapolation of individual spans without having to retrieve this information from the root span. |
@oertl OK, let's keep discussing. My understanding of this proposal in this PR is that it does not propagate inclusion probability. I am definitely in favor of solutions for propagating inclusion probability. The solution in this PR talks about the use of an explicit random number for use coordinating different-sampling schemes w/o addressing inclusion probability. I'm not familiar with the system @lmolkova describes here. |
@jmacd, you are right this is a different issue. However, it tackles a still unsolved problem with regard to sampling. The trace ID ratio based sampler, only works properly if the trace ID is uniformly distributed, which is AFAIK, not well specified. In order to get sampling right, if you do not know the distribution of the trace ID and if you just know that the trace ID is unique, you need to hash the trace ID. A high-quality hash function, is able to transform the trace ID into a uniformly distributed hash value which can then be used for ratio-based sampling. A list of fast, non-cryptographic, high-quality hash function can be found at https://github.com/rurban/smhasher#summary. Apart from the overhead, a further problem is that the hash function needs to be consistently implemented for all supported languages. To save hashing costs, the root span may calculate the hash value and propagate it with the trace ID, such that child spans do not need to hash the trace ID again and again. If the hash value is only calculated once at the root span and never again for child spans, because they are always using the same precalculated hash value from the parent, one could replace the hash value computed at the root span by any random number. This is exactly what is proposed here where the generated random value is called sampling score. It is essentially nothing else than a secondary trace ID that is uniformly distributed. By the way, if ratio-based sampling is restricted to sample rates that are powers of one half as proposed (see #148 (comment)), it would be sufficient to propagate just the number of leading zeros of the sampling score, because this is the only information needed for the sampling decision in this case. |
I'm going to close this OTEP as I'm not working on sampling anymore and it doesn't look like there is much interest. If anyone is interested in it, feel free to take any parts of it, and happy to share any context. |
This goals of this OTEP are, I believe, addressed in #168. |
This is a reincarnation of #107.
The spec described consistent sampling between existing tracing tools that use vendor-specific sampling algorithms and OTel and enables an upgrade path that vendors may use to onboard customers on otel without breaking cross-tracing-tool traces.
The delta OTEP introduces is relatively small:
SamplingResult.Tracestate
field: sampler should be able to assign anew tracestate for to-be-created span
sampling.score
attribute on span (let's discuss attribute vs field).[Update] after review
SamplingScoreGenerator
that is capable of calculating float score from sampling parameters.It has
TraceIdRatioGenerator
,RandomGenerator
, and possible other implementations.TraceIdRatioBased
sampler to use corresponding generator and serve as generic probability sampler with configurable score generation approach.ExternalScoreSampler
implementation ofSampler
. It's created with probability value and implementation ofSamplingScoreGenerator
.With this OTEP we are trying to come up with long-term plan for interop between Microsoft SDKs and services and OpenTelemetry-enabled apps. We want to make sure this solution exists, while implementation can wait.
The specification changes can wait till post-GA, however, I've heard several requests for p1 before GA (ability to modify
tracestate
by Sampler) - open-telemetry/opentelemetry-specification#856.