-
Notifications
You must be signed in to change notification settings - Fork 779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTEL_TRACES_SAMPLER
settings should apply in real time
#5737
Comments
Not a direct answer to this issue, but sharing some related thoughts.
|
@cijothomas those options would be interesting indeed. They might not solve the issue for us necessarily but I could see we potentially leveraging them in some scenarios. The Collector reference is something I forgot to mention. We are using an OpenTelemetry Collector in our infrastructure, but currently we are only doing sampling at the application level and not in the collector itself. For now, we've opted into keeping sampling in the apps as that is substantially more efficient than having everything pushed into the collector only for it to be ignored there. |
Just FYI for anyone landing on this PR @samsp-msft contributed a RateLimitingSampler over on contrib. |
That's a great addition! It would be cool if there were further ways to configure the rate limiting aspect. For example, having a means to prioritize or "isolate" categories that are rate limited independently of each other. As an example, I don't want to sample lots of success traces only to then have some error ones ignored by the rate limiting. Ideally, it would be nice to be able to configure custom "categories" or "filters" that would group traces together and apply rate limiting to those groups in an independent manner. Another common scenario I could see is regarding errors. Imagine I get a burst of the same error in quick succession, but among those "same" errors there are a few outliers of different, interesting errors that I'd really like to not be ignored. General rate limiting would not be aware of those differences and might just use all the duplicate errors to rate limit the other different errors from the stream. TL;DR, there should be some sort of mechanism to decide whether a given trace is a "good representative" or not: if there are a bunch of nearly identical traces already, then it would be considered "not representative", whereas if it was a rare error, it would be considered a "good representative" and not be filtered out. |
Such advanced sampling is generally not possible at OTel Sampler level, as it is head-based - i.e at the time sampling decision is made, it is unknown if the trace will succeed/fail/fail-terribly! Tail-sampling can be used to achieve such advanced capabilities. (There are pros/cons in doing it inside process vs doing it in a local collector/agent) There are few example in the repo showing this advanced trail-sampling in the sdk itself. There is no OTel prescribed "correct" way - OTel offers various options, leaving it up to end users to figure out the right combination for their needs. |
@julealgon, as @cijothomas says, what you are looking for is tail sampling. However to be most effective, I believe its best implemented out of proc in something like the OTel collector. The reason is that by its nature, tracing is distributed. A decision in process A should be influenced by the decisions in processes B, C, D & E, and the retention or dropping of the trace is ideally a uniform decision and should affect the results from all the processes. Trying to coordinate that between them is not really possible today. The solution is to use the OTel collector, send all the traces to it, and then have it apply rules for which traces to keep or discard. The last time I looked, the rules for the collector were pretty flexible, but I was disappointed that there wasn't a coordinated story for dealing with log entries - IMHO you want to filter trace associated log entries in coordination with the traces, making the same retain or drop decision as for the traces. |
@samsp-msft since you brought this up, I do have a question (perhaps off-topic but I'll go for it anyways): wouldn't this approach to sampling be substantially less performant than sampling at the application level? It basically forces us to completely disable sampling in all apps, then send full payloads to the collector (including the network overhead and all that), only to then ignore a significant portion of those spans in the collector. I've been a bit afraid of switching to collector-based sampling because of this fear, although we've not yet tried it. I suspect it will also make it much harder to control sampling percentages based on different projects, as we'd have to constantly update the collector's yaml config file to change sampling rules, instead of just changing an Azure AppConfiguration value to do so for a specific project without any deployment or code changes.
I assume you mean that the sampling processor is configured independently for traces and logs, right? For now, that wouldn't affect us particularly, since we don't intend to enforce any sampling for logs. But I do get what you mean. |
There are pros and cons for the different sampling techniques and you're hitting on some of them there. In debugging there are 2 main types of data you want to keep/sample, slow/interesting requests and those with errors. Unfortunately neither of those are ones you can determine at root span creation time (I.e. head sampling). Using an out of band sampler that buffers spans and makes the decision based on all the spans in the trace is the best way to do this (i.e. tail sampling), but it has drawbacks. The biggest is memory, in order to tail sample, you need to buffer a lot of data and therefore store a lot in memory. This means the samplers can be large. The next is bandwidth, you need to send all spans to the same instance. In AWS, cross-AZ transfer isn't free, so that can get very costly. In Azure that's free, but cross-region is not. So architecture this right is crucial (I've written posts about that). The final one is the delay. In order to get all the spans in the trace to make the decision across all services, you need to wait for that. Because this is generally non-deterministic, you end up with 15-30 second delays. In platforms like AppInsights that have a 90 second ingest SLA, that can be mostly non-noticeable, in the backends with seconds ingest it's a potential issue. Head sampling is generally used in combination with tail sampling at the larger end of traffic. We have a customer that's currently sampling at 1/20 at the head, then 1/200000 at the tail for instance because their volumes are that high. Depending on your scale, and how you do tail sampling (rules based like the collector or more dynamic/adaptive) you can get statistical accuracy with much smaller volumes, but that does require tail sampling. I think ultimately, this isn't something that we would implement in .NET, so I think this issue should probably be closed out. |
@martinjt I was with you the whole time until this very last statement.
I don't see how tail vs head sampling has anything to do with this proposal here. I would still want to be able to configure head sampling dynamically regardless of all the other points in the discussion. The fact that I have to restart my applications today for the sampler to take effect is less than ideal, particularly when I just want to change the sampling percentage. |
I'm not sure how you would change environment variables of a running process to be honest, is that actually possible? The samplers that are part of the otel sdk don't suggest this is something that you can do, so I don't think that it's something we would attempt in core. For contrib, I could see someone creating something a bit like the Jaeger remote sampler that implements this kind of functionality. What I'm saying is that it wouldn't be part of core. |
Settings on the .NET implementation are also honored via the |
I did add a version of the Jaeger rate based sampler to contrib - https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Extensions#ratelimitingsampler. This is typically used with the parent-based sampler to set the Recorded bit for whether a span should be emitted or not. That flows through the chain of It takes a rate at startup, but could easily be modified so that you could adjust the rate dynamically, or go a step further and have it pull the value from config and listen to config changes. |
The bigger issue I think is that it would deviate from other implementation in other languages/frameworks. I know that .NET has already deviated in a lot of ways IConfiguration support is definitely an interesting one, but it's kept true to treating it the same as the other implementations. I think that adding in auto-changing this based on the standard settings would be an interesting step to take, but ultimately a deviation and something that would probably take a lot of work to build and maintain. There is nothing stopping someone adding it to contrib though, as a "auto updating" version of the same code. |
I don't think it would be as complex as you imply there. Basically, it would need to have some lifetimes changed a bit and use Of course, all up to the team. I may be able to contribute but that's not guaranteed. |
My worry would be that this is the hotpath, and massively multi-threaded, but further to that, considering the developer experience. From the top of my head, it would be whether this is what someone actually expects if they come from other languages, and whether there's a consistency issue across multiple service (I.e. how you'd monitor whether it's propagated to everywhere if you're using key vault, vault etc.) I think there's probably a lot to consider since it's deviating from the norm. My point is, if the maintainers won't accept the responsibility to accept this here, we should close this down. However building a variant in contrib that you maintain is probably valuable. |
This is a totally fair concern. There's been complains in the
I don't think this should be a concern. .NET has its own mechanisms that the community is used to, namely in this case
Now you are listing a separate concern that is independent from OpenTelemetry IMHO. Anyone who consumes settings via I don't think this should be a concern at all to a specific usage/library. It's up to the consumer how/if they want to leverage configuration reload and each
I don't agree it "deviates from the norm". It would be fully idiomatic .NET. It might "deviate" from how it is handled in another language, but that's because that other language perhaps doesn't have a widely used, native dynamic configuration mechanism that supports setting reloads without restarting the application. This to me again is only taking advantage of what .NET has to offer.
Fair. I think it would still make sense as part of the standard implementations to take configuration reloading into account, but this is a subjective matter for sure. |
Your points are valid if we consider this as a purely .NET solution, used by developers who work in one language. It becomes more nuanced when you start to think more about how this would differ in a polyglot system where there are multiple entry points across .net, python, js etc. When .NET's configuration updates, but none of the others do. To be clear, those were thoughts off the top of my head as just some initial thoughts that would need to be considered if this was to be taken forward. In regards to being idiomatic for C# devs, that's a thing I've always got behind. I would definitely argue that configuration updating unchange is not in anyway an idiomatic .net behaviour, for over(?) 2 decades changing config files caused application restarts, nevermind it not updating the inmemory representation, and most libraries and applications I've encountered don't implement Further to this, I think the idea of changing things while they're running may become part of the Otel configuration file working group (perhaps even part of OpAmp). That will likely cover whether updating samplers should happen without app restarts. All that to say that unless it's going to be accepted by the maintainers, it's a fruitless debate. If it is, then I'll happily spend thought on what potential gotchas there might be. |
Package
OpenTelemetry
Is your feature request related to a problem?
The
OTEL_TRACES_SAMPLER
andOTEL_TRACES_SAMPLER_ARG
configuration settings are currently only taken into consideration when theTracerProviderSdk
is first built, which only happens a single time in an application's lifetime.This means that if one wants to change sampling percentage, or switch to a different sampler, an application restart is required for the settings to be picked up.
We have many applications today (dozens) that are leveraging OpenTelemetry and we want to fine tune sampler/percentage due to cost concerns with our observability tooling, but requiring a full restart of the applications is prohibitive for us as it results in downtime to consumers in most cases. This means we are currently forced to wait for deployment windows (which happen every 2 weeks or so by default in our process structure) to perform such fine tuning. This severely limits us and doesn't allow us to react to any problems quick enough.
What is the expected behavior?
We'd like for both of the
OTEL_TRACES_SAMPLER
andOTEL_TRACES_SAMPLER_ARG
settings to be monitored for, and reapply in real time when changes occur: if theARG
setting changes, the current sampler should take it into account for any future sampling calls after that, and if theSAMPLER
setting changes, it should complete any outstanding calls on the current sampler and swap to the new sampler as soon as possible.Which alternative solutions or features have you considered?
We don't have any alternative at the moment. Only a full application restart triggers a new evaluation and application of the settings.
Technically, I believe it would be possible to achieve real-time switching of the sampler today from the outside by implementing this logic as part of a custom sampler decorator that relies on
IOptionsMonitor
,IOptionsSnapshot
or change tokens to monitor for changes in the settings, but unfortunately the implementation today doesn't rely on theIOptions
framework, so this would require quite a bit of custom code to support and would force us to basically reimplement the checks against these variables ourselves. We've not pursued that option yet, but will have to consider it in case there is pushback from the OTEL team in making it real-time natively.Additional context
The OTEL specification doesn't explicitly mention whether or not the settings should apply immediately, but they also don't mention anything on the contrary.
I believe it is also unintuitive for the settings to require a restart today, as that is not mentioned anywhere. Sampling decisions are made on a per activity basis, so swapping the sampler should be doable.
Additionally, this goes against the design of metrics, which can be toggled on and off dynamically today via configuration and reflect instantaneously. Having tracing-related configs also reflect in real-time would be a consistency upgrade there.
FYI @erricrice
The text was updated successfully, but these errors were encountered: