-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to set duration threshold for spans #259
Comments
An alternative solution that we have implemented in the RUM agent is to merge small and similar spans into a single one. We also capture the number of merged spans in the span name, i.e. The advantage of this approach is that if the performance of the application is affected by many small spans, this would be visible in the UI, while just discarding these spans might become misleading. |
I think this is an excellent idea, though I agree with @jahtalab that it would be good to call out these dropped spans in some way. The user may want to know that those spans occurred and were truncated, in much the same way that we display truncations in local variable collection for errors. I think it would probably be sufficient to just include a transaction metadata flag telling the APM Server that we did drop spans due to minimum duration, and display that fact to the user in the UI. Since the feature will be off by default, the user will immediately understand what's going on. |
I like the feature in theory, so +1. I think for Go the async handling will probably require a lot more manual handling than what's described, but I think it's still achievable.
We already have I also like @jahtalab's described approach for combining spans. I think it's common in logging UIs (among others) to have something like:
I'd treat that as a separate issue, but I think it would be ideal if we first did this combination and then applied |
Could you elaborate a bit? Would your idea help the Java agent to discard spans whose context is propagated to another thread but no child span is created?
++ What about
The core assumption of backend agents that spans can be sent right after they have finished might not be as easy to change. It also helps to reduce memory utilization. Having said that, I could imagine us buffering spans as long as their direct parent is not ended. It might be a problem for very long executing transactions but otherwise, it might be a reasonable tradeoff. It would allow to identify and de-duplicate N+1 queries, for example.
++
++ I don't see @jahtalab's suggesting contradicting with this setting. However, it seems like it's more complex, especially for backend agents that currently don't buffer. What the de-duplication approach would not solve, for example, is when there are lots of fast spans with alternating names or types. I'd suggest moving forward with this now and creating an issue the de-duplication approach. |
I don't have a specific idea of how it would work yet, I just know that detecting transfer of context between threads is not a viable approach in Go. This bit is pretty language specific I think. For Go I would probably make it so that spans are assumed to be synchronous by default, and callers of the API could explicitly mark them as async.
Sounds fine to me. Maybe just
Another approach: just buffer one span per thread. When the next span is ended, then we would either combine it with the previous one, or send the previous span and buffer the newly ended one. When the transaction is finally ended, send the buffered span if there is one, and disable buffering/span combination thereafter.
++ |
I like |
Thinking about it more, there's an advantage of using We already show this message in the UI And link to the dropped spans docs These are our main options:
My vote goes for 1. as it's the most simple one (does not require changes in APM Server and Kibana) and we can always do step 2 later. |
@felixbarny ah, good catch. I concur: 2 sounds good to me, but I think it would be fine to start without it. |
Superseded by #314 |
Description of the issue
Currently, the experience of our APM solution is not ideal when monitoring an application that ends up generating thousands of spans per request. In elastic/apm-agent-java#1094, a feature has been requested to drop fast executing spans.
We currently have a default limit of 500 spans in the agents and 1000 spans in the UI to prevent overwhelming the system with too many spans. However, that just cuts off at a certain point without considering which spans are important and which are not.
Proposal
Introduce configuration option
span_min_duration
Default: 0ms (don't discard any spans)
Sets the minimum duration of spans. Spans that execute faster than this threshold are attempted to be discarded. The attempt fails if they lead up to a span that can't be discarded.
Spans that propagate the trace context to downstream services, such as outgoing HTTP requests, can't be discarded. Additionally, spans that lead to an error or that may be a parent of async operations can't be discarded.
However, external calls that don't propagate context, such as calls to a database, can be discarded using this threshold.
The metrics for discarded spans are still considered in breakdown metrics.
Deprecating other threshold settings
A note on the Java agent-specific
trace_methods_duration_threshold
andprofiling_inferred_spans_min_duration
options. These options can be deprecated as the new option can also be applied to spans created viatrace_methods
or inferred spans.Limitations
Spans that propagate context to downstream services can't be discarded
We only know whether to discard after the call has ended. At that point, the trace has already continued on the downstream service. Discarding the span for the external request would orphan the transaction of the downstream call.
An argument could be made that this is not a big problem as the trace view then just won't show the downstream transaction. But as this would introduce inconsistencies (e.g. the transaction can be seen in the transaction details view of the service but when viewing the full trace it disappears) I suggest not allowing this for now.
Intermediate spans
Discarding a single span that has both a parent and a child span is not possible as it would lead to orphaned child spans.
However, a whole subtree of spans can get discarded if all spans within that tree are requesting to be discarded. This means that if a leaf of the tree can't be discarded because it propagates context downstream, all spans leading up to it can't be discarded. If the leaf is a non-context propagating span, such as a manually created span or a SQL call, the subtree can be discarded.
Async spans
If the context of a span is propagated to another thread, it may not be discarded. That is because the other thread might create child spans of the first span even if it has already ended.
What this is not about
For now, we won't provide fine-grained control over what kind of spans are affected by the threshold as this gets quite complicated for both users and implementers. For example, to set a threshold only for
db.sql
spans or to set different thresholds for Elasticsearch spans and spans created by manual instrumentation. This could look likespan_min_duration: *=10ms,db=5ms,db.elasticsearch=0ms
to discard spans faster than 10ms, DB spans faster than 5ms, except for MySQL spans that should always be kept.What we are voting on
Whether the option and its semantics are sensible
This does not have to be implemented right away by all agents. Prioritizing the implementation is a decision each agent can take on their own but the Java agent plans to implement this soon.
Vote
The text was updated successfully, but these errors were encountered: