Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to set duration threshold for spans #259

Closed
felixbarny opened this issue Apr 20, 2020 · 9 comments
Closed

Add option to set duration threshold for spans #259

felixbarny opened this issue Apr 20, 2020 · 9 comments

Comments

@felixbarny
Copy link
Member

felixbarny commented Apr 20, 2020

Description of the issue

Currently, the experience of our APM solution is not ideal when monitoring an application that ends up generating thousands of spans per request. In elastic/apm-agent-java#1094, a feature has been requested to drop fast executing spans.

We currently have a default limit of 500 spans in the agents and 1000 spans in the UI to prevent overwhelming the system with too many spans. However, that just cuts off at a certain point without considering which spans are important and which are not.

Proposal

Introduce configuration option span_min_duration
Default: 0ms (don't discard any spans)

Sets the minimum duration of spans. Spans that execute faster than this threshold are attempted to be discarded. The attempt fails if they lead up to a span that can't be discarded.

Spans that propagate the trace context to downstream services, such as outgoing HTTP requests, can't be discarded. Additionally, spans that lead to an error or that may be a parent of async operations can't be discarded.

However, external calls that don't propagate context, such as calls to a database, can be discarded using this threshold.

The metrics for discarded spans are still considered in breakdown metrics.

Deprecating other threshold settings

A note on the Java agent-specific trace_methods_duration_threshold and profiling_inferred_spans_min_duration options. These options can be deprecated as the new option can also be applied to spans created via trace_methods or inferred spans.

Limitations

Spans that propagate context to downstream services can't be discarded

We only know whether to discard after the call has ended. At that point, the trace has already continued on the downstream service. Discarding the span for the external request would orphan the transaction of the downstream call.

An argument could be made that this is not a big problem as the trace view then just won't show the downstream transaction. But as this would introduce inconsistencies (e.g. the transaction can be seen in the transaction details view of the service but when viewing the full trace it disappears) I suggest not allowing this for now.

Intermediate spans

Discarding a single span that has both a parent and a child span is not possible as it would lead to orphaned child spans.

However, a whole subtree of spans can get discarded if all spans within that tree are requesting to be discarded. This means that if a leaf of the tree can't be discarded because it propagates context downstream, all spans leading up to it can't be discarded. If the leaf is a non-context propagating span, such as a manually created span or a SQL call, the subtree can be discarded.

Async spans

If the context of a span is propagated to another thread, it may not be discarded. That is because the other thread might create child spans of the first span even if it has already ended.

What this is not about

For now, we won't provide fine-grained control over what kind of spans are affected by the threshold as this gets quite complicated for both users and implementers. For example, to set a threshold only for db.sql spans or to set different thresholds for Elasticsearch spans and spans created by manual instrumentation. This could look like span_min_duration: *=10ms,db=5ms,db.elasticsearch=0ms to discard spans faster than 10ms, DB spans faster than 5ms, except for MySQL spans that should always be kept.

What we are voting on

Whether the option and its semantics are sensible

This does not have to be implemented right away by all agents. Prioritizing the implementation is a decision each agent can take on their own but the Java agent plans to implement this soon.

Vote

Agent Yes No Indifferent N/A Link to agent issue
.NET
Go
Java
elastic/apm-agent-java#1094
Node.js
Python
Ruby
RUM
@hmdhk
Copy link
Contributor

hmdhk commented Apr 21, 2020

An alternative solution that we have implemented in the RUM agent is to merge small and similar spans into a single one. We also capture the number of merged spans in the span name, i.e. 7x span-name. The algorithm looks at the span name, span type and the ratio of the span duration to the transaction duration. This, of course, requires buffering the spans which the RUM agent needs to do regardless of this.

The advantage of this approach is that if the performance of the application is affected by many small spans, this would be visible in the UI, while just discarding these spans might become misleading.

@basepi
Copy link
Contributor

basepi commented Apr 21, 2020

I think this is an excellent idea, though I agree with @jahtalab that it would be good to call out these dropped spans in some way. The user may want to know that those spans occurred and were truncated, in much the same way that we display truncations in local variable collection for errors.

I think it would probably be sufficient to just include a transaction metadata flag telling the APM Server that we did drop spans due to minimum duration, and display that fact to the user in the UI. Since the feature will be off by default, the user will immediately understand what's going on.

@axw
Copy link
Member

axw commented Apr 22, 2020

I like the feature in theory, so +1. I think for Go the async handling will probably require a lot more manual handling than what's described, but I think it's still achievable.

I think it would probably be sufficient to just include a transaction metadata flag telling the APM Server that we did drop spans due to minimum duration, and display that fact to the user in the UI. Since the feature will be off by default, the user will immediately understand what's going on.

We already have transaction.span_count.dropped. Seems reasonable to go a little more fine-grained, so we could say something like "N fast spans dropped".

I also like @jahtalab's described approach for combining spans. I think it's common in logging UIs (among others) to have something like:

  • foo
  • bar (repeated 999 times)
  • baz

I'd treat that as a separate issue, but I think it would be ideal if we first did this combination and then applied span_min_duration to the combined span. That way if you have something like an N+1 SQL pattern where the individual spans are very short but together take a lot of time, we would keep the combined SQL span.

@felixbarny
Copy link
Member Author

I think for Go the async handling will probably require a lot more manual handling than what's described, but I think it's still achievable.

Could you elaborate a bit? Would your idea help the Java agent to discard spans whose context is propagated to another thread but no child span is created?

We already have transaction.span_count.dropped. Seems reasonable to go a little more fine-grained, so we could say something like "N fast spans dropped".

++

What about transaction.span_count.dropped_fast_spans? So that both properties begin with dropped which might make it more discoverable.

This, of course, requires buffering the spans

The core assumption of backend agents that spans can be sent right after they have finished might not be as easy to change. It also helps to reduce memory utilization.

Having said that, I could imagine us buffering spans as long as their direct parent is not ended. It might be a problem for very long executing transactions but otherwise, it might be a reasonable tradeoff. It would allow to identify and de-duplicate N+1 queries, for example.

The advantage of this approach is that if the performance of the application is affected by many small spans, this would be visible in the UI, while just discarding these spans might become misleading.

++

I'd treat that as a separate issue

++

I don't see @jahtalab's suggesting contradicting with this setting. However, it seems like it's more complex, especially for backend agents that currently don't buffer.

What the de-duplication approach would not solve, for example, is when there are lots of fast spans with alternating names or types.

I'd suggest moving forward with this now and creating an issue the de-duplication approach.

@axw
Copy link
Member

axw commented Apr 22, 2020

I think for Go the async handling will probably require a lot more manual handling than what's described, but I think it's still achievable.

Could you elaborate a bit? Would your idea help the Java agent to discard spans whose context is propagated to another thread but no child span is created?

I don't have a specific idea of how it would work yet, I just know that detecting transfer of context between threads is not a viable approach in Go. This bit is pretty language specific I think. For Go I would probably make it so that spans are assumed to be synchronous by default, and callers of the API could explicitly mark them as async.

What about transaction.span_count.dropped_fast_spans? So that both properties begin with dropped which might make it more discoverable.

Sounds fine to me. Maybe just dropped_fast since the parent is span_count (though it kinda sounds like they were dropped quickly? 😄). Alternatively, span_count.dropped_min_duration to tie it back to the configuration.

Having said that, I could imagine us buffering spans as long as their direct parent is not ended. It might be a problem for very long executing transactions but otherwise, it might be a reasonable tradeoff. It would allow to identify and de-duplicate N+1 queries, for example.

Another approach: just buffer one span per thread. When the next span is ended, then we would either combine it with the previous one, or send the previous span and buffer the newly ended one. When the transaction is finally ended, send the buffered span if there is one, and disable buffering/span combination thereafter.

However, it seems like it's more complex ... I'd suggest moving forward with this now and creating an issue the de-duplication approach.

++

@felixbarny
Copy link
Member Author

I like span_count.dropped_min_duration

@felixbarny
Copy link
Member Author

Thinking about it more, there's an advantage of using span_count.dropped instead of span_count.dropped_min_duration.

We already show this message in the UI

Screen Shot 2020-04-24 at 11 47 49

And link to the dropped spans docs

These are our main options:

  1. Use span_count.dropped for both transaction_max_spans and span_min_duration and adjust the dropped spans docs.
  2. Do 1. but additionally, introduce span_count.dropped_min_duration alongside a more targeted call out. We could think about introducing span_count.dropped_max_spans.
  3. Introduce span_count.dropped_min_duration. Spans dropped due to span_min_duration don't affect span_count.dropped. Add another call out to Kibana.
    Using a new agent that sets span_count.dropped_min_duration with an older stack version would not show any call out at all.

My vote goes for 1. as it's the most simple one (does not require changes in APM Server and Kibana) and we can always do step 2 later.

@axw
Copy link
Member

axw commented Apr 24, 2020

@felixbarny ah, good catch. I concur: 2 sounds good to me, but I think it would be fine to start without it.

@felixbarny
Copy link
Member Author

Superseded by #314

@felixbarny felixbarny linked a pull request Aug 26, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants