-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add span_min_duration spec #314
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally 👍 , just a couple small questions
docs/agents/span-min-duration.md
Outdated
such as a manually created span or a SQL call, | ||
the subtree can be discarded. | ||
|
||
### Spans that lead up to async spans can't be discarded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This limitation might make the proposal not very useful for Node.js or Go/goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also @axw's comment from #259 (comment)
I don't have a specific idea of how it would work yet, I just know that detecting transfer of context between threads is not a viable approach in Go. This bit is pretty language specific I think. For Go I would probably make it so that spans are assumed to be synchronous by default, and callers of the API could explicitly mark them as async.
💔 Build Failed
Expand to view the summary
Build stats
Steps errors
Expand to view the steps failures
|
@beniwohli no rush but could you either resolve the conversations or let me know if anything is still missing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the discussion issue the RUM agent handles the original problem (handling many small spans) in a different way that results in merging small spans into one and show the count. I don't think it's necessary to introduce this behavior in the RUM agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
### Determining whether to report a span | ||
|
||
If the span is requested to be discarded and discardable, | ||
the `span_count.dropped` count is incremeted and the span will not be reported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that spans that are eventually dropped because of span_min_duration
should not be counted in span_count.started
. What implementation approach do you suggest?
- Increment
span_count.started
at the start of each span and then decrement it for the spans that are dropped because ofspan_min_duration
- Increment
span_count.started
at the end of each span when it's already known if the span is going to be dropped.
It seems that approach (1) has a disadvantage of potentially under-utilizing transaction_max_spans
quota.
But approach (2) is somewhat counteractive and can also lead to over-utilizing transaction_max_spans
quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All started spans count into that, no matter whether they are discarded.
We don’t have to decrement when discarding as we also count discarded spans in another variable. See the section „ Impact on transaction_max_spans
“ for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR spans dropped because of transaction_max_spans are not counted in span_count.started. Only spans actually sent to APM Server are counted in span_count.started. So there is no overlap between span_count.started and span_count.dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only spans actually sent to APM Server are counted in span_count.started.
That's news to me. Do you have any sources for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was a bit confused it seems. You're right, the started count does not include dropped spans. The invariant total = started + dropped
should always hold true.
IMHO, started
is a bit of a misnomer. Internally in the Java agent, the variable is named reported
. It is incremented on span end if and only if the span is not discarded.
So it's the approach (2).
Note that the started
count has no influence on transaction_max_spans
. Only total
and dropped
is considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean by
Note that the started count has no influence on transaction_max_spans. Only total and dropped is considered.
Isn't it the case that started
should not be larger than transaction_max_spans
?
If you increment started
at the end of the span, you can get into situation where started
is already equal to transaction_max_spans
by the time the span ends so if you have to report the span (for example because its ID is referenced by an event that is already reported) you will have to increment started
beyond transaction_max_spans
- this is what I meant by
over-utilizing
transaction_max_spans
quota
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the limit is reached if this condition is true:
span_count.total - span_count.dropped >= transaction_max_spans
The total
count is incremented for every span, even for those that are dropped (due to transaction_max_spans
or span_min_duration
).
This condition does not use the started
count, hence it has no (direct) influence on transaction_max_spans
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
The invariant
total = started + dropped
should always hold true.
and
the limit is reached if this condition is true:
span_count.total - span_count.dropped >= transaction_max_spans
doesn't it follow that the limit is reached if this condition is true?
span_count.started >= transaction_max_spans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but we probably won't be able to implement this in the Go agent without breaking changes - so likely not for a while.
@axw which part would require a breaking change in the Go agent? Is it even feasible to Go do fulfill the |
Not automatically AFAICS. We would need to encode that into the API. You would need to explicitly mark the span as asynchronous (or synchronous), making them non-discardable. This is the bit that makes it a breaking change, as otherwise we potentially start discarding things that shouldn't be. The other option is we assume all spans are non-discardable by default, and one always has to mark spans as discardable explicitly; we would then implement this for select auto-instrumented spans like database queries. Not ideal, but better than nothing. We could do this sooner rather than later. |
I'd like to suggest an edit to maintain all SQL and external spans in addition to context propagating spans no matter how quick those are. They are use in determining the connectivity/dependencies of a service and taking them away will lead to partial information(would service maps even show there was an external/db call if we drop the span?). The only spans that we should filter out are spans that represent some internal processing and are below a certain threshold. |
Additionally, spans that lead to an error or that may be a parent of an async operation can't be discarded. | ||
|
||
However, external calls that don't propagate context, | ||
such as calls to a database, can be discarded using this threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion we should keep database calls even if they are below the threshold for the fast spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I heard some feedback from users where they wanted a similar feature and in those cases it was all database spans. ORM frameworks sometimes create multiple db calls and in general I had the impression that db will be the one causing "span explosion" a lot. I feel not including db spans here would limit the usefulness of this config.
But good point about the service map - is it unacceptable to just say that this influences servicemap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an explicit design goal and a request from you know who to reduce the vast amount of short DB calls from the waterfall.
I see that this can be problematic when trying to determine metrics for the service map. But then we might have to tackle the issue in a completely different way.
For example similar to how the RUM agent does it and fold multiple repetitive spans that have the same context into one and display as 10 x SELECT FROM users
, for example.
Or we have to track the metrics in the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixbarny :) I now remember that request.
We should factor in a way to get the Database connections right. The Service Maps should show accurate connectivity data regardless of how fast it is being executed. In the future, we'd also want to add a dependencies/Database page to track all the database queries for a particular service - if we drop these spans now we will have to add back some logic in to add these back. Also, omitting fast SQL queries may hinder a user from detecting a n+1 query problem all together(especially if those queries are quick).
I like the example you provided to fold multiple repetitive spans with the same context to display them. That would solve both the too many spans issue while keeping the maps accurate and allowing for any future development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good points, Neha. span_min_duration seems to be more of a band-aid fix. Re-implementing that for other agents probably doesn't make sense as it's not a good solution long-term.
Maybe we need a solution that tackles the problem of excessive data collection more holistically. Some thoughts on this:
Spans that exceed transaction_max_spans
are currently discarded the same way as spans faster than span_min_duration
. Maybe we need a way of attaching information to the transaction about how many spans were discarded by span type and destination. We could use this information both in the waterfall view (we've omitted 42 DB spans to a MySQL DB that took 1.234s in total) and to calculate edge metrics to DBs accurately in case we dropped data. We already collect breakdown metrics that contain similar data but we don't group them by destination and don't attach them to transactions.
Another thing to reconsider is trying to deduplicate spans that just contain repetitive information, such as n+1 queries or 100s of Redis cache GET
calls in a row.
Do you have a sense of the priority of this? Should we remove span_min_duration
from 7.11? Should we try to get to a holistic solution within the 7.11 timeframe?
such as outgoing HTTP requests, | ||
can't be discarded. | ||
Additionally, spans that lead to an error or that may be a parent of an async operation can't be discarded. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that may be a parent of an async operation can't be discarded.
Repeating what was mentioned elsewhere -- this requirement means that for the Node.js Agent "in the real world" that no spans will be eligible for discarding.
The culture around Node.js heavily encourages the use of asynchronous programming patterns -- specifically asynchronous callback APIs and Promise based APIs. For example -- the typical handling of a HTTP(s) request in Node.js will look like
- Register a callback function to handle a specific route/url
- This callback makes a request to the application's data layer (db query, orm, etc) for some data, which occurs asynchronously, and registers a second callback function
- This route handling callback finishes execution.
- When the asynchronous request for data finishes, node calls the second callback. This callback typically handles sending the HTTP response
While it's possible to construct a route handling callback that works synchronously in Node.js -- it's not very common in the real world.
@nehaduggal (since @alex-fedotyev is currently on leave) -- the above means that the span_min_duration
feature, as speced, wouldn't help any Node.js users with their too many spans problems.
Closing this as the proposal has several shortcomings (see #314 (comment)). Instead, we'll discuss alternatives and a more holistic solution in a working group. |
@felixbarny has the holistic solution been discussed? If so, where can we track the progress of this feature? Thanks. |
We did not end up implementing the ability to drop arbitrary spans across agents as async context propagation makes it hard to determine whether a span can be safely dropped. Instead, we focussed on dropping fast exit spans: #496. |
Discussion issue: #259
Note that the implementation of this spec also includes adding the option to Kibana.
Open questions: