-
Notifications
You must be signed in to change notification settings - Fork 524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient way to send/store repeated spans? #280
Comments
For Python on Opbeat we would group traces together based on their name in order to avoid this problem. It gave the other problem which was that it's very hard to group these together in a good way and you sometimes want to see the full picture. And as you mentioned, if we go back to grouping, querying will become more complex. Here's my proposal for an initial fix: We introduce a trace count limit in the agents. A good default could be 1000 traces. When the agent hits 1000 traces for a single transaction it will set a property on the transaction to indicate that it hit the limit. Starting more traces in the transaction will be a noop. The full duration of the transaction is still recorded. Using the "limit hit" property on the transaction, we can show clearly in the UI that the transaction went over the limit. The idea is that if you have many traces many of them will be essentially the same and they will take up the bulk of the time and you'd want to fix that before diving into other performance problems for that transaction. So you'll have maybe 200 regular traces and the last 800 would just be the same trace over and over and that's something you'd want to fix. This way we still highlight the problem without complicating ES queries, intake-API and avoid having not-quite-right grouping in the agents. The UI that shows the timeline view will have a empty space between hitting the limit and ending the transaction. To make the UI better, we could consider starting a big trace when we hit the limit. That trace would run to the end of the transaction and be named something like |
That seems like a nice compromise! I like! |
I think your suggestion sounds very reasonable 👍 Only thing it doesn't take into account is if the trace count is on purpose. And I'm fine with just not supporting that for now. 1000 seems like it's already way over any regular use (?) |
Yeah, I think a configurable |
Is there any possibility we could include rollup metrics for the traces on the source transaction? e.g. Number of traces ignored? Average Trace time? Percentile metrics etc? I can't immediately think of uses of the latter two, but the former would be good to know. Even being able to plot number of traces per transaction is actually potentially interesting - in django it maybe alludes to some poor use of the framework for example. |
@gingerwizard i like that idea. We could have a "dropped_traces_count" on the transaction or similar. Taking it further, it would possibly be nice to know the type of traces were dropped:
I would file that under nice-to-have though. |
yes |
On the frontend we try to group traces that are "continuously similar and insignificant" and show them as one trace (that contains the end time of the last trace and the count for traces grouped together). We consider the trace's @roncohen, I like the compromise you suggested and I think it also makes sense to have a limit on the frontend even though we have a grouping algorithm already. |
@simitt will you write up some high-level notes on how you plan to implement this? |
Following the discussions and looking into the current structure what do you think about Intake API Transaction
Elasticsearch Transaction Document
I added a I am aware that this would mean quite some changes in the agents, server and UI but I think adding the |
@simitt good draft.
|
I feel like this is a fairly invasive change, less than 10 work days before GA feature freeze (and also BC1 IIRC). And it's not like we haven't anything else to do. What speaks against @roncohen's suggestion in #280 (comment)? The only change would be to add an optional property on the transaction. |
yeah, I've also had second thoughts overnight. |
Based on an offline discussion we will only add a simple count for now: Suggestion: Intake API Transaction
Elasticsearch Transaction Document
|
thanks @simitt ! |
Awesome! I'll try to have a first implementation of this in the Python agent, and a view which is over the limit in opbeans-python so the Kibana team has a test case |
I'm not sure what to think of the property name How about simply "dropped_spans": {
"total": 42
} |
these deeply nested structures are only slightly awkward in Python 😁
|
First stab at implementing this: elastic/apm-agent-python#127 I'm also OK with @watson's suggestion, no strong feelings either way |
Adds count to index. closes elastic#280.
Adds count to index. closes elastic#280.
@gingerwizard brought up an interesting problem. In a purposely-made inefficient view, he queries a list of 15000 items one by one instead of a single query, to demonstrate a problem that can be detected and analyzed with APM.
The problem is, due to the huge amount of traces, the payload surpasses our payload size limit, and the transaction is lost.
In the Opbeat agents, we did some optimization by only storing the stacktrace once, but that's not done in the new Elastic APM agents.
A very simplistic approach could be to check if the just-ending trace is equivalent (same trace name, same stack trace, maybe even same context) with the preceding trace. If yes, increment a counter field, and add the duration to the preceding trace, don't create a new trace.
Even this simplistic approach has some drawbacks though. We lose some information (the duration of each individual repetition), and the UI would need to expose the count somehow. Also, things like "Top N queries" would get more complicated to implement.
Any ideas?
The text was updated successfully, but these errors were encountered: