Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient way to send/store repeated spans? #280

Closed
beniwohli opened this issue Nov 2, 2017 · 21 comments · Fixed by #448
Closed

More efficient way to send/store repeated spans? #280

beniwohli opened this issue Nov 2, 2017 · 21 comments · Fixed by #448
Assignees

Comments

@beniwohli
Copy link
Contributor

@gingerwizard brought up an interesting problem. In a purposely-made inefficient view, he queries a list of 15000 items one by one instead of a single query, to demonstrate a problem that can be detected and analyzed with APM.

The problem is, due to the huge amount of traces, the payload surpasses our payload size limit, and the transaction is lost.

In the Opbeat agents, we did some optimization by only storing the stacktrace once, but that's not done in the new Elastic APM agents.

A very simplistic approach could be to check if the just-ending trace is equivalent (same trace name, same stack trace, maybe even same context) with the preceding trace. If yes, increment a counter field, and add the duration to the preceding trace, don't create a new trace.

Even this simplistic approach has some drawbacks though. We lose some information (the duration of each individual repetition), and the UI would need to expose the count somehow. Also, things like "Top N queries" would get more complicated to implement.

Any ideas?

@roncohen
Copy link
Contributor

roncohen commented Nov 2, 2017

For Python on Opbeat we would group traces together based on their name in order to avoid this problem. It gave the other problem which was that it's very hard to group these together in a good way and you sometimes want to see the full picture. And as you mentioned, if we go back to grouping, querying will become more complex.

Here's my proposal for an initial fix:

We introduce a trace count limit in the agents. A good default could be 1000 traces. When the agent hits 1000 traces for a single transaction it will set a property on the transaction to indicate that it hit the limit. Starting more traces in the transaction will be a noop. The full duration of the transaction is still recorded. Using the "limit hit" property on the transaction, we can show clearly in the UI that the transaction went over the limit. The idea is that if you have many traces many of them will be essentially the same and they will take up the bulk of the time and you'd want to fix that before diving into other performance problems for that transaction. So you'll have maybe 200 regular traces and the last 800 would just be the same trace over and over and that's something you'd want to fix. This way we still highlight the problem without complicating ES queries, intake-API and avoid having not-quite-right grouping in the agents.

The UI that shows the timeline view will have a empty space between hitting the limit and ending the transaction. To make the UI better, we could consider starting a big trace when we hit the limit. That trace would run to the end of the transaction and be named something like trace limit hit to indicate that more stuff went on but we didn't record it.

@beniwohli
Copy link
Contributor Author

That seems like a nice compromise! I like!

@roncohen
Copy link
Contributor

roncohen commented Nov 8, 2017

@mikker @watson @jahtalab WDYT?

@mikker
Copy link

mikker commented Nov 8, 2017

I think your suggestion sounds very reasonable 👍

Only thing it doesn't take into account is if the trace count is on purpose. And I'm fine with just not supporting that for now. 1000 seems like it's already way over any regular use (?)

@watson
Copy link
Contributor

watson commented Nov 9, 2017

Yeah, I think a configurable maxTracesPerTransaction option is a good idea whether or not the default should be 1000 or maybe 500 I'm not sure. I guess it depends on the language?

@gingerwizard
Copy link

Is there any possibility we could include rollup metrics for the traces on the source transaction? e.g. Number of traces ignored? Average Trace time? Percentile metrics etc? I can't immediately think of uses of the latter two, but the former would be good to know. Even being able to plot number of traces per transaction is actually potentially interesting - in django it maybe alludes to some poor use of the framework for example.

@roncohen
Copy link
Contributor

roncohen commented Nov 9, 2017

@gingerwizard i like that idea. We could have a "dropped_traces_count" on the transaction or similar.

Taking it further, it would possibly be nice to know the type of traces were dropped:

dropped_traces_counts: {
  db: 23,
  cache: 940,
}

I would file that under nice-to-have though.

@gingerwizard
Copy link

yes dropped_traces_counts and total_traces_counts would be sufficient initially.

@hmdhk
Copy link
Contributor

hmdhk commented Nov 13, 2017

On the frontend we try to group traces that are "continuously similar and insignificant" and show them as one trace (that contains the end time of the last trace and the count for traces grouped together). We consider the trace's type, signature, how small it is in the context of the whole transaction and how far away it is (timewise) from the last trace we grouped.

@roncohen, I like the compromise you suggested and I think it also makes sense to have a limit on the frontend even though we have a grouping algorithm already.

@simitt simitt added the v6.2 label Jan 3, 2018
@graphaelli graphaelli changed the title More efficient way to send/store repeated traces? More efficient way to send/store repeated spans? Jan 3, 2018
@simitt simitt self-assigned this Jan 3, 2018
@roncohen
Copy link
Contributor

roncohen commented Jan 3, 2018

@simitt will you write up some high-level notes on how you plan to implement this?

@simitt
Copy link
Contributor

simitt commented Jan 3, 2018

Following the discussions and looking into the current structure what do you think about

Intake API Transaction

{
    "service": {},
    "system": {},
    "transactions": [
        {
            "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
            "name": "GET /api/types",
            "context": { ... },
            "spans": {
                "sampled": [
                    {
                        "id": 0,
                        "group": "database",
                        "parent": null,
                        "name": "SELECT FROM product_types",
                        ...
                    },
                    ...
                ],
                "count": {
                    "recorded": {
                        "overall": 291,
                        "database": 45,
                        "cache": 231,
                        "misc": 15
                    },
                    "sampled":{
                        "overall": 223,
                        "database": 22,
                        "cache": 186,
                        "misc": 15
                    }
                }
            }
        }
    ]
}

Elasticsearch Transaction Document

{
    "context": { ... },
    "processor": {
        "event": "transaction",
        "name": "transaction"
    },
    "transaction": {
        "duration": {
            "us": 32592
        },
        "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
        "name": "GET /api/types",
        "result": "success",
        "type": "request",
        "spans": {
            "count": {
                "recorded": {
                    "overall": 291,
                    "database": 45,
                    "cache": 231,
                    "misc": 15
                },
                "sampled":{
                    "overall": 223,
                    "database": 22,
                    "cache": 186,
                    "misc": 15
                },
                "dropped":{
                    "overall": 68,
                    "database": 23,
                    "cache": 45
                }
            }
        }
        ...
    }
}

I added a group attribute to every span that is used for splitting the counts, e.g. database, cache, etc. This reflects splitting counts as a nice-to-have mentioned by @roncohen, which could be added by the agents in the future and for now the agents could only send the overall counts.

I am aware that this would mean quite some changes in the agents, server and UI but I think adding the count information as attribute of a spans object would be the cleanest thing to do.
In case sampling and dropping is also added to stracktraces at some point we could reuse the same structure and terminology.

@roncohen
Copy link
Contributor

roncohen commented Jan 4, 2018

@simitt good draft.

  • I think adding sampled here (in both places) could get confusing when we also have it on transactions. Also, I didn't think of this as sampling spans, more as a sort of protection against edge cases with exceptionally many spans where we'll need to drop some spans. However, I'm having a real hard time coming up with better names. Not super keen on items or list either. Any ideas?
  • For counts, I suggest we start with just spans.count.dropped.total both in the intake API and in the elasticsearch doc.

@beniwohli
Copy link
Contributor Author

I feel like this is a fairly invasive change, less than 10 work days before GA feature freeze (and also BC1 IIRC). And it's not like we haven't anything else to do.

What speaks against @roncohen's suggestion in #280 (comment)? The only change would be to add an optional property on the transaction.

@roncohen
Copy link
Contributor

roncohen commented Jan 4, 2018

yeah, I've also had second thoughts overnight.
@simitt how do you feel about only adding something like dropped_spans.total to the transaction?

@simitt
Copy link
Contributor

simitt commented Jan 4, 2018

Based on an offline discussion we will only add a simple count for now:

Suggestion:

Intake API Transaction

{
    "service": {},
    "system": {},
    "transactions": [
        {
            "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
            "name": "GET /api/types",
            "span_count": {
                "dropped": {
                    "total": 2
                }
            },
            "context": { ... },
            "spans": [...]
        }
    ]
}

Elasticsearch Transaction Document

{
    "context": { ... },
    "processor": {
        "event": "transaction",
        "name": "transaction"
    },
    "transaction": {
        "duration": {
            "us": 32592
        },
        "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
        "name": "GET /api/types",
        "result": "success",
        "type": "request",
        "span_count": {
             "dropped": {
                 "total": 2
             }
        },
        ...
    }
}

@roncohen
Copy link
Contributor

roncohen commented Jan 4, 2018

thanks @simitt !

@beniwohli
Copy link
Contributor Author

Awesome! I'll try to have a first implementation of this in the Python agent, and a view which is over the limit in opbeans-python so the Kibana team has a test case

@watson
Copy link
Contributor

watson commented Jan 4, 2018

I'm not sure what to think of the property name span_count. I get that the nested objects dropped.total is so that we can add more properties later without making it a breaking change, but I'm not sure what purpose span_count serves?

How about simply

"dropped_spans": {
    "total": 42
}

@beniwohli
Copy link
Contributor Author

these deeply nested structures are only slightly awkward in Python 😁

result['span_count'] = {'dropped': {'total': self.dropped_spans}}

@beniwohli
Copy link
Contributor Author

First stab at implementing this: elastic/apm-agent-python#127

I'm also OK with @watson's suggestion, no strong feelings either way

@roncohen
Copy link
Contributor

roncohen commented Jan 4, 2018

@watson the idea is that we can add the total recorded span count later. See @simitt's initial example.

simitt added a commit to simitt/apm-server that referenced this issue Jan 4, 2018
simitt added a commit to simitt/apm-server that referenced this issue Jan 4, 2018
simitt added a commit to simitt/apm-server that referenced this issue Jan 5, 2018
simitt added a commit to simitt/apm-server that referenced this issue Jan 5, 2018
simitt added a commit that referenced this issue Jan 5, 2018
simitt added a commit to simitt/apm-server that referenced this issue Jan 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants