More efficient way to send/store repeated spans? #280

beniwohli · 2017-11-02T12:53:56Z

@gingerwizard brought up an interesting problem. In a purposely-made inefficient view, he queries a list of 15000 items one by one instead of a single query, to demonstrate a problem that can be detected and analyzed with APM.

The problem is, due to the huge amount of traces, the payload surpasses our payload size limit, and the transaction is lost.

In the Opbeat agents, we did some optimization by only storing the stacktrace once, but that's not done in the new Elastic APM agents.

A very simplistic approach could be to check if the just-ending trace is equivalent (same trace name, same stack trace, maybe even same context) with the preceding trace. If yes, increment a counter field, and add the duration to the preceding trace, don't create a new trace.

Even this simplistic approach has some drawbacks though. We lose some information (the duration of each individual repetition), and the UI would need to expose the count somehow. Also, things like "Top N queries" would get more complicated to implement.

Any ideas?

roncohen · 2017-11-02T13:13:25Z

For Python on Opbeat we would group traces together based on their name in order to avoid this problem. It gave the other problem which was that it's very hard to group these together in a good way and you sometimes want to see the full picture. And as you mentioned, if we go back to grouping, querying will become more complex.

Here's my proposal for an initial fix:

We introduce a trace count limit in the agents. A good default could be 1000 traces. When the agent hits 1000 traces for a single transaction it will set a property on the transaction to indicate that it hit the limit. Starting more traces in the transaction will be a noop. The full duration of the transaction is still recorded. Using the "limit hit" property on the transaction, we can show clearly in the UI that the transaction went over the limit. The idea is that if you have many traces many of them will be essentially the same and they will take up the bulk of the time and you'd want to fix that before diving into other performance problems for that transaction. So you'll have maybe 200 regular traces and the last 800 would just be the same trace over and over and that's something you'd want to fix. This way we still highlight the problem without complicating ES queries, intake-API and avoid having not-quite-right grouping in the agents.

The UI that shows the timeline view will have a empty space between hitting the limit and ending the transaction. To make the UI better, we could consider starting a big trace when we hit the limit. That trace would run to the end of the transaction and be named something like trace limit hit to indicate that more stuff went on but we didn't record it.

beniwohli · 2017-11-02T13:58:10Z

That seems like a nice compromise! I like!

roncohen · 2017-11-08T16:33:01Z

@mikker @watson @jahtalab WDYT?

mikker · 2017-11-08T18:08:42Z

I think your suggestion sounds very reasonable 👍

Only thing it doesn't take into account is if the trace count is on purpose. And I'm fine with just not supporting that for now. 1000 seems like it's already way over any regular use (?)

watson · 2017-11-09T13:53:25Z

Yeah, I think a configurable maxTracesPerTransaction option is a good idea whether or not the default should be 1000 or maybe 500 I'm not sure. I guess it depends on the language?

gingerwizard · 2017-11-09T19:04:33Z

Is there any possibility we could include rollup metrics for the traces on the source transaction? e.g. Number of traces ignored? Average Trace time? Percentile metrics etc? I can't immediately think of uses of the latter two, but the former would be good to know. Even being able to plot number of traces per transaction is actually potentially interesting - in django it maybe alludes to some poor use of the framework for example.

roncohen · 2017-11-09T21:19:22Z

@gingerwizard i like that idea. We could have a "dropped_traces_count" on the transaction or similar.

Taking it further, it would possibly be nice to know the type of traces were dropped:

dropped_traces_counts: {
  db: 23,
  cache: 940,
}

I would file that under nice-to-have though.

gingerwizard · 2017-11-09T21:27:26Z

yes dropped_traces_counts and total_traces_counts would be sufficient initially.

hmdhk · 2017-11-13T09:13:31Z

On the frontend we try to group traces that are "continuously similar and insignificant" and show them as one trace (that contains the end time of the last trace and the count for traces grouped together). We consider the trace's type, signature, how small it is in the context of the whole transaction and how far away it is (timewise) from the last trace we grouped.

@roncohen, I like the compromise you suggested and I think it also makes sense to have a limit on the frontend even though we have a grouping algorithm already.

roncohen · 2018-01-03T16:29:21Z

@simitt will you write up some high-level notes on how you plan to implement this?

simitt · 2018-01-03T20:51:36Z

Following the discussions and looking into the current structure what do you think about

Intake API Transaction

{
    "service": {},
    "system": {},
    "transactions": [
        {
            "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
            "name": "GET /api/types",
            "context": { ... },
            "spans": {
                "sampled": [
                    {
                        "id": 0,
                        "group": "database",
                        "parent": null,
                        "name": "SELECT FROM product_types",
                        ...
                    },
                    ...
                ],
                "count": {
                    "recorded": {
                        "overall": 291,
                        "database": 45,
                        "cache": 231,
                        "misc": 15
                    },
                    "sampled":{
                        "overall": 223,
                        "database": 22,
                        "cache": 186,
                        "misc": 15
                    }
                }
            }
        }
    ]
}

Elasticsearch Transaction Document

{
    "context": { ... },
    "processor": {
        "event": "transaction",
        "name": "transaction"
    },
    "transaction": {
        "duration": {
            "us": 32592
        },
        "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
        "name": "GET /api/types",
        "result": "success",
        "type": "request",
        "spans": {
            "count": {
                "recorded": {
                    "overall": 291,
                    "database": 45,
                    "cache": 231,
                    "misc": 15
                },
                "sampled":{
                    "overall": 223,
                    "database": 22,
                    "cache": 186,
                    "misc": 15
                },
                "dropped":{
                    "overall": 68,
                    "database": 23,
                    "cache": 45
                }
            }
        }
        ...
    }
}

I added a group attribute to every span that is used for splitting the counts, e.g. database, cache, etc. This reflects splitting counts as a nice-to-have mentioned by @roncohen, which could be added by the agents in the future and for now the agents could only send the overall counts.

I am aware that this would mean quite some changes in the agents, server and UI but I think adding the count information as attribute of a spans object would be the cleanest thing to do.
In case sampling and dropping is also added to stracktraces at some point we could reuse the same structure and terminology.

roncohen · 2018-01-04T00:15:23Z

@simitt good draft.

I think adding sampled here (in both places) could get confusing when we also have it on transactions. Also, I didn't think of this as sampling spans, more as a sort of protection against edge cases with exceptionally many spans where we'll need to drop some spans. However, I'm having a real hard time coming up with better names. Not super keen on items or list either. Any ideas?
For counts, I suggest we start with just spans.count.dropped.total both in the intake API and in the elasticsearch doc.

beniwohli · 2018-01-04T09:31:59Z

I feel like this is a fairly invasive change, less than 10 work days before GA feature freeze (and also BC1 IIRC). And it's not like we haven't anything else to do.

What speaks against @roncohen's suggestion in #280 (comment)? The only change would be to add an optional property on the transaction.

roncohen · 2018-01-04T09:36:24Z

yeah, I've also had second thoughts overnight.
@simitt how do you feel about only adding something like dropped_spans.total to the transaction?

simitt · 2018-01-04T10:25:36Z

Based on an offline discussion we will only add a simple count for now:

Suggestion:

Intake API Transaction

{
    "service": {},
    "system": {},
    "transactions": [
        {
            "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
            "name": "GET /api/types",
            "span_count": {
                "dropped": {
                    "total": 2
                }
            },
            "context": { ... },
            "spans": [...]
        }
    ]
}

Elasticsearch Transaction Document

{
    "context": { ... },
    "processor": {
        "event": "transaction",
        "name": "transaction"
    },
    "transaction": {
        "duration": {
            "us": 32592
        },
        "id": "945254c5-67a5-417e-8a4e-aa29efcbfb79",
        "name": "GET /api/types",
        "result": "success",
        "type": "request",
        "span_count": {
             "dropped": {
                 "total": 2
             }
        },
        ...
    }
}

roncohen · 2018-01-04T10:56:55Z

thanks @simitt !

beniwohli · 2018-01-04T11:02:05Z

Awesome! I'll try to have a first implementation of this in the Python agent, and a view which is over the limit in opbeans-python so the Kibana team has a test case

watson · 2018-01-04T12:56:47Z

I'm not sure what to think of the property name span_count. I get that the nested objects dropped.total is so that we can add more properties later without making it a breaking change, but I'm not sure what purpose span_count serves?

How about simply

"dropped_spans": {
    "total": 42
}

beniwohli · 2018-01-04T13:14:28Z

these deeply nested structures are only slightly awkward in Python 😁

result['span_count'] = {'dropped': {'total': self.dropped_spans}}

beniwohli · 2018-01-04T13:35:20Z

First stab at implementing this: elastic/apm-agent-python#127

I'm also OK with @watson's suggestion, no strong feelings either way

roncohen · 2018-01-04T14:25:55Z

@watson the idea is that we can add the total recorded span count later. See @simitt's initial example.

closes elastic#280.

Adds count to index. closes elastic#280.

Adds count to index. closes #280.

Adds count to index. closes elastic#280.

beniwohli added the discuss label Nov 2, 2017

beniwohli mentioned this issue Dec 23, 2017

When max unzipped limit is hit, APM Server responds with unhelpful 'unexpected EOF' error #438

Closed

simitt added the v6.2 label Jan 3, 2018

graphaelli changed the title ~~More efficient way to send/store repeated traces?~~ More efficient way to send/store repeated spans? Jan 3, 2018

simitt self-assigned this Jan 3, 2018

simitt added a commit to simitt/apm-server that referenced this issue Jan 4, 2018

Support dropped span_count on transactions.

4d42f01

closes elastic#280.

simitt added a commit to simitt/apm-server that referenced this issue Jan 4, 2018

Support dropped span_count on transactions.

a8491bf

closes elastic#280.

simitt mentioned this issue Jan 4, 2018

Support dropped span_count on transactions. #448

Merged

simitt added a commit to simitt/apm-server that referenced this issue Jan 5, 2018

Support dropped span_count on transactions.

770963d

closes elastic#280.

simitt added a commit to simitt/apm-server that referenced this issue Jan 5, 2018

Support dropped span_count on transactions.

a4b558e

Adds count to index. closes elastic#280.

simitt closed this as completed in #448 Jan 5, 2018

simitt added a commit that referenced this issue Jan 5, 2018

Support dropped span_count on transactions. (#448)

5961257

Adds count to index. closes #280.

simitt added a commit to simitt/apm-server that referenced this issue Jan 8, 2018

Support dropped span_count on transactions. (elastic#448)

2032e04

Adds count to index. closes elastic#280.

jalvz mentioned this issue Feb 27, 2020

Optimize the payload size elastic/apm-agent-rum-js#158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient way to send/store repeated spans? #280

More efficient way to send/store repeated spans? #280

beniwohli commented Nov 2, 2017

roncohen commented Nov 2, 2017 •

edited

Loading

beniwohli commented Nov 2, 2017

roncohen commented Nov 8, 2017

mikker commented Nov 8, 2017

watson commented Nov 9, 2017

gingerwizard commented Nov 9, 2017

roncohen commented Nov 9, 2017

gingerwizard commented Nov 9, 2017

hmdhk commented Nov 13, 2017

roncohen commented Jan 3, 2018

simitt commented Jan 3, 2018

roncohen commented Jan 4, 2018

beniwohli commented Jan 4, 2018

roncohen commented Jan 4, 2018

simitt commented Jan 4, 2018 •

edited

Loading

roncohen commented Jan 4, 2018

beniwohli commented Jan 4, 2018

watson commented Jan 4, 2018

beniwohli commented Jan 4, 2018

beniwohli commented Jan 4, 2018

roncohen commented Jan 4, 2018

More efficient way to send/store repeated spans? #280

More efficient way to send/store repeated spans? #280

Comments

beniwohli commented Nov 2, 2017

roncohen commented Nov 2, 2017 • edited Loading

beniwohli commented Nov 2, 2017

roncohen commented Nov 8, 2017

mikker commented Nov 8, 2017

watson commented Nov 9, 2017

gingerwizard commented Nov 9, 2017

roncohen commented Nov 9, 2017

gingerwizard commented Nov 9, 2017

hmdhk commented Nov 13, 2017

roncohen commented Jan 3, 2018

simitt commented Jan 3, 2018

roncohen commented Jan 4, 2018

beniwohli commented Jan 4, 2018

roncohen commented Jan 4, 2018

simitt commented Jan 4, 2018 • edited Loading

roncohen commented Jan 4, 2018

beniwohli commented Jan 4, 2018

watson commented Jan 4, 2018

beniwohli commented Jan 4, 2018

beniwohli commented Jan 4, 2018

roncohen commented Jan 4, 2018

roncohen commented Nov 2, 2017 •

edited

Loading

simitt commented Jan 4, 2018 •

edited

Loading