Request Sample Rate needed - not scaleable #151

bradennapier · 2019-09-20T17:44:58Z

So I love the ES APM but I believe we are going to have to remove it due to the fact there is no way to limit the sampling of requests.

Currently 100% of requests will be sent no matter what with a fairly significant amount of data.

This can be limited by changing the transaction sample rate but this only reduces a small % of the data required to be indexed.

We get millions of requests a minute (mostly rate limited spam) and are constantly getting socket hangup and queue full errors so many requests are being lost anyway.

My estimate is to properly handle our requests it would probably cost us in the tens of thousands of dollars a month considering it can’t handle it and our current elastic cloud is already upwards of $2,000 a month. This would be as much or more than it costs to run our API.

Not to mention I have to clear out the entire system every 48 hours because it is filling up over a TB every couple days - have hot/warm architecture but that fills up completely.

We need to be able to sample a percent of actual requests on the agent then potentially add a multiplier on the other end to provide estimated metrics.

For example, I want to only have apm do anything for maybe 10% of requests made. This would allow us to reduce capacity requirements by 90% while still getting a general picture of what’s going on.

Note I already set to do like 0.01 transaction sampling and not capture any stack traces, etc. however it’s still far too much for apm to handle.

bradennapier · 2019-09-20T18:54:28Z

On another note, it is interesting that when these socket hangup issues occur it is fairly difficult to know that they are happening at all. There is really no indication that this happens on the monitoring or anything.

It is also not very easy to know if the apm server queue continually is filling up and dropping requests.

axw · 2019-09-23T01:57:00Z

@bradennapier I think #104 is relevant here. It's not an immediate solution for you, but sounds essentially more or less like what you want - do you agree?

One of the problems with throwing away data at the agent is that you then can't calculate percentiles properly any more. As described in the linked issue, this would be a trade-off you would have to make, until/unless Elasticsearch supports pre-aggregated histograms.

In the mean time, there may be a couple of things you could do that could help:

We get millions of requests a minute (mostly rate limited spam) and are constantly getting socket hangup and queue full errors so many requests are being lost anyway.

Do you care about monitoring those? Perhaps you could drop them in the agent if you're either not interested in them at all, or they're drowning out the most valuable performance data.

We need to be able to sample a percent of actual requests on the agent then potentially add a multiplier on the other end to provide estimated metrics.

For example, I want to only have apm do anything for maybe 10% of requests made. This would allow us to reduce capacity requirements by 90% while still getting a general picture of what’s going on.

One thing you could do is completely drop the non-sampled transactions, using a filter in the agent: https://www.elastic.co/guide/en/apm/agent/nodejs/current/agent-api.html#apm-add-filter

apm.addFilter(function (payload) {
  if (payload.sampled === false) {
          return
  }
  return payload
})

Bear in mind that there will be no multiplier reported, so the count and rate metrics will all be off. The histogram you see will also only be based on sampled data, but it sounds like you would be OK with that.

bradennapier · 2019-09-24T00:41:56Z

Ahh yeah if i can filter out what it reports i can implement this myself and yeah, would just need to manually do the multiplier which sucks for now but it at least would implement a short-term fix!

Ideally if the agent & the server could agree on sampling rate then all values could artificially be multiplied for the user - but that does have tradeoffs (such as what happens if you change then - then all your data is skewed) so I definitely get it is a difficult thing to understand!

However, the filtering is essentially what i would be doing on my end if i got rid of apm anyway!

Thanks!

Qard · 2019-09-24T01:13:49Z

I think this might be more an issue for @elastic/apm-server with optimizing how we index things. Perhaps we should transfer the issue?

simitt · 2019-09-24T08:47:02Z

I think @axw 's suggestion to avoid sending data to the Server makes a lot of sense in this case. Ofc we can also discuss further improvements on the server and some additional config option for some kind of server-side sampling. However, I think this would require some more holistic discussion, so I suggest to transfer to https://github.com/elastic/apm.

Qard · 2019-09-24T17:15:15Z

My thinking was just that maybe APM Server could somehow do something to aggregate the unsampled transaction data better. Currently, even with the sample rate turned way down, we will still send up the transaction event for every single request. In a high-traffic application there could be millions of transaction records generated per minute or even per second possibly. The structure of those transactions might be unique across routes, but they quickly lose uniqueness across traffic at significant scale. We're likely storing a huge amount of duplicate data in those cases.

bradennapier · 2019-09-24T20:09:11Z

Needless to say, this matches the observations i made in my OP 👍

bradennapier · 2019-09-24T20:34:23Z

@axw That filter broke the whole api FYI. It would appear that returning no value in a filter causes errors.

15|api   | TypeError: Cannot read property 'id' of undefined
15|api   |     at send (/home/ubuntu/project/node_modules/elastic-apm-node/lib/agent.js:386:68)
15|api   |     at prepareError (/home/ubuntu/project/node_modules/elastic-apm-node/lib/agent.js:378:7)
15|api   |     at /home/ubuntu/project/node_modules/elastic-apm-node/lib/agent.js:294:7
15|api   |     at /home/ubuntu/project/node_modules/elastic-apm-node/lib/parsers.js:80:7
15|api   |     at /home/ubuntu/project/node_modules/after-all-results/index.js:20:25
15|api   |     at process._tickCallback (internal/process/next_tick.js:61:11)

This seems to be due to the fact it is not possible to filter out errors or transaction.type of request ? see elastic/apm-agent-nodejs#1385

axw · 2019-09-25T03:06:35Z

@bradennapier sorry about that, and thanks again for the analysis in elastic/apm-agent-nodejs#1385.

There's a couple of issues here:

Either the agent or server, or both, cannot keep up with the rate of data. This can probably be improved with optimisations, but there will always be a limit to what the agent and server can process. We can address this by providing an option to aggregate non-sampled transactions in the agent.
The storage cost is too high. Currently we store individual events and aggregate them at query time, which is what enables calculating percentiles while filtering on high-cardinality fields, e.g. user names, origin country names, etc. By aggregating at the agent we can reduce the storage cost by storing only aggregated data; this will mean throwing away some of those high-cardinality data.

Both of these issues could be addressed by #104. We could alternatively, as you say, just not send the non-sampled transactions and instead introduce a multiplier, but this introduces other issues. In particular, when some transaction names/types are much less common than others, sampling may bias the results.

I think we should close this issue, and focus our efforts towards #104. What do you think?

bradennapier · 2019-10-03T20:02:27Z

Closed

watson transferred this issue from elastic/apm-agent-nodejs Sep 24, 2019

This was referenced Sep 24, 2019

Filters are broken in some cases and crash app #152

Closed

addFilter Crashes App when trying to filter requests elastic/apm-agent-nodejs#1385

Closed

bradennapier closed this as completed Oct 3, 2019

bradennapier mentioned this issue Dec 13, 2019

Improve scalability on ESS #184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request Sample Rate needed - not scaleable #151

Request Sample Rate needed - not scaleable #151

bradennapier commented Sep 20, 2019

bradennapier commented Sep 20, 2019

axw commented Sep 23, 2019

bradennapier commented Sep 24, 2019

Qard commented Sep 24, 2019

simitt commented Sep 24, 2019

Qard commented Sep 24, 2019

bradennapier commented Sep 24, 2019

bradennapier commented Sep 24, 2019 •

edited

Loading

axw commented Sep 25, 2019

bradennapier commented Oct 3, 2019

Request Sample Rate needed - not scaleable #151

Request Sample Rate needed - not scaleable #151

Comments

bradennapier commented Sep 20, 2019

bradennapier commented Sep 20, 2019

axw commented Sep 23, 2019

bradennapier commented Sep 24, 2019

Qard commented Sep 24, 2019

simitt commented Sep 24, 2019

Qard commented Sep 24, 2019

bradennapier commented Sep 24, 2019

bradennapier commented Sep 24, 2019 • edited Loading

axw commented Sep 25, 2019

bradennapier commented Oct 3, 2019

bradennapier commented Sep 24, 2019 •

edited

Loading