Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request Sample Rate needed - not scaleable #151

Closed
bradennapier opened this issue Sep 20, 2019 · 10 comments
Closed

Request Sample Rate needed - not scaleable #151

bradennapier opened this issue Sep 20, 2019 · 10 comments

Comments

@bradennapier
Copy link

So I love the ES APM but I believe we are going to have to remove it due to the fact there is no way to limit the sampling of requests.

Currently 100% of requests will be sent no matter what with a fairly significant amount of data.

This can be limited by changing the transaction sample rate but this only reduces a small % of the data required to be indexed.

We get millions of requests a minute (mostly rate limited spam) and are constantly getting socket hangup and queue full errors so many requests are being lost anyway.

My estimate is to properly handle our requests it would probably cost us in the tens of thousands of dollars a month considering it can’t handle it and our current elastic cloud is already upwards of $2,000 a month. This would be as much or more than it costs to run our API.

Not to mention I have to clear out the entire system every 48 hours because it is filling up over a TB every couple days - have hot/warm architecture but that fills up completely.

We need to be able to sample a percent of actual requests on the agent then potentially add a multiplier on the other end to provide estimated metrics.

For example, I want to only have apm do anything for maybe 10% of requests made. This would allow us to reduce capacity requirements by 90% while still getting a general picture of what’s going on.

Note I already set to do like 0.01 transaction sampling and not capture any stack traces, etc. however it’s still far too much for apm to handle.

@bradennapier
Copy link
Author

On another note, it is interesting that when these socket hangup issues occur it is fairly difficult to know that they are happening at all. There is really no indication that this happens on the monitoring or anything.

It is also not very easy to know if the apm server queue continually is filling up and dropping requests.

@axw
Copy link
Member

axw commented Sep 23, 2019

@bradennapier I think #104 is relevant here. It's not an immediate solution for you, but sounds essentially more or less like what you want - do you agree?

One of the problems with throwing away data at the agent is that you then can't calculate percentiles properly any more. As described in the linked issue, this would be a trade-off you would have to make, until/unless Elasticsearch supports pre-aggregated histograms.

In the mean time, there may be a couple of things you could do that could help:

We get millions of requests a minute (mostly rate limited spam) and are constantly getting socket hangup and queue full errors so many requests are being lost anyway.

Do you care about monitoring those? Perhaps you could drop them in the agent if you're either not interested in them at all, or they're drowning out the most valuable performance data.

We need to be able to sample a percent of actual requests on the agent then potentially add a multiplier on the other end to provide estimated metrics.

For example, I want to only have apm do anything for maybe 10% of requests made. This would allow us to reduce capacity requirements by 90% while still getting a general picture of what’s going on.

One thing you could do is completely drop the non-sampled transactions, using a filter in the agent: https://www.elastic.co/guide/en/apm/agent/nodejs/current/agent-api.html#apm-add-filter

apm.addFilter(function (payload) {
  if (payload.sampled === false) {
          return
  }
  return payload
})

Bear in mind that there will be no multiplier reported, so the count and rate metrics will all be off. The histogram you see will also only be based on sampled data, but it sounds like you would be OK with that.

@bradennapier
Copy link
Author

Ahh yeah if i can filter out what it reports i can implement this myself and yeah, would just need to manually do the multiplier which sucks for now but it at least would implement a short-term fix!

Ideally if the agent & the server could agree on sampling rate then all values could artificially be multiplied for the user - but that does have tradeoffs (such as what happens if you change then - then all your data is skewed) so I definitely get it is a difficult thing to understand!

However, the filtering is essentially what i would be doing on my end if i got rid of apm anyway!

Thanks!

@Qard
Copy link

Qard commented Sep 24, 2019

I think this might be more an issue for @elastic/apm-server with optimizing how we index things. Perhaps we should transfer the issue?

@simitt
Copy link
Contributor

simitt commented Sep 24, 2019

I think @axw 's suggestion to avoid sending data to the Server makes a lot of sense in this case. Ofc we can also discuss further improvements on the server and some additional config option for some kind of server-side sampling. However, I think this would require some more holistic discussion, so I suggest to transfer to https://github.com/elastic/apm.

@watson watson transferred this issue from elastic/apm-agent-nodejs Sep 24, 2019
@Qard
Copy link

Qard commented Sep 24, 2019

My thinking was just that maybe APM Server could somehow do something to aggregate the unsampled transaction data better. Currently, even with the sample rate turned way down, we will still send up the transaction event for every single request. In a high-traffic application there could be millions of transaction records generated per minute or even per second possibly. The structure of those transactions might be unique across routes, but they quickly lose uniqueness across traffic at significant scale. We're likely storing a huge amount of duplicate data in those cases.

@bradennapier
Copy link
Author

Needless to say, this matches the observations i made in my OP 👍

@bradennapier
Copy link
Author

bradennapier commented Sep 24, 2019

@axw That filter broke the whole api FYI. It would appear that returning no value in a filter causes errors.

15|api   | TypeError: Cannot read property 'id' of undefined
15|api   |     at send (/home/ubuntu/project/node_modules/elastic-apm-node/lib/agent.js:386:68)
15|api   |     at prepareError (/home/ubuntu/project/node_modules/elastic-apm-node/lib/agent.js:378:7)
15|api   |     at /home/ubuntu/project/node_modules/elastic-apm-node/lib/agent.js:294:7
15|api   |     at /home/ubuntu/project/node_modules/elastic-apm-node/lib/parsers.js:80:7
15|api   |     at /home/ubuntu/project/node_modules/after-all-results/index.js:20:25
15|api   |     at process._tickCallback (internal/process/next_tick.js:61:11)

This seems to be due to the fact it is not possible to filter out errors or transaction.type of request ? see elastic/apm-agent-nodejs#1385

@axw
Copy link
Member

axw commented Sep 25, 2019

@bradennapier sorry about that, and thanks again for the analysis in elastic/apm-agent-nodejs#1385.

There's a couple of issues here:

  1. Either the agent or server, or both, cannot keep up with the rate of data. This can probably be improved with optimisations, but there will always be a limit to what the agent and server can process. We can address this by providing an option to aggregate non-sampled transactions in the agent.

  2. The storage cost is too high. Currently we store individual events and aggregate them at query time, which is what enables calculating percentiles while filtering on high-cardinality fields, e.g. user names, origin country names, etc. By aggregating at the agent we can reduce the storage cost by storing only aggregated data; this will mean throwing away some of those high-cardinality data.

Both of these issues could be addressed by #104. We could alternatively, as you say, just not send the non-sampled transactions and instead introduce a multiplier, but this introduces other issues. In particular, when some transaction names/types are much less common than others, sampling may bias the results.

I think we should close this issue, and focus our efforts towards #104. What do you think?

@bradennapier
Copy link
Author

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants