-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request Sample Rate needed - not scaleable #151
Comments
On another note, it is interesting that when these socket hangup issues occur it is fairly difficult to know that they are happening at all. There is really no indication that this happens on the monitoring or anything. It is also not very easy to know if the apm server queue continually is filling up and dropping requests. |
@bradennapier I think #104 is relevant here. It's not an immediate solution for you, but sounds essentially more or less like what you want - do you agree? One of the problems with throwing away data at the agent is that you then can't calculate percentiles properly any more. As described in the linked issue, this would be a trade-off you would have to make, until/unless Elasticsearch supports pre-aggregated histograms. In the mean time, there may be a couple of things you could do that could help:
Do you care about monitoring those? Perhaps you could drop them in the agent if you're either not interested in them at all, or they're drowning out the most valuable performance data.
One thing you could do is completely drop the non-sampled transactions, using a filter in the agent: https://www.elastic.co/guide/en/apm/agent/nodejs/current/agent-api.html#apm-add-filter apm.addFilter(function (payload) {
if (payload.sampled === false) {
return
}
return payload
}) Bear in mind that there will be no multiplier reported, so the count and rate metrics will all be off. The histogram you see will also only be based on sampled data, but it sounds like you would be OK with that. |
Ahh yeah if i can filter out what it reports i can implement this myself and yeah, would just need to manually do the multiplier which sucks for now but it at least would implement a short-term fix! Ideally if the agent & the server could agree on sampling rate then all values could artificially be multiplied for the user - but that does have tradeoffs (such as what happens if you change then - then all your data is skewed) so I definitely get it is a difficult thing to understand! However, the filtering is essentially what i would be doing on my end if i got rid of apm anyway! Thanks! |
I think this might be more an issue for @elastic/apm-server with optimizing how we index things. Perhaps we should transfer the issue? |
I think @axw 's suggestion to avoid sending data to the Server makes a lot of sense in this case. Ofc we can also discuss further improvements on the server and some additional config option for some kind of server-side sampling. However, I think this would require some more holistic discussion, so I suggest to transfer to https://github.com/elastic/apm. |
My thinking was just that maybe APM Server could somehow do something to aggregate the unsampled transaction data better. Currently, even with the sample rate turned way down, we will still send up the transaction event for every single request. In a high-traffic application there could be millions of transaction records generated per minute or even per second possibly. The structure of those transactions might be unique across routes, but they quickly lose uniqueness across traffic at significant scale. We're likely storing a huge amount of duplicate data in those cases. |
Needless to say, this matches the observations i made in my OP 👍 |
@axw That filter broke the whole api FYI. It would appear that returning no value in a filter causes errors.
This seems to be due to the fact it is not possible to filter out errors or |
@bradennapier sorry about that, and thanks again for the analysis in elastic/apm-agent-nodejs#1385. There's a couple of issues here:
Both of these issues could be addressed by #104. We could alternatively, as you say, just not send the non-sampled transactions and instead introduce a multiplier, but this introduces other issues. In particular, when some transaction names/types are much less common than others, sampling may bias the results. I think we should close this issue, and focus our efforts towards #104. What do you think? |
Closed |
So I love the ES APM but I believe we are going to have to remove it due to the fact there is no way to limit the sampling of requests.
Currently 100% of requests will be sent no matter what with a fairly significant amount of data.
This can be limited by changing the transaction sample rate but this only reduces a small % of the data required to be indexed.
We get millions of requests a minute (mostly rate limited spam) and are constantly getting socket hangup and queue full errors so many requests are being lost anyway.
My estimate is to properly handle our requests it would probably cost us in the tens of thousands of dollars a month considering it can’t handle it and our current elastic cloud is already upwards of $2,000 a month. This would be as much or more than it costs to run our API.
Not to mention I have to clear out the entire system every 48 hours because it is filling up over a TB every couple days - have hot/warm architecture but that fills up completely.
We need to be able to sample a percent of actual requests on the agent then potentially add a multiplier on the other end to provide estimated metrics.
For example, I want to only have apm do anything for maybe 10% of requests made. This would allow us to reduce capacity requirements by 90% while still getting a general picture of what’s going on.
Note I already set to do like 0.01 transaction sampling and not capture any stack traces, etc. however it’s still far too much for apm to handle.
The text was updated successfully, but these errors were encountered: