-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampling by fingerprint #27884
Comments
Routing to @getsentry/owners-ingest for triage. ⏲️ |
We have features that seem like can help with your problem:
Not sure if you already saw these and your proposal is in addition to existing solutions or you were unable to find these resources (hence sharing them here). |
Because determining the fingerprint for a particular event depends on resolving sourcemaps and debug symbols, which is the most expensive thing we do with an event on the server. We don't want people to send us large amounts of traffic, have us process it all and pay none of the cost that it incurs. All the existing sampling and filtering you can do today is cheap to execute because they're based on data that does not need to be computed by the server first. This doesn't apply to languages like Python/Ruby, but does to JS, for example. |
@roberttod to clarify, when you say "fingerprint" do you mean:
|
Thanks @BYK - I've seen these features but the spike protection is only useful for unexpected spikes, rather than long running high frequency errors and the filtering needs manual intervention for each error type and it'll stop reporting an error that might still need to be addressed. |
@jan-auer ideally the sampling could use the server side fingerprint logic, but if that's expensive even a more simple fingerprint like that calculated on the client side could be useful. I was expecting it to be a server side solution because then sampling could be applied across many clients. I wasn't aware server side fingerprint generation was so expensive, as I expected most of the cost to be storing, sorting and searching through errors on the server. If we could apply some simple fingerprint sampling, that would be very beneficial. Would the more complex server fingerprinting in most cases join more errors together anyway? In that case it might not be much of a difference. |
Thanks for the additional input! In short, we're aware of this challenge and evaluating options at the moment (including client-side fingerprinting). I think this generally makes sense to add.
It's less about being expensive, rather that server side fingerprinting and grouping runs quite late in the pipeline for functional reasons. It requires information from pretty much any prior step of data ingestion, including server-side processing to resolve actual function names from JavaScript sourcemaps or debug information files. Server-side sampling and related functionality runs rather early, in contrast. We're even pushing the sampling decision down to the client in some cases, to increase performance and reduce bandwidth requirements.
Generally speaking, the result of server-side fingerprinting can be completely different from what could be done with information available on the client. In many cases, it would not even be possible to group errors into issues correctly on the client side. |
Understood, that all makes a lot of sense. I think some rudimentary fingerprint of the exact error message + an exact stack match would do for most cases we've seen - it wouldn't group anything it shouldn't group and would still massively reduce error count. If done on the server instead of the client (I believe sampling is client side right now), then that would also help when there are a large number of users getting the same error. Looking at the docs, I always assumed sampling was done client side since it's an SDK option, but I realized it's not specified. Is it done server side or client side? (a different idea from the simple solution above) If your fingerprint logic is done further toward the end of the pipeline, I wonder if something like this would work
I am probably missing some context here, so not sure if this would work - I was assuming all of the info could be gathered from just the incoming error message (which seems reasonable) and that each raw fingerprint could map directly to one server side fingerprint. |
We have a server-side filtering and sampling product as well and are not too worried about its scalability as it sits at the front of the event processing pipeline. Client-side sampling is more widely used.
yes that can work:
you have to ensure that n in the first relationship does not get too large, as otherwise you will end up sending/syncing a large amount of raw fingerprints to Relay. so that rudimentary fingerprint logic can't just liberally add all event data that could ever be involved in grouping, but will have to be improved just for scalability of this setup. one could potentially make the rudimentary fingerprint logic easier to maintain by porting the regular fingerprint logic to Rust first. So it can be run from both Python/Sentry and Rust/Relay. IIRC this is as far as we've gotten in internal discussions 1-2 months ago. |
+1 |
Problem Statement
Projects with a lot of traffic tend to be a lot more expensive to integrate with Sentry because during error conditions there will be many times more errors sent. Roughly speaking, there's a linear correlation between traffic and $ cost for Sentry errors. Of course there is sampling to reduce this cost, but then you lose critical information because errors that happen at a lower rate than the sample rate could be cut out if other errors are happening at a high frequency.
One could argue that if there are errors happening at high frequency, then they should be addressed, deleted or filtered. But in my experience that's not reasonable - frontend projects with many users will frequently get into this state and by the time some thought has been put into mitigation you may have already exhausted many wasted $$ or missed a load of new errors that were sampled out.
Why can't there be a sample rate per event fingerprint? I understand that would still incur higher costs for high traffic projects (for example, you'll need to store those fingerprints somewhere) but the cost-errorrate relationship would not be linear and you'd barely lose any functionality by applying the sampling. You could even continue counting the errors correctly but not put the full error event information into the system.
Solution Brainstorm
The text was updated successfully, but these errors were encountered: