-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoint set on query is causing increase in gRPC error rate #4699
Comments
I'm not using any custom flags here and since v0.23.0 I get spammed all day with these same false positives. I'm just changing the rule for now to ignore Unimplemented. |
receive
as a store
This has for now been fixed in |
@matej-g I am running Thanos |
Hey @sherifkayad this should be in |
@matej-g yes, that's correct all components are upgraded to |
@matej-g Attached a screenshot of the metric used in the alerts as well as the values associated .. For the sake of completeness, my alert is configured as follows: - alert: ThanosQueryGrpcClientErrorRate
annotations:
description: Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}%
of requests.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcclienterrorrate
summary: Thanos Query is failing to send requests.
expr: |
(
sum by (job) (rate(grpc_client_handled_total{grpc_code!~"OK", job=~".*thanos-query.*"}[5m]))
/
sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))
) * 100 > 5
for: 5m
labels:
severity: warning |
@matej-g I think I found the error .. I was using |
Great to hear that all is working as expected then! 😊 |
We have recently switched to Thanos
v0.23.0-rc.0
for all our components on all our environments, which includes introduction of endpoint set flow for query. In ourquery
, we are specifying the--store
flags by leveraging thednssrv+
directive. It points to a number of different components -store
,rule
andreceive
.Not long after, we have noticed an increase in alerts firing for ThanosQueryGrpcClientErrorRate. After an investigation, we've noticed that most of these errors have error code
Unimplemented
. After looking at the code, we are suspecting this is coming from theInfo
gRPC call which is part of the attempt to obtain metadata. However, since we also have a some components specified in the stores, which yet do not implement theInfo
gRPC service, these are returning theUnimplemented
error back after attempting the gRPC call (and only subsequently proceeding to use store API to obtain the data as fallback).For illustration here's the graph for
Unimplemented
error rate we've been seeing since deployingv0.23.0-rc.0
on 9th September:All of this is causing 'false positives' due to the
Unimplemented
errors, causing the raise in error rate and triggering the alerts. I guess after Info API is implemented for all components in #4282 (comment) this should no longer be a problem, but I'm wondering whether there is any interim measure. For example, if we could doInfo
call only on components which are known to implement it at this time, otherwise go straight to back-up solution in form of Store API call.The text was updated successfully, but these errors were encountered: