Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint set on query is causing increase in gRPC error rate #4699

Closed
matej-g opened this issue Sep 24, 2021 · 8 comments
Closed

Endpoint set on query is causing increase in gRPC error rate #4699

matej-g opened this issue Sep 24, 2021 · 8 comments

Comments

@matej-g
Copy link
Collaborator

matej-g commented Sep 24, 2021

We have recently switched to Thanos v0.23.0-rc.0 for all our components on all our environments, which includes introduction of endpoint set flow for query. In our query, we are specifying the --store flags by leveraging the dnssrv+ directive. It points to a number of different components - store, rule and receive.

Not long after, we have noticed an increase in alerts firing for ThanosQueryGrpcClientErrorRate. After an investigation, we've noticed that most of these errors have error code Unimplemented. After looking at the code, we are suspecting this is coming from the Info gRPC call which is part of the attempt to obtain metadata. However, since we also have a some components specified in the stores, which yet do not implement the Info gRPC service, these are returning the Unimplemented error back after attempting the gRPC call (and only subsequently proceeding to use store API to obtain the data as fallback).

For illustration here's the graph for Unimplemented error rate we've been seeing since deploying v0.23.0-rc.0 on 9th September:
image

All of this is causing 'false positives' due to the Unimplemented errors, causing the raise in error rate and triggering the alerts. I guess after Info API is implemented for all components in #4282 (comment) this should no longer be a problem, but I'm wondering whether there is any interim measure. For example, if we could do Info call only on components which are known to implement it at this time, otherwise go straight to back-up solution in form of Store API call.

@jr0dd
Copy link

jr0dd commented Sep 29, 2021

I'm not using any custom flags here and since v0.23.0 I get spammed all day with these same false positives. I'm just changing the rule for now to ignore Unimplemented.

@matej-g matej-g changed the title Endpoint set on query is causing increase in gRPC error rate when specifying receive as a store Endpoint set on query is causing increase in gRPC error rate Sep 30, 2021
@matej-g
Copy link
Collaborator Author

matej-g commented Nov 12, 2021

This has for now been fixed in v0.23.1, so we can close it.

@matej-g matej-g closed this as completed Nov 12, 2021
@sherifkayad
Copy link

sherifkayad commented Jan 31, 2022

@matej-g I am running Thanos v0.24.0 and still seeing this issue. Has the fix been backported to v0.24.x?

@matej-g
Copy link
Collaborator Author

matej-g commented Jan 31, 2022

Hey @sherifkayad this should be in 0.24.0 as well, do you have any more details? Are you getting alerts? Are these caused by Unimplemented error? I assume you have all your components updated to 0.24.0, right?

@sherifkayad
Copy link

@matej-g yes, that's correct all components are upgraded to v0.24.0 and I am still getting the alert due to Unimplemented errors. I re-checked the source code for version 0.24.0 and I think your changes seem to be present there .. e.g. https://github.com/thanos-io/thanos/blob/v0.24.0/pkg/query/endpointset.go .. however, the same error like initially reported by you is still happening .. I am out of ideas here ..

@sherifkayad
Copy link

sherifkayad commented Jan 31, 2022

@matej-g Attached a screenshot of the metric used in the alerts as well as the values associated ..
image

For the sake of completeness, my alert is configured as follows:

- alert: ThanosQueryGrpcClientErrorRate
      annotations:
        description: Thanos Query {{$labels.job}} is failing to send {{$value | humanize}}%
          of requests.
        runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanosquerygrpcclienterrorrate
        summary: Thanos Query is failing to send requests.
      expr: |
        (
          sum by (job) (rate(grpc_client_handled_total{grpc_code!~"OK", job=~".*thanos-query.*"}[5m]))
        /
          sum by (job) (rate(grpc_client_started_total{job=~".*thanos-query.*"}[5m]))
        ) * 100 > 5
      for: 5m
      labels:
        severity: warning

And to make sure the Unimplemented was the reason:
image

@sherifkayad
Copy link

@matej-g I think I found the error .. I was using v0.23.1 for the sidecars .. These were causing the error. Sorry for the confusion from my end. I think we are all good now.
image

@matej-g
Copy link
Collaborator Author

matej-g commented Feb 2, 2022

Great to hear that all is working as expected then! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants