-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos Query stop responding if not queried for some time #705
Comments
Thanks for report. |
There you have the related salt state to deploy query and store :
It happened right now, on the stores panel everything is green, health checks are still responding correctly but I can't for example run a Still no errors in the logs |
Intesting. No immdiate clue - we cannot repro it. So what is really weird in your setup, is that you have 5 replica HA for Prometheus.. Does it mean all 5 instances scrape the same data? Because that's for Nevertheless what might happen now, is that one of those StoreAPI might hang for some reason? Potentially store gateway? One hanging leaf can cause Thanos query to timeout after some time. I would suggest:
|
Each instance monitor their own system (site or project), they don't share metrics data (Except to monitor each other monitoring system (monitoring each other prometheus, alertmanager, sidecars, etc.) Should I remove the replica global label from my configuration ? I've checked the store logs but there is nothing. What I see is that for example queries like When taking a look at this leaf log, there's nothing shown failing recently (Just the classic upload new block). When restarting this leaf (And not the thanos query like before) it works, but I suppose this will happen again in a few hours (as it was restarted before this ticket as I updated the version from |
Yes, they are definitely not replicas then. They are shards (: I would not remove Not sure if that is the reason for your weirdiness query timeouts, but for a try. Have you try turning of some leaf then? when Querier hangs, maybe worth to turn off them one by one. Also in case on hanging worth to check Thanos Querier metrics.
`` |
When I shutdown the only one that had the issue ( The query limit is 20 everywhere but I can see that sometime the thanos-query reach more than 60 concurrent queries, I think this is when I try to open Grafana and that it gets hanging forever because of the issue. If I change the label (I suppose it need to be changed everywhere at the same time), that won't affect the ability to read the previous metrics ? Any consequence ? |
Doesn't this sounds like your issue here? Can you try to change max-concurrent to 100 e.g? BTW what is running on 10.1.0.2:10904? Well, the result will have different labels, otherwise for your case it should be fine (: |
|
I just pushed the change, I will let you know if it happens again, btw maybe it would be interesting to sync the flag to have the same name than on Prometheus ? |
I think it's up to your deployment/ config management to sync those. From Querier standpoint we don't know what StoreAPI is. (: |
So far setting the limit to 100 everywhere (checked on the metric) didn't fix the issue, I just tried right now and I'm getting the same issue. By syncing I was meaning to use the same flag name than Prometheus, not to sync the value :) |
ah, I yes - fully agree. Have you checked resource consumption for components during the issue? Also, where you able to repro this issue with just the |
So far it's not related to the store, I tried to restart the store, didn't fix the issue (And Query doesn't reconnect to it until query is restarted), it looks like it's bound to my CPU usage is almost 0 when doing the query. No errors in any log, network looks ok as hearthbeat goes fine everywhere. |
Wonder if it's similar to this: https://improbable-eng.slack.com/archives/CA4UWKEEN/p1546867840210100 |
I could not get the link to work, it always give me the end of the #thanos channel 😓 |
There is some discussion in #thanos about sidecar slowness. Might be related, might not. Hm.. in your case, I would try debugging sidecar in some way. Checking it's metrics on how many connections/requests, resources it uses and it the end.. debugging the code flow more. (: Will try to repro similar on our stack soon as well. |
Thanks :) Let me know if you need any more testing from me |
Cannot repro, sorry. I literally performed demo on latest version and it's super reliable. Now when I looked it again I can see you use still |
I've hit this case too with v0.4.0 and have collected all pprof profiles from Thanos Querier. |
Ok turns out I was running |
Digging thru goroutine dump I found that many goroutines wait forever on
|
Another good question is what did thanos store API call goroutines were doing, because they haven't returned data in 3880min ~ 64.66h |
Other super long goroutines
|
Ok this data is a bit too confusing, as I am running older version, which didn't have fix for Thanos query |
We decided we need to rewrite that part to make it simpler. So this will be a bit bigger effort. |
Looks like your findings are consistent with what I'm seeing, too. I currently have 3 queries stuck in the queue (based on ps - thanks for your work on this! Let me know if there's anything that I can help with :)
|
Never mind. Looks like this was fixed in #1082. I'll run in prod over the weekend and see if things are resolved. |
Following up on my previous comment, this is definitely resolved by #1082. No stale goroutines or Prom queries have been observed in the 10 days since I started running a newer release of Thanos. This issue should be able to be closed. |
Thanks everyone involved! Was quite painful to dig it (: |
Thanos, Prometheus and Golang version used
Docker tags:
What happened
When thanos-query is not used for some time, it stops answering when we try to query again and only work after are restart.
We saw that when we shut our persistent grafana dashboard (shown on screens on the office), now every time we go to grafana or to the thanos-query interface, the queries hang up forever until we restart thanos-query.
CPU usage on all the thanos relarted processes are low so it doesn't look like it's doing anything.
All the stores are up in thanos-query dashboard but all queries are failing (even
up
).After restarting
thanos query
everything works fine for a few hours (I don't know the exact time)What you expected to happen
Query working fine
How to reproduce it (as minimally and precisely as possible):
Don't do queries for a few hours, then try to query
Full logs to relevant components
Nothing in the logs except context cancelled because the HTTP queries are timing out on Grafana
Anything else we need to know
Environment:
uname -a
):Linux gc-euw1-prometheus-central-1 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: