-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sidecar stops responding when querying over multiple days #773
Comments
Digging deeper into this: |
Thanks for report! So randomized replica is quite bad yes - any compaction/downsampling would not work, but you don't use those. For your setup the issue you have I thing is not because of replica. Questions:
|
Thanks for the info. We want to use compact/downsampling in the near future. You are correct the labels were not the issue. With fixed replica label I'm hitting the same error. The IP's look off, because Prometheus and all monitoring components are in a seperated network inside our Docker swarm cluster. Thanos Query is only in "monitoring" network as it only needs a connection to Prometheus. Prometheus itself is in common and monitoring network, because it has to reach the service that it's supposed to monitor. AFAIK the sidecar process sets its advertise IP wrong. But this only counts for gossip right? We rely on DNS Discovery and static entries, so this should not be a problem. When triggering the error the process becomes completely unresponsive. All ports are closed (also form inside the container). I also started the sidecar with
Sidecar logs with
|
Interesting! Could you compare the dns results of |
In one of our env's prometheus/sidecar outputs following logs:
Starting prometheus with
EDIT: Also tried configuring store endpoints statically on Query.. same result as before. I was also able to reproduce this with only one prometheus instance running. I have now completly reset all volumes and data for Prometheus on our staging system. Maybe this problem only occurs if Sidecar is reading data not stored in WAL. Will have a look at it as soon as Prometheus writes its first full block to disk. |
Prometheus now has written multiple blocks to disk and all are accessible over Thanos. Maybe I'm experiencing something similar to #396? |
@mreichardt95 Thanks for your report. Maybe you can try to upgrade Thanos to the master head version and redo your query request. Meanwhile please check the health of Prometheus instance when your query is no response for you. |
Sidecar stopped again... Prometheus is fully working and healthy when sidecar goes down. I reconfigured our Prometheus base image to build thanos from master:
I tested a bit around was not able to break sidecar again. Only experienced some delays while querying a lot of data, but I guess this is as expected :). Will keep an eye on it. Thanks for your help! |
👍 |
Sidecar just crashed again. I think this might not only related to the queries executed, but also to the time sidecar is running. After a restart I can query everything as expected without any problems. |
Ran into it again on the morning. This time Prometheus itself crashed too. I guess this is caused , when sidecar does not receive queries for some time and then when it receives the next query it crashes. If have spun up a Rule component that just evals |
Have you checked if sidecar is not just OOMing? (: If the query is large it can consume a lot of memory. This will help with this: #488 But in mean time you could check if you have enough memory for sidecar to operate. |
Thanks for your input! Yes the container was OOMing when Prometheus crashed yesterday, but this does not fix the crash from sidecar itself. Since I fired up the rule component we haven't had a single crash or hiccup in any of our env's. I will watch this further but this is probably solved by constantly querying the sidecar to stop it from crashing. |
Not sure, but might be. Let's close this issue and let's track and discuss on #488 - to essentially improve mem consumption from both sidecar and Prometheus |
Thanos, Prometheus and Golang version used
Thanos 0.2.0 & 0.2.1, Prometheus 2.6.1 & 2.6.0, Golang go1.11.4
What happened
I tried querying data for multiple days. Example query
sum(irate(nginx_http_requests_total{env="staging"}[1m])) by (host)
. Time period requested: 1 week. After request Thanos Sidecar gets unresponsive and container needs to be restarted.What you expected to happen
Get data for multiple days
How to reproduce it (as minimally and precisely as possible):
Full logs to relevant components
Sidecar start command
Thanos Query Logs
Thanos Query Command
Anything else we need to know
We're running on docker swarm currently. The sidecar process is running directly inside Prometheus container. Currently we don't use any Bucket storage everything is kept on Disk for now (30d retention).
The text was updated successfully, but these errors were encountered: