-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query: possible goroutine leak seen between queriers on v0.35.1 #7553
Comments
I have taken some goroutine dumps from our queriers. goroutine dump from one of our thanos-querier pods goroutine dump from one of our thanos-querier-tls pods We also see a correlation between memory usage and goroutines, shown in these dashboards |
Maybe related to #6948? cc @thibaultmg , any chance there could be something missing when chaining queriers? |
Hey, indeed I did some changes on the I would suspect a bad context management Looking quickly at the code, |
We are now seeing this issue resolved after the following fix released in v0.36.1 #7618 |
Awesome, I'll close the issue then! |
Thanos, Prometheus and Golang version used:
Thanos version: 0.35.1
Prometheus version: 2.48.1
Object Storage Provider:
AWS S3
What happened:
We are running Thanos using query chaining between tls and non-tls endpoints.
On Thanos v0.35.1, we are seeing what looks like a goroutine leak between our two thanos querier deployments. This behavior was not seen until we implemented query chaining on 6/18. From then on, you can see the steady rise and sawtooth pattern emerge in the goroutines. Along with this, we have seen an increase in timeouts and latency in our queries.
This was reproduced in our dev environment, and when downgrading to Thanos v0.35.0 (still using the same exact configuration with query chaining), we then see goroutines return to normal.
Possibly related to this is that when testing queries on v0.35.0, I see on certain queries we are hitting our
--store.response-timeout
of10s
between our non-tls querier and tls querier. This results in the following error when the load time is over 10s.Interestingly, on Thanos v0.35.1, instead of failing after 10s when we hit the
--store.response-timeout
, the queries instead hang indefinitely until we hit ourresponse_header_timeout
in the query-frontend. This is paired with the buildup of goroutines shown earlier.What you expected to happen:
We expected to be able to utilize query chaining without seeing the large steady increase in goroutines building up. Additionally, we are also wondering what could also be causing the
--store.response-timeout
to be hit (or not hit) in this situation as well.How to reproduce it (as minimally and precisely as possible):
Thanos querier config that has the sharded stores added and the thanos tls querier
Thanos tls querier with the prometheus sidecar endpoints in different clusters.
With this above configuration on Thanos v0.35.1, we see the described behavior with goroutines and we see queries hang even though the
--store.response-timeout
seems to be hit. On v0.35.0, we see goroutines return to normal and the same query times out at 10s. We see this mainly on larger clusters against our largest prometheus which has around 30 million active series.The text was updated successfully, but these errors were encountered: