Query: possible goroutine leak seen between queriers on v0.35.1 #7553

jdgeisler · 2024-07-16T21:54:08Z

Thanos, Prometheus and Golang version used:
Thanos version: 0.35.1
Prometheus version: 2.48.1

Object Storage Provider:
AWS S3

What happened:

We are running Thanos using query chaining between tls and non-tls endpoints.

On Thanos v0.35.1, we are seeing what looks like a goroutine leak between our two thanos querier deployments. This behavior was not seen until we implemented query chaining on 6/18. From then on, you can see the steady rise and sawtooth pattern emerge in the goroutines. Along with this, we have seen an increase in timeouts and latency in our queries.

This was reproduced in our dev environment, and when downgrading to Thanos v0.35.0 (still using the same exact configuration with query chaining), we then see goroutines return to normal.

Possibly related to this is that when testing queries on v0.35.0, I see on certain queries we are hitting our --store.response-timeout of 10s between our non-tls querier and tls querier. This results in the following error when the load time is over 10s.

receive series from Addr: thanos-mscprod-query-tls.thanos.svc.cluster.local:10901 
MinTime: 1692727200000 MaxTime: 9223372036854775807: rpc error: code = Canceled desc = context canceled

Interestingly, on Thanos v0.35.1, instead of failing after 10s when we hit the --store.response-timeout, the queries instead hang indefinitely until we hit our response_header_timeout in the query-frontend. This is paired with the buildup of goroutines shown earlier.

What you expected to happen:

We expected to be able to utilize query chaining without seeing the large steady increase in goroutines building up. Additionally, we are also wondering what could also be causing the --store.response-timeout to be hit (or not hit) in this situation as well.

How to reproduce it (as minimally and precisely as possible):

Thanos querier config that has the sharded stores added and the thanos tls querier

- args:
        - query
        - --query.metadata.default-time-range=24h
        - --query.max-concurrent-select=10
        - --query.auto-downsampling
        - --grpc-compression=snappy
        - --query.max-concurrent=80
        - --store.response-timeout=10000ms
        - --endpoint-strict=thanos-store-short.thanos.svc.cluster.local:10901
        - --endpoint-strict=thanos-store-mid.thanos.svc.cluster.local:10901
        - --endpoint-strict=thanos-store-long.thanos.svc.cluster.local:10901
        - --endpoint-strict=thanos-query-tls.thanos.svc.cluster.local:10901

Thanos tls querier with the prometheus sidecar endpoints in different clusters.

- args:
        - query
        - --grpc-client-tls-secure
        - --query.metadata.default-time-range=24h
        - --query.max-concurrent-select=10
        - --query.auto-downsampling
        - --grpc-compression=snappy
        - --query.max-concurrent=80
        - --store.response-timeout=10000ms
        - --endpoint-strict=prometheus-1:443
        - --endpoint-strict=prometheus-2:443
       - --endpoint-strict=...

With this above configuration on Thanos v0.35.1, we see the described behavior with goroutines and we see queries hang even though the --store.response-timeout seems to be hit. On v0.35.0, we see goroutines return to normal and the same query times out at 10s. We see this mainly on larger clusters against our largest prometheus which has around 30 million active series.

The text was updated successfully, but these errors were encountered:

jdgeisler · 2024-07-18T16:53:01Z

I have taken some goroutine dumps from our queriers.

goroutine dump from one of our thanos-querier pods

https://pprof.me/9a02da26f4be61adc4c922c8c3b9998a/

goroutine dump from one of our thanos-querier-tls pods

https://pprof.me/09fc6792500f4fbf953dde30f06df711/

We also see a correlation between memory usage and goroutines, shown in these dashboards

wiardvanrij · 2024-07-18T22:41:20Z

Maybe related to #6948?

cc @thibaultmg , any chance there could be something missing when chaining queriers?

thibaultmg · 2024-07-22T16:52:08Z

Hey, indeed I did some changes on the eagerRespSet that seems to be the culprit here. IIRC, I removed the check on the context there because it caused a race condition between it and the serverAsClient (implementation of the storepb.Store_SeriesClient interface) in which the eagerRespSet or lazyRespSet would stop calling the Recv() too early, leaving a dangling goroutine.

I would suspect a bad context management

Looking quickly at the code, blockSeriesClient seems to be used by the querier, and in the Recv(), the context is not checked directly but it seems to used in the nextBatch function. There might be a code path where a done context is silently ignored 🤷 . I need to have a deeper look at it. This is just an hypothesis.

jdgeisler · 2024-08-22T19:00:12Z

We are now seeing this issue resolved after the following fix released in v0.36.1 #7618

MichaHoffmann · 2024-08-22T19:17:06Z

We are now seeing this issue resolved after the following fix released in v0.36.1

Awesome, I'll close the issue then!

wiardvanrij added component: query needs-investigation labels Jul 18, 2024

MichaHoffmann closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query: possible goroutine leak seen between queriers on v0.35.1 #7553

Query: possible goroutine leak seen between queriers on v0.35.1 #7553

jdgeisler commented Jul 16, 2024

jdgeisler commented Jul 18, 2024

wiardvanrij commented Jul 18, 2024

thibaultmg commented Jul 22, 2024

jdgeisler commented Aug 22, 2024

MichaHoffmann commented Aug 22, 2024

Query: possible goroutine leak seen between queriers on v0.35.1 #7553

Query: possible goroutine leak seen between queriers on v0.35.1 #7553

Comments

jdgeisler commented Jul 16, 2024

jdgeisler commented Jul 18, 2024

wiardvanrij commented Jul 18, 2024

thibaultmg commented Jul 22, 2024

jdgeisler commented Aug 22, 2024

MichaHoffmann commented Aug 22, 2024