Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Items are not always returned back to the chunk pool #7883

Closed
harry671003 opened this issue Nov 4, 2024 · 2 comments
Closed

Items are not always returned back to the chunk pool #7883

harry671003 opened this issue Nov 4, 2024 · 2 comments

Comments

@harry671003
Copy link
Contributor

Issue

After pulling in #7821 into Cortex, we noticed that not all the items were returned to the chunk pool. This caused the usedTotal in the pool to keep going up and reach the maxTotal (30 GB. After this store gateways are not able to process any requests.
We were able to isolate the root cause to the removal of defer blockClient.Close() from Series() in bucket.go. After putting it back, the issue didn’t occur.

Metrics

In Cortex the chunk pool has metrics to track it's usage. See: https://github.com/cortexproject/cortex/blob/c25b18d514a191182a818c8f0c954564cf6ceaf4/pkg/storegateway/chunk_bytes_pool.go#L23

The following graphs are for one of the store-gateways in the cluster

Chunk pool usedTotal growing

Screenshot 2024-11-04 at 2 32 44 PM

Chunk pool gets - puts

sum(rate(cortex_bucket_store_chunk_pool_operation_bytes_total{stats="cap", operation="get", instance="store-gateway-0"}[15m])) - sum(rate(cortex_bucket_store_chunk_pool_operation_bytes_total{stats="cap", operation="put", instance="store-gateway-0"}[15m]))
Screenshot 2024-11-04 at 2 42 43 PM

This shows that the gets are more than puts.

Chunk pool growth after making the following change

Making the following change seems to have fixed the problem.

> git diff vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
diff --git a/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go b/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
index b3a4a72d..6385a664 100644
--- a/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
+++ b/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
@@ -1691,7 +1692,9 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, seriesSrv storepb.Store
        // Merge the sub-results from each selected block.
        tracing.DoInSpan(ctx, "bucket_store_merge_all", func(ctx context.Context) {
                begin := time.Now()
-               set := NewResponseDeduplicator(NewProxyResponseLoserTree(respSets...))
+               lt := NewProxyResponseLoserTree(respSets...)
+               defer lt.Close()
+               set := NewResponseDeduplicator(lt)
                i := 0
                for set.Next() {
                        i++
Screenshot 2024-11-04 at 2 44 17 PM
fpetkovski added a commit to fpetkovski/thanos that referenced this issue Nov 5, 2024
Applies the fix described in thanos-io#7883.

Signed-off-by: Filip Petkovski <[email protected]>
@harry671003
Copy link
Contributor Author

We also noticed another issue even after cherry-picking the diff. The pendingReaders for some blocks are not decremented correctly. Store-Gateways are not able to sync blocks because of this.

Goroutine stuck at:

b.pendingReaders.Wait()

@harry671003
Copy link
Contributor Author

This can be closed by #7915

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant