Items are not always returned back to the chunk pool #7883

harry671003 · 2024-11-04T22:45:35Z

Issue

After pulling in #7821 into Cortex, we noticed that not all the items were returned to the chunk pool. This caused the usedTotal in the pool to keep going up and reach the maxTotal (30 GB. After this store gateways are not able to process any requests.
We were able to isolate the root cause to the removal of defer blockClient.Close() from Series() in bucket.go. After putting it back, the issue didn’t occur.

Metrics

In Cortex the chunk pool has metrics to track it's usage. See: https://github.com/cortexproject/cortex/blob/c25b18d514a191182a818c8f0c954564cf6ceaf4/pkg/storegateway/chunk_bytes_pool.go#L23

The following graphs are for one of the store-gateways in the cluster

Chunk pool `usedTotal` growing

Chunk pool gets - puts

sum(rate(cortex_bucket_store_chunk_pool_operation_bytes_total{stats="cap", operation="get", instance="store-gateway-0"}[15m])) - sum(rate(cortex_bucket_store_chunk_pool_operation_bytes_total{stats="cap", operation="put", instance="store-gateway-0"}[15m]))

This shows that the gets are more than puts.

Chunk pool growth after making the following change

Making the following change seems to have fixed the problem.

> git diff vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
diff --git a/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go b/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
index b3a4a72d..6385a664 100644
--- a/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
+++ b/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go
@@ -1691,7 +1692,9 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, seriesSrv storepb.Store
        // Merge the sub-results from each selected block.
        tracing.DoInSpan(ctx, "bucket_store_merge_all", func(ctx context.Context) {
                begin := time.Now()
-               set := NewResponseDeduplicator(NewProxyResponseLoserTree(respSets...))
+               lt := NewProxyResponseLoserTree(respSets...)
+               defer lt.Close()
+               set := NewResponseDeduplicator(lt)
                i := 0
                for set.Next() {
                        i++

The text was updated successfully, but these errors were encountered:

Applies the fix described in thanos-io#7883. Signed-off-by: Filip Petkovski <[email protected]>

harry671003 · 2024-11-05T18:16:25Z

We also noticed another issue even after cherry-picking the diff. The pendingReaders for some blocks are not decremented correctly. Store-Gateways are not able to sync blocks because of this.

Goroutine stuck at:

thanos/pkg/store/bucket.go

Line 2507 in 9bc3cc0

b.pendingReaders.Wait()

harry671003 · 2024-11-21T16:34:09Z

This can be closed by #7915

dosubot bot added bug component: bucket tools labels Nov 4, 2024

harry671003 added component: store and removed component: bucket tools labels Nov 4, 2024

fpetkovski added a commit to fpetkovski/thanos that referenced this issue Nov 5, 2024

Fix bug in Bucket Series

77bd9c0

Applies the fix described in thanos-io#7883. Signed-off-by: Filip Petkovski <[email protected]>

fpetkovski mentioned this issue Nov 5, 2024

Fix bug in Bucket Series #7885

Merged

2 tasks

yeya24 mentioned this issue Nov 17, 2024

Close block series client at the end to not reuse chunk buf #7915

Merged

2 tasks

harry671003 closed this as completed Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Items are not always returned back to the chunk pool #7883

Items are not always returned back to the chunk pool #7883

harry671003 commented Nov 4, 2024

harry671003 commented Nov 5, 2024

harry671003 commented Nov 21, 2024

Items are not always returned back to the chunk pool #7883

Items are not always returned back to the chunk pool #7883

Comments

harry671003 commented Nov 4, 2024

Issue

Metrics

Chunk pool usedTotal growing

Chunk pool gets - puts

Chunk pool growth after making the following change

harry671003 commented Nov 5, 2024

harry671003 commented Nov 21, 2024

Chunk pool `usedTotal` growing