bulk: range cache miss during index backfill causes sustained slowdown #84290

nvanbenschoten · 2022-07-12T17:21:08Z

During an index creation on a 6TB table in a cluster with 15 nodes (48 vCPU per instance), I noticed a surprising slowdown. After running for about 12 hours, the index creation slowed to a trickle (~1% of peak ingestion throughput). About an hour later, it sped up to about 10% of peak ingestion throughput.

The distsender.rpc.addsstable.sent metric revealed an interesting pattern. Most nodes had stopped sending AddSSTable requests entirely — they were finished. However, much earlier in the backfill, two of the nodes had fallen off dramatically in their rate of sending AddSSTable requests.

Interestingly, at different times, these two nodes eventually did pick up the rate of sending and quickly finished.

I was able to grab logs during the slow period from each of these nodes: cockroach.log

I was also able to grab logspy with high verbosity on the sst_batcher.go file: full_logspy.txt

We see an interesting pattern here. During the slowdown, the index backfiller appears to be in a linear regime where it repeatedly hits SSTable cannot be added spanning range bounds lines. These are meant to be rare.

In one (of dozens, I got lucky) of my logspy attempts, I was able to catch a pair of log lines that I think hints at the problem:

I220710 09:25:04.277698 12929447 kv/bulk/sst_batcher.go:327 ⋮ [n2,f‹57784e93›,job=0,indexBackfillerProcessor=120] 6954 ‹trade backfill› no cached range desc available to determine sst flush key
I220710 09:25:04.377969 12929447 kv/bulk/sst_batcher.go:407 ⋮ [n2,f‹57784e93›,job=0,indexBackfillerProcessor=120] 7031 ‹trade backfill› flushing 16 MiB SST due to size > 16 MiB

It seems that at some point, the SSTBatcher checked to see whether it should perform a rangeFlush:

cockroach/pkg/kv/bulk/sst_batcher.go

Lines 322 to 328 in 4566e92

    
           r := b.rc.GetCached(ctx, k, false /* inverted */) 
        
           if r != nil { 
        
           	k := r.Desc().EndKey.AsRawKey() 
        
           	b.flushKey = k 
        
           	log.VEventf(ctx, 3, "%s building sstable that will flush before %v", b.name, k) 
        
           } else { 
        
           	log.VEventf(ctx, 2, "%s no cached range desc available to determine sst flush key", b.name)

When checking, it did not find a cached range descriptor for the first key in a batch. As a result, it decided not to perform a range flush for the current batch. Instead, it waited until the batch was 16 MiB in size before performing a size flush.

This did not go well. At the time of this size flush, the index keyspace had 11971 ranges. As a result, the 16 MiB buffer was split into chunks of about 5 KiB each (according to logspy). We then sent these small AddSSTables one-by-one. Each took about 50ms, so we sent about 20 AddSSTable's per second. This meant that we spent about 160 seconds to flush the 16 MiB buffer.

I think the takeaway here is that a RangeCache miss can undermine #79967, which becomes increasingly important with large data sizes. Perhaps we should replace the call to RangeCache.GetCached with one to RangeCache.Lookup so that we'll perform a RangeCache lookup on a cache misses. We're going to send to the range anyway, so a cache lookup now vs. later shouldn't make a difference.

One thing I don't understand is why we kept hitting range cache misses once in this regime. Perhaps that hints at something else going wrong as well. Or perhaps there was a gap in our range cache that kept causing us to skip range flushes but was also never filled during a size flush.

cc. @dt

Jira issue: CRDB-17573

The text was updated successfully, but these errors were encountered:

85859: pgwire: Add support for cursors with special characters r=knz,rafiss a=rimadeodhar IIn CockroachDB and Postgres, it is possible to declare cursors with special characters enclosed within double quotes, for e.g. "1-2-3". Currently, we store the name as an unescaped string which causes problems during the pgwire DESCRIBE step for looking up the cursor. We should be storing using the tree.Name datatype for the cursor name while storing and looking up cursors. This PR updates the code to start using tree.Name instead of raw strings for handling cursor names. This fixes the issue where the pgwire DESCRIBE step fails while attempting to look up cursors with names containing special characters. Resolves #84261 Release note (bug fix): The pgwire DESCRIBE step no longer fails with an error while attempting to look up cursors declared with names containing special characters. 86492: bulk: perform meta lookup on range cache miss during index backfill r=dt a=nvanbenschoten Fixes #84290. This commit addresses the sustained slowdown described in #84290 by replacing the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to `RangeCache.Lookup`. The former method checks the cache but returns no range descriptor if the cache currently has no overlapping key. This is possible if the descriptor was recently evicted because it was stale. The later method performs a meta lookup if the cache currently has no overlapping key, so it should always return _some_ range descriptor. There's a question of whether we should be logging a warning but proceeding if this meta lookup fails. For now, I've decided not to change that behavior. Release justification: None. Don't merge yet. 87885: kv: remove broken attempt to reject lease acquisitions on draining nodes r=nvanbenschoten a=nvanbenschoten Related to #83261. This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it. The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1. Release justification: low risk change related to release blocker. Release note: None 88205: kvserver: (partially) deflake transfer-leases/drain-other-node r=irfansharif a=irfansharif In #85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease. In #83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process. Release note: None Co-authored-by: rimadeodhar <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: irfan sharif <[email protected]>

Fixes #84290. This commit addresses the sustained slowdown described in #84290 by replacing the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to `RangeCache.Lookup`. The former method checks the cache but returns no range descriptor if the cache currently has no overlapping key. This is possible if the descriptor was recently evicted because it was stale. The later method performs a meta lookup if the cache currently has no overlapping key, so it should always return _some_ range descriptor. There's a question of whether we should be logging a warning but proceeding if this meta lookup fails. For now, I've decided not to change that behavior. Release justification: None. Don't merge yet.

nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Jul 12, 2022

blathers-crl bot added the T-disaster-recovery label Jul 12, 2022

cockroachdb deleted a comment from blathers-crl bot Jul 12, 2022

exalate-issue-sync bot assigned dt Jul 19, 2022

nvanbenschoten mentioned this issue Aug 19, 2022

bulk: perform meta lookup on range cache miss during index backfill #86492

Merged

dt assigned nvanbenschoten and unassigned dt Aug 19, 2022

craig bot closed this as completed in 5a85950 Sep 21, 2022

blathers-crl bot mentioned this issue Sep 21, 2022

release-22.2: bulk: perform meta lookup on range cache miss during index backfill #88315

Merged

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulk: range cache miss during index backfill causes sustained slowdown #84290

bulk: range cache miss during index backfill causes sustained slowdown #84290

nvanbenschoten commented Jul 12, 2022 •

edited by cockroach-jira-scripts

Loading

bulk: range cache miss during index backfill causes sustained slowdown #84290

bulk: range cache miss during index backfill causes sustained slowdown #84290

Comments

nvanbenschoten commented Jul 12, 2022 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Jul 12, 2022 •

edited by cockroach-jira-scripts

Loading