Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk: range cache miss during index backfill causes sustained slowdown #84290

Closed
nvanbenschoten opened this issue Jul 12, 2022 · 0 comments · Fixed by #86492
Closed

bulk: range cache miss during index backfill causes sustained slowdown #84290

nvanbenschoten opened this issue Jul 12, 2022 · 0 comments · Fixed by #86492
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jul 12, 2022

During an index creation on a 6TB table in a cluster with 15 nodes (48 vCPU per instance), I noticed a surprising slowdown. After running for about 12 hours, the index creation slowed to a trickle (~1% of peak ingestion throughput). About an hour later, it sped up to about 10% of peak ingestion throughput.

Screen Shot 2022-07-10 at 5 16 33 AM

The distsender.rpc.addsstable.sent metric revealed an interesting pattern. Most nodes had stopped sending AddSSTable requests entirely — they were finished. However, much earlier in the backfill, two of the nodes had fallen off dramatically in their rate of sending AddSSTable requests.

Screen Shot 2022-07-10 at 5 16 16 AM

Interestingly, at different times, these two nodes eventually did pick up the rate of sending and quickly finished.

Screen Shot 2022-07-10 at 2 24 35 PM

I was able to grab logs during the slow period from each of these nodes: cockroach.log

I was also able to grab logspy with high verbosity on the sst_batcher.go file: full_logspy.txt

We see an interesting pattern here. During the slowdown, the index backfiller appears to be in a linear regime where it repeatedly hits SSTable cannot be added spanning range bounds lines. These are meant to be rare.

In one (of dozens, I got lucky) of my logspy attempts, I was able to catch a pair of log lines that I think hints at the problem:

I220710 09:25:04.277698 12929447 kv/bulk/sst_batcher.go:327 ⋮ [n2,f‹57784e93›,job=0,indexBackfillerProcessor=120] 6954 ‹trade backfill› no cached range desc available to determine sst flush key
I220710 09:25:04.377969 12929447 kv/bulk/sst_batcher.go:407 ⋮ [n2,f‹57784e93›,job=0,indexBackfillerProcessor=120] 7031 ‹trade backfill› flushing 16 MiB SST due to size > 16 MiB

It seems that at some point, the SSTBatcher checked to see whether it should perform a rangeFlush:

r := b.rc.GetCached(ctx, k, false /* inverted */)
if r != nil {
k := r.Desc().EndKey.AsRawKey()
b.flushKey = k
log.VEventf(ctx, 3, "%s building sstable that will flush before %v", b.name, k)
} else {
log.VEventf(ctx, 2, "%s no cached range desc available to determine sst flush key", b.name)

When checking, it did not find a cached range descriptor for the first key in a batch. As a result, it decided not to perform a range flush for the current batch. Instead, it waited until the batch was 16 MiB in size before performing a size flush.

This did not go well. At the time of this size flush, the index keyspace had 11971 ranges. As a result, the 16 MiB buffer was split into chunks of about 5 KiB each (according to logspy). We then sent these small AddSSTables one-by-one. Each took about 50ms, so we sent about 20 AddSSTable's per second. This meant that we spent about 160 seconds to flush the 16 MiB buffer.

I think the takeaway here is that a RangeCache miss can undermine #79967, which becomes increasingly important with large data sizes. Perhaps we should replace the call to RangeCache.GetCached with one to RangeCache.Lookup so that we'll perform a RangeCache lookup on a cache misses. We're going to send to the range anyway, so a cache lookup now vs. later shouldn't make a difference.

One thing I don't understand is why we kept hitting range cache misses once in this regime. Perhaps that hints at something else going wrong as well. Or perhaps there was a gap in our range cache that kept causing us to skip range flushes but was also never filled during a size flush.

cc. @dt

Jira issue: CRDB-17573

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels Jul 12, 2022
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Jul 12, 2022
@dt dt assigned nvanbenschoten and unassigned dt Aug 19, 2022
craig bot pushed a commit that referenced this issue Sep 20, 2022
85859: pgwire: Add support for cursors with special characters r=knz,rafiss a=rimadeodhar

IIn CockroachDB and Postgres, it is possible to declare
cursors with special characters enclosed within double
quotes, for e.g. "1-2-3". Currently, we store the name
as an unescaped string which causes problems during the
pgwire DESCRIBE step for looking up the cursor. We should
be storing using the tree.Name datatype for the cursor name
while storing and looking up cursors. This PR updates the code
to start using tree.Name instead of raw strings for handling
cursor names. This fixes the issue where the pgwire DESCRIBE
step fails while attempting to look up cursors with names
containing special characters.

Resolves #84261

Release note (bug fix): The pgwire DESCRIBE step no longer
fails with an error while attempting to look up cursors
declared with names containing special characters.

86492: bulk: perform meta lookup on range cache miss during index backfill r=dt a=nvanbenschoten

Fixes #84290.

This commit addresses the sustained slowdown described in #84290 by replacing
the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to
`RangeCache.Lookup`. The former method checks the cache but returns no range
descriptor if the cache currently has no overlapping key. This is possible if
the descriptor was recently evicted because it was stale. The later method
performs a meta lookup if the cache currently has no overlapping key, so it
should always return _some_ range descriptor.

There's a question of whether we should be logging a warning but proceeding if
this meta lookup fails. For now, I've decided not to change that behavior.

Release justification: None. Don't merge yet.

87885: kv: remove broken attempt to reject lease acquisitions on draining nodes r=nvanbenschoten a=nvanbenschoten

Related to #83261.

This commit removes "protection" that avoided lease acquisitions on draining nodes. This protection had already been effectively disabled by acc1ad1, which allowed Raft leaders to bypass the check. As the comment here (added in 5ffaa9e) explained, Raft followers are already unable to acquire the lease. If leaders bypass the check and follower (and candidates) don't need it, the check is useless, so we remove it.

The commit also removes `TestReplicaDrainLease`, which was supposed to test this behavior. We remove the test not because it started failing after the change, but because it did not. It must not have been testing anything real anymore after acc1ad1.

Release justification: low risk change related to release blocker.

Release note: None

88205: kvserver: (partially) deflake transfer-leases/drain-other-node r=irfansharif a=irfansharif

In #85629 we changed our lease transfer protocol to only ever transfer expiration-based leases, and have recipients later upgrade them to the more efficient epoch based ones. This was done to limit the effects of ill-advised lease transfers since the incoming leaseholder would need to recognize itself as such within a few seconds -- so we wanted this upgrade happen after having received the lease.

In #83261 however we noticed that the upgrade was not immediate -- we were waiting until the current lease's expiration was within its renewal duration -- 4.5s. When the lease was eventually renewed the upgrade did happen, but it was not immediate. We fix this here and remove the manual clock advancing the supporting test had that masked this issue. It now demonstrates that we're no longer relying on upgrades happen as part of the (slow) renewal process.

Release note: None

Co-authored-by: rimadeodhar <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@craig craig bot closed this as completed in 5a85950 Sep 21, 2022
blathers-crl bot pushed a commit that referenced this issue Sep 21, 2022
Fixes #84290.

This commit addresses the sustained slowdown described in #84290 by replacing
the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to
`RangeCache.Lookup`. The former method checks the cache but returns no range
descriptor if the cache currently has no overlapping key. This is possible if
the descriptor was recently evicted because it was stale. The later method
performs a meta lookup if the cache currently has no overlapping key, so it
should always return _some_ range descriptor.

There's a question of whether we should be logging a warning but proceeding if
this meta lookup fails. For now, I've decided not to change that behavior.

Release justification: None. Don't merge yet.
nvanbenschoten added a commit that referenced this issue Nov 16, 2022
Fixes #84290.

This commit addresses the sustained slowdown described in #84290 by replacing
the call in `SSTBatcher.flushIfNeeded` to `RangeCache.GetCached` with a call to
`RangeCache.Lookup`. The former method checks the cache but returns no range
descriptor if the cache currently has no overlapping key. This is possible if
the descriptor was recently evicted because it was stale. The later method
performs a meta lookup if the cache currently has no overlapping key, so it
should always return _some_ range descriptor.

There's a question of whether we should be logging a warning but proceeding if
this meta lookup fails. For now, I've decided not to change that behavior.

Release justification: None. Don't merge yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants