kvserver: avoid or at least efficiently handle intent buildup #60585

tbg · 2021-02-15T15:03:46Z

Describe the problem

We have received numerous instances in which large numbers of intents (hundreds of thousands and more) were able to build up on a range. While this is always theoretically possible - intent cleanup is async and best-effort, so an ill-timed crash can temporarily leave the intents behind - it appears that this is happening in scenarios where it is unexpected.

See for example (internal link): https://github.com/cockroachlabs/support/issues/817

The issue often leads to stuck backups because of #59704 (which we should independently fix).

To Reproduce

Reproducing these issues is part of the problem. So far we only strongly suspect that there are issues, but there is no concrete clue.

Expected behavior
Intents should only be able to build up when

Additional data / screenshots
Recap of the intent resolution behavior:

Intents are generally resolved when EndTxn(commit=true) has finished applying. We will then hit this code path:

cockroach/pkg/kv/kvserver/replica_write.go

Lines 188 to 192 in 218a5a3

    
           if err := r.store.intentResolver.CleanupTxnIntentsAsync( 
        
           	ctx, r.RangeID, propResult.EndTxns, true, /* allowSync */ 
        
           ); err != nil { 
        
           	log.Warningf(ctx, "%v", err) 
        
           }

which calls into this:

cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go

Lines 545 to 559 in 23263a3

    
           for i := range endTxns { 
        
           	et := &endTxns[i] // copy for goroutine 
        
           	if err := ir.runAsyncTask(ctx, allowSyncProcessing, func(ctx context.Context) { 
        
           		locked, release := ir.lockInFlightTxnCleanup(ctx, et.Txn.ID) 
        
           		if !locked { 
        
           			return 
        
           		} 
        
           		defer release() 
        
           		if err := ir.cleanupFinishedTxnIntents( 
        
           			ctx, rangeID, et.Txn, et.Poison, nil, /* onComplete */ 
        
           		); err != nil { 
        
           			if ir.every.ShouldLog() { 
        
           				log.Warningf(ctx, "failed to cleanup transaction intents: %v", err) 
        
           			} 
        
           		}

One possibility is that the intent resolver erroneously skips large batches of txn intents because it is already tracking some smaller amount of work on behalf of the same txn, in which case we'd somehow end up in the !locked code path. In practice, this seems very unlikely, since we only invoke this method from the GC queue (which doesn't touch txns younger than 1h) and the EndTxn(commit=true) proposal result itself, of which there is only ever one.

This raises the question why lockInflightTxnCleanup even exists. I think it may either be premature optimization or related to the GC queue code path. The code hasn't changed much since 2017, i.e. it is very old.

I did check that the cleanup intents are properly scoped to a new background-derived context, i.e. we're not accidentally tying these cleanup attempts to the short-lived caller's context. We're also not introducing an overly aggressive timeout, actually to the best of my knowledge in this particular path there isn't a timeout (?) as runAsyncTask derives from context.Background() and we don't add a timeout anywhere.

Also, isn't this code wrong:

cockroach/pkg/internal/client/requestbatcher/batcher.go

Lines 355 to 366 in 8c5253b

    
           // Update the deadline for the batch if this requests's deadline is later 
        
           // than the current latest. 
        
           rDeadline, rHasDeadline := r.ctx.Deadline() 
        
           // If this is the first request or 
        
           if len(ba.reqs) == 0 || 
        
           	// there are already requests and there is a deadline and 
        
           	(len(ba.reqs) > 0 && !ba.sendDeadline.IsZero() && 
        
           		// this request either doesn't have a deadline or has a later deadline, 
        
           		(!rHasDeadline || rDeadline.After(ba.sendDeadline))) { 
        
           	// set the deadline to this request's deadline. 
        
           	ba.sendDeadline = rDeadline 
        
           }

It seems that it would erase ba.sendDeadline if len(ba.Reqs) > 0, ba.sendDeadline > 0, and !rHasDeadline. However, if anything, this would swallow a context cancellation rather than cause a premature one.

Another thing worth noting is that parallel commits prevents clients from seeing the backpressure. When EndTxn(commit=true) runs, it may feel backpressure. However, that request in parallel commits is completely disjoint from the SQL client as it is fired-and-forgotten here:

cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go

Lines 431 to 455 in ab0b87a

    
           // makeTxnCommitExplicitAsync launches an async task that attempts to move the 
        
           // transaction from implicitly committed (STAGING status with all intents 
        
           // written) to explicitly committed (COMMITTED status). It does so by sending a 
        
           // second EndTxnRequest, this time with no InFlightWrites attached. 
        
           func (tc *txnCommitter) makeTxnCommitExplicitAsync( 
        
           	ctx context.Context, txn *roachpb.Transaction, lockSpans []roachpb.Span, canFwdRTS bool, 
        
           ) { 
        
           	// TODO(nvanbenschoten): consider adding tracing for this request. 
        
           	// TODO(nvanbenschoten): add a timeout to this request. 
        
           	// TODO(nvanbenschoten): consider making this semi-synchronous to 
        
           	//   backpressure client writes when these start to slow down. This 
        
           	//   would be similar to what we do for intent resolution. 
        
           	log.VEventf(ctx, 2, "making txn commit explicit: %s", txn) 
        
           	if err := tc.stopper.RunAsyncTask( 
        
           		context.Background(), "txnCommitter: making txn commit explicit", func(ctx context.Context) { 
        
           			tc.mu.Lock() 
        
           			defer tc.mu.Unlock() 
        
           			if err := makeTxnCommitExplicitLocked(ctx, tc.wrapped, txn, lockSpans, canFwdRTS); err != nil { 
        
           				log.Errorf(ctx, "making txn commit explicit failed for %s: %v", txn, err) 
        
           			} 
        
           		}, 
        
           	); err != nil { 
        
           		log.VErrEventf(ctx, 1, "failed to make txn commit explicit: %v", err) 
        
           	} 
        
           }

Environment:

CockroachDB 20.1.8 and above (probably also below)

Additional context
Once we have the separated lock table, it might be much cheaper to be reactive to the problem.

Epic: CRDB-2554

The text was updated successfully, but these errors were encountered:

tbg · 2021-02-15T15:37:22Z

FWIW I was playing around with

create table foo(v string);
insert into foo values('a'), ('b'), ('c'), ('d'), ('e');
set statement_timeout='50s';
insert into foo select * from foo; -- repeat until stmt_timeout hit
select now(), max(crdb_internal.range_stats(start_key)->'intent_count') from crdb_internal.ranges_no_leases;

The last insert leaves around ~600k intents, and despite hitting the statement timeout they clear out within a few minutes (at a couple thousand resolved a second).

ajwerner · 2021-02-16T17:31:31Z

cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go

Line 419 in 23263a3

ir.Metrics.IntentResolverAsyncThrottled.Inc(1)

Have we checked this metric? Also, the GC path is going to hit the semaphore limiter. Is it possible that the gc queue somehow isn't resolving intents because of this?

cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go

Lines 602 to 608 in 23263a3

    
           return ir.stopper.RunLimitedAsyncTask( 
        
           	// If we've successfully launched a background task, 
        
           	// dissociate this work from our caller's context and 
        
           	// timeout. 
        
           	ir.ambientCtx.AnnotateCtx(context.Background()), 
        
           	"processing txn intents", 
        
           	ir.sem,

joshimhoff · 2021-02-16T18:43:14Z

Two Qs from your friendly SREs:

Have we seen impact worse than BACKUP failures caused by this issue so far, maybe in on-prem? Have we seen DML unavailability or similar?
If not, can we imagine this problem causing DML unavailability or similar?

Relates to cockroachdb#60585. This commit updates the GC queue's `process` method to allow GC even if it determines that doing so would not advance the GC threshold. This check is retained in the queue's `shouldQueue` method, so this change does not impact automated queueing decisions. However, it does ensure that if someone manually enqueues a range in the GC queue with the `ShouldSkipQueue` option, then the range is actually processed. This is important because manually the GC threshold is the first thing that the GC bumps when processing, meaning that the GC threshold alone is not a definitive indication that there is no more work to perform on a range. A processing pass that will not bump the threshold can still be useful, especially with the long context timeout associated with manual enqueuing. Release note (ui change): Manually enqueue range in a replica GC queue now properly respects the SkipShouldQueue option. This can be useful to force a GC of a specific Range.

andreimatei · 2021-02-16T19:24:05Z

I'd test whether we broke something in the cleanup path for dropped pgwire connections. I think a rollback for any ongoing txn is supposed to be happen courtesy of this code, but maybe something rotted.

cockroach/pkg/sql/conn_executor.go

Line 826 in 4086c3e

if err := ex.machine.ApplyWithPayload(ctx, ev, payload); err != nil {

If not, can we imagine this problem causing DML unavailability or similar?

Cleaning up a lot of abandoned locks can be expensive... My understanding is that most reads should be smart in cleaning up many locks at once (but notably the ExportRequests done by backups I understand are dumb and do it one by one), but I guess my answer to the question about whether this can cause "unavailability" is that I'd expect it to cause big problems, yeah.

joshimhoff · 2021-02-16T21:01:25Z

Hmm. We'll be on the lookout for any prod issues that can be tied back to this issue. Thanks for the info!

bdarnell · 2021-02-17T15:47:50Z

Another thing worth noting is that parallel commits prevents clients from seeing the backpressure. When EndTxn(commit=true) runs, it may feel backpressure. However, that request in parallel commits is completely disjoint from the SQL client as it is fired-and-forgotten here:

This is the one that raises alarms for me. If there's no backpressure it's easy for things to get out of control (I believe #5219 was the one that introduced backpressure here and it was motivated by the real-world problem in #4925). That's especially true when network latency is high (which is the case for at least one customer experiencing this issue).

This raises the question why lockInflightTxnCleanup even exists. I think it may either be premature optimization or related to the GC queue code path. The code hasn't changed much since 2017, i.e. it is very old.

I think its roots are even older, in that same PR #5219. While it wasn't discussed as much in the linked issues, I don't think it was premature optimization; we were seeing thundering herd problems where many readers would try to push/resolve the same recently-committed transaction.

tbg · 2021-02-17T16:29:42Z

I think its roots are even older, in that same PR #5219. While it wasn't discussed as much in the linked issues, I don't think it was premature optimization; we were seeing thundering herd problems where many readers would try to push/resolve the same recently-committed transaction.

But at present, there are only two callers to this lock: the GC queue and the txn's own transaction record cleanup intent. The thundering herd wouldn't hit this path. The thundering herd goes here and has its own work avoidance (which uses the same kind of mechanism)

cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go

Line 494 in 23263a3

    
           pushedTxns, pErr := ir.MaybePushTransactions(ctx, pushTxns, h, pushType, skipIfInFlight)

Either way, this certainly isn't making anything worse.

tbg · 2021-02-17T16:35:18Z

I'd test whether we broke something in the cleanup path for dropped pgwire connections. I think a rollback for any ongoing txn is supposed to be happen courtesy of this code, but maybe something rotted.

I did ./cockroach demo, connected in a separate shell, ran a large insert and CTRL-C'ed it half-way through. The intent count dropped to zero (gradually). I did the same with a kill -9 of the SQL cli and the behavior was the same.

nvanbenschoten · 2021-02-17T20:49:50Z

I did ./cockroach demo, connected in a separate shell, ran a large insert and CTRL-C'ed it half-way through. The intent count dropped to zero (gradually). I did the same with a kill -9 of the SQL cli and the behavior was the same.

I did the same experiment last night and saw the same thing. The only way I was able to get abandoned intents was to restart the cluster while async intent resolution was underway. Given that standard async intent resolution does take minutes to clean up about 2 million intents and we've seen instances with many more abandoned intents, this may not be such a wild theory. We're very quick to roll nodes these days when other problems occur. So maybe we're just hitting this issue due to poorly timed restarts and #59704. Do we have an example instance of this issue that we could look at the historical timeseries data for to confirm or deny this? cc. @joshimhoff.

Regardless, we should also keep looking for other explanations.

Something else that strikes me is that the GC queue hasn't been particularly helpful for these issues. Is this because the intents weren't 2 hours old, or because the GC queue is timing out before it can resolve the intents? If it's the latter, we should explore dynamically setting the GC queue's processTimeoutFunc.

tbg · 2021-02-17T20:54:45Z

In the most recent example I looked at, the GC queue ran to completion just fine, and did not consider any intents, meaning they must've been less than 2h old. But that customer also had just run a high-priority select, which either cancelled or pushed all of the intents, so it wasn't too surprising. What was surprising in that was was that there were still 50k+ intents on the range after the SELECT, which made me think that they were actively being created. This may not be the common case we're looking at here.

tbg · 2021-02-17T20:56:17Z

Something else that strikes me is that the GC queue hasn't been particularly helpful for these issues. Is this because the intents weren't 2 hours old, or because the GC queue is timing out before it can resolve the intents? If it's the latter, we should explore dynamically setting the GC queue's processTimeoutFunc.

This makes sense to me - if we're really leaking intents mostly as a function of node restarts (tbd), then and even independently it makes sense that the GC queue would see it as its job to nuke them. We just have to make sure it's done in a way that avoids busy loops in pathological cases where a large number of (non-abandon) intents is always present on the table. If GC is bad at removing lots of intents, the timeout func could help, or even a dynamic timeout that is based on making progress rather than a fixed timeout, which is hard to compute in this case.

nvanbenschoten · 2021-02-17T21:03:09Z

FWIW I just did a test on a local roachprod cluster with IntentAgeThreshold set artificially low and found that the GC queue is able to clean up about 56,000 intents per second on a range. For reference, a high-priority scan was cleaning up about 3,000 intents per second. So that helps confirm that when the GC queue does its job, it is still highly leveraged in cleaning up intents.

As a short-term mitigation, what do you think about me making IntentAgeThreshold a hidden cluster setting? Combined with #60619, this should give us a better toolbox to resolve these issues manually when they come up until we eliminate the problem entirely.

joshimhoff · 2021-02-17T21:06:29Z

Do we have an example instance of this issue that we could look at the historical timeseries data for to confirm or deny this? cc. @joshimhoff.

Yes. One of the CC free-tier clusters hit this issue, leading to issues with BACKUP. Based on the lack of other alerts firing, I don't expect there was a high rate of restarts there, but I can check & report back.

If it helps, consider that we can backport additional obserability into CC & quickly get it on all the 20.2 clusters. If we had metrics that tracked the number of intents in the "largest" ranges by number of intents, we could look actively for such clusters even ahead of impact (in order to help get a root cause).

joshimhoff · 2021-02-17T21:56:35Z

On the free-tier cluster, no restarts for last 2-3 weeks, @nvanbenschoten:

The BACKUP problem occurred there this Friday IIRC.
The most recent set of restarts was likely a patch upgrade so done gracefully.

(The little blip where there appears to be no nodes is likely to be a monitoring glitch.)

tbg · 2021-02-17T22:06:00Z

As a short-term mitigation, what do you think about me making IntentAgeThreshold a hidden cluster setting?

SGTM

60541: opt: inverted-index accelerate filters of the form j->'a' = '{"b": "c"}' r=angelazxu a=angelazxu Previously, the optimizer did not plan inverted index scans for queries with the JSON fetch value operator `->` when an object or array was on the right side of the equality operator `=`, only when it was a boolean, string, number, or null. This change allows the inverted index to be used in these types of queries. It supports objects with multiple key/value pairs, nested objects, arrays, arrays nested within objects, and objects nested within arrays. Fixes: #59605 Release note: None 60619: kv: allow manual GC that does not advance threshold r=nvanbenschoten a=nvanbenschoten Relates to #60585. This commit updates the GC queue's `process` method to allow GC even if it determines that doing so would not advance the GC threshold. This check is retained in the queue's `shouldQueue` method, so this change does not impact automated queueing decisions. However, it does ensure that if someone manually enqueues a range in the GC queue with the `ShouldSkipQueue` option, then the range is actually processed. This is important because manually the GC threshold is the first thing that the GC bumps when processing, meaning that the GC threshold alone is not a definitive indication that there is no more work to perform on a range. A processing pass that will not bump the threshold can still be useful, especially with the long context timeout associated with manual enqueuing. Release note (ui change): Manually enqueue range in a replica GC queue now properly respects the SkipShouldQueue option. This can be useful to force a GC of a specific Range. 60638: sql: fix RENAME COLUMN for REGIONAL BY ROW r=ajstorm a=otan We need to correct the column reference on the table descriptor if we are renaming the REGIONAL BY ROW column. Resolves #59116 Release note: None 60698: vendor: bump pebble to 444296cf r=sumeerbhola a=sumeerbhola 444296cf Merge pull request #1066 from petermattis/pmattis/stability 59659388 update README with note about production readiness a516e691 *: optimize SeekPrefixGE to use Next 327d2757 sstable: hoist blockSizes from testBytesIteratedWithCompression 69b09310 sstable: add benchmarks for zstd cbe3a149 sstable: fix review comments deb29d88 sstable: include zstd into the test cases f13dea6f sstable: support zstd compression 73e8a3b0 vendor: added DataDog/zstd ec81e4c4 sstable: add zstd-related constants Release note: None Co-authored-by: Angela Xu <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: sumeerbhola <[email protected]>

Relates to cockroachdb#60585. This commit updates the GC queue's `process` method to allow GC even if it determines that doing so would not advance the GC threshold. This check is retained in the queue's `shouldQueue` method, so this change does not impact automated queueing decisions. However, it does ensure that if someone manually enqueues a range in the GC queue with the `SkipShouldQueue` option, then the range is actually processed. This is important because the GC threshold is the first thing that the GC bumps when processing, meaning that the GC threshold alone is not a definitive indication that there is no more work to perform on a range. A processing pass that will not bump the threshold can still be useful, especially with the long context timeout associated with manual enqueuing. Release note (ui change): Manually enqueueing a range in the GC queue now properly respects the SkipShouldQueue option. This can be useful to force the GC of a specific Range.

erikgrinaker · 2021-05-06T18:54:35Z

I think I have a plausible explanation for the intent buildup. As mentioned earlier, when we roll back a txn and the client context is cancelled, we give the final EndTxn(commit=false) a 3-second timeout and then process the EndTxn request with this 3-second context:

cockroach/pkg/kv/txn.go

Lines 756 to 758 in 0eba39a

    
           ba.Add(endTxnReq(false /* commit */, nil /* deadline */, false /* systemConfigTrigger */)) 
        
           _ = contextutil.RunWithTimeout(ctx, "async txn rollback", 3*time.Second, func(ctx context.Context) error { 
        
           	if _, pErr := txn.Send(ctx, ba); pErr != nil {

But we didn't really think this should affect cleanup, because we kick off async tasks that are disconnected from the client's context as long as we haven't hit the async task quota:

cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go

Lines 418 to 422 in a4b7a43

    
           err := ir.stopper.RunLimitedAsyncTask( 
        
           	// If we've successfully launched a background task, dissociate 
        
           	// this work from our caller's context and timeout. 
        
           	ir.ambientCtx.AnnotateCtx(context.Background()), 
        
           	"storage.IntentResolver: processing intents",

But of course, if the 3-second context expires before it gets to the point where it can kick off the async task, then it isn't going to clean up anything at all since it gets aborted here:

cockroach/pkg/kv/kvserver/replica_write.go

Lines 291 to 297 in c81a2a3

    
           case <-ctxDone: 
        
           	// If our context was canceled, return an AmbiguousResultError, 
        
           	// which indicates to the caller that the command may have executed. 
        
           	abandon() 
        
           	log.VEventf(ctx, 2, "context cancellation after %0.1fs of attempting command %s", 
        
           		timeutil.Since(startTime).Seconds(), ba) 
        
           	return nil, nil, roachpb.NewError(roachpb.NewAmbiguousResultError(ctx.Err().Error()))

I initially considered this fairly edge-casey, because I thought this only applied to clients submitting a ROLLBACK and then disconnecting. However, when a client disconnects in the middle of a transaction (including an implicit transaction, i.e. a naked DML statement), of course the server-side session will call txn.rollback() to abort the client's open transaction -- and this is going to get called with a context that's already cancelled (because the client has already disconnected).

This means that every time a client disconnects during a txn or in-flight DML statement, the EndTxn(commit=false) has a 3-second deadline to call CleanupTxnIntentsAsync(). In this time, it has to go through RPC, concurrency control, dist-sender, Raft consensus, batch execution, and possibly other stuff. If it doesn't make it, it's not getting cleaned up. It doesn't seem too far-fetched to me that this might happen often enough.

This is kind of what #64770 is getting at, but a bit broader. I'll submit a PR tomorrow that tries harder to clean up in the face of client context cancellation.

erikgrinaker · 2021-05-07T10:05:27Z

I think I found the reason why we're leaking txn records as well:

cockroach/pkg/kv/kvserver/intentresolver/intent_resolver.go

Lines 757 to 762 in a4b7a43

    
           // Run transaction record GC outside of ir.sem. 
        
           return ir.stopper.RunAsyncTask( 
        
           	ctx, 
        
           	"storage.IntentResolver: cleanup txn records", 
        
           	func(ctx context.Context) { 
        
           		err := ir.gcTxnRecord(ctx, rangeID, txn)

We spin off a separate async task to clean up the txn record, but use the surrounding context. In cases where intent resolution is synchronous, and thus is connected to the caller's context, the call can return before gcTxnRecord completes, which can cause the caller to cancel the context (either because of a defer cancel() or a client disconnect) and thus cancel the txn record cleanup as well.

nvanbenschoten · 2021-05-10T19:34:52Z

Another thing to add here is that we don't try very hard to use the LockSpans slice in a transaction record to clean up all of a transaction's intents and drive the transaction record cleanup to completion if the first attempt at intent resolution fails. Currently, the only places where we drive cleanup to completion are in response to a transaction's own client and in the GC queue (after an hour). All other conflicts will push a transaction to determine its status, but will then only resolve the single conflicting intent and will never return (synchronously or asynchronously) to clean up any other intents or GC the transaction record. We have some memory and some batching in the lockTable to avoid redundant pushes and to batch intent resolution for the same transaction, but this is all on a per-range basis.

There's probably something we could do better here. We can't rely on a finalized transaction record to accurately reflect all of the intents that the transaction wrote because the transaction may have expired without ever being finalized cleanly. However, in cases where a transaction is finalized cleanly and the record includes its intent spans, we could do more. For instance, PushTxnRequests and RecoverTxnRequests that find finalized transaction records could call into IntentResolver.CleanupTxnIntentsAsync.

Of course, there are some practical considerations here. For instance, we'd want to avoid thundering herds of cleanup for the same transaction immediately after it is finalized. We may also want to avoid redundant work as best as we can in cases where a transaction's intents are partially cleaned up - we don't currently checkpoint cleanup progress anywhere. There's also some concern about GC-ing the record of a possibly active transaction, as is discussed in #53146.

erikgrinaker · 2021-05-12T09:24:20Z

Another thing to add here is that we don't try very hard to use the LockSpans slice in a transaction record to clean up all of a transaction's intents and drive the transaction record cleanup to completion if the first attempt at intent resolution fails. Currently, the only places where we drive cleanup to completion are in response to a transaction's own client and in the GC queue (after an hour). All other conflicts will push a transaction to determine its status, but will then only resolve the single conflicting intent and will never return (synchronously or asynchronously) to clean up any other intents or GC the transaction record.

I think this is probably the right approach in general. When we're running some other operation, and happen to run into intents, we want to keep latency as low as possible and just deal with the intents that we have to.

We can't rely on a finalized transaction record to accurately reflect all of the intents that the transaction wrote because the transaction may have expired without ever being finalized cleanly. However, in cases where a transaction is finalized cleanly and the record includes its intent spans, we could do more.

We might want a separate GC queue for this rather than doing best-effort ad hoc cleanup. Since the txn records are stored in the range-local key space we can presumably scan them efficiently and clean up any finalized txn records. How to finalize abandoned txns is a separate concern (haven't looked into how this is currently done yet).

64869: kvserver: improve intent cleanup for disconnected clients r=nvanbenschoten a=erikgrinaker ### kvserver: improve intent cleanup for disconnected clients Whenever a client disconnects during an open transaction or in-flight DML statement, the server-side session will asynchronously clean up the transaction record and intents by rolling back the transaction. However, this had a very short timeout of 3 seconds, and even though the actual `IntentResolver.CleanupTxnIntentsAsync()` cleanup task runs asynchronously and independently of the client's context in the typical case, the short timeout could cause transaction cleanup to be cancelled if the `EndTxn` request took more than 3 seconds to get all the way through Raft execution or if the async task limit was exceeded such that the cleanup task kept using the client's context. This in turn could lead to intents building up over time. This patch increases the timeout when rolling back transactions for disconnected clients to 1 minute, and also tries to perform transaction cleanup when a client disconnects while waiting for an `EndTxn` command to go through Raft. Resolves #64770, touches #60585. Release note (bug fix): improved transaction cleanup for disconnected clients, to reduce intent buildup. ### kvserver: fix race condition during synchronous txn record cleanup Transaction records and intents are normally cleaned up asynchronously via `IntentResolver.CleanupTxnIntentsAsync()`, separately from the client's context. When the async task limit is exceeded, cleanup instead becomes synchronous and attached to the client context. However, the final `gcTxnRecord()` call to remove the transaction record is asynchronous even when intent cleanup is synchronous, to avoid holding onto the intent resolver task slot. This call will typically return to the caller before `gcTxnRecord()` completes, which may cause the caller to cancel the context (either via `defer cancel()` or a client disconnection) and in turn cancel the `gcTxnRecord()` call. This patch gives the async `gcTxnRecord()` call a separate background context that's independent of the client's context even in the synchronous case, with a 20 second timeout to avoid goroutine leaks. Resolves #64868, touches #60585. Release note (bug fix): Fixed a race condition during transaction cleanup that could leave old transaction records behind until MVCC garbage collection. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>

erikgrinaker · 2021-05-22T16:35:15Z

For those following along, we've found a couple of issues in the transaction coordinator that seem even more likely to cause intent buildup, I'm working on fixes: #65458 #65587

erikgrinaker · 2021-05-28T22:19:29Z

I believe we may have plugged the main intent leak with #65592, which is being merged to master right now. However, it involves non-trivial changes to the main transaction code path, so we're going to let it sit on master for a few weeks before backporting it. #64131, which is already backported, should alleviate the problems with backups and intents.

Also, we've added a section to the BACKUP docs on how to diagnose and resolve hung backups due to intent buildup, since this keeps coming up:

https://www.cockroachlabs.com/docs/v21.1/backup.html#slow-or-hung-backups-and-queries-due-to-write-intent-buildup

65001: kvserver: make GC intent scoring more aggressive r=tbg a=erikgrinaker ### keys: rename RangeLastGCKey to RangeGCThresholdKey The key `RangeLastGCKey` was not actually used to store the last GC timestamp, but rather the last GC threshold. This commit renames the key to the more appropriate `RangeGCThresholdKey`. Release note: None ### kvserver: increase GC queue timeout The MVCC GC queue timeout was left at the default of 1 minute. Since the `GCThreshold` is advanced and persisted at the very start of the GC run (to coordinate with other processes such as exports and rangefeeds before removing any data), this means that if a GC run takes more than 1 minute to complete it will be aborted due to the timeout and may not run again for some time. This patch increases the timeout to 10 minutes. The primary purpose of the timeout is to avoid head-of-line blocking in the case of unavailable ranges, and a larger timeout still satisfies this property while also allowing more expensive cleanup attempts to complete. Release note (ops change): increased the timeout for range MVCC garbage collection from 1 minute to 10 minutes, to allow larger jobs to run to completion. ### kvserver: use PUSH_TOUCH during GC to push intent owners Using `PUSH_ABORT` to push intent owners during intent GC can cause it to block on (or possibly abort) long-running transactions. Since we would prefer GC to continue with other intents rather than waiting for active transactions, this patch changes the GC queue to use `PUSH_TOUCH` instead. This will still clean up abandoned txns. Resolves #65777. Release note (bug fix): Intent garbage collection no longer waits for or aborts running transactions. ### kvserver: add GC queue cooldown timer When the GC queue considered replicas for MVCC GC, it primarily considered the previous GC threshold and the amount of MVCC garbage that would be removed by advancing the threshold. We would like to also trigger GC based solely on intent stats. However, we can't know whether we'll actually be able to clean up any intents, since they may belong to an in-progress long-running transaction. There is therefore a risk that we will keep spinning on GC attempts due to high intent counts. This patch records the last GC queue processing timestamp, and adds a cooldown timer of 2 hours between each range GC attempt triggered by intent score alone to prevent spinning on GC attempts. Release note: None ### kvserver: make GC intent scoring more aggressive Users often experience buildup of intents due to the GC queue not being aggressive enough in cleaning them up. Currently, the GC queue will only trigger based on intents if the average intent age is 10 days and if there is also MVCC garbage that can be cleaned up. This patch makes the intent scoring more aggressive, triggering GC when the average intent age is 8 hours regardless of other MVCC garbage. The previous commit added a cooldown timer to prevent the GC queue from spinning on a replica if the intents couldn't be cleaned up (e.g. because they belong to an in-progress long-running transaction). Resolves #64266. Touches #60585. Release note (ops change): trigger MVCC and intent garbage collection when the average intent age is 8 hours, down from 10 days. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>

erikgrinaker · 2021-07-07T09:56:07Z

We've now resolved the known issues around intent buildup and mitigation. The most important mitigation (better intent handling during backup export requests in #64131) has been backported, but the other changes carry more risk and we've decided not to backport these for now -- they are thus only available on master, scheduled for 21.2. We therefore won't know for certain whether the intent buildup problem is fully resolved until 21.2 is released and taken into widespread use, but I'm closing this for now and we can reopen it if we find any further issues.

joshimhoff · 2021-07-12T17:52:14Z

YAY! Thanks, Erik!

…uests Fixes cockroachdb#50390. Related to cockroachdb#60585. Related to cockroachdb#103126. Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ``` -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took 70.078s. With this change, the scan took 15.958s. This 78% speed up checks out. Each encountered intent is resolved serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in cockroachdb#49218, because the intents are each written by different transactions. The speedup for this scenario was presented in cockroachdb#64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.

103265: kv: resolve conflicting intents immediately for latency-sensitive requests r=nvanbenschoten a=nvanbenschoten Fixes #50390. Related to #60585. Related to #103126. Closes #64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary. This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (`defaultIntentResolutionBatchWait`) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency. The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency. I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies. ```sql -- session 1 CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY); BEGIN; INSERT INTO keys SELECT generate_series(1, 10000); -- session 2 BEGIN PRIORITY HIGH; SELECT count(*) FROM keys; ``` Without this change, the scan took **70.078s**. With this change, the scan took **15.958s**. This **78%** speed-up checks out. Each encountered intent is resolved serially (see #103126), so the **per-intent latency** drops from **7ms** to **1.6ms.** This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced. While this is a significant improvement and will be important for Read Committed transactions after #102014, this is still not quite enough to resolve #103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in #49218 (for finalized transactions). A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in #49218, because the intents are each written by different transactions. The speedup for this scenario was presented in #64500. Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently. 103362: sql: validate primary / secondary region localities at end of txn r=fqazi a=fqazi Previously, if a database was restored with skip_localities, there was no way to modify this database to set the primary region since validation would kick in too early during the statement. This meant fixing the regions in a restored database was impossible if the primary region was no longer valid. To address this, this patch, delays locality validation till the end of the transaction. Fixes: #103290 Release note (bug fix): SET PRIMARY REGION and SET SECONDARY REGION did not validate transactionally, which could prevent cleaning up removed regions after a restore. 103373: concurrency: small refactors in preparation for reservation removal r=nvanbenschoten a=arulajmani See individual commits for details. Informs: #103361 103538: go.mod: bump Pebble to 6f2788660198, rework shared storage wrapper r=RaduBerinde a=RaduBerinde 6f278866 shared: improve interface for more efficient reading 9eb2c407 db: log events to testing.T in unit tests f32e7dc6 db: add reserved Pebblev4 sstable format 5a6b91b8 objstorage: improve test and add read ahead test 2bc4319e objstorage: remove genericFileReadable 8143ffb9 objstorage: fix readaheadState initialization 06d08888 db: add Compact.Duration metric e7213de0 db: add Uptime metric e9005aed db: don't delete files during ingest application 222b43ec internal/arenaskl: fix Skiplist doc typo Release note: None Epic: none Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

tbg added A-kv Anything in KV that doesn't belong in a more specific category. C-investigation Further steps needed to qualify. C-label will change. labels Feb 15, 2021

tbg assigned tbg and nvanbenschoten Feb 15, 2021

nvanbenschoten mentioned this issue Feb 16, 2021

kv: allow manual GC that does not advance threshold #60619

Merged

nvanbenschoten mentioned this issue Feb 17, 2021

storage: ExportRequest poorly leveraged to clean up abandoned intents #59704

Closed

nvanbenschoten mentioned this issue Feb 18, 2021

release-20.2: kv: allow manual GC that does not advance threshold #60745

Merged

nvanbenschoten mentioned this issue Feb 18, 2021

release-20.1: kv: allow manual GC that does not advance threshold #60746

Merged

aliher1911 mentioned this issue May 6, 2021

Add a debug command to introspect intents and transactions in storage #64787

Closed

This was referenced May 7, 2021

kvserver: sync txn record cleanup can be cancelled prematurely #64868

Closed

kvserver: improve intent cleanup for disconnected clients #64869

Merged

kvserver: clean up after EndTxn even if client disconnects #64770

Closed

erikgrinaker mentioned this issue May 11, 2021

kvserver: make GC intent scoring more aggressive #65001

Merged

This was referenced May 14, 2021

roachtest: output and investigate intent stats after TPCC tests #65193

Closed

kvserver: use a queue for intent resolution and txn cleanup #65204

Open

This was referenced May 18, 2021

release-21.1: kvserver: improve intent cleanup for disconnected clients #65383

Merged

release-20.2: kvserver: improve intent cleanup for disconnected clients #65384

Merged

kvserver: intent leak with ABORT_REASON_CLIENT_REJECT #65458

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker closed this as completed Jul 7, 2021

jlinder added the sync-me-3 label May 24, 2022

exalate-issue-sync bot unassigned aliher1911 Jun 13, 2022

nvanbenschoten mentioned this issue May 14, 2023

kv: resolve conflicting intents immediately for latency-sensitive requests #103265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: avoid or at least efficiently handle intent buildup #60585

kvserver: avoid or at least efficiently handle intent buildup #60585

tbg commented Feb 15, 2021 •

edited by lunevalex

Loading

tbg commented Feb 15, 2021

ajwerner commented Feb 16, 2021

joshimhoff commented Feb 16, 2021

andreimatei commented Feb 16, 2021

joshimhoff commented Feb 16, 2021

bdarnell commented Feb 17, 2021

tbg commented Feb 17, 2021

tbg commented Feb 17, 2021

nvanbenschoten commented Feb 17, 2021

tbg commented Feb 17, 2021

tbg commented Feb 17, 2021 •

edited

Loading

nvanbenschoten commented Feb 17, 2021

joshimhoff commented Feb 17, 2021 •

edited

Loading

joshimhoff commented Feb 17, 2021

tbg commented Feb 17, 2021

erikgrinaker commented May 6, 2021 •

edited

Loading

erikgrinaker commented May 7, 2021

nvanbenschoten commented May 10, 2021 •

edited

Loading

erikgrinaker commented May 12, 2021

erikgrinaker commented May 22, 2021 •

edited

Loading

erikgrinaker commented May 28, 2021

erikgrinaker commented Jul 7, 2021

joshimhoff commented Jul 12, 2021

kvserver: avoid or at least efficiently handle intent buildup #60585

kvserver: avoid or at least efficiently handle intent buildup #60585

Comments

tbg commented Feb 15, 2021 • edited by lunevalex Loading

tbg commented Feb 15, 2021

ajwerner commented Feb 16, 2021

joshimhoff commented Feb 16, 2021

andreimatei commented Feb 16, 2021

joshimhoff commented Feb 16, 2021

bdarnell commented Feb 17, 2021

tbg commented Feb 17, 2021

tbg commented Feb 17, 2021

nvanbenschoten commented Feb 17, 2021

tbg commented Feb 17, 2021

tbg commented Feb 17, 2021 • edited Loading

nvanbenschoten commented Feb 17, 2021

joshimhoff commented Feb 17, 2021 • edited Loading

joshimhoff commented Feb 17, 2021

tbg commented Feb 17, 2021

erikgrinaker commented May 6, 2021 • edited Loading

erikgrinaker commented May 7, 2021

nvanbenschoten commented May 10, 2021 • edited Loading

erikgrinaker commented May 12, 2021

erikgrinaker commented May 22, 2021 • edited Loading

erikgrinaker commented May 28, 2021

erikgrinaker commented Jul 7, 2021

joshimhoff commented Jul 12, 2021

tbg commented Feb 15, 2021 •

edited by lunevalex

Loading

tbg commented Feb 17, 2021 •

edited

Loading

joshimhoff commented Feb 17, 2021 •

edited

Loading

erikgrinaker commented May 6, 2021 •

edited

Loading

nvanbenschoten commented May 10, 2021 •

edited

Loading

erikgrinaker commented May 22, 2021 •

edited

Loading