Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: avoid or at least efficiently handle intent buildup #60585

Closed
tbg opened this issue Feb 15, 2021 · 63 comments
Closed

kvserver: avoid or at least efficiently handle intent buildup #60585

tbg opened this issue Feb 15, 2021 · 63 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-recovery C-investigation Further steps needed to qualify. C-label will change. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Feb 15, 2021

Describe the problem

We have received numerous instances in which large numbers of intents (hundreds of thousands and more) were able to build up on a range. While this is always theoretically possible - intent cleanup is async and best-effort, so an ill-timed crash can temporarily leave the intents behind - it appears that this is happening in scenarios where it is unexpected.

See for example (internal link): https://github.com/cockroachlabs/support/issues/817

The issue often leads to stuck backups because of #59704 (which we should independently fix).

To Reproduce

Reproducing these issues is part of the problem. So far we only strongly suspect that there are issues, but there is no concrete clue.

Expected behavior
Intents should only be able to build up when

Additional data / screenshots
Recap of the intent resolution behavior:

Intents are generally resolved when EndTxn(commit=true) has finished applying. We will then hit this code path:

if err := r.store.intentResolver.CleanupTxnIntentsAsync(
ctx, r.RangeID, propResult.EndTxns, true, /* allowSync */
); err != nil {
log.Warningf(ctx, "%v", err)
}

which calls into this:

for i := range endTxns {
et := &endTxns[i] // copy for goroutine
if err := ir.runAsyncTask(ctx, allowSyncProcessing, func(ctx context.Context) {
locked, release := ir.lockInFlightTxnCleanup(ctx, et.Txn.ID)
if !locked {
return
}
defer release()
if err := ir.cleanupFinishedTxnIntents(
ctx, rangeID, et.Txn, et.Poison, nil, /* onComplete */
); err != nil {
if ir.every.ShouldLog() {
log.Warningf(ctx, "failed to cleanup transaction intents: %v", err)
}
}

One possibility is that the intent resolver erroneously skips large batches of txn intents because it is already tracking some smaller amount of work on behalf of the same txn, in which case we'd somehow end up in the !locked code path. In practice, this seems very unlikely, since we only invoke this method from the GC queue (which doesn't touch txns younger than 1h) and the EndTxn(commit=true) proposal result itself, of which there is only ever one.

This raises the question why lockInflightTxnCleanup even exists. I think it may either be premature optimization or related to the GC queue code path. The code hasn't changed much since 2017, i.e. it is very old.

I did check that the cleanup intents are properly scoped to a new background-derived context, i.e. we're not accidentally tying these cleanup attempts to the short-lived caller's context. We're also not introducing an overly aggressive timeout, actually to the best of my knowledge in this particular path there isn't a timeout (?) as runAsyncTask derives from context.Background() and we don't add a timeout anywhere.

Also, isn't this code wrong:

// Update the deadline for the batch if this requests's deadline is later
// than the current latest.
rDeadline, rHasDeadline := r.ctx.Deadline()
// If this is the first request or
if len(ba.reqs) == 0 ||
// there are already requests and there is a deadline and
(len(ba.reqs) > 0 && !ba.sendDeadline.IsZero() &&
// this request either doesn't have a deadline or has a later deadline,
(!rHasDeadline || rDeadline.After(ba.sendDeadline))) {
// set the deadline to this request's deadline.
ba.sendDeadline = rDeadline
}

It seems that it would erase ba.sendDeadline if len(ba.Reqs) > 0, ba.sendDeadline > 0, and !rHasDeadline. However, if anything, this would swallow a context cancellation rather than cause a premature one.

Another thing worth noting is that parallel commits prevents clients from seeing the backpressure. When EndTxn(commit=true) runs, it may feel backpressure. However, that request in parallel commits is completely disjoint from the SQL client as it is fired-and-forgotten here:

// makeTxnCommitExplicitAsync launches an async task that attempts to move the
// transaction from implicitly committed (STAGING status with all intents
// written) to explicitly committed (COMMITTED status). It does so by sending a
// second EndTxnRequest, this time with no InFlightWrites attached.
func (tc *txnCommitter) makeTxnCommitExplicitAsync(
ctx context.Context, txn *roachpb.Transaction, lockSpans []roachpb.Span, canFwdRTS bool,
) {
// TODO(nvanbenschoten): consider adding tracing for this request.
// TODO(nvanbenschoten): add a timeout to this request.
// TODO(nvanbenschoten): consider making this semi-synchronous to
// backpressure client writes when these start to slow down. This
// would be similar to what we do for intent resolution.
log.VEventf(ctx, 2, "making txn commit explicit: %s", txn)
if err := tc.stopper.RunAsyncTask(
context.Background(), "txnCommitter: making txn commit explicit", func(ctx context.Context) {
tc.mu.Lock()
defer tc.mu.Unlock()
if err := makeTxnCommitExplicitLocked(ctx, tc.wrapped, txn, lockSpans, canFwdRTS); err != nil {
log.Errorf(ctx, "making txn commit explicit failed for %s: %v", txn, err)
}
},
); err != nil {
log.VErrEventf(ctx, 1, "failed to make txn commit explicit: %v", err)
}
}

Environment:

  • CockroachDB 20.1.8 and above (probably also below)

Additional context
Once we have the separated lock table, it might be much cheaper to be reactive to the problem.

Epic: CRDB-2554

@tbg tbg added A-kv Anything in KV that doesn't belong in a more specific category. C-investigation Further steps needed to qualify. C-label will change. labels Feb 15, 2021
@tbg
Copy link
Member Author

tbg commented Feb 15, 2021

FWIW I was playing around with

create table foo(v string);
insert into foo values('a'), ('b'), ('c'), ('d'), ('e');
set statement_timeout='50s';
insert into foo select * from foo; -- repeat until stmt_timeout hit
select now(), max(crdb_internal.range_stats(start_key)->'intent_count') from crdb_internal.ranges_no_leases;

The last insert leaves around ~600k intents, and despite hitting the statement timeout they clear out within a few minutes (at a couple thousand resolved a second).

@ajwerner
Copy link
Contributor

ir.Metrics.IntentResolverAsyncThrottled.Inc(1)

Have we checked this metric? Also, the GC path is going to hit the semaphore limiter. Is it possible that the gc queue somehow isn't resolving intents because of this?

return ir.stopper.RunLimitedAsyncTask(
// If we've successfully launched a background task,
// dissociate this work from our caller's context and
// timeout.
ir.ambientCtx.AnnotateCtx(context.Background()),
"processing txn intents",
ir.sem,

@joshimhoff
Copy link
Collaborator

Two Qs from your friendly SREs:

  1. Have we seen impact worse than BACKUP failures caused by this issue so far, maybe in on-prem? Have we seen DML unavailability or similar?

  2. If not, can we imagine this problem causing DML unavailability or similar?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 16, 2021
Relates to cockroachdb#60585.

This commit updates the GC queue's `process` method to allow GC even if
it determines that doing so would not advance the GC threshold. This
check is retained in the queue's `shouldQueue` method, so this change
does not impact automated queueing decisions. However, it does ensure
that if someone manually enqueues a range in the GC queue with the
`ShouldSkipQueue` option, then the range is actually processed. This is
important because manually the GC threshold is the first thing that the
GC bumps when processing, meaning that the GC threshold alone is not a
definitive indication that there is no more work to perform on a range.
A processing pass that will not bump the threshold can still be useful,
especially with the long context timeout associated with manual
enqueuing.

Release note (ui change): Manually enqueue range in a replica GC queue
now properly respects the SkipShouldQueue option. This can be useful to
force a GC of a specific Range.
@andreimatei
Copy link
Contributor

I'd test whether we broke something in the cleanup path for dropped pgwire connections. I think a rollback for any ongoing txn is supposed to be happen courtesy of this code, but maybe something rotted.

if err := ex.machine.ApplyWithPayload(ctx, ev, payload); err != nil {

If not, can we imagine this problem causing DML unavailability or similar?

Cleaning up a lot of abandoned locks can be expensive... My understanding is that most reads should be smart in cleaning up many locks at once (but notably the ExportRequests done by backups I understand are dumb and do it one by one), but I guess my answer to the question about whether this can cause "unavailability" is that I'd expect it to cause big problems, yeah.

@joshimhoff
Copy link
Collaborator

Hmm. We'll be on the lookout for any prod issues that can be tied back to this issue. Thanks for the info!

@bdarnell
Copy link
Contributor

Another thing worth noting is that parallel commits prevents clients from seeing the backpressure. When EndTxn(commit=true) runs, it may feel backpressure. However, that request in parallel commits is completely disjoint from the SQL client as it is fired-and-forgotten here:

This is the one that raises alarms for me. If there's no backpressure it's easy for things to get out of control (I believe #5219 was the one that introduced backpressure here and it was motivated by the real-world problem in #4925). That's especially true when network latency is high (which is the case for at least one customer experiencing this issue).

This raises the question why lockInflightTxnCleanup even exists. I think it may either be premature optimization or related to the GC queue code path. The code hasn't changed much since 2017, i.e. it is very old.

I think its roots are even older, in that same PR #5219. While it wasn't discussed as much in the linked issues, I don't think it was premature optimization; we were seeing thundering herd problems where many readers would try to push/resolve the same recently-committed transaction.

@tbg
Copy link
Member Author

tbg commented Feb 17, 2021

I think its roots are even older, in that same PR #5219. While it wasn't discussed as much in the linked issues, I don't think it was premature optimization; we were seeing thundering herd problems where many readers would try to push/resolve the same recently-committed transaction.

But at present, there are only two callers to this lock: the GC queue and the txn's own transaction record cleanup intent. The thundering herd wouldn't hit this path. The thundering herd goes here and has its own work avoidance (which uses the same kind of mechanism)

pushedTxns, pErr := ir.MaybePushTransactions(ctx, pushTxns, h, pushType, skipIfInFlight)

Either way, this certainly isn't making anything worse.

@tbg
Copy link
Member Author

tbg commented Feb 17, 2021

I'd test whether we broke something in the cleanup path for dropped pgwire connections. I think a rollback for any ongoing txn is supposed to be happen courtesy of this code, but maybe something rotted.

I did ./cockroach demo, connected in a separate shell, ran a large insert and CTRL-C'ed it half-way through. The intent count dropped to zero (gradually). I did the same with a kill -9 of the SQL cli and the behavior was the same.

@nvanbenschoten
Copy link
Member

I did ./cockroach demo, connected in a separate shell, ran a large insert and CTRL-C'ed it half-way through. The intent count dropped to zero (gradually). I did the same with a kill -9 of the SQL cli and the behavior was the same.

I did the same experiment last night and saw the same thing. The only way I was able to get abandoned intents was to restart the cluster while async intent resolution was underway. Given that standard async intent resolution does take minutes to clean up about 2 million intents and we've seen instances with many more abandoned intents, this may not be such a wild theory. We're very quick to roll nodes these days when other problems occur. So maybe we're just hitting this issue due to poorly timed restarts and #59704. Do we have an example instance of this issue that we could look at the historical timeseries data for to confirm or deny this? cc. @joshimhoff.

Regardless, we should also keep looking for other explanations.

Something else that strikes me is that the GC queue hasn't been particularly helpful for these issues. Is this because the intents weren't 2 hours old, or because the GC queue is timing out before it can resolve the intents? If it's the latter, we should explore dynamically setting the GC queue's processTimeoutFunc.

@tbg
Copy link
Member Author

tbg commented Feb 17, 2021

In the most recent example I looked at, the GC queue ran to completion just fine, and did not consider any intents, meaning they must've been less than 2h old. But that customer also had just run a high-priority select, which either cancelled or pushed all of the intents, so it wasn't too surprising. What was surprising in that was was that there were still 50k+ intents on the range after the SELECT, which made me think that they were actively being created. This may not be the common case we're looking at here.

@tbg
Copy link
Member Author

tbg commented Feb 17, 2021

Something else that strikes me is that the GC queue hasn't been particularly helpful for these issues. Is this because the intents weren't 2 hours old, or because the GC queue is timing out before it can resolve the intents? If it's the latter, we should explore dynamically setting the GC queue's processTimeoutFunc.

This makes sense to me - if we're really leaking intents mostly as a function of node restarts (tbd), then and even independently it makes sense that the GC queue would see it as its job to nuke them. We just have to make sure it's done in a way that avoids busy loops in pathological cases where a large number of (non-abandon) intents is always present on the table. If GC is bad at removing lots of intents, the timeout func could help, or even a dynamic timeout that is based on making progress rather than a fixed timeout, which is hard to compute in this case.

@nvanbenschoten
Copy link
Member

FWIW I just did a test on a local roachprod cluster with IntentAgeThreshold set artificially low and found that the GC queue is able to clean up about 56,000 intents per second on a range. For reference, a high-priority scan was cleaning up about 3,000 intents per second. So that helps confirm that when the GC queue does its job, it is still highly leveraged in cleaning up intents.

As a short-term mitigation, what do you think about me making IntentAgeThreshold a hidden cluster setting? Combined with #60619, this should give us a better toolbox to resolve these issues manually when they come up until we eliminate the problem entirely.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Feb 17, 2021

Do we have an example instance of this issue that we could look at the historical timeseries data for to confirm or deny this? cc. @joshimhoff.

Yes. One of the CC free-tier clusters hit this issue, leading to issues with BACKUP. Based on the lack of other alerts firing, I don't expect there was a high rate of restarts there, but I can check & report back.

If it helps, consider that we can backport additional obserability into CC & quickly get it on all the 20.2 clusters. If we had metrics that tracked the number of intents in the "largest" ranges by number of intents, we could look actively for such clusters even ahead of impact (in order to help get a root cause).

@joshimhoff
Copy link
Collaborator

On the free-tier cluster, no restarts for last 2-3 weeks, @nvanbenschoten:

image

The BACKUP problem occurred there this Friday IIRC.
The most recent set of restarts was likely a patch upgrade so done gracefully.

(The little blip where there appears to be no nodes is likely to be a monitoring glitch.)

@tbg
Copy link
Member Author

tbg commented Feb 17, 2021

As a short-term mitigation, what do you think about me making IntentAgeThreshold a hidden cluster setting?

SGTM

craig bot pushed a commit that referenced this issue Feb 17, 2021
60541: opt: inverted-index accelerate filters of the form j->'a' = '{"b": "c"}' r=angelazxu a=angelazxu

Previously, the optimizer did not plan inverted index scans for queries with
the JSON fetch value operator `->` when an object or array was on the right side
of the equality operator `=`, only when it was a boolean, string, number, or
null.

This change allows the inverted index to be used in these types of queries. It
supports objects with multiple key/value pairs, nested objects, arrays, arrays
nested within objects, and objects nested within arrays.

Fixes: #59605

Release note: None

60619: kv: allow manual GC that does not advance threshold r=nvanbenschoten a=nvanbenschoten

Relates to #60585.

This commit updates the GC queue's `process` method to allow GC even if
it determines that doing so would not advance the GC threshold. This
check is retained in the queue's `shouldQueue` method, so this change
does not impact automated queueing decisions. However, it does ensure
that if someone manually enqueues a range in the GC queue with the
`ShouldSkipQueue` option, then the range is actually processed. This is
important because manually the GC threshold is the first thing that the
GC bumps when processing, meaning that the GC threshold alone is not a
definitive indication that there is no more work to perform on a range.
A processing pass that will not bump the threshold can still be useful,
especially with the long context timeout associated with manual
enqueuing.

Release note (ui change): Manually enqueue range in a replica GC queue
now properly respects the SkipShouldQueue option. This can be useful to
force a GC of a specific Range.

60638: sql: fix RENAME COLUMN for REGIONAL BY ROW r=ajstorm a=otan

We need to correct the column reference on the table descriptor if we
are renaming the REGIONAL BY ROW column.

Resolves #59116 

Release note: None



60698: vendor: bump pebble to 444296cf r=sumeerbhola a=sumeerbhola

444296cf Merge pull request #1066 from petermattis/pmattis/stability
59659388 update README with note about production readiness
a516e691 *: optimize SeekPrefixGE to use Next
327d2757 sstable: hoist blockSizes from testBytesIteratedWithCompression
69b09310 sstable: add benchmarks for zstd
cbe3a149 sstable: fix review comments
deb29d88 sstable: include zstd into the test cases
f13dea6f sstable: support zstd compression
73e8a3b0 vendor: added DataDog/zstd
ec81e4c4 sstable: add zstd-related constants

Release note: None

Co-authored-by: Angela Xu <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: sumeerbhola <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 18, 2021
Relates to cockroachdb#60585.

This commit updates the GC queue's `process` method to allow GC even if
it determines that doing so would not advance the GC threshold. This
check is retained in the queue's `shouldQueue` method, so this change
does not impact automated queueing decisions. However, it does ensure
that if someone manually enqueues a range in the GC queue with the
`SkipShouldQueue` option, then the range is actually processed. This is
important because the GC threshold is the first thing that the
GC bumps when processing, meaning that the GC threshold alone is not a
definitive indication that there is no more work to perform on a range.
A processing pass that will not bump the threshold can still be useful,
especially with the long context timeout associated with manual
enqueuing.

Release note (ui change): Manually enqueueing a range in the GC queue
now properly respects the SkipShouldQueue option. This can be useful to
force the GC of a specific Range.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 18, 2021
Relates to cockroachdb#60585.

This commit updates the GC queue's `process` method to allow GC even if
it determines that doing so would not advance the GC threshold. This
check is retained in the queue's `shouldQueue` method, so this change
does not impact automated queueing decisions. However, it does ensure
that if someone manually enqueues a range in the GC queue with the
`SkipShouldQueue` option, then the range is actually processed. This is
important because the GC threshold is the first thing that the
GC bumps when processing, meaning that the GC threshold alone is not a
definitive indication that there is no more work to perform on a range.
A processing pass that will not bump the threshold can still be useful,
especially with the long context timeout associated with manual
enqueuing.

Release note (ui change): Manually enqueueing a range in the GC queue
now properly respects the SkipShouldQueue option. This can be useful to
force the GC of a specific Range.
@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 6, 2021

I think I have a plausible explanation for the intent buildup. As mentioned earlier, when we roll back a txn and the client context is cancelled, we give the final EndTxn(commit=false) a 3-second timeout and then process the EndTxn request with this 3-second context:

cockroach/pkg/kv/txn.go

Lines 756 to 758 in 0eba39a

ba.Add(endTxnReq(false /* commit */, nil /* deadline */, false /* systemConfigTrigger */))
_ = contextutil.RunWithTimeout(ctx, "async txn rollback", 3*time.Second, func(ctx context.Context) error {
if _, pErr := txn.Send(ctx, ba); pErr != nil {

But we didn't really think this should affect cleanup, because we kick off async tasks that are disconnected from the client's context as long as we haven't hit the async task quota:

err := ir.stopper.RunLimitedAsyncTask(
// If we've successfully launched a background task, dissociate
// this work from our caller's context and timeout.
ir.ambientCtx.AnnotateCtx(context.Background()),
"storage.IntentResolver: processing intents",

But of course, if the 3-second context expires before it gets to the point where it can kick off the async task, then it isn't going to clean up anything at all since it gets aborted here:

case <-ctxDone:
// If our context was canceled, return an AmbiguousResultError,
// which indicates to the caller that the command may have executed.
abandon()
log.VEventf(ctx, 2, "context cancellation after %0.1fs of attempting command %s",
timeutil.Since(startTime).Seconds(), ba)
return nil, nil, roachpb.NewError(roachpb.NewAmbiguousResultError(ctx.Err().Error()))

I initially considered this fairly edge-casey, because I thought this only applied to clients submitting a ROLLBACK and then disconnecting. However, when a client disconnects in the middle of a transaction (including an implicit transaction, i.e. a naked DML statement), of course the server-side session will call txn.rollback() to abort the client's open transaction -- and this is going to get called with a context that's already cancelled (because the client has already disconnected).

This means that every time a client disconnects during a txn or in-flight DML statement, the EndTxn(commit=false) has a 3-second deadline to call CleanupTxnIntentsAsync(). In this time, it has to go through RPC, concurrency control, dist-sender, Raft consensus, batch execution, and possibly other stuff. If it doesn't make it, it's not getting cleaned up. It doesn't seem too far-fetched to me that this might happen often enough.

This is kind of what #64770 is getting at, but a bit broader. I'll submit a PR tomorrow that tries harder to clean up in the face of client context cancellation.

@erikgrinaker
Copy link
Contributor

I think I found the reason why we're leaking txn records as well:

// Run transaction record GC outside of ir.sem.
return ir.stopper.RunAsyncTask(
ctx,
"storage.IntentResolver: cleanup txn records",
func(ctx context.Context) {
err := ir.gcTxnRecord(ctx, rangeID, txn)

We spin off a separate async task to clean up the txn record, but use the surrounding context. In cases where intent resolution is synchronous, and thus is connected to the caller's context, the call can return before gcTxnRecord completes, which can cause the caller to cancel the context (either because of a defer cancel() or a client disconnect) and thus cancel the txn record cleanup as well.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 10, 2021

Another thing to add here is that we don't try very hard to use the LockSpans slice in a transaction record to clean up all of a transaction's intents and drive the transaction record cleanup to completion if the first attempt at intent resolution fails. Currently, the only places where we drive cleanup to completion are in response to a transaction's own client and in the GC queue (after an hour). All other conflicts will push a transaction to determine its status, but will then only resolve the single conflicting intent and will never return (synchronously or asynchronously) to clean up any other intents or GC the transaction record. We have some memory and some batching in the lockTable to avoid redundant pushes and to batch intent resolution for the same transaction, but this is all on a per-range basis.

There's probably something we could do better here. We can't rely on a finalized transaction record to accurately reflect all of the intents that the transaction wrote because the transaction may have expired without ever being finalized cleanly. However, in cases where a transaction is finalized cleanly and the record includes its intent spans, we could do more. For instance, PushTxnRequests and RecoverTxnRequests that find finalized transaction records could call into IntentResolver.CleanupTxnIntentsAsync.

Of course, there are some practical considerations here. For instance, we'd want to avoid thundering herds of cleanup for the same transaction immediately after it is finalized. We may also want to avoid redundant work as best as we can in cases where a transaction's intents are partially cleaned up - we don't currently checkpoint cleanup progress anywhere. There's also some concern about GC-ing the record of a possibly active transaction, as is discussed in #53146.

@erikgrinaker
Copy link
Contributor

Another thing to add here is that we don't try very hard to use the LockSpans slice in a transaction record to clean up all of a transaction's intents and drive the transaction record cleanup to completion if the first attempt at intent resolution fails. Currently, the only places where we drive cleanup to completion are in response to a transaction's own client and in the GC queue (after an hour). All other conflicts will push a transaction to determine its status, but will then only resolve the single conflicting intent and will never return (synchronously or asynchronously) to clean up any other intents or GC the transaction record.

I think this is probably the right approach in general. When we're running some other operation, and happen to run into intents, we want to keep latency as low as possible and just deal with the intents that we have to.

We can't rely on a finalized transaction record to accurately reflect all of the intents that the transaction wrote because the transaction may have expired without ever being finalized cleanly. However, in cases where a transaction is finalized cleanly and the record includes its intent spans, we could do more.

We might want a separate GC queue for this rather than doing best-effort ad hoc cleanup. Since the txn records are stored in the range-local key space we can presumably scan them efficiently and clean up any finalized txn records. How to finalize abandoned txns is a separate concern (haven't looked into how this is currently done yet).

craig bot pushed a commit that referenced this issue May 18, 2021
64869: kvserver: improve intent cleanup for disconnected clients r=nvanbenschoten a=erikgrinaker

### kvserver: improve intent cleanup for disconnected clients

Whenever a client disconnects during an open transaction or in-flight
DML statement, the server-side session will asynchronously clean up the
transaction record and intents by rolling back the transaction.
However, this had a very short timeout of 3 seconds, and even though the
actual `IntentResolver.CleanupTxnIntentsAsync()` cleanup task runs
asynchronously and independently of the client's context in the typical
case, the short timeout could cause transaction cleanup to be cancelled
if the `EndTxn` request took more than 3 seconds to get all the way
through Raft execution or if the async task limit was exceeded such that
the cleanup task kept using the client's context. This in turn could
lead to intents building up over time.

This patch increases the timeout when rolling back transactions for
disconnected clients to 1 minute, and also tries to perform transaction
cleanup when a client disconnects while waiting for an `EndTxn` command
to go through Raft.

Resolves #64770, touches #60585.

Release note (bug fix): improved transaction cleanup for disconnected
clients, to reduce intent buildup.

### kvserver: fix race condition during synchronous txn record cleanup

Transaction records and intents are normally cleaned up asynchronously
via `IntentResolver.CleanupTxnIntentsAsync()`, separately from the
client's context. When the async task limit is exceeded, cleanup instead
becomes synchronous and attached to the client context. However, the
final `gcTxnRecord()` call to remove the transaction record is
asynchronous even when intent cleanup is synchronous, to avoid holding
onto the intent resolver task slot. This call will typically return to
the caller before `gcTxnRecord()` completes, which may cause the caller to
cancel the context (either via `defer cancel()` or a client
disconnection) and in turn cancel the `gcTxnRecord()` call.

This patch gives the async `gcTxnRecord()` call a separate background
context that's independent of the client's context even in the
synchronous case, with a 20 second timeout to avoid goroutine leaks.

Resolves #64868, touches #60585.

Release note (bug fix): Fixed a race condition during transaction
cleanup that could leave old transaction records behind until MVCC
garbage collection.

/cc @cockroachdb/kv 

Co-authored-by: Erik Grinaker <[email protected]>
@erikgrinaker
Copy link
Contributor

erikgrinaker commented May 22, 2021

For those following along, we've found a couple of issues in the transaction coordinator that seem even more likely to cause intent buildup, I'm working on fixes: #65458 #65587

@erikgrinaker
Copy link
Contributor

I believe we may have plugged the main intent leak with #65592, which is being merged to master right now. However, it involves non-trivial changes to the main transaction code path, so we're going to let it sit on master for a few weeks before backporting it. #64131, which is already backported, should alleviate the problems with backups and intents.

Also, we've added a section to the BACKUP docs on how to diagnose and resolve hung backups due to intent buildup, since this keeps coming up:

https://www.cockroachlabs.com/docs/v21.1/backup.html#slow-or-hung-backups-and-queries-due-to-write-intent-buildup

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
craig bot pushed a commit that referenced this issue Jul 7, 2021
65001: kvserver: make GC intent scoring more aggressive r=tbg a=erikgrinaker

### keys: rename RangeLastGCKey to RangeGCThresholdKey

The key `RangeLastGCKey` was not actually used to store the last GC
timestamp, but rather the last GC threshold. This commit renames the key
to the more appropriate `RangeGCThresholdKey`.

Release note: None

### kvserver: increase GC queue timeout

The MVCC GC queue timeout was left at the default of 1 minute. Since the
`GCThreshold` is advanced and persisted at the very start of the GC run
(to coordinate with other processes such as exports and rangefeeds
before removing any data), this means that if a GC run takes more than 1
minute to complete it will be aborted due to the timeout and may not run
again for some time.

This patch increases the timeout to 10 minutes. The primary purpose of
the timeout is to avoid head-of-line blocking in the case of unavailable
ranges, and a larger timeout still satisfies this property while also
allowing more expensive cleanup attempts to complete.

Release note (ops change): increased the timeout for range MVCC
garbage collection from 1 minute to 10 minutes, to allow larger jobs to
run to completion.

### kvserver: use PUSH_TOUCH during GC to push intent owners

Using `PUSH_ABORT` to push intent owners during intent GC can cause it
to block on (or possibly abort) long-running transactions.

Since we would prefer GC to continue with other intents rather than
waiting for active transactions, this patch changes the GC queue to use
`PUSH_TOUCH` instead. This will still clean up abandoned txns.

Resolves #65777.

Release note (bug fix): Intent garbage collection no longer waits for
or aborts running transactions.

### kvserver: add GC queue cooldown timer

When the GC queue considered replicas for MVCC GC, it primarily
considered the previous GC threshold and the amount of MVCC garbage that
would be removed by advancing the threshold. We would like to also
trigger GC based solely on intent stats. However, we can't know whether
we'll actually be able to clean up any intents, since they may belong to
an in-progress long-running transaction. There is therefore a risk that
we will keep spinning on GC attempts due to high intent counts.

This patch records the last GC queue processing timestamp, and adds a
cooldown timer of 2 hours between each range GC attempt triggered by
intent score alone to prevent spinning on GC attempts.

Release note: None

### kvserver: make GC intent scoring more aggressive

Users often experience buildup of intents due to the GC queue not being
aggressive enough in cleaning them up. Currently, the GC queue will only
trigger based on intents if the average intent age is 10 days and if
there is also MVCC garbage that can be cleaned up.

This patch makes the intent scoring more aggressive, triggering GC when
the average intent age is 8 hours regardless of other MVCC garbage. The
previous commit added a cooldown timer to prevent the GC queue from
spinning on a replica if the intents couldn't be cleaned up (e.g.
because they belong to an in-progress long-running transaction).

Resolves #64266.
Touches  #60585.

Release note (ops change): trigger MVCC and intent garbage collection
when the average intent age is 8 hours, down from 10 days.

/cc @cockroachdb/kv 

Co-authored-by: Erik Grinaker <[email protected]>
@erikgrinaker
Copy link
Contributor

We've now resolved the known issues around intent buildup and mitigation. The most important mitigation (better intent handling during backup export requests in #64131) has been backported, but the other changes carry more risk and we've decided not to backport these for now -- they are thus only available on master, scheduled for 21.2. We therefore won't know for certain whether the intent buildup problem is fully resolved until 21.2 is released and taken into widespread use, but I'm closing this for now and we can reopen it if we find any further issues.

@joshimhoff
Copy link
Collaborator

YAY! Thanks, Erik!

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 14, 2023
…uests

Fixes cockroachdb#50390.
Related to cockroachdb#60585.
Related to cockroachdb#103126.

Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar
approach. This commit addresses the comments on that PR, which focused on the
pagination of intent resolution when bypassing the request batcher. We still
don't try to solve this issue, and instead limit the cases where intent
resolution bypasses the request batcher to those where pagination is not
necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver`
`ResolveOptions`, which instructs the `IntentResolver` to send the provided
intent resolution requests immediately, instead of adding them to a batch and
waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up
with other intent resolution requests. This can be used to avoid any
batching-induced latency and should be used only by foreground traffic that is
willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated
by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are
performed by background tasks that are more than happy to trade increased
latency for increased throughput, the `lockTableWaiter` calls into the
`IntentResolver` on behalf of a foreground operation. It wants intent resolution
to complete as soon as possible, so it is willing to trade reduced throughput
for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction
and then scanning over the table using a high-priority transaction. The test was
performed on a 3-node roachprod cluster to demonstrate the effect with realistic
network and disk latencies.
```
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took 70.078s. With this change, the scan took
15.958s. This 78% speed up checks out. Each encountered intent is resolved
serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This
improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`,
which delays each resolution request that passes through a RequestBatcher. With
this change, these resolve intent requests are issued immediately and this delay
is not experienced.

While this is a significant improvement and will be important for Read Committed
transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126.
For that, we need to batch the resolve intent requests themselves using a form
of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents
from many different transactions. This scenario bypasses the deferred intent
resolution path added in cockroachdb#49218, because the intents are each written by
different transactions. The speedup for this scenario was presented in cockroachdb#64500.

Release note (performance improvement): SQL statements that must clean up
intents from many different previously abandoned transactions now do so
moderately more efficiently.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 18, 2023
…uests

Fixes cockroachdb#50390.
Related to cockroachdb#60585.
Related to cockroachdb#103126.

Closes cockroachdb#64500, which was an earlier attempt to solve this issue using a similar
approach. This commit addresses the comments on that PR, which focused on the
pagination of intent resolution when bypassing the request batcher. We still
don't try to solve this issue, and instead limit the cases where intent
resolution bypasses the request batcher to those where pagination is not
necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver`
`ResolveOptions`, which instructs the `IntentResolver` to send the provided
intent resolution requests immediately, instead of adding them to a batch and
waiting up to 10ms (defaultIntentResolutionBatchWait) for that batch to fill up
with other intent resolution requests. This can be used to avoid any
batching-induced latency and should be used only by foreground traffic that is
willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated
by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are
performed by background tasks that are more than happy to trade increased
latency for increased throughput, the `lockTableWaiter` calls into the
`IntentResolver` on behalf of a foreground operation. It wants intent resolution
to complete as soon as possible, so it is willing to trade reduced throughput
for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction
and then scanning over the table using a high-priority transaction. The test was
performed on a 3-node roachprod cluster to demonstrate the effect with realistic
network and disk latencies.
```
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took 70.078s. With this change, the scan took
15.958s. This 78% speed up checks out. Each encountered intent is resolved
serially (see cockroachdb#103126), so the per-intent latency drops from 7ms to 1.6ms. This
improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`,
which delays each resolution request that passes through a RequestBatcher. With
this change, these resolve intent requests are issued immediately and this delay
is not experienced.

While this is a significant improvement and will be important for Read Committed
transactions after cockroachdb#102014, this is still not quite enough to resolve cockroachdb#103126.
For that, we need to batch the resolve intent requests themselves using a form
of deferred intent resolution like we added in cockroachdb#49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents
from many different transactions. This scenario bypasses the deferred intent
resolution path added in cockroachdb#49218, because the intents are each written by
different transactions. The speedup for this scenario was presented in cockroachdb#64500.

Release note (performance improvement): SQL statements that must clean up
intents from many different previously abandoned transactions now do so
moderately more efficiently.
craig bot pushed a commit that referenced this issue May 18, 2023
103265: kv: resolve conflicting intents immediately for latency-sensitive requests r=nvanbenschoten a=nvanbenschoten

Fixes #50390.
Related to #60585.
Related to #103126.

Closes #64500, which was an earlier attempt to solve this issue using a similar approach. This commit addresses the comments on that PR, which focused on the pagination of intent resolution when bypassing the request batcher. We still don't try to solve this issue, and instead limit the cases where intent resolution bypasses the request batcher to those where pagination is not necessary.

This commit adds a new `sendImmediately` option to the `IntentResolver` `ResolveOptions`, which instructs the `IntentResolver` to send the provided intent resolution requests immediately, instead of adding them to a batch and waiting up to 10ms (`defaultIntentResolutionBatchWait`) for that batch to fill up with other intent resolution requests. This can be used to avoid any batching-induced latency and should be used only by foreground traffic that is willing to trade off some throughput for lower latency.

The commit then sets this flag for single-key intent resolution requests initiated by the `lockTableWaiter`. Unlike most calls into the `IntentResolver`, which are performed by background tasks that are more than happy to trade increased latency for increased throughput, the `lockTableWaiter` calls into the `IntentResolver` on behalf of a foreground operation. It wants intent resolution to complete as soon as possible, so it is willing to trade reduced throughput for reduced latency.

I tested this out by writing 10,000 different intents in a normal-priority transaction and then scanning over the table using a high-priority transaction. The test was performed on a 3-node roachprod cluster to demonstrate the effect with realistic network and disk latencies.
```sql
-- session 1
CREATE TABLE keys (k BIGINT NOT NULL PRIMARY KEY);
BEGIN; INSERT INTO keys SELECT generate_series(1, 10000);

-- session 2
BEGIN PRIORITY HIGH; SELECT count(*) FROM keys;
```

Without this change, the scan took **70.078s**. With this change, the scan took **15.958s**. This **78%** speed-up checks out. Each encountered intent is resolved serially (see #103126), so the **per-intent latency** drops from **7ms** to **1.6ms.** This improvement by about 5ms agrees with the `defaultIntentResolutionBatchIdle`, which delays each resolution request that passes through a RequestBatcher. With this change, these resolve intent requests are issued immediately and this delay is not experienced.

While this is a significant improvement and will be important for Read Committed transactions after #102014, this is still not quite enough to resolve #103126. For that, we need to batch the resolve intent requests themselves using a form of deferred intent resolution like we added in #49218 (for finalized transactions).

A similar improvement is seen for scans that encounter many abandoned intents from many different transactions. This scenario bypasses the deferred intent resolution path added in #49218, because the intents are each written by different transactions. The speedup for this scenario was presented in #64500.

Release note (performance improvement): SQL statements that must clean up intents from many different previously abandoned transactions now do so moderately more efficiently.


103362: sql: validate primary / secondary region localities at end of txn r=fqazi a=fqazi

Previously, if a database was restored with skip_localities, there was no way to modify this database to set the primary region since validation would kick in too early during the statement. This meant fixing the regions in a restored database was impossible if the primary region was no longer valid. To address this, this patch, delays locality validation till the end of the transaction.

Fixes: #103290

Release note (bug fix): SET PRIMARY REGION and SET SECONDARY REGION did not validate transactionally, which could prevent cleaning up removed regions after a restore.

103373: concurrency: small refactors in preparation for reservation removal r=nvanbenschoten a=arulajmani

See individual commits for details. 

Informs: #103361

103538: go.mod: bump Pebble to 6f2788660198, rework shared storage wrapper r=RaduBerinde a=RaduBerinde

6f278866 shared: improve interface for more efficient reading
9eb2c407 db: log events to testing.T in unit tests
f32e7dc6 db: add reserved Pebblev4 sstable format
5a6b91b8 objstorage: improve test and add read ahead test
2bc4319e objstorage: remove genericFileReadable
8143ffb9 objstorage: fix readaheadState initialization
06d08888 db: add Compact.Duration metric
e7213de0 db: add Uptime metric
e9005aed db: don't delete files during ingest application
222b43ec internal/arenaskl: fix Skiplist doc typo

Release note: None
Epic: none

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-kv-recovery C-investigation Further steps needed to qualify. C-label will change. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team
Projects
None yet
Development

No branches or pull requests