release-20.1: kv/concurrency: avoid redundant txn pushes and batch intent resolution #49835
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 2/2 commits from #49218.
/cc @cockroachdb/release
Fixes #48790.
Informs #36876.
Closes #31664.
This commit adds a per-Range LRU cache of transactions that are known to be aborted or committed. We use this cache in the lockTableWaiter for two purposes:
kv.lock_table.coordinator_liveness_push_delay
(10 ms) nor push the transactions record (RPC to leaseholder of pushee's txn record range).Together, these two changes make us much more effective as cleaning up after failed transactions that have abandoned a large number of intents. The following example demonstrates this:
Notice that we are still not as fast at cleaning up intents on the insertion path as we are at doing so on the retrieval path. This is because we only batch the resolution of intents observed by a single request at a time. For the scanning case, a single ScanRequest notices all 10,000 intents and cleans them all up together. For the insertion case, each of the 10,000 PutRequests notices a single intent, and each intent is cleaned up individually. So this case is only benefited by the first part of this change (no liveness delay or txn record push) and not the second part of this change (intent resolution batching).
For this reason, we still haven't solved all of #36876. To completely address that, we'll need to defer propagation of
WriteIntentError
during batch evaluation, as we do forWriteTooOldError
s. Or we can wait out the future LockTable changes - once we remove all cases where an intent is not "discovered", the changes here will effectively address #36876.This was a partial regression in v20.1, so we'll want to backport this to that release branch. This change is on the larger side, but I feel ok about it because the mechanics aren't too tricky. I'll wait a week before backporting just to see if anything falls out.
Release note (bug fix): Abandoned intents due to failed transaction coordinators are now cleaned up much faster. This resolves a regression in v20.1.0 compared to prior releases.
@irfansharif I'm adding you as a reviewer because there's not really anyone else on KV that knows this code, so we should change that.