backupccl: ExportRequest with MaxUserPriority times out when it encounters an intent #83342

adityamaru · 2022-06-24T17:09:12Z

Context:

Backup gives each ExportRequest 5 minutes to complete before it cancels the corresponding context. ExportRequest is initially sent with header.WaitPolicy = lock.WaitPolicyError which means that it fails with a WriteIntentError when it sees an intent written by another transaction. For a minute, we retry the export if we see a WriteIntentError, after which we send the request with header.UserPriority = roachpb.MaxUserPriority and header.WaitPolicy = lock.WaitPolicyBlock, which is expected to push any other lower priority transaction out of the way.

Issue:

The ExportRequest sent with a higher priority does not seem to push the low priority txn on its final attempt, and therefore times out after 5 minutes. Following is a test reproducing this behavior:

func TestExportRequestHighPriority(t *testing.T) {
	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

	const numAccounts = 1
	tc, sqlDB, _, cleanupFn := backupRestoreTestSetup(t, singleNode, numAccounts, InitManualReplication)
	defer cleanupFn()

	_, t10sql := serverutils.StartTenant(t, tc.Server(0), base.TestTenantArgs{
		TenantID: roachpb.MakeTenantID(10),
	})

	_, err := t10sql.Exec(`CREATE TABLE foo (id INT)`)
	require.NoError(t, err)

	ctx := context.Background()
	g := ctxgroup.WithContext(ctx)
	g.Go(func() error {
		_, err := t10sql.Exec(`
BEGIN;
INSERT INTO foo VALUES (1);
`)
		return err
	})

	sqlDB.Exec(t, `BACKUP TENANT 10 INTO 'nodelocal://1/foo'`)
	require.NoError(t, g.Wait())
}

Jira issue: CRDB-17016

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-06-24T17:09:16Z

cc @cockroachdb/bulk-io

adityamaru · 2022-06-24T17:10:42Z

@nvanbenschoten has a fix in mind, following which I'll check in a backup level test. This issue might be what we are seeing in https://github.com/cockroachlabs/support/issues/1661 (v21.2.12) and potentially https://github.com/cockroachlabs/support/issues/1570.

Fixes cockroachdb#83342. This commit reworks the behavior of non-transactional requests in lock wait-queues when the lock holder or the lock queue waiter has an extreme priority (min or max priority). In such cases, we allow the lock queue waiter to immediately push the lock holder out of its way, either by moving its timestamp to resolve a read-write conflict or aborting it to resolve a write-write conflict. This handling was broken in two ways for non-transactional requests. 1. these requests' priorities were not consulted when deciding whether to immediately push instead of temporarily delaying while waiting in the lock wait-queue. This meant that a high-priority, non-txn request might still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay) before pushing a lower priority lock holder out of its way. 2. worse, it was possible that if these requests were not in the front of a lock wait-queue, they might never push. This was because we had logic that disabled a push if it was not needed for the purposes of checking liveness, detecting deadlocks, or enforcing timeouts. This commit resolves both of these issues. It also improves the testing of transaction priorities in the `kv/kvserver/concurrency` package. Finally, it consolidates the determination of when a pusher should be able to push/abort a pushee into a single location. Release note (bug fix): a bug in transaction conflict resolution which could allow backups to wait on long-running transactions has been resolved.

83366: kv: fix conflict resolution for high-priority, non-txn'al requests r=andreimatei a=nvanbenschoten Fixes #83342. This commit reworks the behavior of non-transactional requests in lock wait-queues when the lock holder or the lock queue waiter has an extreme priority (min or max priority). In such cases, we allow the lock queue waiter to immediately push the lock holder out of its way, either by moving its timestamp to resolve a read-write conflict or aborting it to resolve a write-write conflict. This handling was broken in two ways for non-transactional requests. 1. these requests' priorities were not consulted when deciding whether to immediately push instead of temporarily delaying while waiting in the lock wait-queue. This meant that a high-priority, non-txn request might still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay) before pushing a lower priority lock holder out of its way. 2. worse, it was possible that if these requests were not in the front of a lock wait-queue, they might never push. This was because we had logic that disabled a push if it was not needed for the purposes of checking liveness, detecting deadlocks, or enforcing timeouts. This commit resolves both of these issues. It also improves the testing of transaction priorities in the `kv/kvserver/concurrency` package. Finally, it consolidates the determination of when a pusher should be able to push/abort a pushee into a single location. Release note (bug fix): a bug in transaction conflict resolution which could allow backups to wait on long-running transactions has been resolved. 83629: util/mon: pass reserved account by reference r=yuzefovich a=yuzefovich **util/mon: pass reserved account by reference** This commit makes it so that we now pass the `reserved` memory account when starting up memory monitors by reference. Previously, due to passing by value, when the monitor is stopped, the copy of the value would get used so that the actual "reserved" memory account would get out of sync with its user. This is now fixed. However, this bug doesn't really have any production impact since we use this "reserved" feature in a handful of places and these "reserved" memory accounts are not reused between different usages. Additionally, this commit renames `MakeStandaloneBudget` to `NewStandaloneBudget` (since it now returns a reference) and adds a separate "start" method when the caller doesn't want to pre-reserve anything when starting up a monitor. Release note: None **sql: make sure to close the reserved account for each session** This commit makes it more clear that the "reserved" memory account created for each connection is closed when that connection is closed. Previously, the account was already cleared (when "session root" monitor is stopped which happens when the connExecutor stops), so there was no leak in the accounting system because of it, yet it wasn't obvious - this commit makes it more obvious. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

Fixes cockroachdb#83342. This commit reworks the behavior of non-transactional requests in lock wait-queues when the lock holder or the lock queue waiter has an extreme priority (min or max priority). In such cases, we allow the lock queue waiter to immediately push the lock holder out of its way, either by moving its timestamp to resolve a read-write conflict or aborting it to resolve a write-write conflict. This handling was broken in two ways for non-transactional requests. 1. these requests' priorities were not consulted when deciding whether to immediately push instead of temporarily delaying while waiting in the lock wait-queue. This meant that a high-priority, non-txn request might still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay) before pushing a lower priority lock holder out of its way. 2. worse, it was possible that if these requests were not in the front of a lock wait-queue, they might never push. This was because we had logic that disabled a push if it was not needed for the purposes of checking liveness, detecting deadlocks, or enforcing timeouts. This commit resolves both of these issues. It also improves the testing of transaction priorities in the `kv/kvserver/concurrency` package. Finally, it consolidates the determination of when a pusher should be able to push/abort a pushee into a single location. Release note (bug fix): a bug in transaction conflict resolution which could allow backups to wait on long-running transactions has been resolved.

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. labels Jun 24, 2022

adityamaru assigned nvanbenschoten and adityamaru Jun 24, 2022

blathers-crl bot added the T-disaster-recovery label Jun 24, 2022

blathers-crl bot added the T-kv KV Team label Jun 24, 2022

nvanbenschoten mentioned this issue Jun 25, 2022

kv: fix conflict resolution for high-priority, non-txn'al requests #83366

Merged

exalate-issue-sync bot unassigned adityamaru Jun 27, 2022

exalate-issue-sync bot removed the T-disaster-recovery label Jun 27, 2022

craig bot closed this as completed in 83ac099 Jul 6, 2022

nvanbenschoten mentioned this issue Jul 6, 2022

release-22.1: kv: fix conflict resolution for high-priority, non-txn'al requests #83900

Merged

nvanbenschoten mentioned this issue Jul 6, 2022

release-21.2: kv: fix conflict resolution for high-priority, non-txn'al requests #83905

Merged

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: ExportRequest with MaxUserPriority times out when it encounters an intent #83342

backupccl: ExportRequest with MaxUserPriority times out when it encounters an intent #83342

adityamaru commented Jun 24, 2022 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Jun 24, 2022

adityamaru commented Jun 24, 2022

backupccl: ExportRequest with MaxUserPriority times out when it encounters an intent #83342

backupccl: ExportRequest with MaxUserPriority times out when it encounters an intent #83342

Comments

adityamaru commented Jun 24, 2022 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Jun 24, 2022

adityamaru commented Jun 24, 2022

adityamaru commented Jun 24, 2022 •

edited by cockroach-jira-scripts

Loading