Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: ExportRequest with MaxUserPriority times out when it encounters an intent #83342

Closed
adityamaru opened this issue Jun 24, 2022 · 2 comments · Fixed by #83366
Closed
Assignees
Labels
A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Jun 24, 2022

Context:

Backup gives each ExportRequest 5 minutes to complete before it cancels the corresponding context. ExportRequest is initially sent with header.WaitPolicy = lock.WaitPolicyError which means that it fails with a WriteIntentError when it sees an intent written by another transaction. For a minute, we retry the export if we see a WriteIntentError, after which we send the request with header.UserPriority = roachpb.MaxUserPriority and header.WaitPolicy = lock.WaitPolicyBlock, which is expected to push any other lower priority transaction out of the way.

Issue:

The ExportRequest sent with a higher priority does not seem to push the low priority txn on its final attempt, and therefore times out after 5 minutes. Following is a test reproducing this behavior:

func TestExportRequestHighPriority(t *testing.T) {
	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

	const numAccounts = 1
	tc, sqlDB, _, cleanupFn := backupRestoreTestSetup(t, singleNode, numAccounts, InitManualReplication)
	defer cleanupFn()

	_, t10sql := serverutils.StartTenant(t, tc.Server(0), base.TestTenantArgs{
		TenantID: roachpb.MakeTenantID(10),
	})

	_, err := t10sql.Exec(`CREATE TABLE foo (id INT)`)
	require.NoError(t, err)

	ctx := context.Background()
	g := ctxgroup.WithContext(ctx)
	g.Go(func() error {
		_, err := t10sql.Exec(`
BEGIN;
INSERT INTO foo VALUES (1);
`)
		return err
	})

	sqlDB.Exec(t, `BACKUP TENANT 10 INTO 'nodelocal://1/foo'`)
	require.NoError(t, g.Wait())
}

Jira issue: CRDB-17016

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. labels Jun 24, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jun 24, 2022

cc @cockroachdb/bulk-io

@blathers-crl blathers-crl bot added the T-kv KV Team label Jun 24, 2022
@adityamaru
Copy link
Contributor Author

@nvanbenschoten has a fix in mind, following which I'll check in a backup level test. This issue might be what we are seeing in https://github.com/cockroachlabs/support/issues/1661 (v21.2.12) and potentially https://github.com/cockroachlabs/support/issues/1570.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 25, 2022
Fixes cockroachdb#83342.

This commit reworks the behavior of non-transactional requests in lock
wait-queues when the lock holder or the lock queue waiter has an extreme
priority (min or max priority). In such cases, we allow the lock queue
waiter to immediately push the lock holder out of its way, either by
moving its timestamp to resolve a read-write conflict or aborting it to
resolve a write-write conflict.

This handling was broken in two ways for non-transactional requests.
1. these requests' priorities were not consulted when deciding whether
   to immediately push instead of temporarily delaying while waiting in the
   lock wait-queue. This meant that a high-priority, non-txn request might
   still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay)
   before pushing a lower priority lock holder out of its way.
2. worse, it was possible that if these requests were not in the front
   of a lock wait-queue, they might never push. This was because we had
   logic that disabled a push if it was not needed for the purposes of
   checking liveness, detecting deadlocks, or enforcing timeouts.

This commit resolves both of these issues. It also improves the testing
of transaction priorities in the `kv/kvserver/concurrency` package.
Finally, it consolidates the determination of when a pusher should be
able to push/abort a pushee into a single location.

Release note (bug fix): a bug in transaction conflict resolution which
could allow backups to wait on long-running transactions has been
resolved.
craig bot pushed a commit that referenced this issue Jul 6, 2022
83366: kv: fix conflict resolution for high-priority, non-txn'al requests r=andreimatei a=nvanbenschoten

Fixes #83342.

This commit reworks the behavior of non-transactional requests in lock
wait-queues when the lock holder or the lock queue waiter has an extreme
priority (min or max priority). In such cases, we allow the lock queue
waiter to immediately push the lock holder out of its way, either by
moving its timestamp to resolve a read-write conflict or aborting it to
resolve a write-write conflict.

This handling was broken in two ways for non-transactional requests.
1. these requests' priorities were not consulted when deciding whether
   to immediately push instead of temporarily delaying while waiting in the
   lock wait-queue. This meant that a high-priority, non-txn request might
   still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay)
   before pushing a lower priority lock holder out of its way.
2. worse, it was possible that if these requests were not in the front
   of a lock wait-queue, they might never push. This was because we had
   logic that disabled a push if it was not needed for the purposes of
   checking liveness, detecting deadlocks, or enforcing timeouts.

This commit resolves both of these issues. It also improves the testing
of transaction priorities in the `kv/kvserver/concurrency` package.
Finally, it consolidates the determination of when a pusher should be
able to push/abort a pushee into a single location.

Release note (bug fix): a bug in transaction conflict resolution which
could allow backups to wait on long-running transactions has been
resolved.

83629: util/mon: pass reserved account by reference r=yuzefovich a=yuzefovich

**util/mon: pass reserved account by reference**

This commit makes it so that we now pass the `reserved` memory account
when starting up memory monitors by reference. Previously, due to
passing by value, when the monitor is stopped, the copy of the value
would get used so that the actual "reserved" memory account would get
out of sync with its user. This is now fixed. However, this bug doesn't
really have any production impact since we use this "reserved" feature
in a handful of places and these "reserved" memory accounts are not
reused between different usages.

Additionally, this commit renames `MakeStandaloneBudget` to
`NewStandaloneBudget` (since it now returns a reference) and adds
a separate "start" method when the caller doesn't want to pre-reserve
anything when starting up a monitor.

Release note: None

**sql: make sure to close the reserved account for each session**

This commit makes it more clear that the "reserved" memory account
created for each connection is closed when that connection is closed.
Previously, the account was already cleared (when "session root" monitor
is stopped which happens when the connExecutor stops), so there was no
leak in the accounting system because of it, yet it wasn't obvious - this
commit makes it more obvious.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 83ac099 Jul 6, 2022
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 6, 2022
Fixes cockroachdb#83342.

This commit reworks the behavior of non-transactional requests in lock
wait-queues when the lock holder or the lock queue waiter has an extreme
priority (min or max priority). In such cases, we allow the lock queue
waiter to immediately push the lock holder out of its way, either by
moving its timestamp to resolve a read-write conflict or aborting it to
resolve a write-write conflict.

This handling was broken in two ways for non-transactional requests.
1. these requests' priorities were not consulted when deciding whether
   to immediately push instead of temporarily delaying while waiting in the
   lock wait-queue. This meant that a high-priority, non-txn request might
   still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay)
   before pushing a lower priority lock holder out of its way.
2. worse, it was possible that if these requests were not in the front
   of a lock wait-queue, they might never push. This was because we had
   logic that disabled a push if it was not needed for the purposes of
   checking liveness, detecting deadlocks, or enforcing timeouts.

This commit resolves both of these issues. It also improves the testing
of transaction priorities in the `kv/kvserver/concurrency` package.
Finally, it consolidates the determination of when a pusher should be
able to push/abort a pushee into a single location.

Release note (bug fix): a bug in transaction conflict resolution which
could allow backups to wait on long-running transactions has been
resolved.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 6, 2022
Fixes cockroachdb#83342.

This commit reworks the behavior of non-transactional requests in lock
wait-queues when the lock holder or the lock queue waiter has an extreme
priority (min or max priority). In such cases, we allow the lock queue
waiter to immediately push the lock holder out of its way, either by
moving its timestamp to resolve a read-write conflict or aborting it to
resolve a write-write conflict.

This handling was broken in two ways for non-transactional requests.
1. these requests' priorities were not consulted when deciding whether
   to immediately push instead of temporarily delaying while waiting in the
   lock wait-queue. This meant that a high-priority, non-txn request might
   still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay)
   before pushing a lower priority lock holder out of its way.
2. worse, it was possible that if these requests were not in the front
   of a lock wait-queue, they might never push. This was because we had
   logic that disabled a push if it was not needed for the purposes of
   checking liveness, detecting deadlocks, or enforcing timeouts.

This commit resolves both of these issues. It also improves the testing
of transaction priorities in the `kv/kvserver/concurrency` package.
Finally, it consolidates the determination of when a pusher should be
able to push/abort a pushee into a single location.

Release note (bug fix): a bug in transaction conflict resolution which
could allow backups to wait on long-running transactions has been
resolved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants