-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: ExportRequest with MaxUserPriority times out when it encounters an intent #83342
Labels
A-disaster-recovery
A-kv
Anything in KV that doesn't belong in a more specific category.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Comments
adityamaru
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-disaster-recovery
A-kv
Anything in KV that doesn't belong in a more specific category.
labels
Jun 24, 2022
cc @cockroachdb/bulk-io |
@nvanbenschoten has a fix in mind, following which I'll check in a backup level test. This issue might be what we are seeing in https://github.com/cockroachlabs/support/issues/1661 (v21.2.12) and potentially https://github.com/cockroachlabs/support/issues/1570. |
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jun 25, 2022
Fixes cockroachdb#83342. This commit reworks the behavior of non-transactional requests in lock wait-queues when the lock holder or the lock queue waiter has an extreme priority (min or max priority). In such cases, we allow the lock queue waiter to immediately push the lock holder out of its way, either by moving its timestamp to resolve a read-write conflict or aborting it to resolve a write-write conflict. This handling was broken in two ways for non-transactional requests. 1. these requests' priorities were not consulted when deciding whether to immediately push instead of temporarily delaying while waiting in the lock wait-queue. This meant that a high-priority, non-txn request might still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay) before pushing a lower priority lock holder out of its way. 2. worse, it was possible that if these requests were not in the front of a lock wait-queue, they might never push. This was because we had logic that disabled a push if it was not needed for the purposes of checking liveness, detecting deadlocks, or enforcing timeouts. This commit resolves both of these issues. It also improves the testing of transaction priorities in the `kv/kvserver/concurrency` package. Finally, it consolidates the determination of when a pusher should be able to push/abort a pushee into a single location. Release note (bug fix): a bug in transaction conflict resolution which could allow backups to wait on long-running transactions has been resolved.
craig bot
pushed a commit
that referenced
this issue
Jul 6, 2022
83366: kv: fix conflict resolution for high-priority, non-txn'al requests r=andreimatei a=nvanbenschoten Fixes #83342. This commit reworks the behavior of non-transactional requests in lock wait-queues when the lock holder or the lock queue waiter has an extreme priority (min or max priority). In such cases, we allow the lock queue waiter to immediately push the lock holder out of its way, either by moving its timestamp to resolve a read-write conflict or aborting it to resolve a write-write conflict. This handling was broken in two ways for non-transactional requests. 1. these requests' priorities were not consulted when deciding whether to immediately push instead of temporarily delaying while waiting in the lock wait-queue. This meant that a high-priority, non-txn request might still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay) before pushing a lower priority lock holder out of its way. 2. worse, it was possible that if these requests were not in the front of a lock wait-queue, they might never push. This was because we had logic that disabled a push if it was not needed for the purposes of checking liveness, detecting deadlocks, or enforcing timeouts. This commit resolves both of these issues. It also improves the testing of transaction priorities in the `kv/kvserver/concurrency` package. Finally, it consolidates the determination of when a pusher should be able to push/abort a pushee into a single location. Release note (bug fix): a bug in transaction conflict resolution which could allow backups to wait on long-running transactions has been resolved. 83629: util/mon: pass reserved account by reference r=yuzefovich a=yuzefovich **util/mon: pass reserved account by reference** This commit makes it so that we now pass the `reserved` memory account when starting up memory monitors by reference. Previously, due to passing by value, when the monitor is stopped, the copy of the value would get used so that the actual "reserved" memory account would get out of sync with its user. This is now fixed. However, this bug doesn't really have any production impact since we use this "reserved" feature in a handful of places and these "reserved" memory accounts are not reused between different usages. Additionally, this commit renames `MakeStandaloneBudget` to `NewStandaloneBudget` (since it now returns a reference) and adds a separate "start" method when the caller doesn't want to pre-reserve anything when starting up a monitor. Release note: None **sql: make sure to close the reserved account for each session** This commit makes it more clear that the "reserved" memory account created for each connection is closed when that connection is closed. Previously, the account was already cleared (when "session root" monitor is stopped which happens when the connExecutor stops), so there was no leak in the accounting system because of it, yet it wasn't obvious - this commit makes it more obvious. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 6, 2022
Fixes cockroachdb#83342. This commit reworks the behavior of non-transactional requests in lock wait-queues when the lock holder or the lock queue waiter has an extreme priority (min or max priority). In such cases, we allow the lock queue waiter to immediately push the lock holder out of its way, either by moving its timestamp to resolve a read-write conflict or aborting it to resolve a write-write conflict. This handling was broken in two ways for non-transactional requests. 1. these requests' priorities were not consulted when deciding whether to immediately push instead of temporarily delaying while waiting in the lock wait-queue. This meant that a high-priority, non-txn request might still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay) before pushing a lower priority lock holder out of its way. 2. worse, it was possible that if these requests were not in the front of a lock wait-queue, they might never push. This was because we had logic that disabled a push if it was not needed for the purposes of checking liveness, detecting deadlocks, or enforcing timeouts. This commit resolves both of these issues. It also improves the testing of transaction priorities in the `kv/kvserver/concurrency` package. Finally, it consolidates the determination of when a pusher should be able to push/abort a pushee into a single location. Release note (bug fix): a bug in transaction conflict resolution which could allow backups to wait on long-running transactions has been resolved.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Jul 6, 2022
Fixes cockroachdb#83342. This commit reworks the behavior of non-transactional requests in lock wait-queues when the lock holder or the lock queue waiter has an extreme priority (min or max priority). In such cases, we allow the lock queue waiter to immediately push the lock holder out of its way, either by moving its timestamp to resolve a read-write conflict or aborting it to resolve a write-write conflict. This handling was broken in two ways for non-transactional requests. 1. these requests' priorities were not consulted when deciding whether to immediately push instead of temporarily delaying while waiting in the lock wait-queue. This meant that a high-priority, non-txn request might still wait for 50ms (kv.lock_table.coordinator_liveness_push_delay) before pushing a lower priority lock holder out of its way. 2. worse, it was possible that if these requests were not in the front of a lock wait-queue, they might never push. This was because we had logic that disabled a push if it was not needed for the purposes of checking liveness, detecting deadlocks, or enforcing timeouts. This commit resolves both of these issues. It also improves the testing of transaction priorities in the `kv/kvserver/concurrency` package. Finally, it consolidates the determination of when a pusher should be able to push/abort a pushee into a single location. Release note (bug fix): a bug in transaction conflict resolution which could allow backups to wait on long-running transactions has been resolved.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
A-kv
Anything in KV that doesn't belong in a more specific category.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Context:
Backup gives each ExportRequest 5 minutes to complete before it cancels the corresponding context. ExportRequest is initially sent with
header.WaitPolicy = lock.WaitPolicyError
which means that it fails with a WriteIntentError when it sees an intent written by another transaction. For a minute, we retry the export if we see a WriteIntentError, after which we send the request withheader.UserPriority = roachpb.MaxUserPriority
andheader.WaitPolicy = lock.WaitPolicyBlock
, which is expected to push any other lower priority transaction out of the way.Issue:
The ExportRequest sent with a higher priority does not seem to push the low priority txn on its final attempt, and therefore times out after 5 minutes. Following is a test reproducing this behavior:
Jira issue: CRDB-17016
The text was updated successfully, but these errors were encountered: