-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore/tpce/32TB/aws/nodes=15/cpus=16 failed #100379
Comments
|
@nvanbenschoten You had a fix for this, right? |
Yes, this will be fixed by #98116. I'm planning to backport this to v23.1 and v22.2, but am moderately concerned about the fragility at this level, so I'd feel more comfortable letting the change bake for a few weeks before the backport. How does targeting v23.1.1 sound? |
98116: kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable r=shralex a=nvanbenschoten Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error. Co-authored-by: Nathan VanBenschoten <[email protected]>
…etryable Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
…etryable Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
…etryable Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
…etryable Fixes cockroachdb#96746. Fixes cockroachdb#100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
roachtest.restore/tpce/32TB/aws/nodes=15/cpus=16 failed with artifacts on master @ 99102ddf4b7602788b422366f1acc14b81c64d24:
Parameters:
ROACHTEST_cloud=aws
,ROACHTEST_cpu=16
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-26395
The text was updated successfully, but these errors were encountered: