-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: restore fails when issuing an AdminSplit #96746
Comments
cc @cockroachdb/disaster-recovery |
1 similar comment
cc @cockroachdb/disaster-recovery |
Adding additional debug logging reveals that the error is from the first error in |
The fact that a split is failing with an error from a lease transfer indicates that this is likely fallout from #74077. To perform a range split, ranges first need to leave joint consensus configurations. This can require a lease transfer to the VOTER_INCOMING that was just added to the range.
If we can reproduce this, it should be possible to determine which case we are in and diagnose the issue. cc. @shralex @adityamaru are we seeing this in a test? Are there reproduction steps. |
We saw this error for the first time in a TPCE 10 million restore on a 96 node cluster 😅 It is quite reproducible at that scale but we haven't seen it in any of our smaller restore roachtests yet. |
If it helps @rhu713 or I can run the restore with a patched binary and get you the logs though. |
Hiya @nvanbenschoten just an FYI (and sorry if repetitive) that this is blocking our ability to deliver “slim manifests” to a customer, it would be great to get this resolved ASAP. Sounds like it’s in KV’s court. Please ping / ask questions to @rhu713 or @adityamaru. |
Hi @shermanCRL, apologies for the delay. I was OOO last week. I'll raise this issue in the KV weekly today and find a path to getting it resolved. In the meantime, @adityamaru how would you prefer we work with you to test a potential fix? I suspect that the resolution will look something like nvanbenschoten@e48d7ba. Do you have an environment to test the restore with a patched binary up and running? |
Aditya is OOO this week; @rhu713 has repro steps though |
I think you should be able reproduce often on master with the 96 node setup that I was using to test restore:
and
Though the reproduction seems pretty consistent, it does take tens of minutes to maybe hours for the error to occur and stop the restore. |
… node This patch adds a setting to control the parallelism for split and scatters in generativeSplitAndScatterProcessor, defaulting to 1. This is a workaround for cockroachdb#96746 as parallel split and scatters sometimes result in a "store error" that fails the restore. In addition, for chunks that have failed to scatter, this patch routes the chunk to a random node instead of the current node. This is necessary as prior to the generative version, split and scatter processors were on every node, thus there was no imbalance introduced from routing chunks that have failed to scatter to the current node. The new generative split and scatter processor is only on 1 node, and thus would cause the same node to process all chunks that have failed to scatter. Release note: None
I think the fix for this is #98116, but I'm not able to reproduce the issue to confirm because of https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1678148317150499. I'll hold off on this issue until that's resolved. |
…etryable Fixes cockroachdb#96746. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
98116: kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable r=shralex a=nvanbenschoten Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error. Co-authored-by: Nathan VanBenschoten <[email protected]>
…etryable Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
…etryable Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
…etryable Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
…etryable Fixes cockroachdb#96746. Fixes cockroachdb#100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.
When running a large restore we occasionally see the call to
AdminSplit
-cockroach/pkg/ccl/backupccl/split_and_scatter_processor.go
Line 99 in 2f81142
This error comes from the code that processes an
AdminTransferLease
-cockroach/pkg/kv/kvserver/replica_range_lease.go
Line 881 in 3359d46
cockroach/pkg/kv/kvserver/replica_send.go
Line 952 in 7de1273
AdminSplit
logic sends anAdminTransferLease
. It is worth noting that there were changes made to how we split and scatter chunks in #94805 but it is not yet clear why we are seeing more of this error than before.Jira issue: CRDB-24307
The text was updated successfully, but these errors were encountered: