backupccl: restore fails when issuing an AdminSplit #96746

adityamaru · 2023-02-07T21:19:10Z

When running a large restore we occasionally see the call to AdminSplit -

cockroach/pkg/ccl/backupccl/split_and_scatter_processor.go

Line 99 in 2f81142

if err := s.db.AdminSplit(ctx, newSplitKey, expirationTime); err != nil {

failing with:

ERROR: importing 228177 ranges: splitting key /Table/134/1/200039096531068: unable to find store 83 in range r17080:/Table/134/1/2000390{55231406-60112830} [(n87,s87):6, (n46,s46):7, (n85,s85):8, next=9, gen=460, sticky=1675807501.216768519,0]

This error comes from the code that processes an AdminTransferLease -

cockroach/pkg/kv/kvserver/replica_range_lease.go

Line 881 in 3359d46

    
           return nil, nil, errors.Errorf("unable to find store %d in range %+v", target, desc)

when the target replica is not a replica of the range. The stack points to

cockroach/pkg/kv/kvserver/replica_send.go

Line 952 in 7de1273

    
           pErr = roachpb.NewError(r.AdminTransferLease(ctx, tArgs.Target, tArgs.BypassSafetyChecks))

being where we invoke the transfer from, but it is still unclear where the AdminSplit logic sends an AdminTransferLease. It is worth noting that there were changes made to how we split and scatter chunks in #94805 but it is not yet clear why we are seeing more of this error than before.

Jira issue: CRDB-24307

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-02-07T21:19:15Z

cc @cockroachdb/disaster-recovery

blathers-crl · 2023-02-07T21:19:15Z

cc @cockroachdb/disaster-recovery

rhu713 · 2023-02-08T20:33:51Z

Adding additional debug logging reveals that the error is from the first error in adminSplitWithDescriptor

nvanbenschoten · 2023-02-08T20:47:19Z

The fact that a split is failing with an error from a lease transfer indicates that this is likely fallout from #74077. To perform a range split, ranges first need to leave joint consensus configurations. This can require a lease transfer to the VOTER_INCOMING that was just added to the range.

unable to find store 83 in range r17080 is an interesting error to get in these cases. It likely indicates a form of race. Either store 83 was already removed from the range by this point or the leaseholder fielding the TransferLease request has not seen store 83 get added to the range. I'm not sure that the first case is possible, but the second seems to be.

If we can reproduce this, it should be possible to determine which case we are in and diagnose the issue. cc. @shralex

@adityamaru are we seeing this in a test? Are there reproduction steps.

adityamaru · 2023-02-08T20:54:09Z

are we seeing this in a test? Are there reproduction steps.

We saw this error for the first time in a TPCE 10 million restore on a 96 node cluster 😅 It is quite reproducible at that scale but we haven't seen it in any of our smaller restore roachtests yet.

adityamaru · 2023-02-08T20:59:27Z

If it helps @rhu713 or I can run the restore with a patched binary and get you the logs though.

shermanCRL · 2023-02-13T15:27:17Z

Hiya @nvanbenschoten just an FYI (and sorry if repetitive) that this is blocking our ability to deliver “slim manifests” to a customer, it would be great to get this resolved ASAP. Sounds like it’s in KV’s court. Please ping / ask questions to @rhu713 or @adityamaru.

nvanbenschoten · 2023-02-21T19:20:55Z

Hi @shermanCRL, apologies for the delay. I was OOO last week. I'll raise this issue in the KV weekly today and find a path to getting it resolved.

In the meantime, @adityamaru how would you prefer we work with you to test a potential fix? I suspect that the resolution will look something like nvanbenschoten@e48d7ba. Do you have an environment to test the restore with a patched binary up and running?

dt · 2023-02-21T19:30:53Z

Aditya is OOO this week; @rhu713 has repro steps though

rhu713 · 2023-02-22T01:29:26Z

I think you should be able reproduce often on master with the 96 node setup that I was using to test restore:

roachprod create $CLUSTER  -n 96 --gce-machine-type n2-standard-16  --local-ssd=false  --gce-pd-volume-size=4096

and

 RESTORE FROM latest IN 'gs://rui-backup-test/tpce/23.1-filesList/10M/3?AUTH=implicit';

Though the reproduction seems pretty consistent, it does take tens of minutes to maybe hours for the error to occur and stop the restore.

… node This patch adds a setting to control the parallelism for split and scatters in generativeSplitAndScatterProcessor, defaulting to 1. This is a workaround for cockroachdb#96746 as parallel split and scatters sometimes result in a "store error" that fails the restore. In addition, for chunks that have failed to scatter, this patch routes the chunk to a random node instead of the current node. This is necessary as prior to the generative version, split and scatter processors were on every node, thus there was no imbalance introduced from routing chunks that have failed to scatter to the current node. The new generative split and scatter processor is only on 1 node, and thus would cause the same node to process all chunks that have failed to scatter. Release note: None

nvanbenschoten · 2023-03-07T14:59:57Z

I think the fix for this is #98116, but I'm not able to reproduce the issue to confirm because of https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1678148317150499. I'll hold off on this issue until that's resolved.

…etryable Fixes cockroachdb#96746. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.

98116: kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable r=shralex a=nvanbenschoten Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error. Co-authored-by: Nathan VanBenschoten <[email protected]>

…etryable Fixes #96746. Fixes #100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.

…etryable Fixes cockroachdb#96746. Fixes cockroachdb#100379. This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the replication change that added the voter_incoming to the range 4(b). lease transfer request delayed and delivered after voter_incoming has been transferred the lease, added to the range, then removed from the range. In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds. Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. labels Feb 7, 2023

blathers-crl bot added T-disaster-recovery labels Feb 7, 2023

adityamaru added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Feb 7, 2023

rhu713 added T-kv KV Team and removed T-disaster-recovery labels Feb 21, 2023

rhu713 mentioned this issue Feb 23, 2023

backupccl: send chunks with fail scatters to random node in generative ssp #97589

Merged

nvanbenschoten self-assigned this Mar 1, 2023

nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 2, 2023

nvanbenschoten mentioned this issue Mar 7, 2023

kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable #98116

Merged

craig bot closed this as completed in c3519fe Apr 7, 2023

blathers-crl bot mentioned this issue Apr 7, 2023

release-23.1: kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable #100955

Merged

nvanbenschoten mentioned this issue Apr 7, 2023

release-22.2: kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable #100957

Merged

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: restore fails when issuing an AdminSplit #96746

backupccl: restore fails when issuing an AdminSplit #96746

adityamaru commented Feb 7, 2023 •

edited

Loading

blathers-crl bot commented Feb 7, 2023

blathers-crl bot commented Feb 7, 2023

rhu713 commented Feb 8, 2023

nvanbenschoten commented Feb 8, 2023

adityamaru commented Feb 8, 2023 •

edited

Loading

adityamaru commented Feb 8, 2023

shermanCRL commented Feb 13, 2023

nvanbenschoten commented Feb 21, 2023

dt commented Feb 21, 2023

rhu713 commented Feb 22, 2023

nvanbenschoten commented Mar 7, 2023

backupccl: restore fails when issuing an AdminSplit #96746

backupccl: restore fails when issuing an AdminSplit #96746

Comments

adityamaru commented Feb 7, 2023 • edited Loading

blathers-crl bot commented Feb 7, 2023

blathers-crl bot commented Feb 7, 2023

rhu713 commented Feb 8, 2023

nvanbenschoten commented Feb 8, 2023

adityamaru commented Feb 8, 2023 • edited Loading

adityamaru commented Feb 8, 2023

shermanCRL commented Feb 13, 2023

nvanbenschoten commented Feb 21, 2023

dt commented Feb 21, 2023

rhu713 commented Feb 22, 2023

nvanbenschoten commented Mar 7, 2023

adityamaru commented Feb 7, 2023 •

edited

Loading

adityamaru commented Feb 8, 2023 •

edited

Loading