Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: restore fails when issuing an AdminSplit #96746

Closed
adityamaru opened this issue Feb 7, 2023 · 11 comments · Fixed by #98116
Closed

backupccl: restore fails when issuing an AdminSplit #96746

adityamaru opened this issue Feb 7, 2023 · 11 comments · Fixed by #98116
Assignees
Labels
A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-kv KV Team

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Feb 7, 2023

When running a large restore we occasionally see the call to AdminSplit -

if err := s.db.AdminSplit(ctx, newSplitKey, expirationTime); err != nil {
failing with:

ERROR: importing 228177 ranges: splitting key /Table/134/1/200039096531068: unable to find store 83 in range r17080:/Table/134/1/2000390{55231406-60112830} [(n87,s87):6, (n46,s46):7, (n85,s85):8, next=9, gen=460, sticky=1675807501.216768519,0]

This error comes from the code that processes an AdminTransferLease -

return nil, nil, errors.Errorf("unable to find store %d in range %+v", target, desc)
when the target replica is not a replica of the range. The stack points to
pErr = roachpb.NewError(r.AdminTransferLease(ctx, tArgs.Target, tArgs.BypassSafetyChecks))
being where we invoke the transfer from, but it is still unclear where the AdminSplit logic sends an AdminTransferLease. It is worth noting that there were changes made to how we split and scatter chunks in #94805 but it is not yet clear why we are seeing more of this error than before.

Jira issue: CRDB-24307

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. labels Feb 7, 2023
@blathers-crl
Copy link

blathers-crl bot commented Feb 7, 2023

cc @cockroachdb/disaster-recovery

1 similar comment
@blathers-crl
Copy link

blathers-crl bot commented Feb 7, 2023

cc @cockroachdb/disaster-recovery

@adityamaru adityamaru added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Feb 7, 2023
@rhu713
Copy link
Contributor

rhu713 commented Feb 8, 2023

Adding additional debug logging reveals that the error is from the first error in adminSplitWithDescriptor

@nvanbenschoten
Copy link
Member

The fact that a split is failing with an error from a lease transfer indicates that this is likely fallout from #74077. To perform a range split, ranges first need to leave joint consensus configurations. This can require a lease transfer to the VOTER_INCOMING that was just added to the range.

unable to find store 83 in range r17080 is an interesting error to get in these cases. It likely indicates a form of race. Either store 83 was already removed from the range by this point or the leaseholder fielding the TransferLease request has not seen store 83 get added to the range. I'm not sure that the first case is possible, but the second seems to be.

If we can reproduce this, it should be possible to determine which case we are in and diagnose the issue. cc. @shralex

@adityamaru are we seeing this in a test? Are there reproduction steps.

@adityamaru
Copy link
Contributor Author

adityamaru commented Feb 8, 2023

are we seeing this in a test? Are there reproduction steps.

We saw this error for the first time in a TPCE 10 million restore on a 96 node cluster 😅 It is quite reproducible at that scale but we haven't seen it in any of our smaller restore roachtests yet.

@adityamaru
Copy link
Contributor Author

If it helps @rhu713 or I can run the restore with a patched binary and get you the logs though.

@shermanCRL
Copy link
Contributor

Hiya @nvanbenschoten just an FYI (and sorry if repetitive) that this is blocking our ability to deliver “slim manifests” to a customer, it would be great to get this resolved ASAP. Sounds like it’s in KV’s court. Please ping / ask questions to @rhu713 or @adityamaru.

@rhu713 rhu713 added T-kv KV Team and removed T-disaster-recovery labels Feb 21, 2023
@nvanbenschoten
Copy link
Member

Hi @shermanCRL, apologies for the delay. I was OOO last week. I'll raise this issue in the KV weekly today and find a path to getting it resolved.

In the meantime, @adityamaru how would you prefer we work with you to test a potential fix? I suspect that the resolution will look something like nvanbenschoten@e48d7ba. Do you have an environment to test the restore with a patched binary up and running?

@dt
Copy link
Member

dt commented Feb 21, 2023

Aditya is OOO this week; @rhu713 has repro steps though

@rhu713
Copy link
Contributor

rhu713 commented Feb 22, 2023

I think you should be able reproduce often on master with the 96 node setup that I was using to test restore:

roachprod create $CLUSTER  -n 96 --gce-machine-type n2-standard-16  --local-ssd=false  --gce-pd-volume-size=4096

and

 RESTORE FROM latest IN 'gs://rui-backup-test/tpce/23.1-filesList/10M/3?AUTH=implicit';

Though the reproduction seems pretty consistent, it does take tens of minutes to maybe hours for the error to occur and stop the restore.

rhu713 pushed a commit to rhu713/cockroach that referenced this issue Feb 23, 2023
… node

This patch adds a setting to control the parallelism for split and scatters in
generativeSplitAndScatterProcessor, defaulting to 1. This is a workaround
for cockroachdb#96746 as parallel split and scatters sometimes result in a "store error"
that fails the restore.

In addition, for chunks that have failed to scatter, this patch routes the
chunk to a random node instead of the current node. This is necessary as prior
to the generative version, split and scatter processors were on every node,
thus there was no imbalance introduced from routing chunks that have failed to
scatter to the current node. The new generative split and scatter processor is
only on 1 node, and thus would cause the same node to process all chunks that
have failed to scatter.

Release note: None
@nvanbenschoten nvanbenschoten self-assigned this Mar 1, 2023
@nvanbenschoten nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 2, 2023
@nvanbenschoten
Copy link
Member

I think the fix for this is #98116, but I'm not able to reproduce the issue to confirm because of https://cockroachlabs.slack.com/archives/C2C5FKPPB/p1678148317150499. I'll hold off on this issue until that's resolved.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 7, 2023
…etryable

Fixes cockroachdb#96746.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
craig bot pushed a commit that referenced this issue Apr 7, 2023
98116: kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable r=shralex a=nvanbenschoten

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in c3519fe Apr 7, 2023
blathers-crl bot pushed a commit that referenced this issue Apr 7, 2023
…etryable

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
nvanbenschoten added a commit that referenced this issue Apr 24, 2023
…etryable

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
nvanbenschoten added a commit that referenced this issue Jun 1, 2023
…etryable

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 1, 2023
…etryable

Fixes cockroachdb#96746.
Fixes cockroachdb#100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-kv Anything in KV that doesn't belong in a more specific category. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker T-kv KV Team
Projects
No open projects
Archived in project
5 participants