Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: restore/tpce/32TB/aws/nodes=15/cpus=16 failed #100379

Closed
cockroach-teamcity opened this issue Apr 1, 2023 · 3 comments · Fixed by #98116
Closed

roachtest: restore/tpce/32TB/aws/nodes=15/cpus=16 failed #100379

cockroach-teamcity opened this issue Apr 1, 2023 · 3 comments · Fixed by #98116
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 1, 2023

roachtest.restore/tpce/32TB/aws/nodes=15/cpus=16 failed with artifacts on master @ 99102ddf4b7602788b422366f1acc14b81c64d24:

test artifacts and logs in: /artifacts/restore/tpce/32TB/aws/nodes=15/cpus=16/run_1
(monitor.go:127).Wait: monitor failure: monitor task failed: output in run_052239.378670427_n1_cockroach-sql-insecu: ./cockroach sql --insecure -e "RESTORE  FROM LATEST IN 's3://cockroach-fixtures/backups/tpc-e/customers=2000000/v22.2.1/inc-count=48?AUTH=implicit' AS OF SYSTEM TIME '2023-01-11T23:45:00Z' " returned: COMMAND_PROBLEM: ssh verbose log retained in ssh_052239.382670106_n1_cockroach-sql-insecu.log: exit status 1

Parameters: ROACHTEST_cloud=aws , ROACHTEST_cpu=16 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-26395

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 1, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 1, 2023
@dt
Copy link
Member

dt commented Apr 3, 2023

ERROR: importing 56050 ranges: splitting key /Table/139/1/200020134191750: unable to find store 8 in range r27981:/Table/139/1/2000199{47440712-93909240} [(n15,s15):7, (n13,s13):5, (n1,s1):3, next=8, gen=625, sticky=1680330407.362469527,0]

@dt
Copy link
Member

dt commented Apr 3, 2023

@nvanbenschoten You had a fix for this, right?

@dt dt added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 3, 2023
@nvanbenschoten
Copy link
Member

Yes, this will be fixed by #98116. I'm planning to backport this to v23.1 and v22.2, but am moderately concerned about the fragility at this level, so I'd feel more comfortable letting the change bake for a few weeks before the backport. How does targeting v23.1.1 sound?

@dt dt removed the GA-blocker label Apr 5, 2023
craig bot pushed a commit that referenced this issue Apr 7, 2023
98116: kv: consider ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable r=shralex a=nvanbenschoten

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease to be retryable replication change errors when thrown by lease transfer requests. In doing so, these errors will be retried by the retry loop in `Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs to fail with a `unable to find store` error.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in c3519fe Apr 7, 2023
blathers-crl bot pushed a commit that referenced this issue Apr 7, 2023
…etryable

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
nvanbenschoten added a commit that referenced this issue Apr 24, 2023
…etryable

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
nvanbenschoten added a commit that referenced this issue Jun 1, 2023
…etryable

Fixes #96746.
Fixes #100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 1, 2023
…etryable

Fixes cockroachdb#96746.
Fixes cockroachdb#100379.

This commit considers ErrReplicaNotFound and ErrReplicaCannotHoldLease
to be retryable replication change errors when thrown by lease transfer
requests. In doing so, these errors will be retried by the retry loop in
`Replica.executeAdminCommandWithDescriptor`.

This avoids spurious errors when a split gets blocked behind a lateral
replica move like we see in the following situation:
1. issue AdminSplit
2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas)
3. to leave, needs to transfer lease from voter_outgoing to voter_incoming
4(a). lease transfer request sent to replica that has not yet applied the
      replication change that added the voter_incoming to the range
4(b). lease transfer request delayed and delivered after voter_incoming has
      been transferred the lease, added to the range, then removed from the
      range.

In either case, retrying the AdminSplit operation on these errors will
ensure that it eventually succeeds.

Release note (bug fix): Fixed a rare race that could allow large RESTOREs
to fail with a `unable to find store` error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
3 participants