Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestLeaseExpirationBasedDrainTransferWithExtension failed #110715

Closed
cockroach-teamcity opened this issue Sep 15, 2023 · 9 comments
Closed
Assignees
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 15, 2023

kv/kvserver.TestLeaseExpirationBasedDrainTransferWithExtension failed with artifacts on release-23.1.11-rc @ b2012193c9558ed2f6541da27469a8c5ecca52ae:

=== RUN   TestLeaseExpirationBasedDrainTransferWithExtension
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestLeaseExpirationBasedDrainTransferWithExtension2127012733
    test_log_scope.go:79: use -show-logs to present logs inline
    client_replica_test.go:1806: condition failed to evaluate within 3m45s: [NotLeaseHolderError] lease held by different store; r64: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):2 seq=2 start=1694777961.282534834,0 exp=1694778193.624914971,0 pro=1694778187.624914971,0
    panic.go:540: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestLeaseExpirationBasedDrainTransferWithExtension2127012733
--- FAIL: TestLeaseExpirationBasedDrainTransferWithExtension (233.09s)

Parameters: TAGS=bazel,gss,race

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-31574

@cockroach-teamcity cockroach-teamcity added branch-release-23.1.11-rc C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Sep 15, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Sep 15, 2023
@andrewbaptist
Copy link
Collaborator

@kvoli I'm removing the release blocker tag, but can you take a look at this when you get a chance? We have seen few failures that are similar to this so just wanted to make sure this wasn't anything new.

@andrewbaptist andrewbaptist removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 15, 2023
@nvanbenschoten
Copy link
Member

I can reproduce after 524 runs completed, 1 failures, over 8m18s under stress + race.

dev test pkg/kv/kvserver -f=TestLeaseExpirationBasedDrainTransferWithExtension --race --stress

I'm taking a look at what's going wrong.

@nvanbenschoten
Copy link
Member

We see the lease extension get blocked, unblocked, and then succeed:

I230925 14:41:54.292263 4082 kv/kvserver_test/client_replica_test.go:1544 ⋮ [T1,n2,s2,r64/2:‹/System{\x00aaa-test…-/NodeLive…}›] 215  filter blocking request: ‹header:<key:"\004\000aaa-testing" > lease:<start:<wall_time:1695652917282965205 > expiration:<wall_time:1695652923282965205 > replica:<node_id:2 store_id:2 replica_id:2 type:VOTER_FULL > proposed_ts:<wall_time:1695652917282965205 > > prev_lease:<start:<wall_time:1695652914275046965 > expiration:<wall_time:1695652920274444638 > replica:<node_id:2 store_id:2 replica_id:2 type:VOTER_FULL > proposed_ts:<wall_time:1695652914274444638 > sequence:2 acquisition_type:Transfer > min_proposed_ts:<>›
I230925 14:41:54.298729 4082 kv/kvserver_test/client_replica_test.go:1546 ⋮ [T1,n2,s2,r64/2:‹/System{\x00aaa-test…-/NodeLive…}›] 216  filter unblocking lease request

After this point, we see the drain get stuck waiting to transfer away the lease:

I230925 14:41:57.521738 4083 kv/kvserver/store.go:1823 ⋮ [drain] 257  waiting for 1 replicas to transfer their lease away
I230925 14:41:54.352268 3140 kv/kvserver/replica_proposal.go:276 ⋮ [T1,n2,s2,r64/2:‹/System{\x00aaa-test…-/NodeLive…}›,raft] 224  new range lease repl=(n2,s2):2 seq=2 start=1695652914.275046965,0 exp=1695652923.282965205,0 pro=1695652917.282965205,0 following repl=(n2,s2):2 seq=2 start=1695652914.275046965,0 exp=1695652920.274444638,0 pro=1695652914.274444638,0
...
I230925 14:41:57.521738 4083 kv/kvserver/store.go:1823 ⋮ [drain] 257  waiting for 1 replicas to transfer their lease away
I230925 14:41:57.897760 3142 kv/kvserver/replica_proposal.go:276 ⋮ [T1,n2,s2,r64/2:‹/System{\x00aaa-test…-/NodeLive…}›,raft] 258  new range lease repl=(n2,s2):2 seq=2 start=1695652914.275046965,0 exp=1695652926.883641757,0 pro=1695652920.883641757,0 following repl=(n2,s2):2 seq=2 start=1695652914.275046965,0 exp=1695652923.282965205,0 pro=1695652917.282965205,0

Eventually, the drain hits a 5s timeout:

I230925 14:41:59.316693 4083 kv/kvserver/store.go:1823 ⋮ [drain] 262  waiting for 1 replicas to transfer their lease away
W230925 14:41:59.356646 4083 kv/kvserver/store.go:1839 ⋮ [drain] 263  unable to drain cleanly within 5s (cluster setting ‹server.shutdown.lease_transfer_wait›), service might briefly deteriorate if the node is terminated: waiting for 1 replicas to transfer their lease away

At this point, we give up on the drain and the test is bound to eventually fail.

The questions to answer are:

  • what's blocking the lease transfer from n2 to n1?
  • would the drain have ever completed if this timeout was higher?
  • why are we now occasionally hitting this 5s timeout?

@nvanbenschoten
Copy link
Member

With additional logging (--vmodule=store=1), I see the following log lines repeating:

I230925 15:30:40.343542 4708 kv/kvserver/store.go:1736 ⋮ [T1,drain,n2,s2,r64/2:‹/System{\x00aaa-test…-/NodeLive…}›] 271  attempting to transfer lease repl=(n2,s2):2 seq=2 start=1695655839.861607363,0 exp=1695655849.138051962,0 pro=1695655843.138051962,0 for range r64:‹/System{\x00aaa-testing-/NodeLiveness}› [(n1,s1):1, (n2,s2):2, next=3, gen=3, sticky=9223372036.854775807,2147483647]
E230925 15:30:40.344498 4708 kv/kvserver/store.go:1767 ⋮ [T1,drain,n2,s2,r64/2:‹/System{\x00aaa-test…-/NodeLive…}›] 272  failed to transfer lease repl=(n2,s2):2 seq=2 start=1695655839.861607363,0 exp=1695655849.138051962,0 pro=1695655843.138051962,0 for range r64:‹/System{\x00aaa-testing-/NodeLiveness}› [(n1,s1):1, (n2,s2):2, next=3, gen=3, sticky=9223372036.854775807,2147483647] when draining: ‹no suitable transfer target found›

So the lease transfer repeatedly hits a no suitable transfer target found status.

@nvanbenschoten
Copy link
Member

This reminds me of #101885.

@nvanbenschoten
Copy link
Member

n1 seems to be getting filtered out as a lease transfer target here:

candidates, _ = storePool.LiveAndDeadReplicas(
candidates, false, /* includeSuspectAndDrainingStores */
)

@andrewbaptist
Copy link
Collaborator

Was the store recently started or flapping liveness?

@nvanbenschoten
Copy link
Member

It wasn't recently restarted, but I think the liveness is artificially flapping because the test is messing with manual clocks and injecting large clock jumps.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 25, 2023
Informs cockroachdb#110715, which will be fixed by a non-clean backport (see 42e45b4)
of this commit.

This commit deflakes TestLeaseExpirationBasedDrainTransferWithExtension by
disabling the node suspect timer in leaseTransferTest tests. These tests
manually control the clock and have a habit of inducing destabilizing clock
jumps. In this case, if n2 looked at liveness immediately after one of these
manual clock jumps, it would mark n1 as suspect and refuse to transfer it the
lease for the 30s server.time_after_store_suspect, which is longer than the 5s
server.shutdown.lease_transfer_wait. This would cause the test to fail.

Release note: None
@nvanbenschoten nvanbenschoten self-assigned this Sep 25, 2023
@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Sep 25, 2023
craig bot pushed a commit that referenced this issue Sep 25, 2023
…111232

110967: asim: enable random zone config event generation r=kvoli a=wenyihu6

Previously, zone config event generation used hardcoded span configurations.
This limits our ability to test the allocator more thoroughly.

To improve this, this patch enables random span configs to be generated and
applied as part of the simulation. These configurations are generated by
randomly selecting the primary region, region V.S. zone survival goal, and
leaseholder preference.

```
The following command is now supported:
"rand_events" [cycle_via_random_survival_goals]
```

Part of: #106192
Release Note: none
Epic: none

111192: bulk: allow caller-configurable priority in SSTBatcher r=adityamaru a=stevendanna

This adds the ability for some callers to use a higher admission priority in SSTBatcher. This is helpful for streaming where we want to run at a priority that isn't subject to the elastic admission regime.

Epic: none

Release note: None

111206: kv: fix (store|node) not found err checking r=erikgrinaker a=kvoli

`StoreNotFoundError` and `NodeNotFoundError` errors were moved to the `kvpb` pkg in #110374. As part of the move, `crdb_internal` functions which checked if the error were `DescNotFoundError` were also updated so that node/store not found errors would be recognized e.g.

```
errors.Is(kvpb.NewNodeNotFoundError(nodeID), &kvpb.DescNotFoundError{})
```

This didn't work, because the error doesn't match the reference error variable being given. It does match the type. Update these error assertions to use `HasType` instead.

Resolves: #111084
Epic: none
Release note: None

111214: release: fix roachtest artifacts name r=srosenberg a=rail

This fixes the roachtest artifacts directory name.

Epic: none
Release note: None

111217: cloud/azure: Fix azure schemes r=benbardin a=benbardin

Part of: https://cockroachlabs.atlassian.net/browse/CRDB-31120

Release note (bug fix): Fixes azure schemes in storage, kms and external conns.

111223: storage: populate timestamp field in lock table values r=nvanbenschoten a=nvanbenschoten

Informs #109645.

This commit starts writing the `Timestamp` field in lock table `MVCCMetadata` values for shared and exclusive locks. This mirrors the behavior of intent locks. This is not strictly needed, as the timestamp is always equal to `Txn.WriteTimestamp`, but it is cheap to do and helps unify some stats logic, which uses this field to compute "lock age".

Maybe we'll get rid of this for all lock strengths one day...

Release note: None

111230: authors: add xhesikam to authors r=xhesikam a=xhesikam

Release note: None
Epic: None

111231: backupccl: add missing ctx cancel check r=msbutler a=adityamaru

In #111159 we deduced from the stacks a
situation in which the goroutine draining `spanCh` had exited due to a context cancelation but the
writer was not listening for a ctx cancelation.
This manifests as a stuck restore when using
the non-default make simple import spans implementation.

Fixes: #111159
Release note: None

111232: kv: deflake TestLeaseExpirationBasedDrainTransferWithExtension r=nvanbenschoten a=nvanbenschoten

Informs #110715, which will be fixed by a non-clean backport (see 42e45b4) of this commit.

This commit deflakes `TestLeaseExpirationBasedDrainTransferWithExtension` by disabling the node suspect timer in leaseTransferTest tests. These tests manually control the clock and have a habit of inducing destabilizing clock jumps. In this case, if n2 looked at liveness immediately after one of these manual clock jumps, it would mark n1 as suspect and refuse to transfer it the lease for the 30 second `server.time_after_store_suspect`, which is longer than the 5 second `server.shutdown.lease_transfer_wait`. This would cause the test to fail.

Before this patch, the test would fail under stress race in about 8 minutes. Since the patch, it hasn't failed in over 30 minutes.

Release note: None

Co-authored-by: wenyihu6 <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: xhesikam <[email protected]>
Co-authored-by: adityamaru <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Sep 26, 2023
Fixes cockroachdb#110715.

This commit deflakes TestLeaseExpirationBasedDrainTransferWithExtension by
disabling the node suspect timer in leaseTransferTest tests. These tests
manually control the clock and have a habit of inducing destabilizing clock
jumps. In this case, if n2 looked at liveness immediately after one of these
manual clock jumps, it would mark n1 as suspect and refuse to transfer it the
lease for the 30s server.time_after_store_suspect, which is longer than the 5s
server.shutdown.lease_transfer_wait. This would cause the test to fail.

Release note: None
@nvanbenschoten
Copy link
Member

Fixed by #111254.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants