-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvserver: TestLeaseExpirationBasedDrainTransferWithExtension failed #110715
Comments
@kvoli I'm removing the release blocker tag, but can you take a look at this when you get a chance? We have seen few failures that are similar to this so just wanted to make sure this wasn't anything new. |
I can reproduce after
I'm taking a look at what's going wrong. |
We see the lease extension get blocked, unblocked, and then succeed:
After this point, we see the drain get stuck waiting to transfer away the lease:
Eventually, the drain hits a 5s timeout:
At this point, we give up on the drain and the test is bound to eventually fail. The questions to answer are:
|
With additional logging (
So the lease transfer repeatedly hits a |
This reminds me of #101885. |
n1 seems to be getting filtered out as a lease transfer target here: cockroach/pkg/kv/kvserver/allocator/allocatorimpl/allocator.go Lines 1986 to 1988 in b201219
|
Was the store recently started or flapping liveness? |
It wasn't recently restarted, but I think the liveness is artificially flapping because the test is messing with manual clocks and injecting large clock jumps. |
Informs cockroachdb#110715, which will be fixed by a non-clean backport (see 42e45b4) of this commit. This commit deflakes TestLeaseExpirationBasedDrainTransferWithExtension by disabling the node suspect timer in leaseTransferTest tests. These tests manually control the clock and have a habit of inducing destabilizing clock jumps. In this case, if n2 looked at liveness immediately after one of these manual clock jumps, it would mark n1 as suspect and refuse to transfer it the lease for the 30s server.time_after_store_suspect, which is longer than the 5s server.shutdown.lease_transfer_wait. This would cause the test to fail. Release note: None
…111232 110967: asim: enable random zone config event generation r=kvoli a=wenyihu6 Previously, zone config event generation used hardcoded span configurations. This limits our ability to test the allocator more thoroughly. To improve this, this patch enables random span configs to be generated and applied as part of the simulation. These configurations are generated by randomly selecting the primary region, region V.S. zone survival goal, and leaseholder preference. ``` The following command is now supported: "rand_events" [cycle_via_random_survival_goals] ``` Part of: #106192 Release Note: none Epic: none 111192: bulk: allow caller-configurable priority in SSTBatcher r=adityamaru a=stevendanna This adds the ability for some callers to use a higher admission priority in SSTBatcher. This is helpful for streaming where we want to run at a priority that isn't subject to the elastic admission regime. Epic: none Release note: None 111206: kv: fix (store|node) not found err checking r=erikgrinaker a=kvoli `StoreNotFoundError` and `NodeNotFoundError` errors were moved to the `kvpb` pkg in #110374. As part of the move, `crdb_internal` functions which checked if the error were `DescNotFoundError` were also updated so that node/store not found errors would be recognized e.g. ``` errors.Is(kvpb.NewNodeNotFoundError(nodeID), &kvpb.DescNotFoundError{}) ``` This didn't work, because the error doesn't match the reference error variable being given. It does match the type. Update these error assertions to use `HasType` instead. Resolves: #111084 Epic: none Release note: None 111214: release: fix roachtest artifacts name r=srosenberg a=rail This fixes the roachtest artifacts directory name. Epic: none Release note: None 111217: cloud/azure: Fix azure schemes r=benbardin a=benbardin Part of: https://cockroachlabs.atlassian.net/browse/CRDB-31120 Release note (bug fix): Fixes azure schemes in storage, kms and external conns. 111223: storage: populate timestamp field in lock table values r=nvanbenschoten a=nvanbenschoten Informs #109645. This commit starts writing the `Timestamp` field in lock table `MVCCMetadata` values for shared and exclusive locks. This mirrors the behavior of intent locks. This is not strictly needed, as the timestamp is always equal to `Txn.WriteTimestamp`, but it is cheap to do and helps unify some stats logic, which uses this field to compute "lock age". Maybe we'll get rid of this for all lock strengths one day... Release note: None 111230: authors: add xhesikam to authors r=xhesikam a=xhesikam Release note: None Epic: None 111231: backupccl: add missing ctx cancel check r=msbutler a=adityamaru In #111159 we deduced from the stacks a situation in which the goroutine draining `spanCh` had exited due to a context cancelation but the writer was not listening for a ctx cancelation. This manifests as a stuck restore when using the non-default make simple import spans implementation. Fixes: #111159 Release note: None 111232: kv: deflake TestLeaseExpirationBasedDrainTransferWithExtension r=nvanbenschoten a=nvanbenschoten Informs #110715, which will be fixed by a non-clean backport (see 42e45b4) of this commit. This commit deflakes `TestLeaseExpirationBasedDrainTransferWithExtension` by disabling the node suspect timer in leaseTransferTest tests. These tests manually control the clock and have a habit of inducing destabilizing clock jumps. In this case, if n2 looked at liveness immediately after one of these manual clock jumps, it would mark n1 as suspect and refuse to transfer it the lease for the 30 second `server.time_after_store_suspect`, which is longer than the 5 second `server.shutdown.lease_transfer_wait`. This would cause the test to fail. Before this patch, the test would fail under stress race in about 8 minutes. Since the patch, it hasn't failed in over 30 minutes. Release note: None Co-authored-by: wenyihu6 <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Rail Aliiev <[email protected]> Co-authored-by: Ben Bardin <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: xhesikam <[email protected]> Co-authored-by: adityamaru <[email protected]>
Fixes cockroachdb#110715. This commit deflakes TestLeaseExpirationBasedDrainTransferWithExtension by disabling the node suspect timer in leaseTransferTest tests. These tests manually control the clock and have a habit of inducing destabilizing clock jumps. In this case, if n2 looked at liveness immediately after one of these manual clock jumps, it would mark n1 as suspect and refuse to transfer it the lease for the 30s server.time_after_store_suspect, which is longer than the 5s server.shutdown.lease_transfer_wait. This would cause the test to fail. Release note: None
Fixed by #111254. |
kv/kvserver.TestLeaseExpirationBasedDrainTransferWithExtension failed with artifacts on release-23.1.11-rc @ b2012193c9558ed2f6541da27469a8c5ecca52ae:
Parameters:
TAGS=bazel,gss,race
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-31574
The text was updated successfully, but these errors were encountered: