Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestLeasePreferencesDuringOutage failed #105101

Closed
cockroach-teamcity opened this issue Jun 17, 2023 · 8 comments · Fixed by #108333
Closed

kv/kvserver: TestLeasePreferencesDuringOutage failed #105101

cockroach-teamcity opened this issue Jun 17, 2023 · 8 comments · Fixed by #108333
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 17, 2023

kv/kvserver.TestLeasePreferencesDuringOutage failed with artifacts on release-22.2 @ 8055184fc991424d1d0eefd7ccf703948b8de3ee:

=== RUN   TestLeasePreferencesDuringOutage
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestLeasePreferencesDuringOutage4182034325
    test_log_scope.go:79: use -show-logs to present logs inline
    client_lease_test.go:1028: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_lease_test.go:1028
        	Error:      	Received unexpected error:
        	            	replica [n1,s1,r53/1:/{Table/100-Max}] does not have the range lease: [NotLeaseHolderError] lease acquisition attempt lost to another lease, which has expired in the meantime; r53: replica (n1,s1):1 not lease holder; lease holder unknown
        	            	(1) attached stack trace
        	            	  -- stack trace:
        	            	  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Enqueue
        	            	  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3644
        	            	  | github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestLeasePreferencesDuringOutage
        	            	  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/client_lease_test.go:1026
        	            	  | testing.tRunner
        	            	  | 	GOROOT/src/testing/testing.go:1446
        	            	  | runtime.goexit
        	            	  | 	GOROOT/src/runtime/asm_amd64.s:1594
        	            	Wraps: (2) replica [n1,s1,r53/1:/{Table/100-Max}] does not have the range lease
        	            	Wraps: (3) [NotLeaseHolderError] lease acquisition attempt lost to another lease, which has expired in the meantime; r53: replica (n1,s1):1 not lease holder; lease holder unknown
        	            	Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *roachpb.NotLeaseHolderError
        	Test:       	TestLeasePreferencesDuringOutage
        	Messages:   	failed to enqueue replica for replication
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/33e1d369c27b9c01b2b6009c561815a3/logTestLeasePreferencesDuringOutage4182034325
--- FAIL: TestLeasePreferencesDuringOutage (6.38s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-28862

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team labels Jun 17, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jun 17, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jun 20, 2023

This is failing here:

_, _, enqueueError := tc.GetFirstStoreFromServer(t, 0).
Enqueue(ctx, "replicate", repl, true /* skipShouldQueue */, false /* async */)
require.NoError(t, enqueueError, "failed to enqueue replica for replication")

The error is coming from lease acquisition attempt here:

// Many queues are only meant to be run on leaseholder replicas, so attempt to
// take the lease here or bail out early if a different replica has it.
if needsLease {
if _, pErr := repl.redirectOnOrAcquireLease(ctx); pErr != nil {
return nil, nil, errors.Wrapf(pErr.GoError(), "replica %v does not have the range lease", repl)
}
}

r53 looks like [(n1,s1):1 (n2,s2):2 (n4,s4):3]. n2 is stopped and clocks advanced so it appears dead. n1 and n4 should still be up.

The failure is different than previous flakes. I see we backported a deflake commit to 22.2: #103636.

@kvoli
Copy link
Collaborator

kvoli commented Jun 21, 2023

Stressed for 25mins with no failure.

dev test pkg/kv/kvserver -f TestLeasePreferencesDuringOutage -v --show-logs --vmodule=store=3,replicate_queue=3,allocator=3,replica_proposal=1 --stress
...
5755 runs so far, 0 failures, over 25m35s

_, _, enqueueError := tc.GetFirstStoreFromServer(t, 0).
Enqueue(ctx, "replicate", repl, true /* skipShouldQueue */, false /* async */)
require.NoError(t, enqueueError, "failed to enqueue replica for replication")

I wonder if we could wrap this in a succeeds soon to deflake the test.

@kvoli
Copy link
Collaborator

kvoli commented Jun 21, 2023

Reproduced after trying again. Looks like n2 converts its r53 lease to an epoch one just before being stopped (or after actually).

I230621 02:07:47.935147 1978 kv/kvserver/replica_proposal.go:398  [n2,s2,r53/2:/{Table/100-Max},raft] 322  upgrading expiration lease repl=(n2,s2):2 seq=2 start=1687313267.933937822,0 exp=1687313276.933796505,0 pro=1687313267.933796505,0 to an epoch-based one  
I230621 02:07:47.935240 1978 kv/kvserver/replica_proposal.go:454  [n2,s2,r53/2:/{Table/100-Max},raft] 323  is now leaseholder
W230621 02:07:47.936371 4337 2@rpc/nodedialer/nodedialer.go:193  [n2] 324  unable to connect to n4: failed to connect to n4 at 127.0.0.1:36751: stopped
W230621 02:07:47.936388 2157 sql/sqlliveness/slinstance/slinstance.go:242  [n2] 325  exiting heartbeat loop   
I230621 02:07:47.936672 1985 kv/kvserver/replica_proposal.go:273  [n2,s2,r53/2:/{Table/100-Max},raft] 326  new range lease repl=(n2,s2):2 seq=3 start=1687313267.933937822,0 epo=1 pro=1687313267.935120009,0 following repl=(n2,s2):2 seq=2 start=1687313267.933937822,0 exp=1687313276.933796505,0 pro=1687313267.933796505,0 
W230621 02:07:47.936841 1985 kv/kvserver/store.go:2649  [n2,s2,r53/2:/{Table/100-Max},raft] 327  unable to gossip on capacity change: node unavailable; try another peer 
W230621 02:07:47.936841 1985 kv/kvserver/store.go:2649  [n2,s2,r53/2:/{Table/100-Max},raft] 327 +(1) node unavailable; try another peer
W230621 02:07:47.936841 1985 kv/kvserver/store.go:2649  [n2,s2,r53/2:/{Table/100-Max},raft] 327 +Error types: (1) *roachpb.NodeUnavailableError
I230621 02:07:47.936942 1985 kv/kvserver/replica_proposal.go:454  [n2,s2,r53/2:/{Table/100-Max},raft] 328  is now leaseholder 
W230621 02:07:52.536079 4545 2@rpc/nodedialer/nodedialer.go:193  [n1] 348  unable to connect to n2: failed to connect to n2 at 127.0.0.1:35577: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:35577: connect: connection refused"
replica [n1,s1,r53/1:/{Table/100-Max}] does not have the range lease: [NotLeaseHolderError] lease acquisition attempt lost to another lease, which has expired in the meantime; r53: replica (n1,s1):1 not lease holder; lease holder unknown
...
3928 runs completed, 1 failures, over 17m33s

@kvoli kvoli self-assigned this Jun 21, 2023
@kvoli kvoli added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Jun 21, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jun 21, 2023

We either need to disable expiration leases or retry the enqueue. I'll open a PR.

kvoli added a commit to kvoli/cockroach that referenced this issue Jun 22, 2023
It was possible that another node would acquire the lease, causing the
replicate queue enqueue to fail. Wrap the enqueue in a succeeds soon to
retry when this happens.

Resolves: cockroachdb#105101

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Jul 21, 2023
`TestLeasePreferencesDuringOutage` checks that a non-preferred node can
up-replicate and transfer the lease in a single replicate queue
`process()`

It was possible that another node would acquire the lease, causing the
replicate queue enqueue to fail.

Disable expiration based lease transfers and force the transfer to a
non-preferred locality. This avoids flakiness in parts of the test which
are not directly part of the behavior being asserted on.

This patch also unskips the test.

Resolves: cockroachdb#105101

Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Jul 21, 2023
`TestLeasePreferencesDuringOutage` checks that a non-preferred node can
up-replicate and transfer the lease in a single replicate queue
`process()`

It was possible that another node would acquire the lease, causing the
replicate queue enqueue to fail.

Disable expiration based lease transfers and force the transfer to a
non-preferred locality. This avoids flakiness in parts of the test which
are not directly part of the behavior being asserted on.

This patch also unskips the test.

Resolves: cockroachdb#105101

Release note: None
@github-actions
Copy link

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@kvoli
Copy link
Collaborator

kvoli commented Aug 2, 2023

This hasn't flaked in a while. I'm going to stress for an hour, if I can't reproduce the problem, I'll close the issue.

@kvoli
Copy link
Collaborator

kvoli commented Aug 2, 2023

nvm this is still flaky due to above.

dev test pkg/kv/kvserver -f TestLeasePreferencesDuringOutage -v --show-logs --stress
1732 runs completed, 1 failures, over 7m47s

@kvoli kvoli added the A-kv-distribution Relating to rebalancing and leasing. label Aug 3, 2023
@kvoli
Copy link
Collaborator

kvoli commented Aug 7, 2023

I'm stressing a candidate fix, where expiration based lease transfers are disabled.

kvoli added a commit to kvoli/cockroach that referenced this issue Aug 8, 2023
Previously, it was possible for a soon to be dead replica, to acquire
the range lease in the `TestLeasePreferencesDuringOutage` test. The
acquired lease would be expiration based, disallowing the intended
leaseholder from acquiring the lease.

This patch disables expiration based lease transfers, deflaking the
test.

Resolves: cockroachdb#105101
Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Aug 8, 2023
108295: roachtest: increase timeout for network_logging to 60s r=dhartunian a=abarganier

Fixes: #108088

The new network_logging roachtest sets a pgclient timeout of 10s to attempt detecting deadlocks. This timeout was hit fairly easily during the nightly runs. This is an indication that our 10s timeout is too aggressive.

This PR changes the timeout from 10s to 60s which still achieves the original aim without being so aggressive.

Release note: none

108333: kvserver: deflake lease preferences during outage r=erikgrinaker a=kvoli

*This PR is intended to be backported to `release-22.2`.
 `TestLeasePreferencesDuringOutage` is currently skipped on master.
Stressed for 30 mins without failure on release-22.2.*

Previously, it was possible for a soon-to-be dead replica, to acquire the range lease in the `TestLeasePreferencesDuringOutage` test. The acquired lease would be expiration based, disallowing the intended leaseholder from acquiring the lease.

This patch disables expiration based lease transfers, deflaking the test.

Resolves: #105101
Epic: none

Release note: None

Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
@craig craig bot closed this as completed in 80b8bf8 Aug 8, 2023
kvoli added a commit that referenced this issue Aug 8, 2023
Previously, it was possible for a soon to be dead replica, to acquire
the range lease in the `TestLeasePreferencesDuringOutage` test. The
acquired lease would be expiration based, disallowing the intended
leaseholder from acquiring the lease.

This patch disables expiration based lease transfers, deflaking the
test.

Resolves: #105101
Epic: none

Release note: None
kvoli added a commit that referenced this issue Aug 8, 2023
Previously, it was possible for a soon to be dead replica, to acquire
the range lease in the `TestLeasePreferencesDuringOutage` test. The
acquired lease would be expiration based, disallowing the intended
leaseholder from acquiring the lease.

This patch disables expiration based lease transfers, deflaking the
test.

Resolves: #105101
Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-testing Testing tools and infrastructure branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
None yet
3 participants