roachtest: transfer-leases/drain-other-node failed #91801

cockroach-teamcity · 2022-11-13T08:22:30Z

roachtest.transfer-leases/drain-other-node failed with artifacts on release-22.2.0 @ 234c9295cc02150f919cfa96b09ee2fa07b68ace:

test artifacts and logs in: /artifacts/transfer-leases/drain-other-node/run_1
	quit.go:72,quit.go:324,soon.go:69,retry.go:208,soon.go:75,soon.go:48,quit.go:228,quit.go:95,quit.go:154,context.go:91,quit.go:153,quit.go:95,quit.go:54,quit.go:361,test_runner.go:930: (1) ranges with no lease outside of node 3: []string{"37"}

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-21437}

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2022-11-14T16:13:36Z

n3 does think that is has transferred all leases away by the end of the test.

I221113 08:22:04.569289 69493 1@server/drain.go:285 â‹® [n3] 2005  drain remaining: 1
I221113 08:22:04.569310 69493 1@server/drain.go:287 â‹® [n3] 2006  drain details: range lease iterations: 1
I221113 08:22:04.776108 70977 1@server/drain.go:285 â‹® [n3] 2010  drain remaining: 0

And then n2 does receive the lease in a transfer:

I221113 08:22:09.444855 207 kv/kvserver/replica_proposal.go:268 â‹® [n2,s2,r37/2:â€¹/Table/3{6-7}â€º,raft] 987  new range lease repl=(n2,s2):2 seq=12 start=1668327724.301867742,0 exp=1668327733.301786243,0 pro=1668327724.301786243,0 following repl=(n3,s3):3 seq=11 start=1668327694.466207271,0 epo=3 pro=1668327694.510061860,0
I221113 08:22:09.445062 207 kv/kvserver/replica_proposal.go:374 â‹® [n2,s2,r37/2:â€¹/Table/3{6-7}â€º,raft] 988  upgrading expiration lease repl=(n2,s2):2 seq=12 start=1668327724.301867742,0 exp=1668327733.301786243,0 pro=1668327724.301786243,0 to an epoch-based one
I221113 08:22:09.445125 207 kv/kvserver/replica_proposal.go:430 â‹® [n2,s2,r37/2:â€¹/Table/3{6-7}â€º,raft] 989  is now leaseholder
I221113 08:22:09.449525 209 kv/kvserver/replica_proposal.go:268 â‹® [n2,s2,r37/2:â€¹/Table/3{6-7}â€º,raft] 991  new range lease repl=(n2,s2):2 seq=13 start=1668327724.301867742,0 epo=4 pro=1668327729.445054129,0 following repl=(n2,s2):2 seq=12 start=1668327724.301867742,0 exp=1668327733.301786243,0 pro=1668327724.301786243,0
I221113 08:22:09.449719 209 kv/kvserver/replica_proposal.go:430 â‹® [n2,s2,r37/2:â€¹/Table/3{6-7}â€º,raft] 992  is now leaseholder

I don't understand why the test failed, but this doesn't need to block the release while we find out.

tbg · 2022-11-22T08:20:40Z

Looks similar to #75438, as a result of which I added some more debugging. Unfortunately, looks like I botched it - this is written in a loop so we get a random node's view. In this case, for r37 we see the lease on n3 still. But the test would only fail if this is the case for all nodes (i.e. if the lease transfer "has happened", at least one replica must reflect it), so we need all files. I think this wouldn't show anything surprising probably, we'd see all leases on n3 probably. There must be a tiny amount of asynchrony somewhere in the drain-lease-transfer path.

tbg · 2022-11-22T09:05:28Z

Is this code right? otherNodeID is 1 + (3 % count) (boils down to n1) , i.e. independent of which node in the for loop we're looking at. I think we want q.c.Node(i)here, nototherNodeID`.

This would also explain the test failure: We've checked n1 only. The lease for r37 went from n3 to n2, so it's more likely for n1 to learn about this a split second later. But since we're only asking n1, the test doesn't see that.

I think even with that fixed there is a race here. The drain completes (assuming r37 is the only range) once n3 has applied the lease transfer (to n2). But that doesn't mean that n2 has already applied it; for that to happen n3 has to first send the commit index to n2.

The correct check in that method would be to verify that n3 no longer claims that it itself has the lease. I think this race is probably theoretical but I'll add a comment.

Will send a PR to address the correct node for pulling the range info.

blathers-crl · 2022-11-22T09:10:47Z

cc @cockroachdb/replication

tbg · 2022-11-22T13:50:33Z

Unfortunately I missed something, so the above explanation doesn't check out. Maybe it really is the "no other node may have applied the transfer yet" race. See #92313 (comment)

90097: kv: collect hot ranges per tenant on store r=koorosh a=koorosh The main goal of this change is to provide ability to request hot ranges info per tenant (as part of changes for UA). To track down hot ranges per tenant, `ReplicaRankingMap` struct is added that works almost identically as existing `ReplicaRankings` struct with only difference that hottest replicas are stored in an underlying map structure to keep separately ranges for every tenant. Also, it was decided to not extend existing `ReplicaRankings` functionality because it is used for replica rebalancing that doesn't need to know about tenants. This change might bring small overhead for self-hosted clusters as it will accumulate same hot ranges in both `replRankings` and `replRankingsByTenant` fields in Store. Changes in next PRs are supposed to rely on this change. Release note: None Epic: https://cockroachlabs.atlassian.net/browse/CRDB-19124 92313: roachtest: fix transfer-leases r=erikgrinaker a=tbg `checkNoLeases` verifies that once a node is drained, for each range there is one replica of the other nodes that sees the lease on one of the non-drained stores. The reason it asks justs for *one* replica to have the lease as opposed to all of them is because some followers may be behind. However, even just the assumption that there is one can be violated. The drain succeeds once the draining node has locally seen the lease transfer succeed. It is likely the raft leader at this point, so it will be the first one to see this event. Other nodes will only see it after one additional round-trip (when they learn that the log index has committed, and then go and apply it). So going and looking for a replica that sees the new lease immediately after the drain succeeds may fail. Work around this by sleeping for one second before checking, which ought to be enough, and is also a small enough delay to make sure that if leases are actually not getting transferred, the check will continue to fail (non-cooperative lease failover is on the order of multiple seconds). This commit also improves a debugging file which was previously clobbered over the multiple iterations of the surrounding loop. It also makes it clearer that we're pulling the range data from each node in turn (we were previously hitting otherNodeID but asking it to proxy to the node. Now we're hitting each node directly without relying on the internal redirect, which is less confusing). See: #91801 (comment) Fixes #91801. Epic: none Release note: None 92834: tree: improve type-checking for placeholders with ambiguous type r=jordanlewis a=rafiss fixes #90364 The key fix is to change the typeCheckSplitExprs function so that it marks _all_ placeholder indexes. This then causes the existing type-checking logic in typeCheckOverloadedExprs to check all placeholder expressions, rather than just ones that don't have type hints. Release note (bug fix): Prepared statements that use type hints can now succeed type-checking in more cases when the placeholder type is ambiguous. 93047: base: remove old testing knobs r=andreimatei a=andreimatei The testing knobs for startup migrations don't exist anymore. Release note: None Epic: None 93074: testutils: add missing comment r=andreimatei a=andreimatei Release note: None Epic: None 93079: insights: record more txn insights r=xinhaoz a=xinhaoz Closes #93076 This commit adds the following fields at the txn level when recording insights in the sql insights system: - contention: total txn contention time - start_time: txn start time - end_time: txn end time - auto_retry_reason: last reason for auto txn retry - retry_count - rows_written - rows_read In addition, the following fields have been moved from recording at the stmt level to recording at the txn level for insights: - user - application_name Release note: None 93160: bazci: insert `--config ci` if necessary r=healthy-pod a=rickystewart All builds and tests in CI need this --config argument. Epic: None Release note: None 93177: dev: fix dev build error when cross building r=rickystewart a=healthy-pod This is a temporary fix for the issue. In a future change, we should let beaver hub distinguish between normal and cross builds, and then have an actual fix for this. Release note: None Epic: none 93193: vendor: bump Pebble to 0d6d19018632 r=nicktrav a=itsbilal ``` 0d6d1901 crossversion: don't stream TestMeta output on verbose d05a6f0e vfs: use SequentialReadsOption in vfs.[Limited]Copy b85fc64f merging_iter: don't relative-seek past prefix in isNextEntryDeleted 32ad55f8 *: use github.com/cockroachdb/datadriven 6b644274 sstable: don't fatal if file no longer exists during readahead ``` Fixes #93191. Release note: None. Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: healthy-pod <[email protected]> Co-authored-by: Bilal Akhtar <[email protected]>

`checkNoLeases` verifies that once a node is drained, for each range there is one replica of the other nodes that sees the lease on one of the non-drained stores. The reason it asks justs for *one* replica to have the lease as opposed to all of them is because some followers may be behind. However, even just the assumption that there is one can be violated. The drain succeeds once the draining node has locally seen the lease transfer succeed. It is likely the raft leader at this point, so it will be the first one to see this event. Other nodes will only see it after one additional round-trip (when they learn that the log index has committed, and then go and apply it). So going and looking for a replica that sees the new lease immediately after the drain succeeds may fail. Work around this by sleeping for one second before checking, which ought to be enough, and is also a small enough delay to make sure that if leases are actually not getting transferred, the check will continue to fail (non-cooperative lease failover is on the order of multiple seconds). This commit also improves a debugging file which was previously clobbered over the multiple iterations of the surrounding loop. It also makes it clearer that we're pulling the range data from each node in turn (we were previously hitting otherNodeID but asking it to proxy to the node. Now we're hitting each node directly without relying on the internal redirect, which is less confusing). See: #91801 (comment) Fixes #91801. Epic: none Release note: None

cockroach-teamcity added branch-release-22.2.0 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 13, 2022

cockroach-teamcity added this to the 22.2 milestone Nov 13, 2022

blathers-crl bot added the T-kv KV Team label Nov 13, 2022

nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Nov 14, 2022

tbg mentioned this issue Nov 22, 2022

roachtest: fix transfer-leases #92313

Merged

tbg self-assigned this Nov 22, 2022

tbg added T-kv-replication and removed T-kv KV Team labels Nov 22, 2022

craig bot closed this as completed in 4dbbafa Dec 7, 2022

lunevalex added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Dec 9, 2022

exalate-issue-sync bot removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Dec 15, 2022

lunevalex added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels Jan 11, 2023

blathers-crl bot mentioned this issue Aug 30, 2023

release-22.2: roachtest: fix transfer-leases #109741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: transfer-leases/drain-other-node failed #91801

roachtest: transfer-leases/drain-other-node failed #91801

cockroach-teamcity commented Nov 13, 2022 •

edited by jlinder

Loading

nvanbenschoten commented Nov 14, 2022

tbg commented Nov 22, 2022

tbg commented Nov 22, 2022

blathers-crl bot commented Nov 22, 2022

tbg commented Nov 22, 2022

roachtest: transfer-leases/drain-other-node failed #91801

roachtest: transfer-leases/drain-other-node failed #91801

Comments

cockroach-teamcity commented Nov 13, 2022 • edited by jlinder Loading

nvanbenschoten commented Nov 14, 2022

tbg commented Nov 22, 2022

tbg commented Nov 22, 2022

blathers-crl bot commented Nov 22, 2022

tbg commented Nov 22, 2022

cockroach-teamcity commented Nov 13, 2022 •

edited by jlinder

Loading