-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: replicate/wide failed #99268
Comments
Stuck here cockroach/pkg/cmd/roachtest/tests/allocator.go Lines 406 to 409 in eec5a47
when attempting to create a backup schedule:
cockroach/pkg/roachprod/install/cockroach.go Lines 831 to 840 in c5ef384
I think there an internal endless DistSender retry loop (I also see a stuck SQL stmt in DistSender waiting to hear back from single-range RPCs like this stack here): It's this retry loop: cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 1611 to 1618 in b19a4a8
There are various "slow range RPC":
Note that only n1-n6 are up right now. n7, n8, n9 are intentionally down. What should happen is that such a request would quickly bounce to one of the live nodes (which are all in the descriptor and we shouldn't have lost quorum here). But it doesn't seem to happen. Have to run now, need to investigate more later. |
I was a bit sloppy here. We're getting NLHEs back, so these are clearly coming from n1-n6 (a down node can't respond). For example, the one above came from I see these messages from all replicas (for different RPCs), so it's a fair assumption that they are all responding in that same way. Why is nobody acquiring the lease? Probably this has to do with this message
which we see off and on, for n7-n9 (i.e. the down nodes). This originates here: cockroach/pkg/kv/kvserver/replica_range_lease.go Lines 672 to 692 in ecc931b
I think the effect of that is that it prevents lease acquisitions, since it returns an "error" lease state, but we need an "expired" one of we are to take over the lease. What should have happened is that the liveness of all nodes, including down ones, was gossiped when the (expiration-based) liveness lease was acquired. As it happens, we just made a change in that area two weeks ago: Prior to this PR, all livenesses were gossiped on each extension, i.e. quite frequently. Now this only happens once, when the lease is first acquired. It's possible that there is a bug where this trigger does not fire. Is it possible that cockroach/pkg/kv/kvserver/replica_proposal.go Line 264 in 04f4284
Then, even though we've cleared cluster-wide gossip state due to the whole-cluster downtime, the effectively "new" (same) leaseholder would not re-gossip, and so the gossip info for the nodes that remain down would never appear in gossip. It seems like a likely explanation. cc @nvanbenschoten |
…ollowing restart See: cockroachdb#99268 (comment) This also sneaks in the logging requested in cockroachdb#99472. Epic: none Release note: None
Was hoping that putting in an assertion and restarting a local single-node cluster would prove it, but I've run it a few times and so far no success. |
Still, reading the code I see all indications are there that a node re-requesting a lease it previously held would not bump the sequence number. |
roachtest.replicate/wide failed with artifacts on release-23.1 @ 52e55d2ef172b7cfec14e8a0a954f8864b2be779:
Parameters: Same failure on other branches
|
roachtest.replicate/wide failed with artifacts on release-23.1 @ f351747ed97862fc037717cadec23f18073fb6be:
Parameters: Same failure on other branches
|
I think it's this bit that prevents the assertion I've put in to fire: cockroach/pkg/kv/kvserver/batcheval/cmd_lease_request.go Lines 108 to 125 in 736a67e
when the node starts up, we set it to cockroach/pkg/kv/kvserver/replica_init.go Lines 215 to 225 in 51f8f8e
|
Roachstress running for 574ded4 (presumed "bad") and 0fda198 (its left parent, presumed "good"). And we have the known bad SHA 80c4895. presumed bad: 10/10 passed I'll try "presumed bad" for another 20 iterations. If it's still not failing then, I'll have to assume the problem is in another PR. Unfortunately having some trouble with roachtest/roachprod, seems like cluster creation sometimes hangs for 20m+, I was suspecting a deadlock, not sure. I collected some stack traces but for now will try to focus on the bisection. update: seeing a cluster hung with these exact symptoms in "presumed bad". Its gossip network is missing the liveness info for n7-n9, as expected: "presumed good" passed another 20+ times, there is a different failure mode that I also saw on "presumed bad", so I think it is unrelated and we can call "presumed good" "good". |
cc @cockroachdb/replication |
Once Github is not down any more, I'll send the revert and backport it to 22.2 and 22.1. |
99643: kvserver: revert #98150 r=tbg a=tbg This reverts #98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: #99268 (comment) Touches #99268. Touches #97966. Closes #99268. Touches #98945. Tracked in #99652. Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]>
This reverts #98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: #99268 (comment) Touches #99268. Touches #97966. Closes #99268. Touches #98945. Epic: none Release note: None
This reverts cockroachdb#98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: cockroachdb#99268 (comment) Touches cockroachdb#99268. Touches cockroachdb#97966. Closes cockroachdb#99268. Touches cockroachdb#98945. Epic: none Release note: None
This reverts #98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: #99268 (comment) Touches #99268. Touches #97966. Closes #99268. Touches #98945. Epic: none Release note: None
This reverts #98150 because we think it introduced a problem that was detected via the `replicate/wide` roachtest[^1]. It seems that for reasons still unknown we do rely on the periodic gossip trigger on liveness lease extensions. [^1]: #99268 (comment) Touches #99268. Touches #97966. Closes #99268. Touches #98945. Epic: none Release note: None
roachtest.replicate/wide failed with artifacts on release-23.1 @ 80c4895c566a7eaa6f16c4098980509dd3795ad7:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=1
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-25795
The text was updated successfully, but these errors were encountered: