-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed [tocommit out of range] #97926
Comments
Quite a few liveness errors and eventually node 1 crashes with a panic from etcd/raft:
cc @cockroachdb/replication |
cc @cockroachdb/replication |
@pavelkalinnikov This seems like a high-priority problem, would appreciate if you could have an initial look. |
Node 1 got behind, and fails on an empty {
"span": {
"start_key": "/Table/111/1/3781091",
"end_key": "/Table/112"
},
"raft_state": {
"replica_id": 2,
"hard_state": {
"term": 6,
"vote": 2,
"commit": 821
},
"lead": 2,
"state": "StateLeader",
"applied": 821,
"progress": {
"1": {
"match": 427,
"next": 428,
"state": "StateProbe",
"paused": true
},
"2": {
"match": 821,
"next": 822,
"state": "StateReplicate"
},
"3": {
"match": 821,
"next": 822,
"state": "StateReplicate"
}
}
},
"state": {
"state": {
"raft_applied_index": 821,
"lease_applied_index": 508, |
Same symptoms in another failure here: #97389 (comment) |
The message causing the panic is a pb.Message{
Type: pb.MsgApp,
To: 1,
From: 2,
Term: 6,
Index: 489,
LogTerm: 0, // <--- Why is this 0?
Entries: [], // <--- Looks like no entries were found with index >= 489 (or max-inflight is saturated)?
Commit: 660,
}
|
@nvanbenschoten Could something like this be caused by async log appends introduced in #94165? We're seeing |
Possibly index 489 got truncated on the leader by the time
|
I came to the same conclusion. If it is legitimate for a MsgApp to carry a If it's not legitimate then the receiver-side code could probably be improved, but it's not the root of the problem. Instead, we'll need to look at the leader and understand whether we're hitting the |
@nvanbenschoten Yes, likely we're hitting the first case, see the log truncation message above. |
Interesting. When we're dealing with some kind of race with log truncation, do we have any reason to expect |
I'm struggling to see how at all we would hit the
To get an
|
Culprit: etcd-io/raft@42419da I couldn't repro this panic yet, but found another one that this commit causes (bisected to verify). Trying to repro this one too, it seems to have the same underlying cause - sending a zero |
Found a repro, working in upstream to fix it: etcd-io/raft#31. |
@pavelkalinnikov nice find! Could you explain why etcd-io/raft@42419da is the culprit? We've primarily been looking at |
@nvanbenschoten In etcd-io/raft#31 there is a test which simulates the behaviour in this issue: a Raft log truncation + a bit of slowness on a follower so that truncation on the leader overtakes the appends flow to the follower. I bisected, and this test starts panicking (with the same message) right at the culprit commit that I linked. The reason why my change broke it is: previously we unconditionally called Yes, broadly speaking we need to fix or workaround the |
98574: sql: support tenant configuration templates r=stevendanna,ecwall a=knz Fixes #98573. Epic: CRDB-23559 First commit from #98726. This change introduces the LIKE clause to CREATE TENANT, which makes CREATE TENANT copy the parameters (but not the storage keyspace) from the tenant selected by LIKE. Also if LIKE is not specified, but the (new) cluster setting `sql.create_tenant.default_template` is not empty, the value of the cluster setting is used implicitly as LIKE clause. A proposed use of this is cluster-to-cluster replication, considering cutover as well. On the target (sink) cluster, the operator would do: ``` CREATE TENANT application LIKE app_template FROM REPLICATION OF application ON .... ``` And then cutover would look something like the following if they wanted the tenant to still be named "application" ``` ALTER TENANT application CUTOVER TO LATEST; DROP TENANT application; -- if there's one already ALTER TENANT application START SERVICE SHARED; ``` Release note: None 98721: go.mod: bump etcd-io/raft to 5fe1c31 r=tbg a=pavelkalinnikov Fixes #97926 Epic: none Release note (bug fix): fixed a rare panic in upstream etcd-io/raft when message appends race with log compaction 98747: kvserver: deflake TestReplicaProbeRequest r=pavelkalinnikov a=tbg When we ignored an ambigous result but the probe didn't actually happen, a later condition in the test would fail. Retry the probe on ambiguous results instead; the test already only expects the probe to happen "at least once", so we don't introduce any new issues should a successful probe end up being retried. Fixes #97136. Epic: none Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
roachtest.sysbench/oltp_write_only/nodes=3/cpu=32/conc=256 failed with artifacts on master @ 20e2adda3c76c7172dd986c871df0ae9a346918f:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=32
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-24968
The text was updated successfully, but these errors were encountered: