roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730

cockroach-teamcity · 2023-10-19T22:34:24Z

roachtest.kv95/enc=false/nodes=4/ssds=8 failed with artifacts on master @ fbee0764a4a2953a1178ea8c3edd352545caa227:

(test_runner.go:1113).runTest: test timed out (3h0m0s)
(cluster.go:2185).Run: full command output in run_192916.229252011_n5_workload-run-kv-tole.log: COMMAND_PROBLEM: exit status 134
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv95/enc=false/nodes=4/ssds=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=8

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng _{This test on roachdash | Improve this report!

Jira issue: CRDB-32566}

The text was updated successfully, but these errors were encountered:

renatolabs · 2023-10-20T20:22:42Z

Test timed out; the 1000 splits performed by the workload never finished.

We have 100MB worth of logs with the following error:

W231019 21:53:36.420215 7169 kv/kvserver/idalloc/id_alloc.go:120 ⋮ [T1,Vsystem,n1,s4,idalloc] 37072 unable to allocate 10 ids from /System‹/›‹"range-idgen›‹"›: replica unavailable: (n1,s1):1 unable to serve request to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1, (n2,s11):2, (n4,s27):3, next=5, gen=6]: closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)

Unfortunately, we have no logs for the first couple of hours of this test due to log rotation.

Forwarding to @cockroachdb/kv.

blathers-crl · 2023-10-23T21:12:33Z

cc @cockroachdb/replication

nvanbenschoten · 2023-10-23T21:39:53Z

We see range 3 become unavailable on node 1 for the duration of the test. The raft status is included in the log output:

 unable to allocate 10 ids from /Systemâ€¹/â€ºâ€¹"range-idgenâ€ºâ€¹"â€º: replica unavailable: (n1,s1):1 unable to serve request to r3:â€¹/System/{NodeLivenessMax-tsd}â€º [(n1,s1):1, (n2,s11):2, (n4,s27):3, next=5, gen=6]: closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)
+(1) forced error mark
+  | â€¹"originated at breaker breaker"â€º
+  | github.com/cockroachdb/cockroach/pkg/util/circuit/*circuit.breakerErrorMark::
+Wraps: (2) forced error mark
+  | â€¹"breaker open"â€º
+  | github.com/cockroachdb/errors/withstack/*withstack.withStack::
+Wraps: (3) replica unavailable: (n1,s1):1 unable to serve request to r3:â€¹/System/{NodeLivenessMax-tsd}â€º [(n1,s1):1, (n2,s11):2, (n4,s27):3, next=5, gen=6]: closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)
+Wraps: (4)
+  | (opaque error wrapper)
+  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
+  | reportable 0:
+  |
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.replicaUnavailableError
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:271
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).replicaUnavailableError
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:280
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).tripSync
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:101
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).TripAsync.func1
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:95
+  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
+  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
+  | runtime.goexit
+  | 	src/runtime/asm_amd64.s:1598
+Wraps: (5) closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}
+Wraps: (6)
+  | (opaque error wrapper)
+  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
+  | reportable 0:
+  |
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).refreshProposalsLocked
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1462
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).tick
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1328
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processTick
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:718
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:412
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
+  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
+  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
+  | runtime.goexit
+  | 	src/runtime/asm_amd64.s:1598
+Wraps: (7) have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)
+Error types: (1) *markers.withMark (2) *markers.withMark (3) *kvpb.ReplicaUnavailableError (4) *errbase.opaqueWrapper (5) *errutil.withPrefix (6) *errbase.opaqueWrapper (7) *errutil.leafError

On nodes 2, 3, and 4, we do see the initial raft snapshots for this range, but don't see anything else about r3.

I231019 19:29:15.706452 1714 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n2,s11,r3/2:{-}] 32  applying snapshot c6836cbe from (n1,s1):1 at applied index 30

I231019 19:34:34.830180 371151 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n3,s19,r3/4:{-}] 6310  applying snapshot 8f34dc0b from (n1,s1):1 at applied index 441

I231019 19:29:16.999979 1733 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n4,s27,r3/3:{-}] 38  applying snapshot 14ccd3d8 from (n1,s1):1 at applied index 43

erikgrinaker · 2023-10-24T18:15:50Z

Seems to be some kind of cascading failure. Only n1's logs rotated out, on the other nodes we see the first liveness heartbeat failures at:

n2 19:29:27.953592
n3 19:34:38.622217
n4 19:36:16.168758

erikgrinaker · 2023-10-24T18:20:05Z

Furthermore, all of the initial liveness heartbeat failures are due to disk stalls:

W231019 19:29:27.953504 478 kv/kvserver/liveness/liveness.go:760 ⋮ [T1,Vsystem,n2,liveness-hb] 1608  slow heartbeat took 3.015819446s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:1›: context deadline exceeded
W231019 19:29:27.953592 478 kv/kvserver/liveness/liveness.go:676 ⋮ [T1,Vsystem,n2,liveness-hb] 1609  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.016s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:1›: context deadline exceeded

W231019 19:34:38.622081 453 kv/kvserver/liveness/liveness.go:760 ⋮ [T1,Vsystem,n3,liveness-hb] 2067  slow heartbeat took 3.011392252s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:2›: context deadline exceeded
W231019 19:34:38.622217 453 kv/kvserver/liveness/liveness.go:676 ⋮ [T1,Vsystem,n3,liveness-hb] 2068  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.012s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:2›: context deadline exceeded

W231019 19:36:16.168655 465 kv/kvserver/liveness/liveness.go:760 ⋮ [T1,Vsystem,n4,liveness-hb] 2502  slow heartbeat took 3.003619244s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:7›: context deadline exceeded
W231019 19:36:16.168758 465 kv/kvserver/liveness/liveness.go:676 ⋮ [T1,Vsystem,n4,liveness-hb] 2503  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.004s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:7›: context deadline exceeded

erikgrinaker · 2023-10-24T18:29:45Z

The first circuit breaker tripped on n1 shortly after n4 failed its first liveness heartbeat. At that point, all of n2-n4 were persistently failing heartbeats. This is consistent with n2-n4 actually stalling writes, with the associated Raft quorum loss, but that seems highly improbable.

erikgrinaker · 2023-10-24T18:35:22Z

n2-n4 are continually writing 200 MB/s, but see < 1MB/s network traffic. They also spin at 100% CPU usage. Why?

erikgrinaker · 2023-10-24T18:40:18Z

It looks like all of n2-n4 are continually failing to flush memtables to disk, which explains why writes are stalling on these nodes. Reassigning to @cockroachdb/storage, via @itsbilal as L2.

I231019 22:22:02.279605 11911697 3@pebble/event.go:712 ⋮ [n4,s30,pebble] 14418420  [JOB 1618476] flushing 3 memtables (1013KB) to L0
I231019 22:22:02.279704 11911697 3@pebble/event.go:732 ⋮ [n4,s30,pebble] 14418421  [JOB 1618476] flushing: sstable created 1618485
I231019 22:22:02.285838 11911697 3@pebble/event.go:716 ⋮ [n4,s30,pebble] 14418422  [JOB 1618476] flush error: table format version ‹(Pebble,v3)› is less than the minimum required version ‹(Pebble,v4)› for sized deletion tombstones
I231019 22:22:02.285897 11911697 3@pebble/event.go:696 ⋮ [n4,s30,pebble] 14418423  background error: table format version ‹(Pebble,v3)› is less than the minimum required version ‹(Pebble,v4)› for sized deletion tombstones

erikgrinaker · 2023-10-24T18:42:55Z

Btw, it would be helpful for these messages to be logged at error level rather than info, so they'd show up in the main cockroach.log file.

itsbilal · 2023-10-25T19:22:00Z

Looking into this, it looks like the nodes diverged on their knowledge of cluster versions, and n1 was at a higher cluster version that permitted DeleteSized while the other nodes weren't. We also see these errors in the logs of all nodes except for n1:

./cockroach.teamcity-12247663-1697693516-143-n5cpu8-0003.ubuntu.2023-10-19T22_03_52Z.013157.log:E231019 22:27:03.535562 1077 server/auto_upgrade.go:57 ⋮ [T1,Vsystem,n3] 25343  unable to retrieve cluster version: ‹show-version›: operation ‹"show cluster setting version"› timed out after 2m0s (given timeout 2m0s): value differs between local setting (23.1) and KV (23.1-38)

Indicates that a BumpClusterVersion RPC was missed or so, resulting in the local cluster setting lagging behind. Is there a race in our cluster version / init server bootstrapping and joining code that could lead to this happening on very rare occasions? I've been trying to figure that out through code reading for a day now and haven't made much progress.

It also looks like the joining nodes just got the older cluster version from the first node, even though the first node has already ratcheted its Pebble version forward (indicating a Cockroach cluster version bump at least 4 seconds before this log line). Here's the relevant log lines from a joining node:

I231019 19:29:14.606792 33 server/init.go:316 ⋮ [T1,Vsystem,n?] 35  joined cluster ‹462999d9-6441-44d6-a20d-e33e36e8352f› through join rpc
I231019 19:29:14.606831 33 server/init.go:317 ⋮ [T1,Vsystem,n?] 36  received node ID: 3
I231019 19:29:14.606843 33 server/init.go:318 ⋮ [T1,Vsystem,n?] 37  received cluster version: 23.1

And here's the pebble FMV ratchet from the first node, 4 seconds earlier, indicating that the cluster version should have been ratcheted to at least 23.1-16 before the joining node even started:

I231019 19:29:10.442750 65 3@pebble/event.go:720 ⋮ [n?,s?,pebble] 146  upgraded to format version: 016

Either way this seems like it's outside of storage, and is definitively within the server/cluster version/join code. It's also not easily reproducible.

itsbilal · 2023-10-26T16:00:52Z

Given logs on n1 (which likely played the biggest role in returning the old cluster version in JoinNodeResponse while also having a newer one stored in kv) have rotated out since the bootstrapping phase, it's really hard to keep digging here without any additional info or pointers on what went wrong there to cause this cluster version divergence. I'm inclined to close this out as this investigation is hitting a dead end and I've spent a lot of cycles reading code around here. Thoughts?

erikgrinaker · 2023-10-26T17:27:31Z

I think this seems serious enough that we need to get to the bottom of it.

I haven't looked at this in depth, but we do expect nodes to bump the cluster version asynchronously. Is the problem here that all nodes should ratchet each cluster version before proceeding to the next, and we require all nodes to apply V23_2_PebbleFormatDeleteSizedAndObsolete before enabling V23_2_UseSizedPebblePointTombstones?

It's not clear to me why nodes began failing memtable flushes if they hadn't bumped the version gate either, but again, I haven't looked at this in detail.

itsbilal · 2023-10-26T19:20:17Z

@erikgrinaker That's right, the nodes should all be at one cluster version before proceeding to the next one. In particular it looks like n1 (whose logs got rotated away) was several cluster versions ahead of the one it was advertising over the Join RPC. So it wrote sized point deletions, and when those writes got replicated to other nodes, other nodes couldn't write a sized deletion as their max sstable format didn't support it, resulting in flushes failing.

Pebble could fail in a more obvious fashion for a bug like this here, which would have preserved the logs as we wouldn't have indefinitely retried the flushes and only given up when the test timed out. That would be cockroachdb/pebble#270 .

itsbilal · 2023-10-26T19:36:11Z

Ah, I took a closer look at the failing SHA, checked out the code, found the bug and realized it was fixed in #112684 . 🤦

Also a dup of #112658 .

erikgrinaker · 2023-10-26T19:38:53Z

Nice, glad we figured it out -- thanks for digging!

srosenberg · 2023-10-27T14:43:17Z

Ah, I took a closer look at the failing SHA, checked out the code, found the bug and realized it was fixed in #112684 . 🤦

I bet we may have noticed this sooner if we had surfaced the fact that raid0=false; will be tackled in #112707

cockroach-teamcity added this to the 23.2 milestone Oct 19, 2023

renatolabs added the T-kv KV Team label Oct 23, 2023

renatolabs removed the T-testeng TestEng Team label Oct 23, 2023

shralex added the T-kv-replication label Oct 23, 2023

shralex removed the T-kv KV Team label Oct 23, 2023

erikgrinaker self-assigned this Oct 24, 2023

erikgrinaker added T-storage Storage Team and removed T-kv-replication labels Oct 24, 2023

blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Oct 24, 2023

erikgrinaker assigned itsbilal and unassigned erikgrinaker Oct 24, 2023

jbowens mentioned this issue Oct 25, 2023

db: add Errorf to Logger interface cockroachdb/pebble#3013

Merged

itsbilal closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2023

exalate-issue-sync bot closed this as completed Oct 26, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730

roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730

cockroach-teamcity commented Oct 19, 2023 •

edited by cockroach-jira-scripts

Loading

renatolabs commented Oct 20, 2023

blathers-crl bot commented Oct 23, 2023

nvanbenschoten commented Oct 23, 2023 •

edited

Loading

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023 •

edited

Loading

erikgrinaker commented Oct 24, 2023

itsbilal commented Oct 25, 2023

itsbilal commented Oct 26, 2023

erikgrinaker commented Oct 26, 2023

itsbilal commented Oct 26, 2023

itsbilal commented Oct 26, 2023

erikgrinaker commented Oct 26, 2023

srosenberg commented Oct 27, 2023

roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730

roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730

Comments

cockroach-teamcity commented Oct 19, 2023 • edited by cockroach-jira-scripts Loading

renatolabs commented Oct 20, 2023

blathers-crl bot commented Oct 23, 2023

nvanbenschoten commented Oct 23, 2023 • edited Loading

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023

erikgrinaker commented Oct 24, 2023 • edited Loading

erikgrinaker commented Oct 24, 2023

itsbilal commented Oct 25, 2023

itsbilal commented Oct 26, 2023

erikgrinaker commented Oct 26, 2023

itsbilal commented Oct 26, 2023

itsbilal commented Oct 26, 2023

erikgrinaker commented Oct 26, 2023

srosenberg commented Oct 27, 2023

cockroach-teamcity commented Oct 19, 2023 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Oct 23, 2023 •

edited

Loading

erikgrinaker commented Oct 24, 2023 •

edited

Loading