Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730

Closed
cockroach-teamcity opened this issue Oct 19, 2023 · 16 comments
Closed

roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730

cockroach-teamcity opened this issue Oct 19, 2023 · 16 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 19, 2023

roachtest.kv95/enc=false/nodes=4/ssds=8 failed with artifacts on master @ fbee0764a4a2953a1178ea8c3edd352545caa227:

(test_runner.go:1113).runTest: test timed out (3h0m0s)
(cluster.go:2185).Run: full command output in run_192916.229252011_n5_workload-run-kv-tole.log: COMMAND_PROBLEM: exit status 134
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/kv95/enc=false/nodes=4/ssds=8/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=8

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-32566

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team labels Oct 19, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Oct 19, 2023
@renatolabs
Copy link
Contributor

Test timed out; the 1000 splits performed by the workload never finished.

We have 100MB worth of logs with the following error:

W231019 21:53:36.420215 7169 kv/kvserver/idalloc/id_alloc.go:120 ⋮ [T1,Vsystem,n1,s4,idalloc] 37072 unable to allocate 10 ids from /System‹/›‹"range-idgen›‹"›: replica unavailable: (n1,s1):1 unable to serve request to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1, (n2,s11):2, (n4,s27):3, next=5, gen=6]: closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)

Unfortunately, we have no logs for the first couple of hours of this test due to log rotation.

Forwarding to @cockroachdb/kv.

@renatolabs renatolabs added the T-kv KV Team label Oct 23, 2023
@renatolabs renatolabs removed the T-testeng TestEng Team label Oct 23, 2023
@blathers-crl
Copy link

blathers-crl bot commented Oct 23, 2023

cc @cockroachdb/replication

@shralex shralex removed the T-kv KV Team label Oct 23, 2023
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Oct 23, 2023

We see range 3 become unavailable on node 1 for the duration of the test. The raft status is included in the log output:

 unable to allocate 10 ids from /System‹/›‹"range-idgen›‹"›: replica unavailable: (n1,s1):1 unable to serve request to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1, (n2,s11):2, (n4,s27):3, next=5, gen=6]: closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)
+(1) forced error mark
+  | ‹"originated at breaker breaker"›
+  | github.com/cockroachdb/cockroach/pkg/util/circuit/*circuit.breakerErrorMark::
+Wraps: (2) forced error mark
+  | ‹"breaker open"›
+  | github.com/cockroachdb/errors/withstack/*withstack.withStack::
+Wraps: (3) replica unavailable: (n1,s1):1 unable to serve request to r3:‹/System/{NodeLivenessMax-tsd}› [(n1,s1):1, (n2,s11):2, (n4,s27):3, next=5, gen=6]: closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}: have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)
+Wraps: (4)
+  | (opaque error wrapper)
+  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
+  | reportable 0:
+  |
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.replicaUnavailableError
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:271
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).replicaUnavailableError
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:280
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).tripSync
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:101
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaCircuitBreaker).TripAsync.func1
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_circuit_breaker.go:95
+  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
+  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
+  | runtime.goexit
+  | 	src/runtime/asm_amd64.s:1598
+Wraps: (5) closed timestamp: 1697744307.475945591,0 (2023-10-19 19:38:27); raft status: {"id":"1","term":6,"vote":"1","commit":562,"lead":"0","raftState":"StatePreCandidate","applied":562,"progress":{},"leadtransferee":"0"}
+Wraps: (6)
+  | (opaque error wrapper)
+  | type name: github.com/cockroachdb/errors/withstack/*withstack.withStack
+  | reportable 0:
+  |
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).refreshProposalsLocked
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1462
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).tick
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1328
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processTick
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:718
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:412
+  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
+  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:321
+  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
+  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:484
+  | runtime.goexit
+  | 	src/runtime/asm_amd64.s:1598
+Wraps: (7) have been waiting 62.50s for slow proposal ConditionalPut [/System/StatusNode/1,/Min)
+Error types: (1) *markers.withMark (2) *markers.withMark (3) *kvpb.ReplicaUnavailableError (4) *errbase.opaqueWrapper (5) *errutil.withPrefix (6) *errbase.opaqueWrapper (7) *errutil.leafError

On nodes 2, 3, and 4, we do see the initial raft snapshots for this range, but don't see anything else about r3.

I231019 19:29:15.706452 1714 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n2,s11,r3/2:{-}] 32  applying snapshot c6836cbe from (n1,s1):1 at applied index 30

I231019 19:34:34.830180 371151 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n3,s19,r3/4:{-}] 6310  applying snapshot 8f34dc0b from (n1,s1):1 at applied index 441

I231019 19:29:16.999979 1733 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n4,s27,r3/3:{-}] 38  applying snapshot 14ccd3d8 from (n1,s1):1 at applied index 43

@erikgrinaker erikgrinaker self-assigned this Oct 24, 2023
@erikgrinaker
Copy link
Contributor

Seems to be some kind of cascading failure. Only n1's logs rotated out, on the other nodes we see the first liveness heartbeat failures at:

n2 19:29:27.953592
n3 19:34:38.622217
n4 19:36:16.168758

@erikgrinaker
Copy link
Contributor

Furthermore, all of the initial liveness heartbeat failures are due to disk stalls:

W231019 19:29:27.953504 478 kv/kvserver/liveness/liveness.go:760 ⋮ [T1,Vsystem,n2,liveness-hb] 1608  slow heartbeat took 3.015819446s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:1›: context deadline exceeded
W231019 19:29:27.953592 478 kv/kvserver/liveness/liveness.go:676 ⋮ [T1,Vsystem,n2,liveness-hb] 1609  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.016s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:1›: context deadline exceeded

W231019 19:34:38.622081 453 kv/kvserver/liveness/liveness.go:760 ⋮ [T1,Vsystem,n3,liveness-hb] 2067  slow heartbeat took 3.011392252s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:2›: context deadline exceeded
W231019 19:34:38.622217 453 kv/kvserver/liveness/liveness.go:676 ⋮ [T1,Vsystem,n3,liveness-hb] 2068  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.012s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:2›: context deadline exceeded

W231019 19:36:16.168655 465 kv/kvserver/liveness/liveness.go:760 ⋮ [T1,Vsystem,n4,liveness-hb] 2502  slow heartbeat took 3.003619244s; err=disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:7›: context deadline exceeded
W231019 19:36:16.168758 465 kv/kvserver/liveness/liveness.go:676 ⋮ [T1,Vsystem,n4,liveness-hb] 2503  failed node liveness heartbeat: operation ‹"node liveness heartbeat"› timed out after 3.004s (given timeout 3s): disk write failed while updating node liveness: interrupted during singleflight ‹engine sync:7›: context deadline exceeded

@erikgrinaker
Copy link
Contributor

The first circuit breaker tripped on n1 shortly after n4 failed its first liveness heartbeat. At that point, all of n2-n4 were persistently failing heartbeats. This is consistent with n2-n4 actually stalling writes, with the associated Raft quorum loss, but that seems highly improbable.

Screenshot 2023-10-24 at 20 26 52 Screenshot 2023-10-24 at 20 27 58

@erikgrinaker
Copy link
Contributor

n2-n4 are continually writing 200 MB/s, but see < 1MB/s network traffic. They also spin at 100% CPU usage. Why?

Screenshot 2023-10-24 at 20 33 07 Screenshot 2023-10-24 at 20 35 00 Screenshot 2023-10-24 at 20 34 43

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Oct 24, 2023

It looks like all of n2-n4 are continually failing to flush memtables to disk, which explains why writes are stalling on these nodes. Reassigning to @cockroachdb/storage, via @itsbilal as L2.

I231019 22:22:02.279605 11911697 3@pebble/event.go:712 ⋮ [n4,s30,pebble] 14418420  [JOB 1618476] flushing 3 memtables (1013KB) to L0
I231019 22:22:02.279704 11911697 3@pebble/event.go:732 ⋮ [n4,s30,pebble] 14418421  [JOB 1618476] flushing: sstable created 1618485
I231019 22:22:02.285838 11911697 3@pebble/event.go:716 ⋮ [n4,s30,pebble] 14418422  [JOB 1618476] flush error: table format version ‹(Pebble,v3)› is less than the minimum required version ‹(Pebble,v4)› for sized deletion tombstones
I231019 22:22:02.285897 11911697 3@pebble/event.go:696 ⋮ [n4,s30,pebble] 14418423  background error: table format version ‹(Pebble,v3)› is less than the minimum required version ‹(Pebble,v4)› for sized deletion tombstones

@erikgrinaker erikgrinaker added T-storage Storage Team and removed T-kv-replication labels Oct 24, 2023
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Oct 24, 2023
@erikgrinaker
Copy link
Contributor

Btw, it would be helpful for these messages to be logged at error level rather than info, so they'd show up in the main cockroach.log file.

@itsbilal
Copy link
Contributor

Looking into this, it looks like the nodes diverged on their knowledge of cluster versions, and n1 was at a higher cluster version that permitted DeleteSized while the other nodes weren't. We also see these errors in the logs of all nodes except for n1:

./cockroach.teamcity-12247663-1697693516-143-n5cpu8-0003.ubuntu.2023-10-19T22_03_52Z.013157.log:E231019 22:27:03.535562 1077 server/auto_upgrade.go:57 ⋮ [T1,Vsystem,n3] 25343  unable to retrieve cluster version: ‹show-version›: operation ‹"show cluster setting version"› timed out after 2m0s (given timeout 2m0s): value differs between local setting (23.1) and KV (23.1-38)

Indicates that a BumpClusterVersion RPC was missed or so, resulting in the local cluster setting lagging behind. Is there a race in our cluster version / init server bootstrapping and joining code that could lead to this happening on very rare occasions? I've been trying to figure that out through code reading for a day now and haven't made much progress.

It also looks like the joining nodes just got the older cluster version from the first node, even though the first node has already ratcheted its Pebble version forward (indicating a Cockroach cluster version bump at least 4 seconds before this log line). Here's the relevant log lines from a joining node:

I231019 19:29:14.606792 33 server/init.go:316 ⋮ [T1,Vsystem,n?] 35  joined cluster ‹462999d9-6441-44d6-a20d-e33e36e8352f› through join rpc
I231019 19:29:14.606831 33 server/init.go:317 ⋮ [T1,Vsystem,n?] 36  received node ID: 3
I231019 19:29:14.606843 33 server/init.go:318 ⋮ [T1,Vsystem,n?] 37  received cluster version: 23.1

And here's the pebble FMV ratchet from the first node, 4 seconds earlier, indicating that the cluster version should have been ratcheted to at least 23.1-16 before the joining node even started:

I231019 19:29:10.442750 65 3@pebble/event.go:720 ⋮ [n?,s?,pebble] 146  upgraded to format version: 016

Either way this seems like it's outside of storage, and is definitively within the server/cluster version/join code. It's also not easily reproducible.

@itsbilal
Copy link
Contributor

Given logs on n1 (which likely played the biggest role in returning the old cluster version in JoinNodeResponse while also having a newer one stored in kv) have rotated out since the bootstrapping phase, it's really hard to keep digging here without any additional info or pointers on what went wrong there to cause this cluster version divergence. I'm inclined to close this out as this investigation is hitting a dead end and I've spent a lot of cycles reading code around here. Thoughts?

@erikgrinaker
Copy link
Contributor

I think this seems serious enough that we need to get to the bottom of it.

I haven't looked at this in depth, but we do expect nodes to bump the cluster version asynchronously. Is the problem here that all nodes should ratchet each cluster version before proceeding to the next, and we require all nodes to apply V23_2_PebbleFormatDeleteSizedAndObsolete before enabling V23_2_UseSizedPebblePointTombstones?

It's not clear to me why nodes began failing memtable flushes if they hadn't bumped the version gate either, but again, I haven't looked at this in detail.

@itsbilal
Copy link
Contributor

@erikgrinaker That's right, the nodes should all be at one cluster version before proceeding to the next one. In particular it looks like n1 (whose logs got rotated away) was several cluster versions ahead of the one it was advertising over the Join RPC. So it wrote sized point deletions, and when those writes got replicated to other nodes, other nodes couldn't write a sized deletion as their max sstable format didn't support it, resulting in flushes failing.

Pebble could fail in a more obvious fashion for a bug like this here, which would have preserved the logs as we wouldn't have indefinitely retried the flushes and only given up when the test timed out. That would be cockroachdb/pebble#270 .

@itsbilal
Copy link
Contributor

Ah, I took a closer look at the failing SHA, checked out the code, found the bug and realized it was fixed in #112684 . 🤦

Also a dup of #112658 .

@itsbilal itsbilal closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2023
@erikgrinaker
Copy link
Contributor

Nice, glad we figured it out -- thanks for digging!

@srosenberg
Copy link
Member

Ah, I took a closer look at the failing SHA, checked out the code, found the bug and realized it was fixed in #112684 . 🤦

I bet we may have noticed this sooner if we had surfaced the fact that raid0=false; will be tackled in #112707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants