-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv95/enc=false/nodes=4/ssds=8 failed #112730
Comments
Test timed out; the 1000 splits performed by the workload never finished. We have 100MB worth of logs with the following error:
Unfortunately, we have no logs for the first couple of hours of this test due to log rotation. Forwarding to @cockroachdb/kv. |
cc @cockroachdb/replication |
We see range 3 become unavailable on node 1 for the duration of the test. The raft status is included in the log output:
On nodes 2, 3, and 4, we do see the initial raft snapshots for this range, but don't see anything else about r3.
|
Seems to be some kind of cascading failure. Only n1's logs rotated out, on the other nodes we see the first liveness heartbeat failures at:
|
Furthermore, all of the initial liveness heartbeat failures are due to disk stalls:
|
It looks like all of n2-n4 are continually failing to flush memtables to disk, which explains why writes are stalling on these nodes. Reassigning to @cockroachdb/storage, via @itsbilal as L2.
|
Btw, it would be helpful for these messages to be logged at error level rather than info, so they'd show up in the main cockroach.log file. |
Looking into this, it looks like the nodes diverged on their knowledge of cluster versions, and n1 was at a higher cluster version that permitted
Indicates that a It also looks like the joining nodes just got the older cluster version from the first node, even though the first node has already ratcheted its Pebble version forward (indicating a Cockroach cluster version bump at least 4 seconds before this log line). Here's the relevant log lines from a joining node:
And here's the pebble FMV ratchet from the first node, 4 seconds earlier, indicating that the cluster version should have been ratcheted to at least
Either way this seems like it's outside of storage, and is definitively within the server/cluster version/join code. It's also not easily reproducible. |
Given logs on n1 (which likely played the biggest role in returning the old cluster version in |
I think this seems serious enough that we need to get to the bottom of it. I haven't looked at this in depth, but we do expect nodes to bump the cluster version asynchronously. Is the problem here that all nodes should ratchet each cluster version before proceeding to the next, and we require all nodes to apply It's not clear to me why nodes began failing memtable flushes if they hadn't bumped the version gate either, but again, I haven't looked at this in detail. |
@erikgrinaker That's right, the nodes should all be at one cluster version before proceeding to the next one. In particular it looks like n1 (whose logs got rotated away) was several cluster versions ahead of the one it was advertising over the Join RPC. So it wrote sized point deletions, and when those writes got replicated to other nodes, other nodes couldn't write a sized deletion as their max sstable format didn't support it, resulting in flushes failing. Pebble could fail in a more obvious fashion for a bug like this here, which would have preserved the logs as we wouldn't have indefinitely retried the flushes and only given up when the test timed out. That would be cockroachdb/pebble#270 . |
Nice, glad we figured it out -- thanks for digging! |
roachtest.kv95/enc=false/nodes=4/ssds=8 failed with artifacts on master @ fbee0764a4a2953a1178ea8c3edd352545caa227:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=false
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=8
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-32566
The text was updated successfully, but these errors were encountered: