-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sentry: checksum for AddSSTable does not match #90834
Comments
Similiar to #52720, I think possibly explainable by disk corruption if the SST was sideloaded? Might be worthwhile to add more logging to this case. |
Just a bit of background. This was a brand new cluster. All brand new hardware. I was doing an IMPORT INTO using massive datasets. About 12 files for 600GB of data. It happened during the import. I deleted the node, and rejoined it to the cluster to get past it. If there is another way to get past it incase it happens again, please let me know. |
I wanted to follow up on this. Andrew requested that I check these two machines, and he was correct. After running a I do have a bit of a worry using cockroach, how 1 machine can bring down the entire cluster to a near halt though. Like what if one drive goes bad in one machine, if not monitored correctly it looked like the entire cluster was going to crash. |
Hey @STRATZ-Ken, thanks for following up. Today a single instance of disk corruption on a node requires replacement of the entire node. A Cockroach cluster can tolerate this loss of a single node, and there should be minimal impact to the rest of the cluster. However, replacing a node does require up-replicating all its ranges which takes time. We have #67568 tracking work to narrow the blast radius of disk corruption, allowing a node to remove its corrupt replicas without the need to replace the node and upreplicate all of the resident ranges. |
@jbowens Thanks for update. But this was a replication factor 3 with 6 nodes and the issue still occurred. Ken |
I might be misunderstanding you, but the replication factor of 3 allows you to recover from this situation by replacing the node. Currently, with RF=3, if a Cockroach node experiences disk corruption, the node with corruption exits ungracefully. The remaining nodes in the cluster continue to operate and will upreplicate the underreplicated ranges once the node with corruptions is declared dead. It is exactly the replication factor of 3 that allows this recovery without data loss or ever returning inconsistent results. |
Thats why the ticket was created. I was unable to recovery the node in question. Actual 2 nodes. I tried rebooting but the node would constantly crash. I had to purge the data on the two nodes, decommission from the cluster, then rejoin to get the node back online. Nothing else would work. |
Reiterating, this is the expectation although something we hope to improve with #67568.
This explains why the cluster suffered unavailability. With a replication factor of 3, you can suffer only 1 node loss. Any ranges resident on both nodes would’ve lost a quorum. |
@jbowens When I checked the log, it looked like N2 crashed because it could not connect to N9 (The first crash, with the bad hard drive). I should of saved all the logs, my bad. |
It's not likely that any communication error caused a node crash. Cockroach nodes will tolerate loss of peers. Ranges just won't be able to make progress if a majority of the replicas are absent. Given how widespread disk corruption issues appear to be on the cluster, it seems likely the other node also crashed with disk-level corruption. |
@jbowens Just a bit of an update on my tasks. I just deleted the entire cluster. I re-ran |
@jbowens just to update. I re-checked all the drives, every drive is good and passes with |
We have marked this issue as stale because it has been inactive for |
closing as stale |
This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.
Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3703257674/?referrer=webhooks_plugin
Panic message:
Stacktrace (expand for inline code snippets):
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal.go#L463-L465 in pkg/kv/kvserver.addSSTablePreApply
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L675-L677 in pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L565-L567 in pkg/kv/kvserver.(*replicaAppBatch).Stage
cockroach/pkg/kv/kvserver/apply/cmd.go
Lines 183 to 185 in cc25a78
cockroach/pkg/kv/kvserver/apply/task.go
Lines 278 to 280 in cc25a78
cockroach/pkg/kv/kvserver/apply/task.go
Lines 245 to 247 in cc25a78
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1037-L1039 in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L656-L658 in pkg/kv/kvserver.(*Replica).handleRaftReady
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L640-L642 in pkg/kv/kvserver.(*Store).processReady
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L307-L309 in pkg/kv/kvserver.(*raftScheduler).worker
cockroach/pkg/util/stop/stopper.go
Lines 488 to 490 in cc25a78
GOROOT/src/runtime/asm_amd64.s#L1593-L1595 in runtime.goexit
v22.2.0-beta.4
Jira issue: CRDB-20964
The text was updated successfully, but these errors were encountered: