Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentry: checksum for AddSSTable does not match #90834

Closed
cockroach-teamcity opened this issue Oct 28, 2022 · 14 comments
Closed

sentry: checksum for AddSSTable does not match #90834

cockroach-teamcity opened this issue Oct 28, 2022 · 14 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 28, 2022

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3703257674/?referrer=webhooks_plugin

Panic message:

replica_proposal.go:464: log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096)
(1) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.addSSTablePreApply
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal.go:464
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:676
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).Stage
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:566
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCmdIter
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:184
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:279
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:246
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1038
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:657
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:641
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:308
| github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
| github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (2) log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096)
Error types: (1) *withstack.withStack (2) *errutil.leafError
-- report composition:
*errutil.leafError: log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096)
replica_proposal.go:464: *withstack.withStack (top exception)

Stacktrace (expand for inline code snippets):

https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal.go#L463-L465 in pkg/kv/kvserver.addSSTablePreApply
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L675-L677 in pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L565-L567 in pkg/kv/kvserver.(*replicaAppBatch).Stage

cur := iter.Cur()
checked, err := fn(cur.Ctx(), cur)
if err != nil {
in pkg/kv/kvserver/apply.mapCmdIter
// Stage each command in the batch.
stagedIter, err := mapCmdIter(batchIter, batch.Stage)
if err != nil {
in pkg/kv/kvserver/apply.(*Task).applyOneBatch
for iter.Valid() {
if err := t.applyOneBatch(ctx, iter); err != nil {
// If the batch threw an error, reject all remaining commands in the
in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1037-L1039 in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L656-L658 in pkg/kv/kvserver.(*Replica).handleRaftReady
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L640-L642 in pkg/kv/kvserver.(*Store).processReady
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L307-L309 in pkg/kv/kvserver.(*raftScheduler).worker
sp.UpdateGoroutineIDToCurrent()
f(ctx)
}()
in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
GOROOT/src/runtime/asm_amd64.s#L1593-L1595 in runtime.goexit

pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal.go in pkg/kv/kvserver.addSSTablePreApply at line 464
pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go in pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch at line 676
pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go in pkg/kv/kvserver.(*replicaAppBatch).Stage at line 566
pkg/kv/kvserver/apply/cmd.go in pkg/kv/kvserver/apply.mapCmdIter at line 184
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).applyOneBatch at line 279
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries at line 246
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked at line 1038
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReady at line 657
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).processReady at line 641
pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go in pkg/kv/kvserver.(*raftScheduler).worker at line 308
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 489
GOROOT/src/runtime/asm_amd64.s in runtime.goexit at line 1594
Tag Value
Cockroach Release v22.2.0-beta.4
Cockroach SHA: cc25a78
Platform linux amd64
Distribution CCL
Environment v22.2.0-beta.4
Command server
Go Version ``
# of CPUs
# of Goroutines

Jira issue: CRDB-20964

@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Oct 28, 2022
@jbowens
Copy link
Collaborator

jbowens commented Oct 28, 2022

Similiar to #52720, I think possibly explainable by disk corruption if the SST was sideloaded? Might be worthwhile to add more logging to this case.

@STRATZ-Ken
Copy link

Just a bit of background.

This was a brand new cluster.

All brand new hardware. I was doing an IMPORT INTO using massive datasets. About 12 files for 600GB of data. It happened during the import.

I deleted the node, and rejoined it to the cluster to get past it. If there is another way to get past it incase it happens again, please let me know.

@jbowens jbowens changed the title sentry: replica_proposal.go:464: log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096) (1) attached stack trace -- st... sentry: checksum for AddSSTable does not match Oct 28, 2022
@STRATZ-Ken
Copy link

I wanted to follow up on this.

Andrew requested that I check these two machines, and he was correct. After running a badblocks test on the two machine, I found 1 of the 2 hard drives in each machine reported back read/write errors. I am going to RMA the drives.

I do have a bit of a worry using cockroach, how 1 machine can bring down the entire cluster to a near halt though. Like what if one drive goes bad in one machine, if not monitored correctly it looked like the entire cluster was going to crash.

@jbowens
Copy link
Collaborator

jbowens commented Oct 31, 2022

Hey @STRATZ-Ken, thanks for following up. Today a single instance of disk corruption on a node requires replacement of the entire node. A Cockroach cluster can tolerate this loss of a single node, and there should be minimal impact to the rest of the cluster. However, replacing a node does require up-replicating all its ranges which takes time. We have #67568 tracking work to narrow the blast radius of disk corruption, allowing a node to remove its corrupt replicas without the need to replace the node and upreplicate all of the resident ranges.

@STRATZ-Ken
Copy link

STRATZ-Ken commented Oct 31, 2022

@jbowens Thanks for update. But this was a replication factor 3 with 6 nodes and the issue still occurred.

Ken

@jbowens
Copy link
Collaborator

jbowens commented Oct 31, 2022

I might be misunderstanding you, but the replication factor of 3 allows you to recover from this situation by replacing the node.

Currently, with RF=3, if a Cockroach node experiences disk corruption, the node with corruption exits ungracefully. The remaining nodes in the cluster continue to operate and will upreplicate the underreplicated ranges once the node with corruptions is declared dead. It is exactly the replication factor of 3 that allows this recovery without data loss or ever returning inconsistent results.

@STRATZ-Ken
Copy link

Thats why the ticket was created. I was unable to recovery the node in question. Actual 2 nodes. I tried rebooting but the node would constantly crash. I had to purge the data on the two nodes, decommission from the cluster, then rejoin to get the node back online. Nothing else would work.

@jbowens
Copy link
Collaborator

jbowens commented Nov 1, 2022

I tried rebooting but the node would constantly crash.

Reiterating, this is the expectation although something we hope to improve with #67568.

Actual 2 nodes.

This explains why the cluster suffered unavailability. With a replication factor of 3, you can suffer only 1 node loss. Any ranges resident on both nodes would’ve lost a quorum.

@STRATZ-Ken
Copy link

@jbowens When I checked the log, it looked like N2 crashed because it could not connect to N9 (The first crash, with the bad hard drive). I should of saved all the logs, my bad.

@jbowens
Copy link
Collaborator

jbowens commented Nov 1, 2022

It's not likely that any communication error caused a node crash. Cockroach nodes will tolerate loss of peers. Ranges just won't be able to make progress if a majority of the replicas are absent. Given how widespread disk corruption issues appear to be on the cluster, it seems likely the other node also crashed with disk-level corruption.

@STRATZ-Ken
Copy link

@jbowens Just a bit of an update on my tasks.

I just deleted the entire cluster. I re-ran badblocks on every drive just to reconfirm no issues. I re-formatted the drives, and am now loading the TPCC dataset with Partitions 10, Warehouses 25,000. I will conduct this test and report back.

@STRATZ-Ken
Copy link

@jbowens just to update.

I re-checked all the drives, every drive is good and passes with badblocks. I did find that the ulimit in Ubuntu was set to 1024. Which was causing a bunch of issues. Once I set this to 65,000 things seemed to calm down. If I had to guess, a connection was attempted to be made repair the block, and since the ulimit blocked the connection (No free limitors) it failed to do the repair. Maybe?

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@yuzefovich
Copy link
Member

closing as stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.
Projects
None yet
Development

No branches or pull requests

4 participants