sentry: checksum for AddSSTable does not match #90834

cockroach-teamcity · 2022-10-28T13:06:53Z

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/organizations/cockroach-labs/issues/3703257674/?referrer=webhooks_plugin

Panic message:

replica_proposal.go:464: log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096)
(1) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.addSSTablePreApply
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal.go:464
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:676
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).Stage
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go:566
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCmdIter
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:184
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:279
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:246
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1038
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:657
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:641
| github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker
| github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:308
| github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
| github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:489
| runtime.goexit
| GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (2) log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096)
Error types: (1) *withstack.withStack (2) *errutil.leafError
-- report composition:
*errutil.leafError: log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096)
replica_proposal.go:464: *withstack.withStack (top exception)

Stacktrace (expand for inline code snippets):

https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal.go#L463-L465 in pkg/kv/kvserver.addSSTablePreApply
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L675-L677 in pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go#L565-L567 in pkg/kv/kvserver.(*replicaAppBatch).Stage

cockroach/pkg/kv/kvserver/apply/cmd.go

Lines 183 to 185 in cc25a78

    
           cur := iter.Cur() 
        
           checked, err := fn(cur.Ctx(), cur) 
        
           if err != nil {

in pkg/kv/kvserver/apply.mapCmdIter

cockroach/pkg/kv/kvserver/apply/task.go

Lines 278 to 280 in cc25a78

    
           // Stage each command in the batch. 
        
           stagedIter, err := mapCmdIter(batchIter, batch.Stage) 
        
           if err != nil {

in pkg/kv/kvserver/apply.(*Task).applyOneBatch

cockroach/pkg/kv/kvserver/apply/task.go

Lines 245 to 247 in cc25a78

    
           for iter.Valid() { 
        
           	if err := t.applyOneBatch(ctx, iter); err != nil { 
        
           		// If the batch threw an error, reject all remaining commands in the

in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1037-L1039 in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L656-L658 in pkg/kv/kvserver.(*Replica).handleRaftReady
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L640-L642 in pkg/kv/kvserver.(*Store).processReady
https://github.com/cockroachdb/cockroach/blob/cc25a7893ea924897fcc6d3e80d116c85666e8eb/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L307-L309 in pkg/kv/kvserver.(*raftScheduler).worker

cockroach/pkg/util/stop/stopper.go

Lines 488 to 490 in cc25a78

    
           	sp.UpdateGoroutineIDToCurrent() 
        
           	f(ctx) 
        
           }()

in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
GOROOT/src/runtime/asm_amd64.s#L1593-L1595 in runtime.goexit

pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal.go in pkg/kv/kvserver.addSSTablePreApply at line 464
pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go in pkg/kv/kvserver.(*replicaAppBatch).runPreApplyTriggersAfterStagingWriteBatch at line 676
pkg/kv/kvserver/pkg/kv/kvserver/replica_application_state_machine.go in pkg/kv/kvserver.(*replicaAppBatch).Stage at line 566
pkg/kv/kvserver/apply/cmd.go in pkg/kv/kvserver/apply.mapCmdIter at line 184
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).applyOneBatch at line 279
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries at line 246
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked at line 1038
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReady at line 657
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).processReady at line 641
pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go in pkg/kv/kvserver.(*raftScheduler).worker at line 308
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 489
GOROOT/src/runtime/asm_amd64.s in runtime.goexit at line 1594

Tag	Value
Cockroach Release	`v22.2.0-beta.4`
Cockroach SHA:	`cc25a78`
Platform	linux amd64
Distribution	CCL
Environment	v22.2.0-beta.4
Command	server
Go Version	``
# of CPUs
# of Goroutines

Jira issue: CRDB-20964

The text was updated successfully, but these errors were encountered:

jbowens · 2022-10-28T14:05:26Z

Similiar to #52720, I think possibly explainable by disk corruption if the SST was sideloaded? Might be worthwhile to add more logging to this case.

STRATZ-Ken · 2022-10-28T14:07:22Z

Just a bit of background.

This was a brand new cluster.

All brand new hardware. I was doing an IMPORT INTO using massive datasets. About 12 files for 600GB of data. It happened during the import.

I deleted the node, and rejoined it to the cluster to get past it. If there is another way to get past it incase it happens again, please let me know.

STRATZ-Ken · 2022-10-31T15:18:38Z

I wanted to follow up on this.

Andrew requested that I check these two machines, and he was correct. After running a badblocks test on the two machine, I found 1 of the 2 hard drives in each machine reported back read/write errors. I am going to RMA the drives.

I do have a bit of a worry using cockroach, how 1 machine can bring down the entire cluster to a near halt though. Like what if one drive goes bad in one machine, if not monitored correctly it looked like the entire cluster was going to crash.

jbowens · 2022-10-31T21:57:14Z

Hey @STRATZ-Ken, thanks for following up. Today a single instance of disk corruption on a node requires replacement of the entire node. A Cockroach cluster can tolerate this loss of a single node, and there should be minimal impact to the rest of the cluster. However, replacing a node does require up-replicating all its ranges which takes time. We have #67568 tracking work to narrow the blast radius of disk corruption, allowing a node to remove its corrupt replicas without the need to replace the node and upreplicate all of the resident ranges.

STRATZ-Ken · 2022-10-31T21:58:24Z

@jbowens Thanks for update. But this was a replication factor 3 with 6 nodes and the issue still occurred.

Ken

jbowens · 2022-10-31T22:08:35Z

I might be misunderstanding you, but the replication factor of 3 allows you to recover from this situation by replacing the node.

Currently, with RF=3, if a Cockroach node experiences disk corruption, the node with corruption exits ungracefully. The remaining nodes in the cluster continue to operate and will upreplicate the underreplicated ranges once the node with corruptions is declared dead. It is exactly the replication factor of 3 that allows this recovery without data loss or ever returning inconsistent results.

STRATZ-Ken · 2022-10-31T22:14:15Z

Thats why the ticket was created. I was unable to recovery the node in question. Actual 2 nodes. I tried rebooting but the node would constantly crash. I had to purge the data on the two nodes, decommission from the cluster, then rejoin to get the node back online. Nothing else would work.

jbowens · 2022-11-01T00:02:31Z

I tried rebooting but the node would constantly crash.

Reiterating, this is the expectation although something we hope to improve with #67568.

Actual 2 nodes.

This explains why the cluster suffered unavailability. With a replication factor of 3, you can suffer only 1 node loss. Any ranges resident on both nodes would’ve lost a quorum.

STRATZ-Ken · 2022-11-01T17:19:28Z

@jbowens When I checked the log, it looked like N2 crashed because it could not connect to N9 (The first crash, with the bad hard drive). I should of saved all the logs, my bad.

jbowens · 2022-11-01T22:21:29Z

It's not likely that any communication error caused a node crash. Cockroach nodes will tolerate loss of peers. Ranges just won't be able to make progress if a majority of the replicas are absent. Given how widespread disk corruption issues appear to be on the cluster, it seems likely the other node also crashed with disk-level corruption.

STRATZ-Ken · 2022-11-01T23:36:54Z

@jbowens Just a bit of an update on my tasks.

I just deleted the entire cluster. I re-ran badblocks on every drive just to reconfirm no issues. I re-formatted the drives, and am now loading the TPCC dataset with Partitions 10, Warehouses 25,000. I will conduct this test and report back.

STRATZ-Ken · 2022-11-02T17:59:39Z

@jbowens just to update.

I re-checked all the drives, every drive is good and passes with badblocks. I did find that the ulimit in Ubuntu was set to 1024. Which was causing a bunch of issues. Once I set this to 65,000 things seemed to calm down. If I had to guess, a connection was attempted to be made repair the block, and since the ulimit blocked the connection (No free limitors) it failed to do the repair. Maybe?

github-actions · 2024-04-29T11:04:39Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

yuzefovich · 2024-05-01T17:30:51Z

closing as stale

cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Oct 28, 2022

jbowens changed the title sentry: replica_proposal.go:464: log.Fatal: checksum for AddSSTable at index term 6, index 12 does not match; at proposal time 7502948a (1963103370), now c5acc0c8 (3316433096) (1) attached stack trace -- st... sentry: checksum for AddSSTable does not match Oct 28, 2022

jbowens mentioned this issue Oct 31, 2022

storage,kv: tolerate corruption of sideloaded sstables #91029

Open

yuzefovich mentioned this issue Nov 30, 2022

sentry: replica_proposal.go:441: log.Fatal: checksum for AddSSTable at index term 6, index 26 does not match; at proposal time f3d30345 (4090692421), now f874a246 (4168393286) (1) attached stack trace -- st... #91159

Closed

github-actions bot added the no-issue-activity label Apr 29, 2024

yuzefovich removed the no-issue-activity label May 1, 2024

yuzefovich closed this as completed May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sentry: checksum for AddSSTable does not match #90834

sentry: checksum for AddSSTable does not match #90834

cockroach-teamcity commented Oct 28, 2022 •

edited by cockroach-jira-scripts

Loading

jbowens commented Oct 28, 2022

STRATZ-Ken commented Oct 28, 2022

STRATZ-Ken commented Oct 31, 2022

jbowens commented Oct 31, 2022

STRATZ-Ken commented Oct 31, 2022 •

edited

Loading

jbowens commented Oct 31, 2022

STRATZ-Ken commented Oct 31, 2022

jbowens commented Nov 1, 2022 •

edited

Loading

STRATZ-Ken commented Nov 1, 2022

jbowens commented Nov 1, 2022

STRATZ-Ken commented Nov 1, 2022

STRATZ-Ken commented Nov 2, 2022

github-actions bot commented Apr 29, 2024

yuzefovich commented May 1, 2024

sentry: checksum for AddSSTable does not match #90834

sentry: checksum for AddSSTable does not match #90834

Comments

cockroach-teamcity commented Oct 28, 2022 • edited by cockroach-jira-scripts Loading

jbowens commented Oct 28, 2022

STRATZ-Ken commented Oct 28, 2022

STRATZ-Ken commented Oct 31, 2022

jbowens commented Oct 31, 2022

STRATZ-Ken commented Oct 31, 2022 • edited Loading

jbowens commented Oct 31, 2022

STRATZ-Ken commented Oct 31, 2022

jbowens commented Nov 1, 2022 • edited Loading

STRATZ-Ken commented Nov 1, 2022

jbowens commented Nov 1, 2022

STRATZ-Ken commented Nov 1, 2022

STRATZ-Ken commented Nov 2, 2022

github-actions bot commented Apr 29, 2024

yuzefovich commented May 1, 2024

cockroach-teamcity commented Oct 28, 2022 •

edited by cockroach-jira-scripts

Loading

STRATZ-Ken commented Oct 31, 2022 •

edited

Loading

jbowens commented Nov 1, 2022 •

edited

Loading