Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: possible batch mutation during commit [inconsistent batch count] #113268

Closed
cockroach-sentry opened this issue Oct 29, 2023 · 6 comments
Closed
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.

Comments

@cockroach-sentry
Copy link
Collaborator

cockroach-sentry commented Oct 29, 2023

This issue was auto filed by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry Link: https://cockroach-labs.sentry.io/issues/4583696564/?referrer=webhooks_plugin

Panic Message:

db.go:811: log.Fatal: pebble: fatal commit error: pebble: inconsistent batch count: 7046526 vs 7046582
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/pebble.(*DB).applyInternal
  | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:811
  | github.com/cockroachdb/pebble.(*DB).Apply
  | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:746
  | github.com/cockroachdb/pebble.(*Batch).Commit
  | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/batch.go:1109
  | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleBatch).Commit
  | 	github.com/cockroachdb/cockroach/pkg/storage/pebble_batch.go:549
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).ApplyToStateMachine
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_app_batch.go:562
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:290
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1005
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:718
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:646
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:395
  | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
  | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:302
  | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
  | runtime.goexit
  | 	GOROOT/src/runtime/asm_amd64.s:1594
Wraps: (2) secondary error attachment
  | pebble: inconsistent batch count: 7046526 vs 7046582
  | (1) forced error mark
  |   | ×
  |   | github.com/cockroachdb/errors/withstack/*withstack.withStack::
  | Wraps: (2) attached stack trace
  |   -- stack trace:
  |   | github.com/cockroachdb/pebble/internal/base.CorruptionErrorf
  |   | 	github.com/cockroachdb/pebble/internal/base/external/com_github_cockroachdb_pebble/internal/base/error.go:27
  |   | github.com/cockroachdb/pebble.(*memTable).apply
  |   | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/mem_table.go:224
  |   | github.com/cockroachdb/pebble.(*DB).commitApply
  |   | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:831
  |   | github.com/cockroachdb/pebble.(*commitPipeline).Commit
  |   | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/commit.go:317
  |   | github.com/cockroachdb/pebble.(*DB).applyInternal
  |   | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:808
  |   | github.com/cockroachdb/pebble.(*DB).Apply
  |   | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go:746
  |   | github.com/cockroachdb/pebble.(*Batch).Commit
  |   | 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/batch.go:1109
  |   | github.com/cockroachdb/cockroach/pkg/storage.(*pebbleBatch).Commit
  |   | 	github.com/cockroachdb/cockroach/pkg/storage/pebble_batch.go:549
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaAppBatch).ApplyToStateMachine
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_app_batch.go:562
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:290
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:251
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1005
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:718
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:646
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftSchedulerShard).worker
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:395
  |   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
  |   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:302
  |   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  |   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470
  |   | runtime.goexit
  |   | 	GOROOT/src/runtime/asm_amd64.s:1594
  | Wraps: (3) pebble: inconsistent batch count: 7046526 vs 7046582
  | Error types: (1) *markers.withMark (2) *withstack.withStack (3) *errutil.leafError
Wraps: (3) log.Fatal: pebble: fatal commit error: pebble: inconsistent batch count: 7046526 vs 7046582
Error types: (1) *withstack.withStack (2) *secondary.withSecondaryError (3) *errutil.leafError
-- report composition:
*errutil.leafError: log.Fatal: pebble: fatal commit error: pebble: inconsistent batch count: 7046526 vs 7046582
*secondary.withSecondaryError: details for github.com/cockroachdb/errors/withstack/*withstack.withStack:::
db.go:811: *withstack.withStack (top exception)
Stacktrace (expand for inline code snippets):

GOROOT/src/runtime/asm_amd64.s#L1593-L1595

sp.UpdateGoroutineIDToCurrent()
f(ctx)
}()

https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L301-L303
https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go#L394-L396
https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go#L645-L647
https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L717-L719
https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go#L1004-L1006
for iter.Valid() {
if err := t.applyOneBatch(ctx, iter); err != nil {
// If the batch threw an error, reject all remaining commands in the

// Apply the persistent state transitions to the state machine.
if err := batch.ApplyToStateMachine(ctx); err != nil {
return err

https://github.com/cockroachdb/cockroach/blob/810d4f27a7f02b9cc2750cab654ed1c62ac3e75a/pkg/kv/kvserver/pkg/kv/kvserver/replica_app_batch.go#L561-L563
}
err := p.batch.Commit(opts)
if err != nil {

github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/batch.go#L1108-L1110
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go#L745-L747
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go#L810-L812

GOROOT/src/runtime/asm_amd64.s in runtime.goexit at line 1594
pkg/util/stop/stopper.go in pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2 at line 470
pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go in pkg/kv/kvserver.(*raftScheduler).Start.func2 at line 302
pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go in pkg/kv/kvserver.(*raftSchedulerShard).worker at line 395
pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go in pkg/kv/kvserver.(*Store).processReady at line 646
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReady at line 718
pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go in pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked at line 1005
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries at line 251
pkg/kv/kvserver/apply/task.go in pkg/kv/kvserver/apply.(*Task).applyOneBatch at line 290
pkg/kv/kvserver/pkg/kv/kvserver/replica_app_batch.go in pkg/kv/kvserver.(*replicaAppBatch).ApplyToStateMachine at line 562
pkg/storage/pebble_batch.go in pkg/storage.(*pebbleBatch).Commit at line 549
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/batch.go in github.com/cockroachdb/pebble.(*Batch).Commit at line 1109
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go in github.com/cockroachdb/pebble.(*DB).Apply at line 746
github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/db.go in github.com/cockroachdb/pebble.(*DB).applyInternal at line 811

Tags

Tag Value
Command start-single-node
Environment v23.1.2
Go Version go1.19.4
Platform linux amd64
Distribution CCL
Cockroach Release v23.1.2
Cockroach SHA 810d4f2
# of CPUs 6
# of Goroutines 480

Jira issue: CRDB-32838

@cockroach-sentry cockroach-sentry added O-sentry Originated from an in-the-wild panic report. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Oct 29, 2023
@jbowens jbowens changed the title Sentry: db.go:811: log.Fatal: pebble: fatal commit error: pebble: inconsistent batch count: 7046526 vs 7046582 (1) attached stack trace -- stack trace: | github.com/cockroachdb/pebble.(*DB).applyI... storage: pebble: fatal commit error: inconsistent batch count Oct 30, 2023
@jbowens jbowens added A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Oct 30, 2023
@jbowens
Copy link
Collaborator

jbowens commented Oct 30, 2023

I see a few possible explanations here:

  1. The tail of the batch was corrupt due to some form of memory corruption. If we fail to decode something from the batch, we return ok=false from BatchReader.Next and it's indistinguishable from the end of the batch. I filed batch: detect corrupt batches during reading pebble#3023 for erroring appropriately if the batch is corrupt.
  2. The batch received additional write operations after it already entered the commit pipeline. I filed batch: guard against additional batch writes during Commit pebble#3024 for improving detection of write ops against committ[ing|ed] batches.

@cockroachdb/kv – want to check if code auditing reveals some possibility of mutating a batch while it's being committed in this code path? Feel free to close out otherwise, and Storage will address the two filed issues.

@blathers-crl blathers-crl bot added the T-kv KV Team label Oct 30, 2023
@jbowens jbowens added the A-kv Anything in KV that doesn't belong in a more specific category. label Oct 30, 2023
@jbowens jbowens removed A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Oct 30, 2023
@jbowens jbowens changed the title storage: pebble: fatal commit error: inconsistent batch count kv: possible batch mutation during commit [inconsistent batch count] Oct 30, 2023
jbowens added a commit to jbowens/pebble that referenced this issue Oct 30, 2023
Add a guardrail that ensures a batch isn't in the process of being committed
when applying a mutation to the Batch. We've observed panics during batch
application that suggest concurrent modification of the batch, and these
guardrails will help surface any races more readily.

Informs cockroachdb/cockroach#113268
Resolve cockroachdb#3024.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 30, 2023
Add a guardrail that ensures a batch isn't in the process of being committed
when applying a mutation to the Batch. We've observed panics during batch
application that suggest concurrent modification of the batch, and these
guardrails will help surface any races more readily.

Informs cockroachdb/cockroach#113268
Resolve cockroachdb#3024.
jbowens added a commit to jbowens/pebble that referenced this issue Nov 1, 2023
Add a guardrail that ensures a batch isn't in the process of being committed
when applying a mutation to the Batch. We've observed panics during batch
application that suggest concurrent modification of the batch, and these
guardrails will help surface any races more readily.

Informs cockroachdb/cockroach#113268
Resolve cockroachdb#3024.
jbowens added a commit to cockroachdb/pebble that referenced this issue Nov 1, 2023
Add a guardrail that ensures a batch isn't in the process of being committed
when applying a mutation to the Batch. We've observed panics during batch
application that suggest concurrent modification of the batch, and these
guardrails will help surface any races more readily.

Informs cockroachdb/cockroach#113268
Resolve #3024.
@exalate-issue-sync exalate-issue-sync bot added T-kv-replication and removed T-kv KV Team labels Nov 6, 2023
Copy link

blathers-crl bot commented Nov 6, 2023

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

@pavelkalinnikov Can you have a look at this when you get a chance?

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 7, 2023

@jbowens The batch construction and commit in this code path is sequential and isolated (not reused by other paths), I couldn't quickly find a way in which it could be modified while committing.

Basically it's confined within this block:

batch := t.sm.NewBatch()
defer batch.Close()
// Consume a batch-worth of commands.
pol := trivialPolicy{maxCount: t.batchSize}
batchIter := takeWhileCmdIter(iter, func(cmd Command) bool {
return pol.maybeAdd(cmd.IsTrivial())
})
// Stage each command in the batch.
stagedIter, err := mapCmdIter(batchIter, batch.Stage)
if err != nil {
return err
}
// Apply the persistent state transitions to the state machine.
if err := batch.ApplyToStateMachine(ctx); err != nil {
return err
}

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 7, 2023

@jbowens Does this error happen only at a single-batch scale? Can it happen that some batch somewhere else is modified while committing, but our batch happens to notice this?

@jbowens
Copy link
Collaborator

jbowens commented Nov 7, 2023

Single batch scale, although batches are pooled, so it could conceivably be the result of a batch elsewhere receiving writes after it’s been closed and reused.

RaduBerinde pushed a commit to RaduBerinde/pebble that referenced this issue Nov 30, 2023
Add a guardrail that ensures a batch isn't in the process of being committed
when applying a mutation to the Batch. We've observed panics during batch
application that suggest concurrent modification of the batch, and these
guardrails will help surface any races more readily.

Informs cockroachdb/cockroach#113268
Resolve cockroachdb#3024.
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants