sentry: replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int #33220

cockroach-teamcity · 2018-12-18T05:25:59Z

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/cockroach-labs/cockroachdb/issues/812705860/?referrer=webhooks_plugin

Panic message:

replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int

Stacktrace:

storage/replica_consistency.go in storage.(*Replica).CheckConsistency at line 161
storage/consistency_queue.go in storage.(*consistencyQueue).process at line 117
storage/queue.go in storage.(*baseQueue).processReplica at line 761
storage/queue.go in storage.(*baseQueue).processLoop.func1.2 at line 639
util/stop/stopper.go in util/stop.(*Stopper).RunAsyncTask.func1 at line 324

Tag	Value
Cockroach Release	`v2.1.2`
Cockroach SHA:	`8d807ce`
Platform	linux amd64
Distribution	CCL
Environment	v2.1.2
Command	server
Go Version	`go1.10.3`
# of CPUs	8
# of Goroutines	197
gomaxprocs	8

The text was updated successfully, but these errors were encountered:

tbg · 2018-12-18T08:21:57Z

cc @bdarnell. Unfortunately there isn't much information to extract here.

sentry-io · 2018-12-18T17:45:37Z

Sentry issue: COCKROACHDB-JE

bdarnell · 2018-12-18T18:04:14Z

Yes, not much to go on here. This is a different cluster ID than the privately-reported issue that you and I were looking at yesterday, though. That one got grouped into the sentry issue I linked above, along with several other clusters. These reports date back as far as 2.1.0-beta-20180904.

The privately-reported issue shows a regular userspace MVCC metadata key that appears on only one of the followers. The cluster was undergoing heavy split activity at the time and it looks like a split occurred between the writing of the intent and its resolution, possibly at the same time.

As best I can reconstruct the sequence of events, the key in question was written to range 6716 at approximately 2018-12-14T10:41:45.498736Z (this is the client's now() timestamp, so the intent would be written shortly after that). r6716 was on n2, n3, and n5 with n2 as leader and leaseholder. r6716 split at 10:42:26:

I181214 10:42:26.298936 579537264 storage/replica_command.go:317  [n2,split,s2,r6716/3:/Table/17{6/3/2018…-7}] initiating a split of this range at key /Table/176/3/2018-12-14T10:40:50.544926Z/"5h17tae7c8n9oic7" [r6717]; may cause Raft snapshot to r6716/1: next = 1, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0

The target of that raft snapshot is n5, and it failed immediately:

E181214 10:42:27.008303 579540807 storage/queue.go:805  [n2,raftsnapshot,s2,r6716/3:/Table/17{6/3/2018…-7}] snapshot failed: (n5,s5):1: remote couldn't accept Raft snapshot 5e6eedb0 at applied index 12644 with error: [n5,s5],r6716: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n5,s5,r6715/1:/Table/17{6/3/2018…-7}]

I don't think that snapshot to n5 matters because the inconsistency is eventually reported between n2 and n3. Before that happens, though, r6717 splits again, and the snapshot also fails to apply:

I181214 10:44:31.726309 579804779 storage/replica_command.go:317  [n2,split,s2,r6717/3:/Table/17{6/3/2018…-7}] initiating a split of this range at key /Table/176/3/2018-12-14T10:42:52.454548Z/"xer9g6x6j6thtol2" [r6718]; may cause Raft snapshot to r6717/2: next = 1, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0
E181214 10:44:33.124169 579809490 storage/queue.go:805  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot c291704f at applied index 13003 with error: [n3,s3],r6717: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r6716/2:/Table/17{6/3/2018…-7}]

This retries a lot and spams the logs, and the last reported snapshot failure is at

E181214 10:44:42.724264 579818202 storage/queue.go:805  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 0a765aa6 at applied index 13004 with error: [n3,s3],r6717: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r6716/2:/Table/17{6/3/2018…-7}]

Unfortunately we don't have much logging of successes. A few minutes later, we have a consistency checker panic:

E181214 10:56:12.359787 581203638 storage/replica_consistency.go:97  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] replica (n3,s3):2 is inconsistent: expected checksum 60c53b9e183b7a64a102830dddd2a1b58d513b078b429aa24fae201da58c760df7a9b44a8b77df7952ddb6a95c9591661d6e863553e2a1af04760ae465bd989d, got 297b26f3fc903511b66571dc0235cf0fb450b488292ca5fe7771f82c469da4b153cdc3b660a054d351fb09f7ef99e9786c2099362aca77b05e260041c10ad03c
E181214 10:56:12.359814 581203638 storage/replica_consistency.go:158  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] consistency check failed with 1 inconsistent replicas; fetching details

E181214 10:56:17.454260 581203638 storage/replica_consistency.go:97  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] replica (n3,s3):2 is inconsistent: expected checksum 837c7b972bb0972eac75f1a614697580aa695d161a6d5fa128afad5c4abc7747989c1fadeddf605b34a9f29f89ffa916599c892227ae9ca2e26fe72082b2f0c0, got 77cdc5f5e2c0516425e820f4b9a7559046143851d14952b6bc448e4b9b533cb9e5f35e6b1ac89e6171557cb2063623a69f9b11ad250d25b9e952efd86dd5b320
--- leaseholder
+++ follower
+0.000000000,0 /Table/176/3/2018-12-14T10:41:45.498736Z/"6vvq92gflc0rq1ok"/0
+    ts:<zero>
+    value:"[redacted]"
+    raw mvcc_key/value: [redacted]
E181214 10:56:17.589647 581203638 util/log/crash_reporting.go:477  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] Reported as error 381179d1347f42cdb8c1964e8cec6fa6

So the one clue we have is that there might be an issue with raft snapshots that span splits.

tbg · 2018-12-18T18:51:46Z

We did have changes in that code due to range merging. I'll also take a look at this tomorrow.

bdarnell · 2018-12-18T18:57:20Z

Yeah, the timing lines up with a lot of the merge work, especially #29083, #28533, and #28683. Maybe with snapshots flying around for multiple ranges we're hitting some edge case and an old pre-split snapshot is being interpreted as a "widening" post-merge snapshot.

tbg · 2018-12-19T11:19:34Z

Augmenting the timeline regarding the snapshots:

E181214 10:44:43.119556 579818305 storage/queue.go:805  [n2,raftsnapshot,s2,r6718/3:/Table/17{6/3/2018…-7}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 4abd33d2 at applied index 10 with error: [n3,s3],r6718: cannot apply snapshot: snaps
hot intersects existing range; initiated GC: [n3,s3,r6717/2:/Table/17{6/3/2018…-7}]
I181214 10:44:43.719186 579818346 storage/store_snapshot.go:636  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] sending Raft snapshot cf16cf2b at applied index 13004
I181214 10:44:47.857067 579818346 storage/store_snapshot.go:679  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] streamed snapshot to (n3,s3):2: kv pairs: 645489, log entries: 223, rate-limit: 8.0 MiB/sec, 4138ms

So the snapshot likely applied moments after 10:44:47. It's unfortunate that we don't have the logs for n3.

Worth pointing out is also that the snapshots before the one that went through remained at index 13003 for ~12s, indicating that either there was no write activity on the range, or that the quota pool had backed up (in which case there would be lots of writes after the snapshot).

Likely not relevant, but since I don't usually see it in the logs:

W181214 10:55:13.541130 155885761 storage/intent_resolver.go:757  [n5,s5] failed to cleanup transaction intents: failed to resolve intents: context deadline exceeded
W181214 10:55:13.543508 97811126 storage/intent_resolver.go:757  [n11,s11] failed to cleanup transaction intents: failed to resolve intents: context deadline exceeded
W181214 10:55:13.922880 581024504 storage/intent_resolver.go:757  [n2,s2] failed to cleanup transaction intents: failed to resolve intents: result is ambiguous (context deadline exceeded)

If this is something about snapshots being erroneously assumed to span merges, I'd expect to see messages from removeReplicaImpl (if we compute subsumedReplicas, they get destroyed during the merge and that message gets locked). Unfortunately this would happen on the receiver. But it's hard to believe that this would affect just a single key.

bdarnell · 2018-12-19T15:31:50Z

Worth pointing out is also that the snapshots before the one that went through remained at index 13003 for ~12s, indicating that either there was no write activity on the range, or that the quota pool had backed up (in which case there would be lots of writes after the snapshot).

Writes were occurring in roughly sequential order, so it's expected that ranges at the tail of the keyspace would grow, split, then become idle as all future writes land on the RHS. The split would often be the last thing to happen on a range (or close to it)

tbg · 2019-04-10T11:27:53Z

Closing as inactionable. The original sentry event is also gone.

We recently fixed a bug on the 19.1 branch, #35424 (comment), that caused inconsistencies, though we think that it didn't affect 2.1 (due to its different raft entry cache implementation). As a fallout of this issue, we have better early detection of inconsistencies in our nightly tests though, and are also working on creating snapshots close to the inconsistency's detection, which should help future investigations.

cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Dec 18, 2018

jordanlewis mentioned this issue Dec 18, 2018

sentry: auto-filed alerts need some improvements #33182

Closed

3 tasks

tbg assigned bdarnell Dec 18, 2018

tbg closed this as completed Apr 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sentry: replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int #33220

sentry: replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int #33220

cockroach-teamcity commented Dec 18, 2018

tbg commented Dec 18, 2018

sentry-io bot commented Dec 18, 2018

bdarnell commented Dec 18, 2018

tbg commented Dec 18, 2018

bdarnell commented Dec 18, 2018

tbg commented Dec 19, 2018

bdarnell commented Dec 19, 2018

tbg commented Apr 10, 2019

sentry: replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int #33220

sentry: replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int #33220

Comments

cockroach-teamcity commented Dec 18, 2018

tbg commented Dec 18, 2018

sentry-io bot commented Dec 18, 2018

bdarnell commented Dec 18, 2018

tbg commented Dec 18, 2018

bdarnell commented Dec 18, 2018

tbg commented Dec 19, 2018

bdarnell commented Dec 19, 2018

tbg commented Apr 10, 2019