Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentry: replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int #33220

Closed
cockroach-teamcity opened this issue Dec 18, 2018 · 8 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.

Comments

@cockroach-teamcity
Copy link
Member

This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.

Sentry link: https://sentry.io/cockroach-labs/cockroachdb/issues/812705860/?referrer=webhooks_plugin

Panic message:

replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int

Stacktrace:

storage/replica_consistency.go in storage.(*Replica).CheckConsistency at line 161
storage/consistency_queue.go in storage.(*consistencyQueue).process at line 117
storage/queue.go in storage.(*baseQueue).processReplica at line 761
storage/queue.go in storage.(*baseQueue).processLoop.func1.2 at line 639
util/stop/stopper.go in util/stop.(*Stopper).RunAsyncTask.func1 at line 324
Tag Value
Cockroach Release v2.1.2
Cockroach SHA: 8d807ce
Platform linux amd64
Distribution CCL
Environment v2.1.2
Command server
Go Version go1.10.3
# of CPUs 8
# of Goroutines 197
gomaxprocs 8
@cockroach-teamcity cockroach-teamcity added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report. labels Dec 18, 2018
@tbg
Copy link
Member

tbg commented Dec 18, 2018

cc @bdarnell. Unfortunately there isn't much information to extract here.

@sentry-io
Copy link

sentry-io bot commented Dec 18, 2018

Sentry issue: COCKROACHDB-JE

@bdarnell
Copy link
Contributor

Yes, not much to go on here. This is a different cluster ID than the privately-reported issue that you and I were looking at yesterday, though. That one got grouped into the sentry issue I linked above, along with several other clusters. These reports date back as far as 2.1.0-beta-20180904.

The privately-reported issue shows a regular userspace MVCC metadata key that appears on only one of the followers. The cluster was undergoing heavy split activity at the time and it looks like a split occurred between the writing of the intent and its resolution, possibly at the same time.

As best I can reconstruct the sequence of events, the key in question was written to range 6716 at approximately 2018-12-14T10:41:45.498736Z (this is the client's now() timestamp, so the intent would be written shortly after that). r6716 was on n2, n3, and n5 with n2 as leader and leaseholder. r6716 split at 10:42:26:

I181214 10:42:26.298936 579537264 storage/replica_command.go:317  [n2,split,s2,r6716/3:/Table/17{6/3/2018…-7}] initiating a split of this range at key /Table/176/3/2018-12-14T10:40:50.544926Z/"5h17tae7c8n9oic7" [r6717]; may cause Raft snapshot to r6716/1: next = 1, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0

The target of that raft snapshot is n5, and it failed immediately:

E181214 10:42:27.008303 579540807 storage/queue.go:805  [n2,raftsnapshot,s2,r6716/3:/Table/17{6/3/2018…-7}] snapshot failed: (n5,s5):1: remote couldn't accept Raft snapshot 5e6eedb0 at applied index 12644 with error: [n5,s5],r6716: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n5,s5,r6715/1:/Table/17{6/3/2018…-7}]

I don't think that snapshot to n5 matters because the inconsistency is eventually reported between n2 and n3. Before that happens, though, r6717 splits again, and the snapshot also fails to apply:

I181214 10:44:31.726309 579804779 storage/replica_command.go:317  [n2,split,s2,r6717/3:/Table/17{6/3/2018…-7}] initiating a split of this range at key /Table/176/3/2018-12-14T10:42:52.454548Z/"xer9g6x6j6thtol2" [r6718]; may cause Raft snapshot to r6717/2: next = 1, match = 0, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0
E181214 10:44:33.124169 579809490 storage/queue.go:805  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot c291704f at applied index 13003 with error: [n3,s3],r6717: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r6716/2:/Table/17{6/3/2018…-7}]

This retries a lot and spams the logs, and the last reported snapshot failure is at

E181214 10:44:42.724264 579818202 storage/queue.go:805  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 0a765aa6 at applied index 13004 with error: [n3,s3],r6717: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r6716/2:/Table/17{6/3/2018…-7}]

Unfortunately we don't have much logging of successes. A few minutes later, we have a consistency checker panic:

E181214 10:56:12.359787 581203638 storage/replica_consistency.go:97  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] replica (n3,s3):2 is inconsistent: expected checksum 60c53b9e183b7a64a102830dddd2a1b58d513b078b429aa24fae201da58c760df7a9b44a8b77df7952ddb6a95c9591661d6e863553e2a1af04760ae465bd989d, got 297b26f3fc903511b66571dc0235cf0fb450b488292ca5fe7771f82c469da4b153cdc3b660a054d351fb09f7ef99e9786c2099362aca77b05e260041c10ad03c
E181214 10:56:12.359814 581203638 storage/replica_consistency.go:158  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] consistency check failed with 1 inconsistent replicas; fetching details

E181214 10:56:17.454260 581203638 storage/replica_consistency.go:97  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] replica (n3,s3):2 is inconsistent: expected checksum 837c7b972bb0972eac75f1a614697580aa695d161a6d5fa128afad5c4abc7747989c1fadeddf605b34a9f29f89ffa916599c892227ae9ca2e26fe72082b2f0c0, got 77cdc5f5e2c0516425e820f4b9a7559046143851d14952b6bc448e4b9b533cb9e5f35e6b1ac89e6171557cb2063623a69f9b11ad250d25b9e952efd86dd5b320
--- leaseholder
+++ follower
+0.000000000,0 /Table/176/3/2018-12-14T10:41:45.498736Z/"6vvq92gflc0rq1ok"/0
+    ts:<zero>
+    value:"[redacted]"
+    raw mvcc_key/value: [redacted]
E181214 10:56:17.589647 581203638 util/log/crash_reporting.go:477  [n2,consistencyChecker,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] Reported as error 381179d1347f42cdb8c1964e8cec6fa6

So the one clue we have is that there might be an issue with raft snapshots that span splits.

@tbg
Copy link
Member

tbg commented Dec 18, 2018

We did have changes in that code due to range merging. I'll also take a look at this tomorrow.

@bdarnell
Copy link
Contributor

Yeah, the timing lines up with a lot of the merge work, especially #29083, #28533, and #28683. Maybe with snapshots flying around for multiple ranges we're hitting some edge case and an old pre-split snapshot is being interpreted as a "widening" post-merge snapshot.

@tbg
Copy link
Member

tbg commented Dec 19, 2018

Augmenting the timeline regarding the snapshots:

E181214 10:44:43.119556 579818305 storage/queue.go:805  [n2,raftsnapshot,s2,r6718/3:/Table/17{6/3/2018…-7}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 4abd33d2 at applied index 10 with error: [n3,s3],r6718: cannot apply snapshot: snaps
hot intersects existing range; initiated GC: [n3,s3,r6717/2:/Table/17{6/3/2018…-7}]
I181214 10:44:43.719186 579818346 storage/store_snapshot.go:636  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] sending Raft snapshot cf16cf2b at applied index 13004
I181214 10:44:47.857067 579818346 storage/store_snapshot.go:679  [n2,raftsnapshot,s2,r6717/3:/Table/176/3/2018-12-14T10:4…] streamed snapshot to (n3,s3):2: kv pairs: 645489, log entries: 223, rate-limit: 8.0 MiB/sec, 4138ms

So the snapshot likely applied moments after 10:44:47. It's unfortunate that we don't have the logs for n3.

Worth pointing out is also that the snapshots before the one that went through remained at index 13003 for ~12s, indicating that either there was no write activity on the range, or that the quota pool had backed up (in which case there would be lots of writes after the snapshot).

Likely not relevant, but since I don't usually see it in the logs:

W181214 10:55:13.541130 155885761 storage/intent_resolver.go:757  [n5,s5] failed to cleanup transaction intents: failed to resolve intents: context deadline exceeded
W181214 10:55:13.543508 97811126 storage/intent_resolver.go:757  [n11,s11] failed to cleanup transaction intents: failed to resolve intents: context deadline exceeded
W181214 10:55:13.922880 581024504 storage/intent_resolver.go:757  [n2,s2] failed to cleanup transaction intents: failed to resolve intents: result is ambiguous (context deadline exceeded)

If this is something about snapshots being erroneously assumed to span merges, I'd expect to see messages from removeReplicaImpl (if we compute subsumedReplicas, they get destroyed during the merge and that message gets locked). Unfortunately this would happen on the receiver. But it's hard to believe that this would affect just a single key.

@bdarnell
Copy link
Contributor

Worth pointing out is also that the snapshots before the one that went through remained at index 13003 for ~12s, indicating that either there was no write activity on the range, or that the quota pool had backed up (in which case there would be lots of writes after the snapshot).

Writes were occurring in roughly sequential order, so it's expected that ranges at the tail of the keyspace would grow, split, then become idle as all future writes land on the RHS. The split would often be the last thing to happen on a range (or close to it)

@tbg
Copy link
Member

tbg commented Apr 10, 2019

Closing as inactionable. The original sentry event is also gone.

We recently fixed a bug on the 19.1 branch, #35424 (comment), that caused inconsistencies, though we think that it didn't affect 2.1 (due to its different raft entry cache implementation). As a fallout of this issue, we have better early detection of inconsistencies in our nightly tests though, and are also working on creating snapshots close to the inconsistency's detection, which should help future investigations.

@tbg tbg closed this as completed Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-sentry Originated from an in-the-wild panic report.
Projects
None yet
Development

No branches or pull requests

3 participants