-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sentry: replica_consistency.go:152: consistency check failed with %d inconsistent replicas | int #33220
Comments
cc @bdarnell. Unfortunately there isn't much information to extract here. |
Sentry issue: COCKROACHDB-JE |
Yes, not much to go on here. This is a different cluster ID than the privately-reported issue that you and I were looking at yesterday, though. That one got grouped into the sentry issue I linked above, along with several other clusters. These reports date back as far as 2.1.0-beta-20180904. The privately-reported issue shows a regular userspace MVCC metadata key that appears on only one of the followers. The cluster was undergoing heavy split activity at the time and it looks like a split occurred between the writing of the intent and its resolution, possibly at the same time. As best I can reconstruct the sequence of events, the key in question was written to range 6716 at approximately 2018-12-14T10:41:45.498736Z (this is the client's now() timestamp, so the intent would be written shortly after that). r6716 was on n2, n3, and n5 with n2 as leader and leaseholder. r6716 split at 10:42:26:
The target of that raft snapshot is n5, and it failed immediately:
I don't think that snapshot to n5 matters because the inconsistency is eventually reported between n2 and n3. Before that happens, though, r6717 splits again, and the snapshot also fails to apply:
This retries a lot and spams the logs, and the last reported snapshot failure is at
Unfortunately we don't have much logging of successes. A few minutes later, we have a consistency checker panic:
So the one clue we have is that there might be an issue with raft snapshots that span splits. |
We did have changes in that code due to range merging. I'll also take a look at this tomorrow. |
Augmenting the timeline regarding the snapshots:
So the snapshot likely applied moments after 10:44:47. It's unfortunate that we don't have the logs for n3. Worth pointing out is also that the snapshots before the one that went through remained at index 13003 for ~12s, indicating that either there was no write activity on the range, or that the quota pool had backed up (in which case there would be lots of writes after the snapshot). Likely not relevant, but since I don't usually see it in the logs:
If this is something about snapshots being erroneously assumed to span merges, I'd expect to see messages from |
Writes were occurring in roughly sequential order, so it's expected that ranges at the tail of the keyspace would grow, split, then become idle as all future writes land on the RHS. The split would often be the last thing to happen on a range (or close to it) |
Closing as inactionable. The original sentry event is also gone. We recently fixed a bug on the 19.1 branch, #35424 (comment), that caused inconsistencies, though we think that it didn't affect 2.1 (due to its different raft entry cache implementation). As a fallout of this issue, we have better early detection of inconsistencies in our nightly tests though, and are also working on creating snapshots close to the inconsistency's detection, which should help future investigations. |
This issue was autofiled by Sentry. It represents a crash or reported error on a live cluster with telemetry enabled.
Sentry link: https://sentry.io/cockroach-labs/cockroachdb/issues/812705860/?referrer=webhooks_plugin
Panic message:
Stacktrace:
v2.1.2
go1.10.3
The text was updated successfully, but these errors were encountered: