-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: #16004
Comments
… On Wed, May 17, 2017 at 4:06 PM, Tobias Schottdorf ***@***.*** > wrote:
This happened on a 1.0 binary, so it's neither #15819
<#15819> nor the more
recently #15935 <#15935>.
https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/
*errors.errorString: replica.go:1351 on-disk and in-memory state diverged:
%s
File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1351, in assertStateRLocked
File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1336, in assertState
File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 792, in handleEvalResult
File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3755, in processRaftCommand
File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2886, in handleRaftReadyRaftMuLocked
...
(5 additional frame(s) were not displayed)
replica.go:1351 on-disk and in-memory state diverged:
%s
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#16004>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABdsPPKElj8uHGckAMZOuKdaekKh24lxks5r61NAgaJpZM4NeX5u>
.
|
Ah, well spotted. Any idea what the misconfiguration is? |
@tamird Can you elaborate on why you think this is a sentry misconfiguration? I'm seeing the |
Where are you seeing that? I do see a couple occurrences at that commit, but also one at 43aa1ed with release=v1.0 |
@a-robinson but 43aa1ed is not the 1.0 SHA! It's on the release branch, but it's not actually 1.0. That said, there are some events at the real 1.0 SHA: https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/events/?query=rev%3A%2243aa1ed04cd817348921ac27e11e25725f634aa6%22 |
cc @dt because I don't understand where the release tag is coming from. |
@tamird I clicked around in your link but all the events that I opened up had the 0ba0af sha as well. Am I missing something? |
@tschottdorf I think that bar is just showing you the rev with the highest number of occurrences; the actual event's rev is in the middle of the screen in the tags section. |
@tamird if 43aa1ed isn't the @tschottdorf that's funny, because all the ones I'm seeing at that link are at 43aa1ed. |
Add for a given instance (i.e. https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/), I think you want to look at the |
@a-robinson eh, you're right - I was basing that on the GH UI which showed me the 1.0-release branch but not the tag, for some reason. |
@tamird release = buildInfo.Tag, which comes from
rev = buildInfo.Revision, or |
Hm, yeah. Seems real. |
Ah, I looked at the distribution without realizing that's what it was. Ok, seems real. |
The 1.0.1 ship may already have sailed, but if it has not, then let us consider whether we want to get this in to get more information on cockroachdb#16004 in the wild.
The 1.0.1 ship may already have sailed, but if it has not, then let us consider whether we want to get this in to get more information on cockroachdb#16004 in the wild.
The 1.0.1 ship may already have sailed, but if it has not, then let us consider whether we want to get this in to get more information on #16004 in the wild.
removing milestone, as this doesn't seem actionable after chatting with @tschottdorf |
This is still happening on v1.0.6, with messages such as
These look at least like two distinct classes of problems. Perhaps one of them (the ones where the disk-state is empty) is linking two clusters together accidentally, but the other one looks more serious because it's always an off-by-one. I think what's happening here is that we run out of disk space (or some similar i/o error) while committing a Raft batch which right now marks the replica as corrupted, but then allows the assertion to run in a situation in which the in-memory state was updated, but the actual write failed. I'll send a patch to make that case fail differently, in the hope that we won't be seeing any more of the above assertion failure (or if we do, that it's an actual failure and not fallout from a full disk etc). |
Previously, if `applyRaftCommand` returned an error, it would mark the replica as corrupt but then go on and execute the side effects and potentially the assertions in `assertState()`. These were then likely to fail and return a misleading error, as likely seen in cockroachdb#16004. Instead, cause a fatal error right when observing the error, and potentially capture the root cause on sentry.io. The (perhaps too optimisic) expectation is that after accounting these disk corruption/space errors, there will be much fewer (possibly no) reports triggered by `assertState()`. Touches cockroachdb#16004.
Previously, if `applyRaftCommand` returned an error, it would mark the replica as corrupt but then go on and execute the side effects and potentially the assertions in `assertState()`. These were then likely to fail and return a misleading error, as likely seen in cockroachdb#16004. Instead, cause a fatal error right when observing the error, and potentially capture the root cause on sentry.io. The (perhaps too optimisic) expectation is that after accounting these disk corruption/space errors, there will be much fewer (possibly no) reports triggered by `assertState()`. Touches cockroachdb#16004.
Previously, if `applyRaftCommand` returned an error, it would mark the replica as corrupt but then go on and execute the side effects and potentially the assertions in `assertState()`. These were then likely to fail and return a misleading error, as likely seen in cockroachdb#16004. Instead, cause a fatal error right when observing the error, and potentially capture the root cause on sentry.io. The (perhaps too optimisic) expectation is that after accounting these disk corruption/space errors, there will be much fewer (possibly no) reports triggered by `assertState()`. Touches cockroachdb#16004.
This happened on a 1.0 binary, so it's neither #15819 nor the more recently #15935.
https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/
The text was updated successfully, but these errors were encountered: