Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: #16004

Closed
tbg opened this issue May 17, 2017 · 18 comments
Closed
Assignees
Milestone

Comments

@tbg
Copy link
Member

tbg commented May 17, 2017

This happened on a 1.0 binary, so it's neither #15819 nor the more recently #15935.

https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/

*errors.errorString: replica.go:1351 on-disk and in-memory state diverged:
%s
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1351, in assertStateRLocked
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1336, in assertState
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 792, in handleEvalResult
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3755, in processRaftCommand
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2886, in handleRaftReadyRaftMuLocked
...
(5 additional frame(s) were not displayed)

replica.go:1351 on-disk and in-memory state diverged:
%s
@tamird
Copy link
Contributor

tamird commented May 17, 2017 via email

@tbg
Copy link
Member Author

tbg commented May 17, 2017

Ah, well spotted. Any idea what the misconfiguration is?

@petermattis
Copy link
Collaborator

@tamird Can you elaborate on why you think this is a sentry misconfiguration? I'm seeing the rev match the v1.0 SHA. What are you looking at?

@a-robinson
Copy link
Contributor

Where are you seeing that? I do see a couple occurrences at that commit, but also one at 43aa1ed with release=v1.0

@tbg
Copy link
Member Author

tbg commented May 17, 2017

image

on the right hand side, around the middle.

@tamird
Copy link
Contributor

tamird commented May 17, 2017

@a-robinson but 43aa1ed is not the 1.0 SHA! It's on the release branch, but it's not actually 1.0.

That said, there are some events at the real 1.0 SHA: https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/events/?query=rev%3A%2243aa1ed04cd817348921ac27e11e25725f634aa6%22

@tamird
Copy link
Contributor

tamird commented May 17, 2017

cc @dt because I don't understand where the release tag is coming from.

@tbg
Copy link
Member Author

tbg commented May 17, 2017

@tamird I clicked around in your link but all the events that I opened up had the 0ba0af sha as well. Am I missing something?

@tamird
Copy link
Contributor

tamird commented May 17, 2017

@tschottdorf I think that bar is just showing you the rev with the highest number of occurrences; the actual event's rev is in the middle of the screen in the tags section.

@a-robinson
Copy link
Contributor

@tamird if 43aa1ed isn't the REV value for the 1.0 release, then what is?

@tschottdorf that's funny, because all the ones I'm seeing at that link are at 43aa1ed.
screen shot 2017-05-17 at 4 24 15 pm

@petermattis
Copy link
Collaborator

Add for a given instance (i.e. https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/), I think you want to look at the rev info in the Tags section. Certainly seems to be 43aa1ed to me.

@tamird
Copy link
Contributor

tamird commented May 17, 2017

@a-robinson eh, you're right - I was basing that on the GH UI which showed me the 1.0-release branch but not the tag, for some reason.

@dt
Copy link
Member

dt commented May 17, 2017

@tamird release = buildInfo.Tag, which comes from

.buildinfo/tag: | .buildinfo
	@{ git describe --tags --exact-match 2> /dev/null || git rev-parse --short HEAD; } | tr -d \\n > $@
	@git diff-index --quiet HEAD || echo -dirty >> $@

rev = buildInfo.Revision, or git rev-parse HEAD

@tamird
Copy link
Contributor

tamird commented May 17, 2017

Hm, yeah. Seems real.

@tbg
Copy link
Member Author

tbg commented May 17, 2017

Ah, I looked at the distribution without realizing that's what it was. Ok, seems real.

tbg added a commit to tbg/cockroach that referenced this issue May 18, 2017
The 1.0.1 ship may already have sailed, but if it has not,
then let us consider whether we want to get this in to get
more information on cockroachdb#16004 in the wild.
tbg added a commit to tbg/cockroach that referenced this issue May 18, 2017
The 1.0.1 ship may already have sailed, but if it has not,
then let us consider whether we want to get this in to get
more information on cockroachdb#16004 in the wild.
mberhault pushed a commit that referenced this issue May 18, 2017
The 1.0.1 ship may already have sailed, but if it has not,
then let us consider whether we want to get this in to get
more information on #16004 in the wild.
@petermattis petermattis added this to the 1.1 milestone Jun 1, 2017
@petermattis petermattis changed the title *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: Jun 1, 2017
@cuongdo cuongdo removed this from the 1.1 milestone Aug 22, 2017
@cuongdo
Copy link
Contributor

cuongdo commented Aug 22, 2017

removing milestone, as this doesn't seem actionable after chatting with @tschottdorf

@tbg tbg added this to the 1.2 milestone Sep 7, 2017
@tbg
Copy link
Member Author

tbg commented Oct 23, 2017

This is still happening on v1.0.6, with messages such as

on-disk and in-memory state diverged: [RaftAppliedIndex: 68 != 69 Lease.Expiration.WallTime: 1506597457245193300 != 1506597461745378100 Lease.DeprecatedStartStasis.WallTime: 1506597457245193300 != 1506597461745378100 Lease.ProposedTS.WallTime: 1506597448245204300 != 1506597452745386300] replica.go:1366

on-disk and in-memory state diverged: [RaftAppliedIndex: 234235 != 234236 LeaseAppliedIndex: 232956 != 232957 TruncatedState.Index: 229979 != 234234 TruncatedState.Term: 2643 != 4547] replica.go:1366


on-disk and in-memory state diverged: [RaftAppliedIndex: 0 != 168 LeaseAppliedIndex: 0 != 139 Lease.Start.WallTime: 0 != 1506432427777071984 Lease.Start.Logical: 0 != 1 Lease.Expiration.WallTime: 0 != 1506432530591537745 Lease.Replica.NodeID: 0 != 1 Lease.Replica.StoreID: 0 != 1 Lease.Replica.ReplicaID: 0 != 1 Lease.DeprecatedStartStasis.WallTime: 0 != 1506432530591537745 Lease.ProposedTS: nil != &hlc.Timestamp{WallTime:1506432521591542099, Logical:0} TruncatedState.Index: 0 != 116 TruncatedState.Term: 0 != 7 Stats.LastUpdateNanos: 0 != 1506432522484024584 Stats.GCBytesAge: 0 != 995935 Stats.LiveBytes: 0 != 901 Stats.LiveCount: 0 != 13 Stats.KeyBytes: 0 != 935 Stats.KeyCount: 0 != 13 Stats.ValBytes: 0 != 2785 Stats.ValCount: 0 != 74 Stats.IntentBytes: 0 != 110 Stats.IntentCount: 0 != 2 Stats.SysBytes: 0 != 1068 Stats.SysCount: 0 != 10] replica.go:1366

on-disk and in-memory state diverged: [RaftAppliedIndex: 0 != 183 LeaseAppliedIndex: 0 != 155 Lease.Start.WallTime: 0 != 1506418620962787293 Lease.Expiration.WallTime: 0 != 1506418629962787293 Lease.Replica.NodeID: 0 != 3 Lease.Replica.StoreID: 0 != 3 Lease.Replica.ReplicaID: 0 != 2 Lease.ProposedTS: nil != &hlc.Timestamp{WallTime:1506418620962790073, Logical:0} TruncatedState.Index: 0 != 115 TruncatedState.Term: 0 != 7 Stats.LastUpdateNanos: 0 != 1506418620436285731 Stats.GCBytesAge: 0 != 799523 Stats.LiveBytes: 0 != 839 Stats.LiveCount: 0 != 13 Stats.KeyBytes: 0 != 995 Stats.KeyCount: 0 != 13 Stats.ValBytes: 0 != 2926 Stats.ValCount: 0 != 79 Stats.IntentBytes: 0 != 51 Stats.IntentCount: 0 != 1 Stats.SysBytes: 0 != 1226 Stats.SysCount: 0 != 13] replica.go:1366

on-disk and in-memory state diverged: [RaftAppliedIndex: 68 != 69 Lease.Expiration.WallTime: 1506007046679528500 != 1506007055678808700 Lease.DeprecatedStartStasis.WallTime: 1506007046679528500 != 1506007055678808700 Lease.ProposedTS.WallTime: 1506007037679539500 != 1506007046678811100] replica.go:1366

These look at least like two distinct classes of problems. Perhaps one of them (the ones where the disk-state is empty) is linking two clusters together accidentally, but the other one looks more serious because it's always an off-by-one. I think what's happening here is that we run out of disk space (or some similar i/o error) while committing a Raft batch which right now marks the replica as corrupted, but then allows the assertion to run in a situation in which the in-memory state was updated, but the actual write failed. I'll send a patch to make that case fail differently, in the hope that we won't be seeing any more of the above assertion failure (or if we do, that it's an actual failure and not fallout from a full disk etc).

tbg added a commit to tbg/cockroach that referenced this issue Oct 23, 2017
Previously, if `applyRaftCommand` returned an error, it would mark the replica
as corrupt but then go on and execute the side effects and potentially the
assertions in `assertState()`. These were then likely to fail and return a
misleading error, as likely seen in cockroachdb#16004.

Instead, cause a fatal error right when observing the error, and potentially
capture the root cause on sentry.io. The (perhaps too optimisic) expectation
is that after accounting these disk corruption/space errors, there will be
much fewer (possibly no) reports triggered by `assertState()`.

Touches cockroachdb#16004.
tbg added a commit to tbg/cockroach that referenced this issue Oct 23, 2017
Previously, if `applyRaftCommand` returned an error, it would mark the replica
as corrupt but then go on and execute the side effects and potentially the
assertions in `assertState()`. These were then likely to fail and return a
misleading error, as likely seen in cockroachdb#16004.

Instead, cause a fatal error right when observing the error, and potentially
capture the root cause on sentry.io. The (perhaps too optimisic) expectation
is that after accounting these disk corruption/space errors, there will be
much fewer (possibly no) reports triggered by `assertState()`.

Touches cockroachdb#16004.
tbg added a commit to tbg/cockroach that referenced this issue Oct 24, 2017
Previously, if `applyRaftCommand` returned an error, it would mark the replica
as corrupt but then go on and execute the side effects and potentially the
assertions in `assertState()`. These were then likely to fail and return a
misleading error, as likely seen in cockroachdb#16004.

Instead, cause a fatal error right when observing the error, and potentially
capture the root cause on sentry.io. The (perhaps too optimisic) expectation
is that after accounting these disk corruption/space errors, there will be
much fewer (possibly no) reports triggered by `assertState()`.

Touches cockroachdb#16004.
@tbg
Copy link
Member Author

tbg commented Nov 16, 2017

Closing as not actionable, but we've since identified a few issues that could be at fault here, most prominently #19448 and #19447.

@tbg tbg closed this as completed Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants