storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: #16004

tbg · 2017-05-17T20:06:18Z

This happened on a 1.0 binary, so it's neither #15819 nor the more recently #15935.

https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/

*errors.errorString: replica.go:1351 on-disk and in-memory state diverged:
%s
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1351, in assertStateRLocked
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1336, in assertState
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 792, in handleEvalResult
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3755, in processRaftCommand
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2886, in handleRaftReadyRaftMuLocked
...
(5 additional frame(s) were not displayed)

replica.go:1351 on-disk and in-memory state diverged:
%s

The text was updated successfully, but these errors were encountered:

tamird · 2017-05-17T20:13:55Z

This looks like a sentry misconfiguration. The actual SHA is 0ba0af9, which is #15819.

…

On Wed, May 17, 2017 at 4:06 PM, Tobias Schottdorf ***@***.*** > wrote: This happened on a 1.0 binary, so it's neither #15819 <#15819> nor the more recently #15935 <#15935>. https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/ *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: %s File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1351, in assertStateRLocked File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1336, in assertState File "github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go", line 792, in handleEvalResult File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3755, in processRaftCommand File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2886, in handleRaftReadyRaftMuLocked ... (5 additional frame(s) were not displayed) replica.go:1351 on-disk and in-memory state diverged: %s — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16004>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPPKElj8uHGckAMZOuKdaekKh24lxks5r61NAgaJpZM4NeX5u> .

tbg · 2017-05-17T20:17:03Z

Ah, well spotted. Any idea what the misconfiguration is?

petermattis · 2017-05-17T20:17:41Z

@tamird Can you elaborate on why you think this is a sentry misconfiguration? I'm seeing the rev match the v1.0 SHA. What are you looking at?

a-robinson · 2017-05-17T20:17:43Z

Where are you seeing that? I do see a couple occurrences at that commit, but also one at 43aa1ed with release=v1.0

tbg · 2017-05-17T20:20:10Z

on the right hand side, around the middle.

tamird · 2017-05-17T20:20:35Z

@a-robinson but 43aa1ed is not the 1.0 SHA! It's on the release branch, but it's not actually 1.0.

That said, there are some events at the real 1.0 SHA: https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/events/?query=rev%3A%2243aa1ed04cd817348921ac27e11e25725f634aa6%22

tamird · 2017-05-17T20:20:49Z

cc @dt because I don't understand where the release tag is coming from.

tbg · 2017-05-17T20:22:58Z

@tamird I clicked around in your link but all the events that I opened up had the 0ba0af sha as well. Am I missing something?

tamird · 2017-05-17T20:24:28Z

@tschottdorf I think that bar is just showing you the rev with the highest number of occurrences; the actual event's rev is in the middle of the screen in the tags section.

a-robinson · 2017-05-17T20:24:34Z

@tamird if 43aa1ed isn't the REV value for the 1.0 release, then what is?

@tschottdorf that's funny, because all the ones I'm seeing at that link are at 43aa1ed.

petermattis · 2017-05-17T20:25:44Z

Add for a given instance (i.e. https://sentry.io/cockroach-labs/cockroachdb/issues/269875826/), I think you want to look at the rev info in the Tags section. Certainly seems to be 43aa1ed to me.

tamird · 2017-05-17T20:25:45Z

@a-robinson eh, you're right - I was basing that on the GH UI which showed me the 1.0-release branch but not the tag, for some reason.

dt · 2017-05-17T20:27:18Z

@tamird release = buildInfo.Tag, which comes from

.buildinfo/tag: | .buildinfo
	@{ git describe --tags --exact-match 2> /dev/null || git rev-parse --short HEAD; } | tr -d \\n > $@
	@git diff-index --quiet HEAD || echo -dirty >> $@

rev = buildInfo.Revision, or git rev-parse HEAD

tamird · 2017-05-17T20:31:56Z

Hm, yeah. Seems real.

tbg · 2017-05-17T21:03:43Z

Ah, I looked at the distribution without realizing that's what it was. Ok, seems real.

The 1.0.1 ship may already have sailed, but if it has not, then let us consider whether we want to get this in to get more information on cockroachdb#16004 in the wild.

The 1.0.1 ship may already have sailed, but if it has not, then let us consider whether we want to get this in to get more information on #16004 in the wild.

cuongdo · 2017-08-22T19:15:02Z

removing milestone, as this doesn't seem actionable after chatting with @tschottdorf

tbg · 2017-10-23T15:08:34Z

This is still happening on v1.0.6, with messages such as

on-disk and in-memory state diverged: [RaftAppliedIndex: 68 != 69 Lease.Expiration.WallTime: 1506597457245193300 != 1506597461745378100 Lease.DeprecatedStartStasis.WallTime: 1506597457245193300 != 1506597461745378100 Lease.ProposedTS.WallTime: 1506597448245204300 != 1506597452745386300] replica.go:1366

on-disk and in-memory state diverged: [RaftAppliedIndex: 234235 != 234236 LeaseAppliedIndex: 232956 != 232957 TruncatedState.Index: 229979 != 234234 TruncatedState.Term: 2643 != 4547] replica.go:1366


on-disk and in-memory state diverged: [RaftAppliedIndex: 0 != 168 LeaseAppliedIndex: 0 != 139 Lease.Start.WallTime: 0 != 1506432427777071984 Lease.Start.Logical: 0 != 1 Lease.Expiration.WallTime: 0 != 1506432530591537745 Lease.Replica.NodeID: 0 != 1 Lease.Replica.StoreID: 0 != 1 Lease.Replica.ReplicaID: 0 != 1 Lease.DeprecatedStartStasis.WallTime: 0 != 1506432530591537745 Lease.ProposedTS: nil != &hlc.Timestamp{WallTime:1506432521591542099, Logical:0} TruncatedState.Index: 0 != 116 TruncatedState.Term: 0 != 7 Stats.LastUpdateNanos: 0 != 1506432522484024584 Stats.GCBytesAge: 0 != 995935 Stats.LiveBytes: 0 != 901 Stats.LiveCount: 0 != 13 Stats.KeyBytes: 0 != 935 Stats.KeyCount: 0 != 13 Stats.ValBytes: 0 != 2785 Stats.ValCount: 0 != 74 Stats.IntentBytes: 0 != 110 Stats.IntentCount: 0 != 2 Stats.SysBytes: 0 != 1068 Stats.SysCount: 0 != 10] replica.go:1366

on-disk and in-memory state diverged: [RaftAppliedIndex: 0 != 183 LeaseAppliedIndex: 0 != 155 Lease.Start.WallTime: 0 != 1506418620962787293 Lease.Expiration.WallTime: 0 != 1506418629962787293 Lease.Replica.NodeID: 0 != 3 Lease.Replica.StoreID: 0 != 3 Lease.Replica.ReplicaID: 0 != 2 Lease.ProposedTS: nil != &hlc.Timestamp{WallTime:1506418620962790073, Logical:0} TruncatedState.Index: 0 != 115 TruncatedState.Term: 0 != 7 Stats.LastUpdateNanos: 0 != 1506418620436285731 Stats.GCBytesAge: 0 != 799523 Stats.LiveBytes: 0 != 839 Stats.LiveCount: 0 != 13 Stats.KeyBytes: 0 != 995 Stats.KeyCount: 0 != 13 Stats.ValBytes: 0 != 2926 Stats.ValCount: 0 != 79 Stats.IntentBytes: 0 != 51 Stats.IntentCount: 0 != 1 Stats.SysBytes: 0 != 1226 Stats.SysCount: 0 != 13] replica.go:1366

on-disk and in-memory state diverged: [RaftAppliedIndex: 68 != 69 Lease.Expiration.WallTime: 1506007046679528500 != 1506007055678808700 Lease.DeprecatedStartStasis.WallTime: 1506007046679528500 != 1506007055678808700 Lease.ProposedTS.WallTime: 1506007037679539500 != 1506007046678811100] replica.go:1366

These look at least like two distinct classes of problems. Perhaps one of them (the ones where the disk-state is empty) is linking two clusters together accidentally, but the other one looks more serious because it's always an off-by-one. I think what's happening here is that we run out of disk space (or some similar i/o error) while committing a Raft batch which right now marks the replica as corrupted, but then allows the assertion to run in a situation in which the in-memory state was updated, but the actual write failed. I'll send a patch to make that case fail differently, in the hope that we won't be seeing any more of the above assertion failure (or if we do, that it's an actual failure and not fallout from a full disk etc).

Previously, if `applyRaftCommand` returned an error, it would mark the replica as corrupt but then go on and execute the side effects and potentially the assertions in `assertState()`. These were then likely to fail and return a misleading error, as likely seen in cockroachdb#16004. Instead, cause a fatal error right when observing the error, and potentially capture the root cause on sentry.io. The (perhaps too optimisic) expectation is that after accounting these disk corruption/space errors, there will be much fewer (possibly no) reports triggered by `assertState()`. Touches cockroachdb#16004.

tbg · 2017-11-16T18:48:17Z

Closing as not actionable, but we've since identified a few issues that could be at fault here, most prominently #19448 and #19447.

tbg mentioned this issue May 18, 2017

storage: extract error from assertReplicaState #16013

Merged

mberhault pushed a commit that referenced this issue May 18, 2017

storage: extract error from assertReplicaState

6a2e591

The 1.0.1 ship may already have sailed, but if it has not, then let us consider whether we want to get this in to get more information on #16004 in the wild.

mberhault mentioned this issue May 18, 2017

cherrypick: storage: extract error from assertReplicaState #16022

Merged

petermattis added this to the 1.1 milestone Jun 1, 2017

petermattis changed the title *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: Jun 1, 2017

petermattis assigned tbg Aug 17, 2017

cuongdo removed this from the 1.1 milestone Aug 22, 2017

tbg added this to the 1.2 milestone Sep 7, 2017

tbg mentioned this issue Oct 23, 2017

storage: report failed raft command application #19447

Merged

garvitjuniwal mentioned this issue Oct 28, 2017

storage: out of disk space #8473

Closed

tbg closed this as completed Nov 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: #16004

storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: #16004

tbg commented May 17, 2017

tamird commented May 17, 2017 via email

tbg commented May 17, 2017

petermattis commented May 17, 2017

a-robinson commented May 17, 2017

tbg commented May 17, 2017

tamird commented May 17, 2017

tamird commented May 17, 2017

tbg commented May 17, 2017

tamird commented May 17, 2017

a-robinson commented May 17, 2017

petermattis commented May 17, 2017

tamird commented May 17, 2017

dt commented May 17, 2017 •

edited

Loading

tamird commented May 17, 2017

tbg commented May 17, 2017

cuongdo commented Aug 22, 2017

tbg commented Oct 23, 2017

tbg commented Nov 16, 2017

storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: #16004

storage: *errors.errorString: replica.go:1351 on-disk and in-memory state diverged: #16004

Comments

tbg commented May 17, 2017

tamird commented May 17, 2017 via email

tbg commented May 17, 2017

petermattis commented May 17, 2017

a-robinson commented May 17, 2017

tbg commented May 17, 2017

tamird commented May 17, 2017

tamird commented May 17, 2017

tbg commented May 17, 2017

tamird commented May 17, 2017

a-robinson commented May 17, 2017

petermattis commented May 17, 2017

tamird commented May 17, 2017

dt commented May 17, 2017 • edited Loading

tamird commented May 17, 2017

tbg commented May 17, 2017

cuongdo commented Aug 22, 2017

tbg commented Oct 23, 2017

tbg commented Nov 16, 2017

dt commented May 17, 2017 •

edited

Loading