storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #25173

rytaft · 2018-04-30T17:34:35Z

https://sentry.io/cockroach-labs/cockroachdb/issues/415952086/

*log.safeError: stopper.go:175: string
  File "github.com/cockroachdb/cockroach/pkg/storage/raft.go", line 106, in Panicf
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3335, in func1
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 607, in withRaftGroupLocked
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 624, in withRaftGroup
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3330, in stepRaftGroup
...
(14 additional frame(s) were not displayed)

stopper.go:175: string

The text was updated successfully, but these errors were encountered:

tbg · 2018-05-14T16:41:41Z

This was most recently seen on 5/9, on v2.0.1 (after <10s of uptime). Assigning @bdarnell since he is looking at a similar issue. Back in the day, we saw these errors more frequently when joining together incompatible clusters, but we prevent this today, or so I thought. The other root cause for this is straight-up data loss (or a Raft bug).

bdarnell · 2018-05-14T16:54:16Z

It would be useful to know the numbers here, specifically whether the second one is zero or non-zero. The cluster-mixing issue would usually have a zero lastIndex. If lastIndex is non-zero, this is a separate issue (which I'm currently investigating from a non-public report).

I believe setting kv.raft_log.synchronize to false could cause this error after an unclean shutdown.

tbg · 2018-06-06T20:03:48Z

I was running a three node (default roachprod GCE) cluster with 100k splits today. The workload was kv with concurrency 60, run from a fourth GCE box. I restarted node3 a few times, both with cockroach quit and roachprod stop (aka kill -9). I got this error after what probably was cockroach quit, a few hours into the lifetime of the cluster, when node3 came back up. I say that because the log that has the crash first is 2018-06-06T13_25_24Z.003633.log and the preceding logfile 2018-06-06T12_59_27Z.003326.log has the signs of a clean shutdown. Curiously there's also a log file before that at almost the same timestamp:

$ ls logs/cockroach.tobias-quiesce-0003.tschottdorf.2018-06-06T12_59_27Z.0033*
logs/cockroach.tobias-quiesce-0003.tschottdorf.2018-06-06T12_59_27Z.003316.log
logs/cockroach.tobias-quiesce-0003.tschottdorf.2018-06-06T12_59_27Z.003326.log

and that file contains only the log header (@knz, any ideas?).

The crashes I've seen on this cluster are

and then after restarting again

E180606 13:25:48.627621 92 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r5/2:/System/ts{d-e}] tocommit(2195) i
s out of range [lastIndex(2126)]. Was the raft log corrupted, truncated, or lost?

E180606 19:49:04.637886 489 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r56383/3:/Table/53/1/-294{9776…-8788…
}] tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost?

I was running 6c70e54 with the following (perhaps irrelevant, perhaps exacerbating) patch:

commit 397c4a6df22d3a4cc9feb387c313d8ee04a49f54
Author: Tobias Schottdorf <[email protected]>
Date:   Wed Jun 6 08:57:19 2018 -0400

    storage: hack: quiesce with leader and one follower up to date
    
    Release note: None

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index a1455f2f35..0bddbd8cbc 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -4097,6 +4097,7 @@ func shouldReplicaQuiesce(
 		return nil, false
 	}
 	var foundSelf bool
+	var foundOther bool
 	for id, progress := range status.Progress {
 		if id == status.ID {
 			foundSelf = true
@@ -4106,7 +4107,10 @@ func shouldReplicaQuiesce(
 				log.Infof(ctx, "not quiescing: replica %d match (%d) != applied (%d)",
 					id, progress.Match, status.Applied)
 			}
-			return nil, false
+			// HACK
+			// return nil, false
+		} else if id != status.ID {
+			foundOther = true
 		}
 	}
 	if !foundSelf {
@@ -4116,6 +4120,10 @@ func shouldReplicaQuiesce(
 		}
 		return nil, false
 	}
+	if !foundOther {
+		// HACK: Can quiesce if self and someone else are up to date, doesn't have to be everybody.
+		return nil, false
+	}
 	if q.hasRaftReadyRLocked() {
 		if log.V(4) {
 			log.Infof(ctx, "not quiescing: raft ready")

knz · 2018-06-06T20:07:04Z

The logs are flushed every second. If the crash occurs before a second has elapsed after a rotation, then the log file will be empty.

knz · 2018-06-06T20:07:24Z

We do flush on go panics and log fatals though.

tbg · 2018-06-06T20:08:17Z

What I think I'm seeing is that there's a log file a and a log file b where a immediately precedes b, and b contains the marks of ./cockroach quit.

knz · 2018-06-06T20:10:07Z

please use more words -- what is the question exactly?

tbg · 2018-06-06T20:12:30Z

debug check-store shows a lot of

range 60062: applied index 0 should be between first index 11 and last index 18

But I think that's a false positive (right @bdarnell?)

There are no other complaints except a few wrong MVCCStats. That's probably just a statement about the stringency of the check-store command.

a-robinson · 2018-06-06T20:13:57Z

Curiously there's also a log file before that at almost the same timestamp:

That's just caused by the way the --background flag works. The first process spins up and a log file gets created for it. It then starts a second process without the --background flag to be the actual running cockroach server, which gets its own log file at a slightly later timestamp.

I'm not saying it's ideal, but that's what you're seeing. Nothing related to the issue.

tbg · 2018-06-06T20:16:10Z

Thanks @a-robinson.

For posterity, here's more of the crash (newly obtained). It's in handleHeartbeat.

E180606 19:49:04.637886 489 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r56383/3:/Table/53/1/-294{9776…-8788…}] tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost?
E180606 19:49:04.637914 489 util/log/crash_reporting.go:203  [n3,s3] a panic has occurred!
E180606 19:49:04.638486 489 util/log/crash_reporting.go:477  [n3,s3] Reported as error 31d1824ed6104d29afc57b95d9a888df
panic: tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost?

goroutine 489 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc420020240, 0x2a19fe0, 0xc43bb2c690)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:178 +0x11f
panic(0x21f5fc0, 0xc430540cf0)
        /home/tschottdorf/go1.10/src/runtime/panic.go:505 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc430540be0, 0x265172f, 0x5d, 0xc453ef0800, 0x2, 0x2)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:106 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc4317361c0, 0x274)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc420735000, 0x8, 0x3, 0x1, 0xc, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

bdarnell · 2018-06-06T20:18:33Z

The applied index should be set at the same time the first index is, so that check-store message doesn't look like a false positive to me. Could be a bug in check-store, though.

tbg · 2018-06-06T20:21:07Z

The range above is one of the ones that has this bug:

range 56383: applied index 0 should be between first index 11 and last index 63

but on the other hand it seems that the message is printed for all ranges.

$ cat checkstore | grep -c 'should be between'
100128

tbg · 2018-06-06T20:21:58Z

Ah this is just fallout from moving the RaftAppliedKey.

tbg · 2018-06-06T21:47:40Z

I figured out my problem. It's that my diff above results in illegal heartbeats being sent, namely ones that contain the leader's commit index. The follower, if behind, would trip over those. Before my diff, we'd require that all followers had that log index.

andy-kimball · 2018-08-14T18:30:08Z

Sentry issue: COCKROACHDB-J6

andy-kimball · 2018-08-14T18:31:43Z

Sentry issue: COCKROACHDB-J5

When rare errors happen in the wild, they are a) often unreported or b) reported only in anonymized form, with little context that can help pinpoint the root cause. Users can help us out tremendously by contacting us, and so we should incentivize that. Do so by concluding fatal errors (i.e. most crashes) with a call to action. Touches cockroachdb#28699. Touches cockroachdb#24033. Touches cockroachdb#25173. Release note: None

30898: log: include a call to action with fatal errors r=knz,petermattis a=tschottdorf When rare errors happen in the wild, they are a) often unreported or b) reported only in anonymized form, with little context that can help pinpoint the root cause. Users can help us out tremendously by contacting us, and so we should incentivize that. Do so by concluding fatal errors (i.e. most crashes) with a call to action. Touches #28699. Touches #24033. Touches #25173. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

When rare errors happen in the wild, they are a) often unreported or b) reported only in anonymized form, with little context that can help pinpoint the root cause. Users can help us out tremendously by contacting us, and so we should incentivize that. Do so by concluding fatal errors (i.e. most crashes) with a call to action. Touches cockroachdb#28699. Touches cockroachdb#24033. Touches cockroachdb#25173. Release note: None

This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.

32605: Revert "libroach: use FlushWAL instead of SyncWAL" r=bdarnell a=petermattis This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to #31948. See #25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full. Co-authored-by: Peter Mattis <[email protected]>

This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.

Fixes cockroachdb#32631 Fixes cockroachdb#25173 Release note: None

32674: roachtest: add disk-full roachtest r=bdarnell a=petermattis RocksDB was missing some error checks on the code paths involved in `FlushWAL`. The result was that writes that failed when the disk is full were erroneously being considered success. Fixes #32631 Fixes #25173 Release note (performance improvement): Re-enable usage of RocksDB FlushWAL which is a minor performance improvement for synchronous RocksDB write operations. Co-authored-by: Peter Mattis <[email protected]>

rytaft assigned tbg Apr 30, 2018

tbg changed the title *log.safeError: stopper.go:175: string storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?" May 14, 2018

tbg changed the title ~~storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?"~~ storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? May 14, 2018

tbg assigned bdarnell and unassigned tbg May 14, 2018

nvanbenschoten mentioned this issue Jul 18, 2018

roachtest: verify that joining two clusters doesn't have catastrophic results #27729

Closed

tbg added this to the 2.1 milestone Jul 22, 2018

bdarnell modified the milestones: 2.1, 2.2 Aug 15, 2018

tbg mentioned this issue Sep 25, 2018

storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #30357

Closed

tbg mentioned this issue Oct 2, 2018

log: include a call to action with fatal errors #30898

Merged

petermattis mentioned this issue Nov 26, 2018

Revert "libroach: use FlushWAL instead of SyncWAL" #32605

Merged

petermattis mentioned this issue Nov 27, 2018

release-2.1: Revert "libroach: use FlushWAL instead of SyncWAL" #32633

Merged

petermattis mentioned this issue Nov 29, 2018

roachtest: add disk-full roachtest #32674

Merged

petermattis added a commit to petermattis/cockroach that referenced this issue Nov 30, 2018

c-deps: bump RocksDB to pick up FlushWAL fix

98cb00e

Fixes cockroachdb#32631 Fixes cockroachdb#25173 Release note: None

craig bot closed this as completed in #32674 Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #25173

storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #25173

rytaft commented Apr 30, 2018

tbg commented May 14, 2018

bdarnell commented May 14, 2018

tbg commented Jun 6, 2018 •

edited

Loading

knz commented Jun 6, 2018

knz commented Jun 6, 2018

tbg commented Jun 6, 2018

knz commented Jun 6, 2018

tbg commented Jun 6, 2018

a-robinson commented Jun 6, 2018

tbg commented Jun 6, 2018

bdarnell commented Jun 6, 2018

tbg commented Jun 6, 2018

tbg commented Jun 6, 2018

tbg commented Jun 6, 2018 •

edited

Loading

andy-kimball commented Aug 14, 2018

andy-kimball commented Aug 14, 2018

storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #25173

storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #25173

Comments

rytaft commented Apr 30, 2018

tbg commented May 14, 2018

bdarnell commented May 14, 2018

tbg commented Jun 6, 2018 • edited Loading

knz commented Jun 6, 2018

knz commented Jun 6, 2018

tbg commented Jun 6, 2018

knz commented Jun 6, 2018

tbg commented Jun 6, 2018

a-robinson commented Jun 6, 2018

tbg commented Jun 6, 2018

bdarnell commented Jun 6, 2018

tbg commented Jun 6, 2018

tbg commented Jun 6, 2018

tbg commented Jun 6, 2018 • edited Loading

andy-kimball commented Aug 14, 2018

andy-kimball commented Aug 14, 2018

tbg commented Jun 6, 2018 •

edited

Loading

tbg commented Jun 6, 2018 •

edited

Loading