Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #25173

Closed
rytaft opened this issue Apr 30, 2018 · 16 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@rytaft
Copy link
Collaborator

rytaft commented Apr 30, 2018

https://sentry.io/cockroach-labs/cockroachdb/issues/415952086/

*log.safeError: stopper.go:175: string
  File "github.com/cockroachdb/cockroach/pkg/storage/raft.go", line 106, in Panicf
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3335, in func1
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 607, in withRaftGroupLocked
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 624, in withRaftGroup
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 3330, in stepRaftGroup
...
(14 additional frame(s) were not displayed)

stopper.go:175: string
@jordanlewis jordanlewis added S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels May 1, 2018
@tbg tbg changed the title *log.safeError: stopper.go:175: string storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?" May 14, 2018
@tbg tbg changed the title storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?" storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? May 14, 2018
@tbg
Copy link
Member

tbg commented May 14, 2018

This was most recently seen on 5/9, on v2.0.1 (after <10s of uptime). Assigning @bdarnell since he is looking at a similar issue. Back in the day, we saw these errors more frequently when joining together incompatible clusters, but we prevent this today, or so I thought. The other root cause for this is straight-up data loss (or a Raft bug).

@tbg tbg assigned bdarnell and unassigned tbg May 14, 2018
@bdarnell
Copy link
Contributor

It would be useful to know the numbers here, specifically whether the second one is zero or non-zero. The cluster-mixing issue would usually have a zero lastIndex. If lastIndex is non-zero, this is a separate issue (which I'm currently investigating from a non-public report).

I believe setting kv.raft_log.synchronize to false could cause this error after an unclean shutdown.

@tbg
Copy link
Member

tbg commented Jun 6, 2018

I was running a three node (default roachprod GCE) cluster with 100k splits today. The workload was kv with concurrency 60, run from a fourth GCE box. I restarted node3 a few times, both with cockroach quit and roachprod stop (aka kill -9). I got this error after what probably was cockroach quit, a few hours into the lifetime of the cluster, when node3 came back up. I say that because the log that has the crash first is 2018-06-06T13_25_24Z.003633.log and the preceding logfile 2018-06-06T12_59_27Z.003326.log has the signs of a clean shutdown. Curiously there's also a log file before that at almost the same timestamp:

$ ls logs/cockroach.tobias-quiesce-0003.tschottdorf.2018-06-06T12_59_27Z.0033*
logs/cockroach.tobias-quiesce-0003.tschottdorf.2018-06-06T12_59_27Z.003316.log
logs/cockroach.tobias-quiesce-0003.tschottdorf.2018-06-06T12_59_27Z.003326.log

and that file contains only the log header (@knz, any ideas?).

The crashes I've seen on this cluster are

and then after restarting again

E180606 13:25:48.627621 92 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r5/2:/System/ts{d-e}] tocommit(2195) i
s out of range [lastIndex(2126)]. Was the raft log corrupted, truncated, or lost?
E180606 19:49:04.637886 489 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r56383/3:/Table/53/1/-294{9776…-8788…
}] tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost?

I was running 6c70e54 with the following (perhaps irrelevant, perhaps exacerbating) patch:

commit 397c4a6df22d3a4cc9feb387c313d8ee04a49f54
Author: Tobias Schottdorf <[email protected]>
Date:   Wed Jun 6 08:57:19 2018 -0400

    storage: hack: quiesce with leader and one follower up to date
    
    Release note: None

diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index a1455f2f35..0bddbd8cbc 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -4097,6 +4097,7 @@ func shouldReplicaQuiesce(
 		return nil, false
 	}
 	var foundSelf bool
+	var foundOther bool
 	for id, progress := range status.Progress {
 		if id == status.ID {
 			foundSelf = true
@@ -4106,7 +4107,10 @@ func shouldReplicaQuiesce(
 				log.Infof(ctx, "not quiescing: replica %d match (%d) != applied (%d)",
 					id, progress.Match, status.Applied)
 			}
-			return nil, false
+			// HACK
+			// return nil, false
+		} else if id != status.ID {
+			foundOther = true
 		}
 	}
 	if !foundSelf {
@@ -4116,6 +4120,10 @@ func shouldReplicaQuiesce(
 		}
 		return nil, false
 	}
+	if !foundOther {
+		// HACK: Can quiesce if self and someone else are up to date, doesn't have to be everybody.
+		return nil, false
+	}
 	if q.hasRaftReadyRLocked() {
 		if log.V(4) {
 			log.Infof(ctx, "not quiescing: raft ready")

@knz
Copy link
Contributor

knz commented Jun 6, 2018

The logs are flushed every second. If the crash occurs before a second has elapsed after a rotation, then the log file will be empty.

@knz
Copy link
Contributor

knz commented Jun 6, 2018

We do flush on go panics and log fatals though.

@tbg
Copy link
Member

tbg commented Jun 6, 2018

What I think I'm seeing is that there's a log file a and a log file b where a immediately precedes b, and b contains the marks of ./cockroach quit.

@knz
Copy link
Contributor

knz commented Jun 6, 2018

please use more words -- what is the question exactly?

@tbg
Copy link
Member

tbg commented Jun 6, 2018

debug check-store shows a lot of

range 60062: applied index 0 should be between first index 11 and last index 18

But I think that's a false positive (right @bdarnell?)

There are no other complaints except a few wrong MVCCStats. That's probably just a statement about the stringency of the check-store command.

@a-robinson
Copy link
Contributor

Curiously there's also a log file before that at almost the same timestamp:

That's just caused by the way the --background flag works. The first process spins up and a log file gets created for it. It then starts a second process without the --background flag to be the actual running cockroach server, which gets its own log file at a slightly later timestamp.

I'm not saying it's ideal, but that's what you're seeing. Nothing related to the issue.

@tbg
Copy link
Member

tbg commented Jun 6, 2018

Thanks @a-robinson.

For posterity, here's more of the crash (newly obtained). It's in handleHeartbeat.

E180606 19:49:04.637886 489 vendor/github.com/coreos/etcd/raft/log.go:191  [n3,s3,r56383/3:/Table/53/1/-294{9776…-8788…}] tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost?
E180606 19:49:04.637914 489 util/log/crash_reporting.go:203  [n3,s3] a panic has occurred!
E180606 19:49:04.638486 489 util/log/crash_reporting.go:477  [n3,s3] Reported as error 31d1824ed6104d29afc57b95d9a888df
panic: tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost? [recovered]
        panic: tocommit(628) is out of range [lastIndex(63)]. Was the raft log corrupted, truncated, or lost?

goroutine 489 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc420020240, 0x2a19fe0, 0xc43bb2c690)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:178 +0x11f
panic(0x21f5fc0, 0xc430540cf0)
        /home/tschottdorf/go1.10/src/runtime/panic.go:505 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftLogger).Panicf(0xc430540be0, 0x265172f, 0x5d, 0xc453ef0800, 0x2, 0x2)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.go:106 +0xfe
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc4317361c0, 0x274)
        /home/tschottdorf/go/src/github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
github.com/cockroachdb/cockroach/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc420735000, 0x8, 0x3, 0x1, 0xc, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

@bdarnell
Copy link
Contributor

bdarnell commented Jun 6, 2018

The applied index should be set at the same time the first index is, so that check-store message doesn't look like a false positive to me. Could be a bug in check-store, though.

@tbg
Copy link
Member

tbg commented Jun 6, 2018

The range above is one of the ones that has this bug:

range 56383: applied index 0 should be between first index 11 and last index 63

but on the other hand it seems that the message is printed for all ranges.

$ cat checkstore | grep -c 'should be between'
100128

@tbg
Copy link
Member

tbg commented Jun 6, 2018

Ah this is just fallout from moving the RaftAppliedKey.

@tbg
Copy link
Member

tbg commented Jun 6, 2018

I figured out my problem. It's that my diff above results in illegal heartbeats being sent, namely ones that contain the leader's commit index. The follower, if behind, would trip over those. Before my diff, we'd require that all followers had that log index.

@andy-kimball
Copy link
Contributor

Sentry issue: COCKROACHDB-J6

@andy-kimball
Copy link
Contributor

Sentry issue: COCKROACHDB-J5

@bdarnell bdarnell modified the milestones: 2.1, 2.2 Aug 15, 2018
tbg added a commit to tbg/cockroach that referenced this issue Oct 2, 2018
When rare errors happen in the wild, they are a) often unreported or b)
reported only in anonymized form, with little context that can help
pinpoint the root cause.

Users can help us out tremendously by contacting us, and so we should
incentivize that.

Do so by concluding fatal errors (i.e. most crashes) with a call to
action.

Touches cockroachdb#28699.
Touches cockroachdb#24033.
Touches cockroachdb#25173.

Release note: None
craig bot pushed a commit that referenced this issue Oct 3, 2018
30898: log: include a call to action with fatal errors r=knz,petermattis a=tschottdorf

When rare errors happen in the wild, they are a) often unreported or b)
reported only in anonymized form, with little context that can help
pinpoint the root cause.

Users can help us out tremendously by contacting us, and so we should
incentivize that.

Do so by concluding fatal errors (i.e. most crashes) with a call to
action.

Touches #28699.
Touches #24033.
Touches #25173.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
tbg added a commit to tbg/cockroach that referenced this issue Oct 3, 2018
When rare errors happen in the wild, they are a) often unreported or b)
reported only in anonymized form, with little context that can help
pinpoint the root cause.

Users can help us out tremendously by contacting us, and so we should
incentivize that.

Do so by concluding fatal errors (i.e. most crashes) with a call to
action.

Touches cockroachdb#28699.
Touches cockroachdb#24033.
Touches cockroachdb#25173.

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 26, 2018
This reverts commit a321424.

We have a report with credible testing that using FlushWAL instead of
SyncWAL causes data loss in disk full situations. Presumably there is
some error that is not being propagated correctly.
Possibly related to cockroachdb#31948.

See cockroachdb#25173. Possibly unrelated, but the symptom is the same.

Release note (bug fix): Fix a node data loss bug that occurs when a disk
becomes temporarily full.
craig bot pushed a commit that referenced this issue Nov 27, 2018
32605: Revert "libroach: use FlushWAL instead of SyncWAL" r=bdarnell a=petermattis

This reverts commit a321424.

We have a report with credible testing that using FlushWAL instead of
SyncWAL causes data loss in disk full situations. Presumably there is
some error that is not being propagated correctly.
Possibly related to #31948.

See #25173. Possibly unrelated, but the symptom is the same.

Release note (bug fix): Fix a node data loss bug that occurs when a disk
becomes temporarily full.

Co-authored-by: Peter Mattis <[email protected]>
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 27, 2018
This reverts commit a321424.

We have a report with credible testing that using FlushWAL instead of
SyncWAL causes data loss in disk full situations. Presumably there is
some error that is not being propagated correctly.
Possibly related to cockroachdb#31948.

See cockroachdb#25173. Possibly unrelated, but the symptom is the same.

Release note (bug fix): Fix a node data loss bug that occurs when a disk
becomes temporarily full.
petermattis added a commit to petermattis/cockroach that referenced this issue Nov 30, 2018
craig bot pushed a commit that referenced this issue Nov 30, 2018
32674: roachtest: add disk-full roachtest r=bdarnell a=petermattis

RocksDB was missing some error checks on the code paths involved in `FlushWAL`. The result was that writes that failed when the disk is full were erroneously being considered success. 

Fixes #32631
Fixes #25173

Release note (performance improvement): Re-enable usage of RocksDB FlushWAL which is a minor performance improvement for synchronous RocksDB write operations.

Co-authored-by: Peter Mattis <[email protected]>
@craig craig bot closed this as completed in #32674 Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

7 participants