-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost? #25173
Comments
This was most recently seen on 5/9, on v2.0.1 (after <10s of uptime). Assigning @bdarnell since he is looking at a similar issue. Back in the day, we saw these errors more frequently when joining together incompatible clusters, but we prevent this today, or so I thought. The other root cause for this is straight-up data loss (or a Raft bug). |
It would be useful to know the numbers here, specifically whether the second one is zero or non-zero. The cluster-mixing issue would usually have a zero I believe setting |
I was running a three node (default roachprod GCE) cluster with 100k splits today. The workload was kv with concurrency 60, run from a fourth GCE box. I restarted node3 a few times, both with
and that file contains only the log header (@knz, any ideas?). The crashes I've seen on this cluster are and then after restarting again
I was running 6c70e54 with the following (perhaps irrelevant, perhaps exacerbating) patch: commit 397c4a6df22d3a4cc9feb387c313d8ee04a49f54
Author: Tobias Schottdorf <[email protected]>
Date: Wed Jun 6 08:57:19 2018 -0400
storage: hack: quiesce with leader and one follower up to date
Release note: None
diff --git a/pkg/storage/replica.go b/pkg/storage/replica.go
index a1455f2f35..0bddbd8cbc 100644
--- a/pkg/storage/replica.go
+++ b/pkg/storage/replica.go
@@ -4097,6 +4097,7 @@ func shouldReplicaQuiesce(
return nil, false
}
var foundSelf bool
+ var foundOther bool
for id, progress := range status.Progress {
if id == status.ID {
foundSelf = true
@@ -4106,7 +4107,10 @@ func shouldReplicaQuiesce(
log.Infof(ctx, "not quiescing: replica %d match (%d) != applied (%d)",
id, progress.Match, status.Applied)
}
- return nil, false
+ // HACK
+ // return nil, false
+ } else if id != status.ID {
+ foundOther = true
}
}
if !foundSelf {
@@ -4116,6 +4120,10 @@ func shouldReplicaQuiesce(
}
return nil, false
}
+ if !foundOther {
+ // HACK: Can quiesce if self and someone else are up to date, doesn't have to be everybody.
+ return nil, false
+ }
if q.hasRaftReadyRLocked() {
if log.V(4) {
log.Infof(ctx, "not quiescing: raft ready")
|
The logs are flushed every second. If the crash occurs before a second has elapsed after a rotation, then the log file will be empty. |
We do flush on go panics and log fatals though. |
What I think I'm seeing is that there's a log file |
please use more words -- what is the question exactly? |
But I think that's a false positive (right @bdarnell?) There are no other complaints except a few wrong MVCCStats. That's probably just a statement about the stringency of the |
That's just caused by the way the I'm not saying it's ideal, but that's what you're seeing. Nothing related to the issue. |
Thanks @a-robinson. For posterity, here's more of the crash (newly obtained). It's in
|
The applied index should be set at the same time the first index is, so that |
The range above is one of the ones that has this bug:
but on the other hand it seems that the message is printed for all ranges.
|
Ah this is just fallout from moving the RaftAppliedKey. |
I figured out my problem. It's that my diff above results in illegal heartbeats being sent, namely ones that contain the leader's commit index. The follower, if behind, would trip over those. Before my diff, we'd require that all followers had that log index. |
Sentry issue: COCKROACHDB-J6 |
Sentry issue: COCKROACHDB-J5 |
When rare errors happen in the wild, they are a) often unreported or b) reported only in anonymized form, with little context that can help pinpoint the root cause. Users can help us out tremendously by contacting us, and so we should incentivize that. Do so by concluding fatal errors (i.e. most crashes) with a call to action. Touches cockroachdb#28699. Touches cockroachdb#24033. Touches cockroachdb#25173. Release note: None
30898: log: include a call to action with fatal errors r=knz,petermattis a=tschottdorf When rare errors happen in the wild, they are a) often unreported or b) reported only in anonymized form, with little context that can help pinpoint the root cause. Users can help us out tremendously by contacting us, and so we should incentivize that. Do so by concluding fatal errors (i.e. most crashes) with a call to action. Touches #28699. Touches #24033. Touches #25173. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
When rare errors happen in the wild, they are a) often unreported or b) reported only in anonymized form, with little context that can help pinpoint the root cause. Users can help us out tremendously by contacting us, and so we should incentivize that. Do so by concluding fatal errors (i.e. most crashes) with a call to action. Touches cockroachdb#28699. Touches cockroachdb#24033. Touches cockroachdb#25173. Release note: None
This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.
32605: Revert "libroach: use FlushWAL instead of SyncWAL" r=bdarnell a=petermattis This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to #31948. See #25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full. Co-authored-by: Peter Mattis <[email protected]>
This reverts commit a321424. We have a report with credible testing that using FlushWAL instead of SyncWAL causes data loss in disk full situations. Presumably there is some error that is not being propagated correctly. Possibly related to cockroachdb#31948. See cockroachdb#25173. Possibly unrelated, but the symptom is the same. Release note (bug fix): Fix a node data loss bug that occurs when a disk becomes temporarily full.
Fixes cockroachdb#32631 Fixes cockroachdb#25173 Release note: None
32674: roachtest: add disk-full roachtest r=bdarnell a=petermattis RocksDB was missing some error checks on the code paths involved in `FlushWAL`. The result was that writes that failed when the disk is full were erroneously being considered success. Fixes #32631 Fixes #25173 Release note (performance improvement): Re-enable usage of RocksDB FlushWAL which is a minor performance improvement for synchronous RocksDB write operations. Co-authored-by: Peter Mattis <[email protected]>
https://sentry.io/cockroach-labs/cockroachdb/issues/415952086/
The text was updated successfully, but these errors were encountered: