-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: tocommit(691243) is out of range [lastIndex(691183)] causes CrashLoopBackOff #30260
Comments
@tschottdorf Could this be a different manifestation of #28918? v2.1.0-beta.20180910 did not contain the fix (#29579). @cgebe The other possibility is that
Not currently. It is probably possible to do something better here, but it is also possible that the attempt at something better makes the situation a lot worse. Trying to recover from unexpected errors is fraught. If we were to automatically remove some of the bad data, we might inadvertently be removing the last copy of the data. |
@petermattis I wiped the node and added it again in fresh state. Since the first incident, i did not have the same problem. The requirements of my use case, especially time series data storage, may not align fully with cockroachdb's intended usage. I already took pressure off the insertion by placing an in-memory store in between and periodically insert to crdb separated by logical database. As i said, i simply cannot turn Since my knowledge is limited about the underlying processes, i will continue reporting my experience. Surely, i look forward to optimizations. I really like cockroachdb's functionality and interfaces (SQL with Views) and appreciate your activity. Switching to special time series dbs would be stepping backward in usability (specially in horizontal scaling), therefore unlikely. |
You're correct that time series data storage does not align well with CockroachDB's intended usage. You may be better served by a dedicated time series database. Running with I haven't looked at #28487, but there may be something else going on there that we should investigate. |
@petermattis I don't think so since the gap here is ~60 entries, though perhaps this is the failure mode I wasn't quite able to explain in that issue (where I saw a gap of 3). Right off the bat I'd say that this is more likely when a node didn't sync properly. |
The raft implementation is arguably being too strict here. If a single node has its log regress like this, it's safe to just catch it back up with new log entries. The problem is that if you had this happen to several nodes at once, data could be lost (silently, since there wouldn't be a more up-to-date node to tell them about the data they're missing). At this point all bets are off because various invariants could have been violated. |
Closing since this is likely caused by not syncing to disk. |
Is this a bug report or a feature request?
Bug
BUG REPORT
v2.1.0-beta.20180910
#28487
&
#28487 (comment)
kv.raft_log.synchronize
=false
The node which panicked to come back up again.
Pod crashes over and over again.
Node staying dead.
I guess the panic is legitimate but a smooth recovery would be great.
Is there another way to restore the node without wiping its data dir?
Edit: While decommissioning the node there are plenty of replicas assigned to the dead node. Decommissioning takes ages:
The text was updated successfully, but these errors were encountered: