-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: raft panic on delta #9946
Comments
upgraded to sha 9ef38d7, still failing. |
#9841 was a discrepancy between the on-disk and in-memory state. Since the node is still failing with a new SHA, we've corrupted the on-disk Raft state for the range. That's worrisome. @mberhault Is it only one node that is crashing like this, or are there multiple? @bdarnell I'll try to do some debugging on this today, but could use assistance here. |
it's the only one dying that way |
I'm looking, but there's not much help in the logs because it's restarted so much that we no longer have logs for the first failure. Backtrace appears to be failing; the following line appears before each failure:
I'm not sure whether that's an actual memory problem or not. From
The failing replica is
The other store on node 7 never appears as a member of this range, which appears to rule out any problems related to the two-stores-per-node configuration. But I don't understand why not only has this replica not been garbage collected, it has also been keeping up to date with membership changes that it shouldn't have known about. |
The most recent copies of range descriptor 9577 are:
So there are three replicas in agreement and four days ahead of the one that's crashing. We see a replica change transaction involving store 13 here:
The replica was added to store 13 at Thu Oct 13 09:43:03 2016, but the latest copy of the range descriptor on store 13 was Thu Oct 13 07:58:24 2016. The replica was removed four days later (I think the four day gap is just the time when the cluster was down) I think what happened is that something when wrong while store 13 was applying its preemptive snapshot, and so it was non-functional after being added. I don't see anything here that explains why nodes other than the one with the raft state corruption are running out of memory and crashing, so that's the next thing to investigate. I think we will likely be able to recover this without wiping the whole cluster by wiping store 13 and letting the cluster repair itself. But we need to figure out why other nodes are failing first. |
Filed a separate issue for the memory usage (which looks unrelated): #10050 |
@bdarnell Anything else you think we should investigate here? There were a number of bugs fixed recently which might have been related. |
This is probably #11591, no? EDIT: no, probably not, since the raft log is not empty. |
sha: ce09bd8 (from Oct 6th)
Node
104.196.31.237
died with:The text was updated successfully, but these errors were encountered: