-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: crash on uninitialized replica #14193
Comments
CC #14167 |
@a-robinson: omega is running with |
Ack, thanks for the ping. Also cc @BramGruneir, since this looks potentially related to some of the problems seen since merging #13973 |
Also @tamird, who originally wrote the check that's assuming the replica is initialized here, but it certainly looks like the issue is with the replica being removed twice, which is likely related to @BramGruneir's recent changes. |
Looks like a recurrence of an ancient unfixed bug:
#12574
…On Thu, Mar 16, 2017 at 9:41 AM, Alex Robinson ***@***.***> wrote:
Also @tamird <https://github.com/tamird>, who originally wrote the check
that's assuming the replica is initialized here, but it certainly looks
like the issue is with the replica being removed twice, which is likely
related to @BramGruneir <https://github.com/BramGruneir>'s recent changes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14193 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABdsPJ7fm1OcWFHSicxMd31u9_tGBE5qks5rmTwLgaJpZM4MfMAz>
.
|
Also consider that this could have been a lurking bug that was only revealed by recent changes. |
backed up logs go |
Removing an uninitialized replica will always result in the
@bdarnell Clearly the range wasn't merged because we don't have merging turned on. How else could we be in this code path? |
This appears to be an existing race between replica creation and replica deletion that's getting made much more likely by bad behavior in the replicate queue and allocator. I haven't yet confidently determined the logic flaw allowing the race, but the bad behavior in the replicate queue and allocator is quite clear. This is what was happening on node 6, the leaseholder for r3363, that caused node 3 to hit the race: Leaseholder n6 decides to rebalance away from itself [1] remove candidates: [ [2] ShouldTransferLease source (lease-holder=6): There are a few pieces of bad behavior here.
In short, #13973 has made replica thrashing more likely by enabling things to get put back in the replicate queue more often. I think we may want to hold off on the release until this bad behavior and/or the underlying race is fixed. |
Agreed. Should we revert #13973 while we work on fixing this? And if we do, do we want to go ahead with the release or wait until next week? |
@petermattis What's happening on node 3 looks something like:
Note that the second preemptive snapshot started streaming from n6 at 22:00:04.918343 and finished streaming at 22:00:05.334711, yet there's nothing in n3's logs about it ever starting to apply it. n3 was working on removing its old keys until 22:00:05.607439, and (presumably as soon as it could get the relevant mutex) noticed the request to GC its new replica at 22:00:05.609776, still before it could start applying the preemptive snapshot. Somehow the remove replica logic got triggered at 22:00:05.658301 without the snapshot getting applied in the middle, leading to the crash. |
@bdarnell Reverting #13973 and releasing without it would be safe. I think we can continue pulling on the two threads separately without needing that to still be in master. @BramGruneir, can you pull on the replicate queue / allocator behavior thread? |
And when I say "safe", I mean as safe as any of our past releases. |
The test failure in https://teamcity.cockroachdb.com/viewLog.html?buildId=172955&tab=buildLog#_ has the same pattern - add replica on node, remove replica on node, add replica on node, remove replica on node. All in quick succession. There has to be something about that pattern. |
For the trashing, specifically the adding and then subsequent removal from 3 twice... this shouldn't happen before #14106. Since this is using the older code, it has to first pass the shouldRebalance function which should reject just this type of thrashing. Regardless, it seems very suspect. Do you happen to have the logs with the remove candidates and rebalance candidates for either of those quick add then remove cases? Reverting #13973 is fine for now, for the beta at least, but to be clear, it's just patching over the other deeper issue. |
The logs with all the rebalancing decisions can be found in |
Ok, I was able to reproduce the failure by stressing Replica 38 got put in gc queue, but before replica 38 could be removed, a preemptive streaming snapshot was sent for the range, and that snapshot snuck through the logic in
I won't have time to fix this today, but there are a couple potential ways that we could do so. And for historical reference, @tschottdorf expressed some concern about parts of this logic in the past - #7830. |
And if anyone wants to check out the logs from the repro failure that helped me understand this, I ran
against commit 150b52d and got the output in the file below. The real key was pumping up the verbosity so that raft's info logs got printed. |
Nice debugging, @a-robinson. |
Yes. Nice work!
…On Fri, Mar 17, 2017 at 9:29 PM Peter Mattis ***@***.***> wrote:
Nice debugging, @a-robinson <https://github.com/a-robinson>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#14193 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTV9cztsvtYmZTcBr_w9n_iqzMvQ2ks5rmzODgaJpZM4MfMAz>
.
|
* Add more detail to Fatal log message hit by cockroachdb#14193 * Improve nasty indentation in `processRaftRequest` by moving the error check above the normal code path * Fix `removePlaceholderLocked` method comment
* Add more detail to Fatal log message hit by cockroachdb#14193 * Improve nasty indentation in `processRaftRequest` by moving the error check above the normal code path * Fix `removePlaceholderLocked` method comment
But what is the damage? All of this looks fine to me. The preemptive snapshot was applied successfully, but preemptive snapshots are handled entirely through our code and don't (shouldn't) communicate anything back to the raft leader's view of the new replica's state. The leader wouldn't believe that the snapshot had been applied successfully unless there had been another message after the snapshot, which would have informed this node that its new replica ID is 39 (and would have in turn prevented the replica GC, barring some synchronization failure). Deleting a replica in between a preemptive snapshot and its first real use is unfortunate (it causes wasted work since we have to send a raft snapshot afterwards), but it's supposed to all work. Maybe there's an interaction with heartbeat coalescing that is making the leader think the new replica is alive when that replica doesn't know it yet? |
Thanks for correcting me there. I might have other faulty assumptions as well, so let me take another close look at things and we can chat later this afternoon. |
I'm not seeing
For the first point, you make it sound like it's unexpected for a leader to send a request to a follower that isn't alive yet. What's the expected ordering of events there? It seems possible that the successful sending of a preemptive snapshot is being used as a justification for assuming that the replica exists and is ready to receive traffic. |
When you say "uninitialized", do you mean "doesn't know its replica ID" or "doesn't know its key range"? (
It is expected that the leader will send requests to followers that don't know their replica ID yet (and may not even exist, which could indicate that either a message was delayed until after a replica GC or an early GC destroyed a replica which has been created by a preemptive snapshot). Messages sent by raft (i.e. everything but preemptive snapshots and coalesced heartbeats) always include the replica ID.
The normal order is:
So in order to get this panic, the GC must happen sometime after step 4.
I don't see any way for this to happen because we don't even allocate a new replica ID until after the snapshot has been sent.
The only caller of |
I mean the key range, i.e. the same thing as
It only ever enqueues initialized replicas, but it only enqueues them by their range ID, not the Replica pointer or replica ID, so it is conceivable that a copy of the old replica could have made it through these checks and caused the new replica to get GC'ed. |
As for the raft panic due to an existing commit index of 0 (#14231), I'm missing where the protection that prevents an incoming raft request from causing a panic is meant to be. This appears to be the entire code path from an incoming raft request, with an uninitialized replica and raft group being created along the way as needed.
Are we really just relying on raft messages to be delivered in order and only when a replica exists? I have to be missing something. |
Going back to the replica GC
There are still some breadcrumbs I need to track down (or have explained to me) about the raft panic, but I'm pretty sure this settles the |
Motivated by experience debugging cockroachdb#14193
No, we don't rely on in-order delivery, and it is possible for a raft message to be delivered when no replica exists (in which case it will be created by that message). We rely on two properties:
|
cockroach-omega-0003
was just upgraded to beta candidate: 10c2544Crashed after ~9 minutes: stack trace and log entries pertaining to r3363:
The code in question is: pkg/storage/store.go:
The text was updated successfully, but these errors were encountered: