-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: raft log panic on fresh cluster #11591
Comments
Both those tests exhibited no problems under 10min of stress (each). |
I just wiped cyan (testing continuous-deployment with wipe).
Node
Crash was:
|
node is still down, and all continuous deployment is paused. |
Can you stop the cluster? |
Cluster is stopped and all alerts for |
Here are some other log messages that reference
The stack trace for the crash is:
So we added a replica to a node and then very quickly decided to remove it. Shortly after the removal the replica received a heartbeat that we sent on through to Raft. This is surprising because we're not supposed to process Raft messages on destroyed replicas. |
In addition to the Raft panic, we're seeing some sort of rebalancing thrashing going on:
So we've added a different replica for |
I'm not sure if I'm interpreting the goroutine output correctly, but I'm suspicious that the replica GC queue is using a different
The panicking goroutine shows:
I'm pretty sure |
Ok, looked at how to interpret goroutine stack output a bit more. The |
All data/logs backed up on all instances to:
|
Debugging aid for cockroachdb#11591.
Re: rebalancer thrashing. I can see rebalancer badness using: |
cyan restarted with sha 459d9c1 |
started stop/wipe/start on cyan, letting it run for ~20-30 minutes. no crashes so far. |
New crash on cyan after 3 minutes: panic message:
full log: |
Crap. My testing of |
The goroutine stacks once again indicate that the |
that was quick.
Full log: |
This was the first reference to |
Rebalancer thrashing:
The above logs are from the leader of |
Here's the timeline I'm seeing:
An indication that this is happening is the Cc @bdarnell |
On Tue, Nov 29, 2016 at 3:36 PM, Peter Mattis ***@***.***> wrote:
n6: GC replica 46 for r22. Note that the tombstone will be for replica ID
46, not 47, because this node hasn't received the add replica notification
yet.
Don't we do a consistent lookup of the range descriptor in the GC queue?
|
We do, but we don't pass that down when we destroy the replica. |
I'll do the stop/wipe/start cycle on cyan once this is in. |
Writing a test for this is proving more difficult than expected. In order for this panic to trigger, the heartbeat has to address a non-zero On the other hand, in order for the replica to be "revivable", its range descriptor must be stale at the time of its removal. These goals seem at odds with each other. @petermattis @bdarnell ? |
Can the test manually construct a problematic heartbeat message and get a local store into a bad state where it is removed but with the wrong tombstone key set? |
Manually constructing the heartbeat is not a problem. How do I create a non-empty raft log while keeping the descriptor stale? I'd rather not write a broken tombstone manually because then we're not testing anything (and the test will still fail after the fix). |
I'm not sure.
Yes, definitely don't do that. |
Tried blocking the application of the additive change replicas in hopes of producing this, but it seems that raftMu protects against removal during log replay. I'm at the end of my rope =/ |
This happens when something else gets committed to the range's raft log concurrently with the ChangeReplicas, so it appears in the log after the preemptive snapshot is started but before the ChangeReplicas transaction commits. The block writer puts enough data in the raft log that (with the new smaller message sizes since #10929) the post-add-replica log replay takes multiple messages, and that's what gives you a window in which the descriptor is stale but we have processed some of the log. In a test, you could tune the constants in #10929 so low that you don't need anything else happening concurrently; the first part of the |
Thanks for the advice! I got this working, will send a PR shortly.
…On Fri, Dec 2, 2016 at 12:10 AM, Ben Darnell ***@***.***> wrote:
How do I create a non-empty raft log while keeping the descriptor stale?
This happens when something else gets committed to the range's raft log
concurrently with the ChangeReplicas, so it appears in the log after the
preemptive snapshot is started but before the ChangeReplicas transaction
commits. The block writer puts enough data in the raft log that (with the
new smaller message sizes since #10929
<#10929>) the
post-add-replica log replay takes multiple messages, and that's what gives
you a window in which the descriptor is stale but we have processed some of
the log.
In a test, you could tune the constants in #10929
<#10929> so low that you
don't need anything else happening concurrently; the first part of the
ChangeReplicas(add) transaction would go into its own message. Then you
could mess with the raft transport to allow that message through but drop
any MsgApp containing an EndTransaction. Then you can remove the replica
(while still blocking EndTransactions at the raft level), trigger the
replica GC, and send the raft message that would revive it.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#11591 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABdsPE_Voex1l4TGlrWjLjTdNZtWbSwtks5rD6g9gaJpZM4K7piK>
.
|
The replica pointer tag was used for debugging cockroachdb#11591, which has been long since fixed.
The replica pointer tag was used for debugging cockroachdb#11591, which has been long since fixed.
For posterity, this test was added in #11798. |
cluster:
cobalt
(new cluster for continuous-deployment from empty clusters)sha: b5a19ba
Cluster started from scratch at 161124-08:26:19 (well, this is the third node, but all started within 30s.
The third node (
40.117.230.120
) crashed after 5 minutes with:This occured about 100s after starting the block-writer on that node:
Full stderr from this node:
cockroach.stderr.txt
All other nodes are still up and running and processing sql requests.
I'll leave this node down for now and silence the alert. Continuous deployment has not yet been enabled.
The text was updated successfully, but these errors were encountered: