Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functional: simulate quorum disaster #9565

Merged
merged 37 commits into from
Apr 12, 2018
Merged

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Apr 11, 2018

Address #9150.

// SIGQUIT_AND_REMOVE_QUORUM_AND_RESTORE_LEADER_SNAPSHOT_FROM_SCRATCH first
// stops majority number of nodes, deletes data directories on those quorum
// nodes, to make the whole cluster inoperable. Now that quorum and their
// data are totally destroyed, cluster cannot even remove unavailable nodes
// (e.g. 2 out of 3 are lost, so no leader can be elected).
// Let's assume 3-node cluster of node A, B, and C. One day, node A and B
// are destroyed and all their data are gone. The only viable solution is
// to recover from C's latest snapshot.
//
// To simulate:
//  1. Assume node C is the current leader with most up-to-date data.
//  2. Download snapshot from node C, before destroying node A and B.
//  3. Destroy node A and B, and make the whole cluster inoperable.
//  4. Now node C cannot operate either.
//  5. SIGTERM node C and remove its data directories.
//  6. Restore a new seed member from node C's latest snapshot file.
//  7. Add another member to establish 2-node cluster.
//  8. Add another member to establish 3-node cluster.
//  9. Add more if any.
//
// The expected behavior is that etcd successfully recovers from such
// disastrous situation as only 1-node survives out of 3-node cluster,
// new members joins the existing cluster, and previous data from snapshot
// are still preserved after recovery process. As always, after recovery,
// each member must be able to process client requests.
Case_SIGQUIT_AND_REMOVE_QUORUM_AND_RESTORE_LEADER_SNAPSHOT_FROM_SCRATCH Case = 14

Can confirm SIGQUIT_AND_REMOVE_QUORUM_AND_RESTORE_LEADER_SNAPSHOT_FROM_SCRATCH passes when run individually.

gyuho added 26 commits April 11, 2018 19:52
…UM_AND_RESTORE_SNAPSHOT_FROM_SCRATCH"

Signed-off-by: Gyuho Lee <[email protected]>
Later to add benchmark marks

Signed-off-by: Gyuho Lee <[email protected]>
@gyuho gyuho force-pushed the quorum-disaster branch from a3b2c20 to 554dfaa Compare April 12, 2018 02:54
@gyuho gyuho force-pushed the quorum-disaster branch from 751339e to f72449c Compare April 12, 2018 04:13
@gyuho gyuho merged commit 70341b1 into etcd-io:master Apr 12, 2018
@gyuho gyuho deleted the quorum-disaster branch April 12, 2018 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

1 participant