Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 60: Data protection / High availability #92

Merged
merged 34 commits into from
Dec 20, 2018

Conversation

adrianmo
Copy link
Contributor

@adrianmo adrianmo commented Nov 20, 2018

Change log description

The operator should ensure that pods are deployed in a reasonable way from a data protection and high availability perspective.

Pod protection

Kubernetes has the ability to safely evict all of your pods from a node before an admin can perform maintenance on a node. The operator should prevent this from happening if the node evicted would cause disruption in a Pravega cluster. This can be achieved with Pod Disruption Budgets.

With these changes, the operator will create a PDB for each component type (controller, segmentstore, bookie) with the following PDB configuration.

  • Controller: allow planned disruptions as long as there is at one Controller pod available.
  • Segment Store: allow 1 planned disruption if the number of replicas is greater than 1, otherwise, no planned disruption is allowed.
  • Bookkeeper: allow 1 planned disruption at a time.
    • The minimum number of Bookies is enforced to 3.
    • The quorum has been brought down to 2 (ref) This allows one Bookie disruption even in the minimum configuration.
    • AutoRecovery is set to true by default to automatically recover under-replicated ledger when there is a pod disruption (planned or unplanned).

Pod placement

To achieve higher performance and fault tolerance instances of the same type are recommended to be spread across different nodes. For example, assuming that we have a 3-node Kubernetes cluster and we are attempting to create 3 Bookkeeper instances, the first Bookie will be placed in any of the 3 nodes (e.g. Node 1), the second one will be preferably placed in a different node (e.g. Node 2 or Node 3), and the third bookie will be placed in the only node that does not contain a bookie. If a fourth Bookie was to be scheduled, it would be placed in any of the nodes, because all of them already have one Bookie.

This has been implemented in this PR with Pod Anti-Affinity using the preferredDuringSchedulingIgnoredDuringExecution soft requirement because there's no functional requirement to place components of the same kind in different nodes.

Purpose of the change

How to test the change

Assuming you already have access to a Kubernetes environment (GKE, PKS...) with at least 3 nodes, follow the instructions in the README file to deploy ZooKeeper, the NFS server provisioner, and the Pravega Tier 2 PVC. Skip the deployment of the Pravega operator.

Check out this branch.

git checkout issue-60-data-protection-ha

Update the Pravega operator image to use an image built with the PR changes. Open the deploy/operator.yaml file and replace the container image from pravega/pravega-operator:latest to adrianmo/pravega-operator:issue-60.

Deploy the operator.

kubectl create -f deploy/operator.yaml

Follow the instructions in the README file to deploy a Pravega cluster with 3 Bookies, 1 or more Controllers, and 3 Segment stores.

Run the following command to verify the correct placement of pods.

$ kubectl get po -o wide
NAME                                         READY     STATUS    RESTARTS   AGE       IP           NODE
...
pravega-bookie-0                             1/1       Running   0          35s       10.56.6.43   gke-adrian-cluster-1-default-pool-6e9f73a5-7348
pravega-bookie-1                             1/1       Running   0          35s       10.56.1.29   gke-adrian-cluster-1-default-pool-0616952c-kd4z
pravega-bookie-2                             1/1       Running   0          35s       10.56.9.42   gke-adrian-cluster-1-default-pool-d0ef8647-f01m
pravega-operator-7dbd5bb757-wzp7z            1/1       Running   0          1h        10.56.9.38   gke-adrian-cluster-1-default-pool-d0ef8647-f01m
pravega-pravega-controller-58b4dcf4f-69jpv   1/1       Running   0          35s       10.56.1.28   gke-adrian-cluster-1-default-pool-0616952c-kd4z
pravega-segmentstore-0                       1/1       Running   0          35s       10.56.3.21   gke-adrian-cluster-1-default-pool-0616952c-n948
pravega-segmentstore-1                       1/1       Running   0          35s       10.56.9.41   gke-adrian-cluster-1-default-pool-d0ef8647-f01m
pravega-segmentstore-2                       1/1       Running   0          35s       10.56.6.42   gke-adrian-cluster-1-default-pool-6e9f73a5-7348
...

Verify that pods of the same kind are scheduled to different nodes.

Check that pod disruption budgets are correctly set.

$ kubectl get pdb
NAME                         MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
pravega-bookie               N/A             1                 1                     6m
pravega-pravega-controller   1               N/A               0                     6m
pravega-segmentstore         N/A             1                 1                     6m

In this example, we allow one pod disruption for Segment Store and Bookkeeper, but none disruption for the Controller since there's only one Controller pod.

If we try to drain the node that contains the Controller instance, we should expect the drain command to abort.

$ kubectl drain gke-adrian-cluster-1-default-pool-0616 --ignore-daemonsets
952c-kd4z                                                                                                                                                       
node "gke-adrian-cluster-1-default-pool-0616952c-kd4z" cordoned                                                                                                 
error: unable to drain node "gke-adrian-cluster-1-default-pool-0616952c-kd4z", aborting command...
$ kubectl get nodes
NAME                                              STATUS                     ROLES     AGE       VERSION
...
gke-adrian-cluster-1-default-pool-0616952c-kd4z   Ready,SchedulingDisabled   <none>    4d        v1.9.7-gke.11
...

The node will continue to be part of the cluster, but it will not host any new pods. The controller instance will still be untouched and the Pravega admin will be required to manually increase the number of replicas and/or delete the affected pod to automatically be rescheduled to a different node.

@adrianmo adrianmo self-assigned this Nov 20, 2018
@adrianmo
Copy link
Contributor Author

Copy link

@fpj fpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to unit test this operator work?

pkg/pravega/pravega_controller.go Outdated Show resolved Hide resolved
pkg/pravega/pravega_segmentstore.go Outdated Show resolved Hide resolved
pkg/pravega/bookie.go Outdated Show resolved Hide resolved
Copy link

@shrids shrids left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of questions.

pkg/pravega/bookie.go Outdated Show resolved Hide resolved
pkg/pravega/bookie.go Outdated Show resolved Hide resolved
@adrianmo adrianmo added the status/blocked Issue or PR is blocked on another item; add reference in a comment label Nov 27, 2018
Copy link

@shrids shrids left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple bookies should not be brought down at the same time.
(and auto recovery should be ON by default for bookies)

@adrianmo adrianmo added status/work-in-progress PR work is in progress; do not merge and removed status/blocked Issue or PR is blocked on another item; add reference in a comment labels Dec 5, 2018
@adrianmo
Copy link
Contributor Author

adrianmo commented Dec 5, 2018

Need to rebase and fix conflicts.

@adrianmo
Copy link
Contributor Author

adrianmo commented Dec 5, 2018

Blocked on #89 that brings the ability to set default values.

@EronWright
Copy link
Contributor

@adrianmo we also need to be able to disable the anti-affinity features for dev deployments, e.g. minikube. I suppose that one option would be that the features would be disabled if replicas < 3 for a given component. WDYT?

@adrianmo
Copy link
Contributor Author

adrianmo commented Dec 7, 2018

@EronWright the anti-affinity rules used in this PR are soft requirements (PreferredDuringSchedulingIgnoredDuringExecution), meaning that if there's only one node (e.g. MiniKube), all pods will run in that node.

@adrianmo
Copy link
Contributor Author

adrianmo commented Dec 7, 2018

Need to bring down the BK replication factor to 2 as it was done in pravega/pravega#3158

* master:
  Issue 78: Clean up persistent volumes when deleting Pravega Cluster (#103)
  Issue 97: Update to operator SDK v0.2.0 (#105)
  Ability to scale Pravega Controller (#99)
  Issue 61: Support external connectivity (#77)
  Issue 95: Updated README.md (#95)
Signed-off-by: Adrián Moreno <[email protected]>
* master:
  Issue 88: Set default values when not specified (#89)
Signed-off-by: Adrián Moreno <[email protected]>
@adrianmo adrianmo added status/ready The issue is ready to be worked on; or the PR is ready to review and removed status/work-in-progress PR work is in progress; do not merge labels Dec 12, 2018
@adrianmo
Copy link
Contributor Author

@fpj @shrids The PR is ready to review again.

Recent changes:

Copy link

@shrids shrids left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good.
I have a query regarding controller's PDB.

pkg/controller/pravega/pravega_controller.go Outdated Show resolved Hide resolved
Copy link

@shrids shrids left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, no additional comments from my end apart from a query related to controller's PDB.

@adrianmo
Copy link
Contributor Author

Updated PR and description to address @shrids comments. Changed Controller PDB to minAvailable=1.

@adrianmo adrianmo requested a review from Tristan1900 December 12, 2018 17:22
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
@fpj
Copy link

fpj commented Dec 17, 2018

I wanted to capture an offline discussion with @EronWright . BK automatically replicates ledger fragments automatically (when the feature is enabled). Even if we have a disruption budget such that we have a single disruption at a time, there is no guarantee that the data will be re-replicated fast enough. As such, we could end up in this situation in which we get rid of too many copies of the data and can't recover it.

Ideally, we wait for the re-replication to finish before causing another disruption, but I don't think we have an API for that. If we are replacing bookies, then we could consider keeping the volumes of the decommissioned bookies, but that's confusing in the general case.

Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
* master:
  Issue 96: Startup and healthcheck improvements (#102)

Signed-off-by: Adrián Moreno <[email protected]>
@adrianmo adrianmo force-pushed the issue-60-data-protection-ha branch from 9886223 to 181112b Compare December 20, 2018 11:20
@fpj fpj added status/ready The issue is ready to be worked on; or the PR is ready to review and removed status/work-in-progress PR work is in progress; do not merge labels Dec 20, 2018
var maxUnavailable intstr.IntOrString

if pravegaCluster.Spec.Pravega.SegmentStoreReplicas == int32(1) {
maxUnavailable = intstr.FromInt(0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should remove this if clause. If there is a single instance, then in principle we never want it down. At the same time, it is not possible to guarantee it. As the PDB policy is a soft policy, the pod will be brought down anyway in the case k8s needs to do it, so we might as well remove the if and leave it at:

maxUnavailable = intstr.FromInt(1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDB protects pods from planned disruptions, such as a node eviction. In such case, if we have 1 Segment Store replica and maxUnavailable=1, Kubernetes will first kill the pod, and then start a new one in a different node, causing a temporary disruption. Whereas if maxUnavailble=0, Kubernetes will put node eviction on hold, giving the user time to increase the replica number and create a new pod in a different node, and then delete the affected pod to resume the node eviction, resulting in zero downtime.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment:

if maxUnavailble=0, Kubernetes will put node eviction on hold, giving the user time to increase the replica number and create a new pod in a different node, and then delete the affected pod to resume the node eviction, resulting in zero downtime.

How does the user know it is being given time to increase the replica count? Also, is it saying that we can have multiple instances temporarily? That might be worse than having none temporarily and then one back because that will multiple temporarily will induce two rebalances of the segment containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the user know it is being given time to increase the replica count?

Due to the PDB, Kubernetes will put the eviction on hold until the affected pod is deleted. It's up to the user to decide what to do, they can increase the replica count to avoid down time; or they can just delete the pod altogether and let Kubernetes reschedule it.

Also, is it saying that we can have multiple instances temporarily?

Yes, but not necessarily, the user can just delete the pod and avoid having multiple instances temporarily.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about how the user knows that a pod needs to be deleted. I'm assuming that the eviction is part of some automated process and K8s is applying a policy to decide how to bring specific pods down.

@@ -141,6 +143,9 @@ func makeSegmentstorePodSpec(pravegaCluster *api.PravegaCluster) corev1.PodSpec
func MakeSegmentstoreConfigMap(pravegaCluster *api.PravegaCluster) *corev1.ConfigMap {
javaOpts := []string{
"-Dpravegaservice.clusterName=" + pravegaCluster.Name,
"-Dbookkeeper.bkEnsembleSize=2",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove these lines, and probably give a way to the user to set these values.

Copy link
Contributor Author

@adrianmo adrianmo Dec 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll remove those lines as the default value is already set to 2 and this way we will allow users to override the default values using the options CR section.

...
  pravega:
    options:
      bookkeeper.bkEnsembleSize=2
      bookkeeper.bkAckQuorumSize=2
      bookkeeper.bkWriteQuorumSize=2
...

@adrianmo
Copy link
Contributor Author

adrianmo commented Dec 20, 2018

Regarding @fpj's comment in #92 (comment), I've created issue #114 to investigate BookKeeper's disruption cases.

@fpj fpj merged commit 9990154 into master Dec 20, 2018
@adrianmo adrianmo deleted the issue-60-data-protection-ha branch December 20, 2018 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/ready The issue is ready to be worked on; or the PR is ready to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants