Issue 60: Data protection / High availability #92

adrianmo · 2018-11-20T16:11:33Z

Change log description

The operator should ensure that pods are deployed in a reasonable way from a data protection and high availability perspective.

Pod protection

Kubernetes has the ability to safely evict all of your pods from a node before an admin can perform maintenance on a node. The operator should prevent this from happening if the node evicted would cause disruption in a Pravega cluster. This can be achieved with Pod Disruption Budgets.

With these changes, the operator will create a PDB for each component type (controller, segmentstore, bookie) with the following PDB configuration.

Controller: allow planned disruptions as long as there is at one Controller pod available.
Segment Store: allow 1 planned disruption if the number of replicas is greater than 1, otherwise, no planned disruption is allowed.
Bookkeeper: allow 1 planned disruption at a time.
- The minimum number of Bookies is enforced to 3.
- The quorum has been brought down to 2 (ref) This allows one Bookie disruption even in the minimum configuration.
- AutoRecovery is set to true by default to automatically recover under-replicated ledger when there is a pod disruption (planned or unplanned).

Pod placement

To achieve higher performance and fault tolerance instances of the same type are recommended to be spread across different nodes. For example, assuming that we have a 3-node Kubernetes cluster and we are attempting to create 3 Bookkeeper instances, the first Bookie will be placed in any of the 3 nodes (e.g. Node 1), the second one will be preferably placed in a different node (e.g. Node 2 or Node 3), and the third bookie will be placed in the only node that does not contain a bookie. If a fourth Bookie was to be scheduled, it would be placed in any of the nodes, because all of them already have one Bookie.

This has been implemented in this PR with Pod Anti-Affinity using the preferredDuringSchedulingIgnoredDuringExecution soft requirement because there's no functional requirement to place components of the same kind in different nodes.

Purpose of the change

Fix Ensure data protection (DU/DL) #60

How to test the change

Assuming you already have access to a Kubernetes environment (GKE, PKS...) with at least 3 nodes, follow the instructions in the README file to deploy ZooKeeper, the NFS server provisioner, and the Pravega Tier 2 PVC. Skip the deployment of the Pravega operator.

Check out this branch.

git checkout issue-60-data-protection-ha

Update the Pravega operator image to use an image built with the PR changes. Open the deploy/operator.yaml file and replace the container image from pravega/pravega-operator:latest to adrianmo/pravega-operator:issue-60.

Deploy the operator.

kubectl create -f deploy/operator.yaml

Follow the instructions in the README file to deploy a Pravega cluster with 3 Bookies, 1 or more Controllers, and 3 Segment stores.

Run the following command to verify the correct placement of pods.

$ kubectl get po -o wide
NAME                                         READY     STATUS    RESTARTS   AGE       IP           NODE
...
pravega-bookie-0                             1/1       Running   0          35s       10.56.6.43   gke-adrian-cluster-1-default-pool-6e9f73a5-7348
pravega-bookie-1                             1/1       Running   0          35s       10.56.1.29   gke-adrian-cluster-1-default-pool-0616952c-kd4z
pravega-bookie-2                             1/1       Running   0          35s       10.56.9.42   gke-adrian-cluster-1-default-pool-d0ef8647-f01m
pravega-operator-7dbd5bb757-wzp7z            1/1       Running   0          1h        10.56.9.38   gke-adrian-cluster-1-default-pool-d0ef8647-f01m
pravega-pravega-controller-58b4dcf4f-69jpv   1/1       Running   0          35s       10.56.1.28   gke-adrian-cluster-1-default-pool-0616952c-kd4z
pravega-segmentstore-0                       1/1       Running   0          35s       10.56.3.21   gke-adrian-cluster-1-default-pool-0616952c-n948
pravega-segmentstore-1                       1/1       Running   0          35s       10.56.9.41   gke-adrian-cluster-1-default-pool-d0ef8647-f01m
pravega-segmentstore-2                       1/1       Running   0          35s       10.56.6.42   gke-adrian-cluster-1-default-pool-6e9f73a5-7348
...

Verify that pods of the same kind are scheduled to different nodes.

Check that pod disruption budgets are correctly set.

$ kubectl get pdb
NAME                         MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
pravega-bookie               N/A             1                 1                     6m
pravega-pravega-controller   1               N/A               0                     6m
pravega-segmentstore         N/A             1                 1                     6m

In this example, we allow one pod disruption for Segment Store and Bookkeeper, but none disruption for the Controller since there's only one Controller pod.

If we try to drain the node that contains the Controller instance, we should expect the drain command to abort.

$ kubectl drain gke-adrian-cluster-1-default-pool-0616 --ignore-daemonsets
952c-kd4z                                                                                                                                                       
node "gke-adrian-cluster-1-default-pool-0616952c-kd4z" cordoned                                                                                                 
error: unable to drain node "gke-adrian-cluster-1-default-pool-0616952c-kd4z", aborting command...

$ kubectl get nodes
NAME                                              STATUS                     ROLES     AGE       VERSION
...
gke-adrian-cluster-1-default-pool-0616952c-kd4z   Ready,SchedulingDisabled   <none>    4d        v1.9.7-gke.11
...

The node will continue to be part of the cluster, but it will not host any new pods. The controller instance will still be untouched and the Pravega admin will be required to manually increase the number of replicas and/or delete the affected pod to automatically be rescheduled to a different node.

Signed-off-by: Adrian Moreno <[email protected]>

adrianmo · 2018-11-21T15:43:11Z

Data Protection and High Availability in K8

fpj

Is there a way to unit test this operator work?

pkg/pravega/pravega_controller.go

pkg/pravega/pravega_segmentstore.go

pkg/pravega/bookie.go

shrids

couple of questions.

pkg/pravega/bookie.go

shrids

Multiple bookies should not be brought down at the same time.
(and auto recovery should be ON by default for bookies)

adrianmo · 2018-12-05T18:02:17Z

Need to rebase and fix conflicts.

adrianmo · 2018-12-05T18:05:24Z

Blocked on #89 that brings the ability to set default values.

EronWright · 2018-12-06T22:29:19Z

@adrianmo we also need to be able to disable the anti-affinity features for dev deployments, e.g. minikube. I suppose that one option would be that the features would be disabled if replicas < 3 for a given component. WDYT?

adrianmo · 2018-12-07T14:06:48Z

@EronWright the anti-affinity rules used in this PR are soft requirements (PreferredDuringSchedulingIgnoredDuringExecution), meaning that if there's only one node (e.g. MiniKube), all pods will run in that node.

adrianmo · 2018-12-07T14:08:12Z

Need to bring down the BK replication factor to 2 as it was done in pravega/pravega#3158

* master: Issue 78: Clean up persistent volumes when deleting Pravega Cluster (#103) Issue 97: Update to operator SDK v0.2.0 (#105) Ability to scale Pravega Controller (#99) Issue 61: Support external connectivity (#77) Issue 95: Updated README.md (#95)

Signed-off-by: Adrián Moreno <[email protected]>

* master: Issue 88: Set default values when not specified (#89)

Signed-off-by: Adrián Moreno <[email protected]>

adrianmo · 2018-12-12T09:35:41Z

@fpj @shrids The PR is ready to review again.

Recent changes:

Enforce the minimum number of Bookies to 3 (inherited by Issue 88: Set default values when not specified #89).
Set AutoRecovery to true by default (inherited by Issue 88: Set default values when not specified #89).
Bring down Bookie quorum to 2 (as in Issue 3037: BK degree of replication in Swarm pravega#3158).
Allow only one Bookie pod disruption at a time.

shrids

The changes look good.
I have a query regarding controller's PDB.

pkg/controller/pravega/pravega_controller.go

shrids

The changes look good, no additional comments from my end apart from a query related to controller's PDB.

Signed-off-by: Adrián Moreno <[email protected]>

adrianmo · 2018-12-12T11:42:23Z

Updated PR and description to address @shrids comments. Changed Controller PDB to minAvailable=1.

Signed-off-by: Adrián Moreno <[email protected]>

fpj · 2018-12-17T19:02:57Z

I wanted to capture an offline discussion with @EronWright . BK automatically replicates ledger fragments automatically (when the feature is enabled). Even if we have a disruption budget such that we have a single disruption at a time, there is no guarantee that the data will be re-replicated fast enough. As such, we could end up in this situation in which we get rid of too many copies of the data and can't recover it.

Ideally, we wait for the re-replication to finish before causing another disruption, but I don't think we have an API for that. If we are replacing bookies, then we could consider keeping the volumes of the decommissioned bookies, but that's confusing in the general case.

Signed-off-by: Adrián Moreno <[email protected]>

* master: Issue 96: Startup and healthcheck improvements (#102) Signed-off-by: Adrián Moreno <[email protected]>

fpj · 2018-12-20T12:23:57Z

pkg/controller/pravega/pravega_segmentstore.go

+	var maxUnavailable intstr.IntOrString
+
+	if pravegaCluster.Spec.Pravega.SegmentStoreReplicas == int32(1) {
+		maxUnavailable = intstr.FromInt(0)


I'm wondering if we should remove this if clause. If there is a single instance, then in principle we never want it down. At the same time, it is not possible to guarantee it. As the PDB policy is a soft policy, the pod will be brought down anyway in the case k8s needs to do it, so we might as well remove the if and leave it at:

maxUnavailable = intstr.FromInt(1)

PDB protects pods from planned disruptions, such as a node eviction. In such case, if we have 1 Segment Store replica and maxUnavailable=1, Kubernetes will first kill the pod, and then start a new one in a different node, causing a temporary disruption. Whereas if maxUnavailble=0, Kubernetes will put node eviction on hold, giving the user time to increase the replica number and create a new pod in a different node, and then delete the affected pod to resume the node eviction, resulting in zero downtime.

I don't understand this comment:

if maxUnavailble=0, Kubernetes will put node eviction on hold, giving the user time to increase the replica number and create a new pod in a different node, and then delete the affected pod to resume the node eviction, resulting in zero downtime.

How does the user know it is being given time to increase the replica count? Also, is it saying that we can have multiple instances temporarily? That might be worse than having none temporarily and then one back because that will multiple temporarily will induce two rebalances of the segment containers.

How does the user know it is being given time to increase the replica count?

Due to the PDB, Kubernetes will put the eviction on hold until the affected pod is deleted. It's up to the user to decide what to do, they can increase the replica count to avoid down time; or they can just delete the pod altogether and let Kubernetes reschedule it.

Also, is it saying that we can have multiple instances temporarily?

Yes, but not necessarily, the user can just delete the pod and avoid having multiple instances temporarily.

I'm confused about how the user knows that a pod needs to be deleted. I'm assuming that the eviction is part of some automated process and K8s is applying a policy to decide how to bring specific pods down.

fpj · 2018-12-20T12:24:38Z

pkg/controller/pravega/pravega_segmentstore.go

@@ -141,6 +143,9 @@ func makeSegmentstorePodSpec(pravegaCluster *api.PravegaCluster) corev1.PodSpec
 func MakeSegmentstoreConfigMap(pravegaCluster *api.PravegaCluster) *corev1.ConfigMap {
 	javaOpts := []string{
 		"-Dpravegaservice.clusterName=" + pravegaCluster.Name,
+		"-Dbookkeeper.bkEnsembleSize=2",


We should remove these lines, and probably give a way to the user to set these values.

Good catch. I'll remove those lines as the default value is already set to 2 and this way we will allow users to override the default values using the options CR section.

... pravega: options: bookkeeper.bkEnsembleSize=2 bookkeeper.bkAckQuorumSize=2 bookkeeper.bkWriteQuorumSize=2 ...

adrianmo · 2018-12-20T13:42:55Z

Regarding @fpj's comment in #92 (comment), I've created issue #114 to investigate BookKeeper's disruption cases.

Signed-off-by: Adrián Moreno <[email protected]>

adrianmo added 3 commits November 13, 2018 15:40

Configure pod anti-affinity policy for Ctrlr/SS/Bookie pods

dfe8fd9

Signed-off-by: Adrian Moreno <[email protected]>

Create PDB for segmentstore

03a91d6

Signed-off-by: Adrian Moreno <[email protected]>

Create Pod Disruption Budgets for Controller, SS, and BK

f5dedba

Signed-off-by: Adrian Moreno <[email protected]>

adrianmo self-assigned this Nov 20, 2018

fpj requested changes Nov 23, 2018

View reviewed changes

pkg/pravega/pravega_controller.go Outdated Show resolved Hide resolved

pkg/pravega/pravega_segmentstore.go Outdated Show resolved Hide resolved

pkg/pravega/bookie.go Outdated Show resolved Hide resolved

fpj mentioned this pull request Nov 23, 2018

Issue 61: Support external connectivity #77

Merged

shrids reviewed Nov 26, 2018

View reviewed changes

pkg/pravega/bookie.go Outdated Show resolved Hide resolved

pkg/pravega/bookie.go Outdated Show resolved Hide resolved

adrianmo added the status/blocked Issue or PR is blocked on another item; add reference in a comment label Nov 27, 2018

shrids requested changes Nov 27, 2018

View reviewed changes

adrianmo added status/work-in-progress PR work is in progress; do not merge and removed status/blocked Issue or PR is blocked on another item; add reference in a comment labels Dec 5, 2018

adrianmo added 5 commits December 7, 2018 16:26

Bring PR up to date with master

2eb0ffe

Signed-off-by: Adrián Moreno <[email protected]>

Merge branch 'master' into issue-60-data-protection-ha

6d74a06

* master: Issue 88: Set default values when not specified (#89)

Allow only one Bookie disruption at a time

72bebfa

Signed-off-by: Adrián Moreno <[email protected]>

Bring BK quorum down to 2

d563a31

Signed-off-by: Adrián Moreno <[email protected]>

adrianmo added status/ready The issue is ready to be worked on; or the PR is ready to review and removed status/work-in-progress PR work is in progress; do not merge labels Dec 12, 2018

shrids approved these changes Dec 12, 2018

View reviewed changes

pkg/controller/pravega/pravega_controller.go Outdated Show resolved Hide resolved

shrids approved these changes Dec 12, 2018

View reviewed changes

Change Controller PDB to minAvailable=1

641c6dd

Signed-off-by: Adrián Moreno <[email protected]>

adrianmo requested a review from Tristan1900 December 12, 2018 17:22

adrianmo added 13 commits December 17, 2018 12:57

Fix tar command

1a4abf4

Signed-off-by: Adrián Moreno <[email protected]>

Update Helm install process

7c0bac9

Signed-off-by: Adrián Moreno <[email protected]>

Update Travis

516dbba

Signed-off-by: Adrián Moreno <[email protected]>

Update Travis

9308bb3

Signed-off-by: Adrián Moreno <[email protected]>

Update Travis

cbe2bd1

Signed-off-by: Adrián Moreno <[email protected]>

Update Travis

5898411

Signed-off-by: Adrián Moreno <[email protected]>

Reduce tier 2 storage

f6e0caf

Signed-off-by: Adrián Moreno <[email protected]>

Obtain pod logs if test fails

01a0bac

Signed-off-by: Adrián Moreno <[email protected]>

Update Travis

504bfbb

Signed-off-by: Adrián Moreno <[email protected]>

Update Travis

23f459e

Signed-off-by: Adrián Moreno <[email protected]>

Install nfs-common on host machine

b552b95

Signed-off-by: Adrián Moreno <[email protected]>

Write and read data from cluster

3b00b0d

Signed-off-by: Adrián Moreno <[email protected]>

Force Travis to wait until tests have finished

75dbf31

Signed-off-by: Adrián Moreno <[email protected]>

adrianmo added 3 commits December 18, 2018 11:13

Run test job to validate cluster

6f33b83

Signed-off-by: Adrián Moreno <[email protected]>

Fix e2e test

3941faf

Signed-off-by: Adrián Moreno <[email protected]>

Increase test job timeout

022d224

Signed-off-by: Adrián Moreno <[email protected]>

adrianmo mentioned this pull request Dec 18, 2018

Issue 96: Startup and healthcheck improvements #102

Merged

adrianmo added 2 commits December 18, 2018 17:26

Fix pod readiness check

7b5e535

Signed-off-by: Adrián Moreno <[email protected]>

Merge branch 'master' into issue-60-data-protection-ha

181112b

* master: Issue 96: Startup and healthcheck improvements (#102) Signed-off-by: Adrián Moreno <[email protected]>

adrianmo force-pushed the issue-60-data-protection-ha branch from 9886223 to 181112b Compare December 20, 2018 11:20

fpj added status/ready The issue is ready to be worked on; or the PR is ready to review and removed status/work-in-progress PR work is in progress; do not merge labels Dec 20, 2018

fpj reviewed Dec 20, 2018

View reviewed changes

Remove hardcoded bookkeeper ensemble

34b0edc

Signed-off-by: Adrián Moreno <[email protected]>

fpj approved these changes Dec 20, 2018

View reviewed changes

fpj merged commit 9990154 into master Dec 20, 2018

adrianmo deleted the issue-60-data-protection-ha branch December 20, 2018 17:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 60: Data protection / High availability #92

Issue 60: Data protection / High availability #92

adrianmo commented Nov 20, 2018 •

edited by fpj

Loading

adrianmo commented Nov 21, 2018

fpj left a comment

shrids left a comment

shrids left a comment •

edited

Loading

adrianmo commented Dec 5, 2018

adrianmo commented Dec 5, 2018

EronWright commented Dec 6, 2018

adrianmo commented Dec 7, 2018

adrianmo commented Dec 7, 2018

adrianmo commented Dec 12, 2018

shrids left a comment

shrids left a comment

adrianmo commented Dec 12, 2018

fpj commented Dec 17, 2018

fpj Dec 20, 2018

adrianmo Dec 20, 2018

fpj Dec 20, 2018

adrianmo Dec 20, 2018

fpj Dec 20, 2018

fpj Dec 20, 2018

adrianmo Dec 20, 2018 •

edited

Loading

adrianmo commented Dec 20, 2018 •

edited

Loading

Issue 60: Data protection / High availability #92

Issue 60: Data protection / High availability #92

Conversation

adrianmo commented Nov 20, 2018 • edited by fpj Loading

Change log description

Pod protection

Pod placement

Purpose of the change

How to test the change

adrianmo commented Nov 21, 2018

fpj left a comment

Choose a reason for hiding this comment

shrids left a comment

Choose a reason for hiding this comment

shrids left a comment • edited Loading

Choose a reason for hiding this comment

adrianmo commented Dec 5, 2018

adrianmo commented Dec 5, 2018

EronWright commented Dec 6, 2018

adrianmo commented Dec 7, 2018

adrianmo commented Dec 7, 2018

adrianmo commented Dec 12, 2018

shrids left a comment

Choose a reason for hiding this comment

shrids left a comment

Choose a reason for hiding this comment

adrianmo commented Dec 12, 2018

fpj commented Dec 17, 2018

fpj Dec 20, 2018

Choose a reason for hiding this comment

adrianmo Dec 20, 2018

Choose a reason for hiding this comment

fpj Dec 20, 2018

Choose a reason for hiding this comment

adrianmo Dec 20, 2018

Choose a reason for hiding this comment

fpj Dec 20, 2018

Choose a reason for hiding this comment

fpj Dec 20, 2018

Choose a reason for hiding this comment

adrianmo Dec 20, 2018 • edited Loading

Choose a reason for hiding this comment

adrianmo commented Dec 20, 2018 • edited Loading

adrianmo commented Nov 20, 2018 •

edited by fpj

Loading

shrids left a comment •

edited

Loading

adrianmo Dec 20, 2018 •

edited

Loading

adrianmo commented Dec 20, 2018 •

edited

Loading