-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 60: Data protection / High availability #92
Conversation
Signed-off-by: Adrian Moreno <[email protected]>
Signed-off-by: Adrian Moreno <[email protected]>
Signed-off-by: Adrian Moreno <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to unit test this operator work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple bookies should not be brought down at the same time.
(and auto recovery should be ON by default for bookies)
Need to rebase and fix conflicts. |
Blocked on #89 that brings the ability to set default values. |
@adrianmo we also need to be able to disable the anti-affinity features for dev deployments, e.g. minikube. I suppose that one option would be that the features would be disabled if replicas < 3 for a given component. WDYT? |
@EronWright the anti-affinity rules used in this PR are soft requirements ( |
Need to bring down the BK replication factor to 2 as it was done in pravega/pravega#3158 |
Signed-off-by: Adrián Moreno <[email protected]>
* master: Issue 88: Set default values when not specified (#89)
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
@fpj @shrids The PR is ready to review again. Recent changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good.
I have a query regarding controller's PDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good, no additional comments from my end apart from a query related to controller's PDB.
Signed-off-by: Adrián Moreno <[email protected]>
Updated PR and description to address @shrids comments. Changed Controller PDB to |
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
I wanted to capture an offline discussion with @EronWright . BK automatically replicates ledger fragments automatically (when the feature is enabled). Even if we have a disruption budget such that we have a single disruption at a time, there is no guarantee that the data will be re-replicated fast enough. As such, we could end up in this situation in which we get rid of too many copies of the data and can't recover it. Ideally, we wait for the re-replication to finish before causing another disruption, but I don't think we have an API for that. If we are replacing bookies, then we could consider keeping the volumes of the decommissioned bookies, but that's confusing in the general case. |
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
Signed-off-by: Adrián Moreno <[email protected]>
* master: Issue 96: Startup and healthcheck improvements (#102) Signed-off-by: Adrián Moreno <[email protected]>
9886223
to
181112b
Compare
var maxUnavailable intstr.IntOrString | ||
|
||
if pravegaCluster.Spec.Pravega.SegmentStoreReplicas == int32(1) { | ||
maxUnavailable = intstr.FromInt(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should remove this if clause. If there is a single instance, then in principle we never want it down. At the same time, it is not possible to guarantee it. As the PDB policy is a soft policy, the pod will be brought down anyway in the case k8s needs to do it, so we might as well remove the if and leave it at:
maxUnavailable = intstr.FromInt(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PDB protects pods from planned disruptions, such as a node eviction. In such case, if we have 1 Segment Store replica and maxUnavailable=1
, Kubernetes will first kill the pod, and then start a new one in a different node, causing a temporary disruption. Whereas if maxUnavailble=0
, Kubernetes will put node eviction on hold, giving the user time to increase the replica number and create a new pod in a different node, and then delete the affected pod to resume the node eviction, resulting in zero downtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment:
if maxUnavailble=0, Kubernetes will put node eviction on hold, giving the user time to increase the replica number and create a new pod in a different node, and then delete the affected pod to resume the node eviction, resulting in zero downtime.
How does the user know it is being given time to increase the replica count? Also, is it saying that we can have multiple instances temporarily? That might be worse than having none temporarily and then one back because that will multiple temporarily will induce two rebalances of the segment containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user know it is being given time to increase the replica count?
Due to the PDB, Kubernetes will put the eviction on hold until the affected pod is deleted. It's up to the user to decide what to do, they can increase the replica count to avoid down time; or they can just delete the pod altogether and let Kubernetes reschedule it.
Also, is it saying that we can have multiple instances temporarily?
Yes, but not necessarily, the user can just delete the pod and avoid having multiple instances temporarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about how the user knows that a pod needs to be deleted. I'm assuming that the eviction is part of some automated process and K8s is applying a policy to decide how to bring specific pods down.
@@ -141,6 +143,9 @@ func makeSegmentstorePodSpec(pravegaCluster *api.PravegaCluster) corev1.PodSpec | |||
func MakeSegmentstoreConfigMap(pravegaCluster *api.PravegaCluster) *corev1.ConfigMap { | |||
javaOpts := []string{ | |||
"-Dpravegaservice.clusterName=" + pravegaCluster.Name, | |||
"-Dbookkeeper.bkEnsembleSize=2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove these lines, and probably give a way to the user to set these values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll remove those lines as the default value is already set to 2
and this way we will allow users to override the default values using the options
CR section.
...
pravega:
options:
bookkeeper.bkEnsembleSize=2
bookkeeper.bkAckQuorumSize=2
bookkeeper.bkWriteQuorumSize=2
...
Regarding @fpj's comment in #92 (comment), I've created issue #114 to investigate BookKeeper's disruption cases. |
Signed-off-by: Adrián Moreno <[email protected]>
Change log description
The operator should ensure that pods are deployed in a reasonable way from a data protection and high availability perspective.
Pod protection
Kubernetes has the ability to safely evict all of your pods from a node before an admin can perform maintenance on a node. The operator should prevent this from happening if the node evicted would cause disruption in a Pravega cluster. This can be achieved with Pod Disruption Budgets.
With these changes, the operator will create a PDB for each component type (controller, segmentstore, bookie) with the following PDB configuration.
AutoRecovery
is set totrue
by default to automatically recover under-replicated ledger when there is a pod disruption (planned or unplanned).Pod placement
To achieve higher performance and fault tolerance instances of the same type are recommended to be spread across different nodes. For example, assuming that we have a 3-node Kubernetes cluster and we are attempting to create 3 Bookkeeper instances, the first Bookie will be placed in any of the 3 nodes (e.g. Node 1), the second one will be preferably placed in a different node (e.g. Node 2 or Node 3), and the third bookie will be placed in the only node that does not contain a bookie. If a fourth Bookie was to be scheduled, it would be placed in any of the nodes, because all of them already have one Bookie.
This has been implemented in this PR with Pod Anti-Affinity using the
preferredDuringSchedulingIgnoredDuringExecution
soft requirement because there's no functional requirement to place components of the same kind in different nodes.Purpose of the change
How to test the change
Assuming you already have access to a Kubernetes environment (GKE, PKS...) with at least 3 nodes, follow the instructions in the README file to deploy ZooKeeper, the NFS server provisioner, and the Pravega Tier 2 PVC. Skip the deployment of the Pravega operator.
Check out this branch.
Update the Pravega operator image to use an image built with the PR changes. Open the
deploy/operator.yaml
file and replace the container image frompravega/pravega-operator:latest
toadrianmo/pravega-operator:issue-60
.Deploy the operator.
Follow the instructions in the README file to deploy a Pravega cluster with 3 Bookies, 1 or more Controllers, and 3 Segment stores.
Run the following command to verify the correct placement of pods.
Verify that pods of the same kind are scheduled to different nodes.
Check that pod disruption budgets are correctly set.
In this example, we allow one pod disruption for Segment Store and Bookkeeper, but none disruption for the Controller since there's only one Controller pod.
If we try to drain the node that contains the Controller instance, we should expect the drain command to abort.
The node will continue to be part of the cluster, but it will not host any new pods. The controller instance will still be untouched and the Pravega admin will be required to manually increase the number of replicas and/or delete the affected pod to automatically be rescheduled to a different node.