Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable out-of-service taint in FAR #92

Merged
merged 5 commits into from
Apr 19, 2024

Conversation

k-keiichi-rh
Copy link
Contributor

@k-keiichi-rh k-keiichi-rh commented Oct 12, 2023

This PR is adding a new remediation strategy based on kubernetes/enhancements#1116

The following is the new remediation strategy for the out-of-service taint:

  1. One of the nodes failed
  2. FAR adds NoExecute taint to the failed node
    => Ensure that any workloads are not executed after rebooting the failed node
  3. FAR reboots the failed node via the Fence Agent
    => After rebooting, there are no stateless workloads which were not evicted by the taint in the failed node
  4. FAR sets the out-of-service taint
    => This taint expects that the node is in shutdown or power off state (not in the middle of restarting).
  5. After the failed node becomes healthy, the NoExecute taint in Step 2 and the out-of-service taint in Step 5 are removed and the node becomes schedulable again.

[ToDo]

ECOPROJECT-1326

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 12, 2023

Hi @k-keiichi-rh. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@razo7 razo7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting the PR, and for your first contribution to FAR. This one would be a very nice enhancement to FAR.
Left small nits and one unit test has failed (it seems like there was a too-short timeout).

=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).

Moreover, regarding your comment, I am not sure whether the node will be in a shutdown or power-off state when FAR adds the out-of-service taint. The only supported fencing action that FAR has is reboot which would do power off and then power on. Therefore, the node won't be in your desired state after the fence agent succeeds.

One more thing to raise/add is whether we want to do a validation test of the Kubernetes version (similar to what SNR does) since the out-of-service taint is fairly new in the community and not supported in old versions.

api/v1alpha1/fenceagentsremediation_types.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
pkg/utils/taints.go Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
@k-keiichi-rh
Copy link
Contributor Author

k-keiichi-rh commented Oct 17, 2023

=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).

Moreover, regarding your comment, I am not sure whether the node will be in a shutdown or power-off state when FAR adds the out-of-service taint. The only supported fencing action that FAR has is reboot which would do power off and then power on. Therefore, the node won't be in your desired state after the fence agent succeeds.

We can use the same approach of SNR we discussed in here.

There are the following cases after the reboot action in FAR:

  1. The failed node is rebooted and it becomes healthy again
    => The out-of-service taint doesn't take effect. The taint is ignored. The node can report its status to the control plane and the control plane can delete the stateful workloads instead of the out-of-service taint.
  2. The failed node is rebooted, but it keeps unhealthy(the node can not report its status to the control plane)
    => The out-of-service taint takes effect.
    => The taint will trigger deleting the stateful workloads.
  3. The failed node is not rebooted due to either failing to do power off or failing to do then power on
    => The out-of-service taint doesn't take effect because it's not added to the failed node.
    => FAR is checking the result of executing the fence agent. If it's failed, the exponential backoff is triggered

One more thing to raise/add is whether we want to do a validation test of the Kubernetes version (similar to what SNR does) since the out-of-service taint is fairly new in the community and not supported in old versions.

I will add this topic to my todo list. Thank you for sharing it.

@razo7
Copy link
Member

razo7 commented Oct 25, 2023

We can use the same approach of SNR we discussed in medik8s/self-node-remediation#17 (comment).

SGTM

@k-keiichi-rh k-keiichi-rh changed the title [WIP] Enable out-of-service taint in FAR Enable out-of-service taint in FAR Oct 26, 2023
@k-keiichi-rh k-keiichi-rh changed the title Enable out-of-service taint in FAR [WIP] Enable out-of-service taint in FAR Oct 26, 2023
Copy link
Member

@razo7 razo7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my last comments!

I have added some more comments :) Mostly minor nits on phrasing, missing log, consts, and simulating the deletion of pod and VA.
Please add a new commit after each review so it will be easier to review the changes between the commits and the last review.

pkg/utils/resources.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
@slintes
Copy link
Member

slintes commented Oct 28, 2023

=> This taint expects that the node is in shutdown or power off state (not in the middle of restarting).

Moreover, regarding your comment, I am not sure whether the node will be in a shutdown or power-off state when FAR adds the out-of-service taint. The only supported fencing action that FAR has is reboot which would do power off and then power on. Therefore, the node won't be in your desired state after the fence agent succeeds.

We can use the same approach of SNR we discussed in here.

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

SNR:

  • remediation starts, node reboot is triggered
  • some time expires
  • node reboot completed
  • some more time expires until safeTimeToAssumeNodeRebooted is reached
  • taint is added only when node is still unhealthy

FAR:

  • remediation starts, node reboot is triggered
  • the taint is added immediately
  • some time expires
  • node reboot completed
  • node might be healthy now but has the taint already. I understand this should be avoided.

@k-keiichi-rh
Copy link
Contributor Author

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

I may not understand your point correctly. So please let me confirm it just in case.

SNR:

  • remediation starts, node reboot is triggered
  • some time expires
  • node reboot completed
  • some more time expires until safeTimeToAssumeNodeRebooted is reached
  • taint is added only when node is still unhealthy

In the current OutOfService remediation in SNR, the out-of-service taint is added to the node who becomes healthy after the node reboot. However the out-of-service taint is deleted right after checking if there is no stateful workload on the node.

So should we avoid adding the out-of-service taint to the healthy node by checking if the SNR CR is being deleted by NHC/MHC?

FAR:

  • remediation starts, node reboot is triggered
  • the taint is added immediately
  • some time expires
  • node reboot completed
  • node might be healthy now but has the taint already. I understand this should be avoided.

If the node becomes healthy again, the FAR CR is deleted by NHC/MHC and the recovery action(deleting the out-of-service taint) is also executed. In this case, the healthy node won't have the out-of-service taint. So the node will come back to the cluster again.
So we can avoid the situation that the healthy node will keep having the out-of-service taint after rebooting.

@slintes
Copy link
Member

slintes commented Nov 2, 2023

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

I may not understand your point correctly. So please let me confirm it just in case.

SNR:

  • remediation starts, node reboot is triggered
  • some time expires
  • node reboot completed
  • some more time expires until safeTimeToAssumeNodeRebooted is reached
  • taint is added only when node is still unhealthy

In the current OutOfService remediation in SNR, the out-of-service taint is added to the node who becomes healthy after the node reboot. However the out-of-service taint is deleted right after checking if there is no stateful workload on the node.

So should we avoid adding the out-of-service taint to the healthy node by checking if the SNR CR is being deleted by NHC/MHC?

I thought we already do this, but just double checked the code, and we don't.
Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".
Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

/cc @mshitrit fyi

FAR:

  • remediation starts, node reboot is triggered
  • the taint is added immediately
  • some time expires
  • node reboot completed
  • node might be healthy now but has the taint already. I understand this should be avoided.

If the node becomes healthy again, the FAR CR is deleted by NHC/MHC and the recovery action(deleting the out-of-service taint) is also executed. In this case, the healthy node won't have the out-of-service taint. So the node will come back to the cluster again. So we can avoid the situation that the healthy node will keep having the out-of-service taint after rebooting.

Copy link
Contributor

openshift-ci bot commented Nov 2, 2023

@slintes: GitHub didn't allow me to request PR reviews from the following users: fyi.

Note that only medik8s members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I'm not sure if the same arguments as stated on SNR apply for FAR. The timing is different:

I may not understand your point correctly. So please let me confirm it just in case.

SNR:

  • remediation starts, node reboot is triggered
  • some time expires
  • node reboot completed
  • some more time expires until safeTimeToAssumeNodeRebooted is reached
  • taint is added only when node is still unhealthy

In the current OutOfService remediation in SNR, the out-of-service taint is added to the node who becomes healthy after the node reboot. However the out-of-service taint is deleted right after checking if there is no stateful workload on the node.

So should we avoid adding the out-of-service taint to the healthy node by checking if the SNR CR is being deleted by NHC/MHC?

I thought we already do this, but just double checked the code, and we don't.
Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".
Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

/cc @mshitrit fyi

FAR:

  • remediation starts, node reboot is triggered
  • the taint is added immediately
  • some time expires
  • node reboot completed
  • node might be healthy now but has the taint already. I understand this should be avoided.

If the node becomes healthy again, the FAR CR is deleted by NHC/MHC and the recovery action(deleting the out-of-service taint) is also executed. In this case, the healthy node won't have the out-of-service taint. So the node will come back to the cluster again. So we can avoid the situation that the healthy node will keep having the out-of-service taint after rebooting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k-keiichi-rh
Copy link
Contributor Author

Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".

I agree with you. I will fix it to the out-of-service taint remediation and check if there is no side effect by the change.
As for stopping any further fencing action on the healthy node, I think the basic idea here is that the control-plane should handle the fencing action if the failed node can communicate with the control-plane. So we don't need to do nothing in SNR.
If yes, does the same apply to the Resource Deletion Remediation as well as the OutOfServiceTaint Remediation?

Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

As far as I checked the effect of the out-of-service taint, putting the taint is not an issue and has no side effect.

In the "After rebooting" phase of SNR, the failed node has both the normal NoExecute taint and the NoSchedule taint and we expect that there are no stateful workloads on the node. So the out-of-service taint won't do anything.

@slintes
Copy link
Member

slintes commented Nov 3, 2023

Yes, I think we should stop any further fencing action when the CR has the deletion timestamp set. We should just do cleanup.
I understood that rebooting already is a grey area for using the taint, because of "This taint expects that the node is in shutdown or power off state (not in the middle of restarting)".

I agree with you. I will fix it to the out-of-service taint remediation and check if there is no side effect by the change. As for stopping any further fencing action on the healthy node, I think the basic idea here is that the control-plane should handle the fencing action if the failed node can communicate with the control-plane. So we don't need to do nothing in SNR. If yes, does the same apply to the Resource Deletion Remediation as well as the OutOfServiceTaint Remediation?

  • not sure if I understand, what do you mean with "the control-plane should handle the fencing action"?
  • I think we need to do some "cleanup" in SNR, e.g removing taints which were already set in the pre-reboot phase. Maybe we can just switch to the fencing completed phase directly, it should do everything we need for cleanup?
  • yes, I think the the same applies for the ResourceDeletion strategy
  • before changing anything, we should wait for comment of @mshitrit
  • I will create an issue for SNR to have the discussion at the right place 🙂

Isn't putting the taint on a node which finished rebooting, and is healthy now, an issue? Even when we remove it afterwards?

As far as I checked the effect of the out-of-service taint, putting the taint is not an issue and has no side effect.

In the "After rebooting" phase of SNR, the failed node has both the normal NoExecute taint and the NoSchedule taint and we expect that there are no stateful workloads on the node. So the out-of-service taint won't do anything.

Ok, then my concerns were wrong, and it that makes the SNR topic much less urgent. Sorry for the noise and thanks for the discussion!

@slintes
Copy link
Member

slintes commented Nov 3, 2023

for the SNR related discussion let's continue here: medik8s/self-node-remediation#159

@k-keiichi-rh
Copy link
Contributor Author

I have added some more comments :) Mostly minor nits on phrasing, missing log, consts, and simulating the deletion of pod and VA. Please add a new commit after each review so it will be easier to review the changes between the commits and the last review.

@razo7 Thank you for taking your time and your review.
I reflected your comments. Please check it.

By the way, are my replies to your comments visible?
My replies are noted with the "Pending" tag.

@razo7
Copy link
Member

razo7 commented Nov 5, 2023

By the way, are my replies to your comments visible?
My replies are noted with the "Pending" tag.

No, I can't see your replies since you haven't submitted your review. Please see here on how to submit them.

Copy link
Contributor Author

@k-keiichi-rh k-keiichi-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I have not submitted my review.
I reflected all your comments and there is no questions for the comments.

controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
pkg/utils/taints.go Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly straightforward PR :) Needs rebase though.
Some comments inline.

pkg/utils/taints.go Outdated Show resolved Hide resolved
@@ -72,3 +72,31 @@ func DeleteResources(ctx context.Context, r client.Client, nodeName string) erro

return nil
}

func IsResourceDeletionCompleted(r client.Client, nodeName string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pass a Context to this function and use it in the API calls, similar to the function above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this.

pods := &corev1.PodList{}
if err := r.List(context.Background(), pods); err != nil {
log.Error(err, "failed to get pod list")
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to return an error here, for being able to differentiate between "something went wrong" and "pods not deleted yet" where this function is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I will change it.

pkg/utils/resources.go Outdated Show resolved Hide resolved
// remove out-of-service taint when using OutOfServiceTaint remediation
if far.Spec.RemediationStrategy == v1alpha1.OutOfServiceTaintRemediationStrategy {
r.Log.Info("Removing OutOfService taint", "Fence Agent", far.Spec.Agent, "Node Name", req.Name)
if !utils.IsResourceDeletionCompleted(r.Client, req.Name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this check? Are we sure that always all Pods get the DeletionTimestamp? What about Pods which tolerate the taint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this check?

I think the ResourceDeletionRemediationStrategy tries to forcefully delete all of pods explicitly. We have the way to check the terminating pods are deleted by checking the result of the deletion. So I am 100% sure we don't need this check.

However, in OutOfServiceTaintRemediationStrategy, I have 1% doubt if the terminating pods are deleted.
If NHC identifies the node becomes healthy, the control-plane or kubelet deletes the terminating pods. So we can expect there is no terminating pod in this stage and may not need this check.
However we can't control the behavior of control-plane or kubelet. And we can just expect the terminating pods are deleted indirectly by them compared to the ResourceDeletionRemediationStrategy.

This remaining 1% was the reason why I thought we need this check.
But I may be thinking about it too much. So I will drop this change.

Are we sure that always all Pods get the DeletionTimestamp?
What about Pods which tolerate the taint?

The current out-of-service taint focuses on only the terminating pods which has the DeletionTimestamp to enable workloads to failover to another node. If we can confirm if there is no terminating pod, it means we can move all workloads on the failed node to another node. If we can not, we need to improve the out-of-service taint code in k8s.

controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
controllers/fenceagentsremediation_controller_test.go Outdated Show resolved Hide resolved
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will review the e2e test tomorrow, 2 comments inline

controllers/fenceagentsremediation_controller.go Outdated Show resolved Hide resolved
pkg/utils/events.go Outdated Show resolved Hide resolved
@k-keiichi-rh k-keiichi-rh force-pushed the ecoproject-1326 branch 2 times, most recently from 24fe752 to 9fe49ca Compare April 9, 2024 14:56
@k-keiichi-rh
Copy link
Contributor Author

@slintes Thank you for the comments again. I reflected your comments.
The following is all of changes from my last commit: https://github.com/k-keiichi-rh/fence-agents-remediation/commits/ecoproject-1326-with-review/

@slintes
Copy link
Member

slintes commented Apr 9, 2024

@k-keiichi-rh fyi, we have a CI outage at the moment, e2e tests are expected to fail until further notice 🙁

@slintes
Copy link
Member

slintes commented Apr 11, 2024

/test all

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is working again.
I left one remark in the reconcile code, and there is an issue in the e2e test.
Beside that, lgtm :)

test/e2e/far_e2e_test.go Show resolved Hide resolved
@slintes
Copy link
Member

slintes commented Apr 12, 2024

there is duplicated code in the e2e test, but we clean it up in a follow up in order to get this in for the next release...

@razo7
Copy link
Member

razo7 commented Apr 18, 2024

/retest

@mshitrit
Copy link
Member

/lgtm
/hold
Since we are after Code Freeze, waiting for QE green light before merging

@mshitrit
Copy link
Member

/test 4.15-openshift-e2e

@mshitrit
Copy link
Member

/test 4.14-openshift-e2e

@k-keiichi-rh
Copy link
Contributor Author

/test 4.15-openshift-e2e

@frajamomo
Copy link

/lgtm

image_360

image_720

Copy link
Contributor

openshift-ci bot commented Apr 19, 2024

@frajamomo: changing LGTM is restricted to collaborators

In response to this:

/lgtm

image_360

image_720

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mshitrit
Copy link
Member

/unhold

@mshitrit mshitrit merged commit 7f4a492 into medik8s:main Apr 19, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants