use `--disable-eviction` option when drain node #6929

Bo0km4n · 2022-07-15T05:02:16Z

User Story
If cluster user had created PDB resource, machine deleting would stack at process of node drain
So after a few drain attempts, I hope that cluster-api machine controller try drain with --disable-eviction option.

This problem often occurs when a user performs a RollingUpdate of MachineDeployment.

Detailed Description
I propose my idea to implement above opinion.

Check node drain timeout with using nodeDrainTimeoutExceeded.
Next, If the elapsed time by drain node is exceeded, machine controller enable --disable-eviction option
in drainNode function.

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2022-07-15T10:43:05Z

On first look it seems strange to use Cluster API to implicitly overwrite user intention expressed in a pod disruption budget. It seems like a better solution to this issue would be to surface the reason for the lack of rollout of machines at the Cluster API level so users can better configure their workloads.

@Bo0km4n have you got a toy example I could test to see the impact of PDB blocking rollouts?

sbueringer · 2022-07-15T13:30:06Z

@killianmuldoon I think you can just deploy something like that:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cluster-api
spec:
  minAvailable: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/provider: cluster-api

Selector has to match some pods that you have (e.g. just let it match the capi controller on a self-hosted cluster). If you want a always blocking PDB just use minAvailable 1 and a Deployment with 1 replica.

Current behavior is probably:

without nodeDrainTimeout: drain will just fail indefinitely
with nodeDrainTimeout: drain will fail until the timeout, then the node is deleted

Bo0km4n · 2022-07-15T16:19:12Z

@killianmuldoon
@sbueringer 's example is I said.

with nodeDrainTimeout: drain will fail until the timeout, then the node is deleted

I didn't know that. If I set the timeout to machine I want drain, is it possible to forcibly delete a Node that is running a Pod that cannot be evicted?
If that's true, my problem will be solved by setting nodeDrainTimeout

sbueringer · 2022-07-18T12:14:52Z

I didn't know that. If I set the timeout to machine I want drain, is it possible to forcibly delete a Node that is running a Pod that cannot be evicted?

That is my understanding, yes.

Bo0km4n · 2022-07-18T12:20:14Z

Thanks @sbueringer .
But one point of concern. If the evicting target pod using PVC, the mounted volume attached to delete node will be orphan volume?

sbueringer · 2022-07-18T12:43:56Z

Not sure. This might depend on your infrastructure. CAPI will just delete the node object and then the corresponding infra.

enxebre · 2022-07-18T13:02:09Z

Not sure. This might depend on your infrastructure. CAPI will just delete the node object and then the corresponding infra.

To clarify, capi will always enforce the underlying infra is gone before deleting the Node to avoid potential stateful issues, #2565.

But one point of concern. If the evicting target pod using PVC, the mounted volume attached to delete node will be orphan volume?

At the moment CAPI will wait indefinitely for volumes to be dettached #4945, your kcm cloud provider should take care of it. There's also ongoing discussion about enabling and optional timeout while waiting for the volume #6285.

I didn't know that. If I set the timeout to machine I want drain, is it possible to forcibly delete a Node that is running a Pod that cannot be evicted?
If that's true, my problem will be solved by setting nodeDrainTimeout

Yes. This issue is asking for the behaviour supported via nodeDrainTimeout.
Note that changing this in an existing MachineDeployment it's unlikely to help as it would try to trigger a rolling upgrade. That's a suboptimal UX we want to improve #5880

@Bo0km4n if this makes sense we can close this as it's supported by nodeDrainTimeout and keep any related discussion in the issued linked above.

Bo0km4n · 2022-07-18T13:13:22Z

@enxebre Thank you for your information. I will try discuss about above issues.
And I will check the above volumes and related behaviors in my environment.

Thank you guys. I close this issue.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 15, 2022

Bo0km4n closed this as completed Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use `--disable-eviction` option when drain node #6929

use `--disable-eviction` option when drain node #6929

Bo0km4n commented Jul 15, 2022 •

edited

Loading

killianmuldoon commented Jul 15, 2022

sbueringer commented Jul 15, 2022 •

edited

Loading

Bo0km4n commented Jul 15, 2022 •

edited

Loading

sbueringer commented Jul 18, 2022

Bo0km4n commented Jul 18, 2022

sbueringer commented Jul 18, 2022

enxebre commented Jul 18, 2022 •

edited

Loading

Bo0km4n commented Jul 18, 2022

use --disable-eviction option when drain node #6929

use --disable-eviction option when drain node #6929

Comments

Bo0km4n commented Jul 15, 2022 • edited Loading

killianmuldoon commented Jul 15, 2022

sbueringer commented Jul 15, 2022 • edited Loading

Bo0km4n commented Jul 15, 2022 • edited Loading

sbueringer commented Jul 18, 2022

Bo0km4n commented Jul 18, 2022

sbueringer commented Jul 18, 2022

enxebre commented Jul 18, 2022 • edited Loading

Bo0km4n commented Jul 18, 2022

use `--disable-eviction` option when drain node #6929

use `--disable-eviction` option when drain node #6929

Bo0km4n commented Jul 15, 2022 •

edited

Loading

sbueringer commented Jul 15, 2022 •

edited

Loading

Bo0km4n commented Jul 15, 2022 •

edited

Loading

enxebre commented Jul 18, 2022 •

edited

Loading