-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PersistentVolumes stuck after node consolidation / termination #944
Comments
On the AWS side this has been fixed via this PR to the aws-ebs-csi driver, at least for graceful terminations: kubernetes-sigs/aws-ebs-csi-driver#1736. However, this doesn't address the issue for CSI drivers at large. Curious if any of the Azure folks have ran into similar issues (cc @tallaxes @Bryce-Soghigian @jackfrancis)? |
/remove-label needs-triage |
@jmdeal: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We've got some updates to add here, at least for AWS users. This problem can be largely mitigated by ensuring your nodes are configured for gracefule node shutdown. Currently this isn't enabled by default on AL2, the following UserData should be sufficient to enable it: #!/bin/bash
echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
systemctl restart systemd-logind
echo "$(jq ".shutdownGracePeriod=\"45s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json Note: This is specific for the AL2 EKS optimized AMI, enabling graceful termination will differ depending on your AMI. |
@jmdeal I tried use the userdata on AL2023 and the 6 minute stuck still happen 😢 |
Any suggestions on the correct userdata for EKS Bottlerocket AMIs? Can't use what's suggested above because Bottlerocket doesn't have the |
The AWS provider has supported configuring graceful shutdown for bottlerocket since v0.31.0 (aws/karpenter-provider-aws#4571) and as far as I can tell bottlerocket should have |
@levanlongktmt apologies for missing your question, you likely need to handle the kubelet configuration through the nodeadm config rather than bash. This should work:
|
@jmdeal apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: default
spec:
amiFamily: AL2023
(...)
userData: |
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"
--//
Content-Type: application/node.eks.aws
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
kubelet:
config:
shutdownGracePeriod: 45s
shutdownGracePeriodCriticalPods: 15s
--//
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
echo -e "InhibitDelayMaxSec=45\n" >> /etc/systemd/logind.conf
systemctl restart systemd-logind
--// A Karpenter node is configured as expected: ❯ kubectl debug -it node/ip-172-30-20-187.ap-northeast-1.compute.internal --image=alpine
/ # chroot /host
[root@ip-172-30-20-187 /]# cat /etc/systemd/logind.conf | grep InhibitDelayMaxSec
#InhibitDelayMaxSec=5
InhibitDelayMaxSec=45
[root@ip-172-30-20-187 /]# cat /etc/kubernetes/kubelet/config.json | grep shutdownGracePeriod
"shutdownGracePeriod": "45s",
"shutdownGracePeriodCriticalPods": "15s", |
After some discussion with the EBS CSI Driver team, I think the real solve here is for the client-side drain behavior to actually wait on the volume detachment since that happens asynchronously. Realistically, it seems like this would also be a problem for Cluster Autoscaler if they don't have logic to implement this as well, so we should think about how to solve this generally across SIG Autoscaling if we can. We're receptive to someone taking this change in and implementing it. The change should be waiting after we drain here: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/controller.go#L86. Realistically, for this to work, we need to make sure to avoid draining the EBS CSI driver DaemonSet, which means that EBS shouldn't tolerate the Karpenter taint so that it sticks around for the entire lifetime of the node. Either that or it gets drained as part of the teramination flow and it keeps its pre-stop logic like it has today. |
@jonathan-innis That's interesting and the opposite of what I've thought. So the EBS CSI node pod should not tolerate all tainst (which is the current default) in order to get drained, which runs the preStop hook as described here |
I have been thinking about how to determine if a volume has been unmounted. In the current implementation, we wait for the evicted Pods on the node to go into the Succeeded / Failed phase, but this is not sufficient. A similar discussion can be seen in the upstream regarding the PVC deletion protection feature at kubernetes/kubernetes#123320 (comment), where it's pointed out that even if the Pod phase becomes Succeeded / Failed, it does not guarantee that the volume has been detached. Therefore, there seems to be an attempt to add a new Pod condition called Since I think this feature would be useful for Karpenter / Cluster Autoscaler as well, how about mentioning it as a use case? |
@jonathan-innis @jmdeal do you have any plan for this issue? |
Closing, solved by #1294. |
Description
Observed Behavior:
After Karpenter performs a scale-down event, PersistentVolumes may fail to be detached from the deleted nodes. As a result, pods which are dependent on these PersistentVolumes are left in a pending state until the Attach/Detach controller forcibly detaches the PV after 6 minutes.
This is related to the following upstream kubernetes issue:
While this is an upstream issue, there has been real impact on Karpenter users today and could potentially be addressed as part of Karpenter's node drain logic.
Expected Behavior:
All persistent volumes should be detached from the node as part of the termination process.
Additional References:
The text was updated successfully, but these errors were encountered: