-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to create registration probe file error after updating to v2.4.4 #1028
Comments
As with this issue, it seems on my side that rolling back to v2.4.3 is not sufficient, as some efs-csi-node pods were still having their memory usage growing endlessly (even after a restart). Only draining and replacing the nodes on which these efs-csi-node pods were running seems to solve the issue |
Hi @headyj, Can you install latest helm version 2.4.5 and see whether the issue still persists. If yes, can you share the debugging logs? |
I will not be able to test the memory leak issue, as it's basically breaking our environments. What I can see is despite the What I can tell you though is that I still have the
-efs-csi-node (efs-plugin)
|
Hi @headyj , the error is saying that the file system is read only - Have you checked the security group for the efs file system you are using and the inbound rules within the security group? The security group needs an inbound rule that accepts NFS traffic. More info on how the file system should be configured is here. If the inbound rule is configured properly, you can follow this document to change the read only setting within your efs file system. More info on the node-driver-registrar is here. |
Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives. Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using |
I've noted this new error too. The error does say I notice newer 2.4.4 version of the helm chart have added
Looks like this commit to the 2.4.4 helm chart may have broken things: Good news is you can override this in the chart deployment values: |
have the same issue after updating to 2.4.9 helm chart. sidecars:
nodeDriverRegistrar:
securityContext:
readOnlyRootFilesystem: false |
Registration probe error has been around forever... kubernetes-csi/node-driver-registrar#213. It's also showing for eks addon too. |
With the latest Helm chart I'm getting this:
It looks like it expects this folder to be mounted from the host, but looking at the volume mounts of this container: aws-efs-csi-driver/charts/aws-efs-csi-driver/templates/node-daemonset.yaml Lines 131 to 135 in cb9d97d
This is not mounted from the host, hence it ends up on the container root filesystem, which is configured as read-only. |
Problem is noted in Kubernetes docs -> https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation in this document there is a warning which is the actual problem. Only containers privileged will be able to use So problem here is it is trying to propagate that volume from container Quicker fix is to make that
Of course this is still a bug and needs a fix, or that |
If I understand the comment on the linked issue right, v2.9.0 of the |
We are using the version v1.5.4 of efs-csi-driver, it also has the memory leak issue, no matter how many memory I give to it, some of the efs-csi-nodes will have OOMKilled. Even I changed the |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/kind bug
What happened?
After updating from v2.4.3 to v2.4.4, memory usage seems to have more than doubled on efs-csi-node pods, some of them also seems to have memory leaks.
Also seeing this error on
csi-driver-registrar
container:Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"
the main consequence seems to be that some of the pods refused to mount nodes, with
timeout waiting for condition
error.How to reproduce it (as minimally and precisely as possible)?
Just update from v2.4.3 to v2.4.4
Environment
kubectl version
): v1.26 (EKS)The text was updated successfully, but these errors were encountered: