Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create registration probe file error after updating to v2.4.4 #1028

Closed
headyj opened this issue Jun 8, 2023 · 16 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@headyj
Copy link

headyj commented Jun 8, 2023

/kind bug

What happened?

After updating from v2.4.3 to v2.4.4, memory usage seems to have more than doubled on efs-csi-node pods, some of them also seems to have memory leaks.

Also seeing this error on csi-driver-registrar container:
Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"

the main consequence seems to be that some of the pods refused to mount nodes, with timeout waiting for condition error.

How to reproduce it (as minimally and precisely as possible)?

Just update from v2.4.3 to v2.4.4

Environment

  • Kubernetes version (use kubectl version): v1.26 (EKS)
  • Driver version: v2.4.4
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2023
@headyj
Copy link
Author

headyj commented Jun 8, 2023

As with this issue, it seems on my side that rolling back to v2.4.3 is not sufficient, as some efs-csi-node pods were still having their memory usage growing endlessly (even after a restart). Only draining and replacing the nodes on which these efs-csi-node pods were running seems to solve the issue

@mskanth972
Copy link
Contributor

Hi @headyj, Can you install latest helm version 2.4.5 and see whether the issue still persists. If yes, can you share the debugging logs?
https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

@headyj
Copy link
Author

headyj commented Jun 12, 2023

I will not be able to test the memory leak issue, as it's basically breaking our environments. What I can see is despite the Failed to create registration probe file error, draining an replacing the nodes seems to fix the memory leak issue.

What I can tell you though is that I still have the Failed to create registration probe file with 2.4.5. I tried to set logging_level to DEBUG as explained in troubleshooting but it doesn't seems to work without restarting the pod (and obviously loosing the changes). Also I tried to set v=5 in efs-csi-node daemonset as well as efs-csi-controller deployment but still not much to see in the logs on both side:

  • efs-csi-controller (csi-provisioner)
W0612 08:58:02.113393       1 feature_gate.go:241] Setting GA feature gate Topology=true. It will be removed in a future release.
I0612 08:58:02.113459       1 feature_gate.go:249] feature gates: &{map[Topology:true]}
I0612 08:58:02.113507       1 csi-provisioner.go:154] Version: v3.5.0
I0612 08:58:02.113533       1 csi-provisioner.go:177] Building kube configs for running in cluster...
I0612 08:58:03.155569       1 common.go:111] Probing CSI driver for readiness
I0612 08:58:03.159033       1 csi-provisioner.go:230] Detected CSI driver efs.csi.aws.com
I0612 08:58:03.161060       1 csi-provisioner.go:302] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I0612 08:58:03.161732       1 controller.go:732] Using saving PVs to API server in background
I0612 08:58:03.162378       1 leaderelection.go:245] attempting to acquire leader lease kube-system/efs-csi-aws-com...
  • efs-csi-controller (liveness-probe)
I0612 08:58:02.263863       1 main.go:149] calling CSI driver to discover driver name
I0612 08:58:02.266344       1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 08:58:02.266388       1 main.go:183] ServeMux listening at "0.0.0.0:9909"
  • efs-csi-controller (efs-plugin)
I0612 08:58:02.158386       1 config_dir.go:63] Mounted directories do not exist, creating directory at '/etc/amazon/efs'
I0612 08:58:02.160552       1 metadata.go:63] getting MetadataService...
I0612 08:58:02.162319       1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 08:58:02.163244       1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 08:58:02.163262       1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 08:58:02.163367       1 driver.go:140] Did not find any input tags.
I0612 08:58:02.163544       1 driver.go:113] Registering Node Server
I0612 08:58:02.163553       1 driver.go:115] Registering Controller Server
I0612 08:58:02.163562       1 driver.go:118] Starting efs-utils watchdog
I0612 08:58:02.163706       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0612 08:58:02.163829       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0612 08:58:02.164879       1 driver.go:124] Starting reaper
I0612 08:58:02.164894       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/var/lib/csi/sockets/pluginproxy/csi.sock", Net:"unix"}
I0612 08:58:03.159309       1 identity.go:37] GetPluginCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0612 08:58:03.160257       1 controller.go:417] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
  • efs-csi-node (liveness-probe)
I0612 09:00:09.421248       1 main.go:149] calling CSI driver to discover driver name
I0612 09:00:09.422425       1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 09:00:09.422449       1 main.go:183] ServeMux listening at "0.0.0.0:9809"
  • efs-csi-node (csi-driver-registrar)
I0612 09:00:09.283756       1 main.go:167] Version: v2.8.0
I0612 09:00:09.283823       1 main.go:168] Running node-driver-registrar in mode=registration
I0612 09:00:09.284438       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0612 09:00:09.289374       1 main.go:199] Calling CSI driver to discover driver name
I0612 09:00:09.293048       1 node_register.go:53] Starting Registration Server at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293221       1 node_register.go:62] Registration Server started at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293453       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0612 09:00:09.533955       1 main.go:102] Received GetInfo call: &InfoRequest{}
E0612 09:00:09.534095       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"
I0612 09:00:09.560524       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

-efs-csi-node (efs-plugin)

I0612 09:00:09.180667       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0612 09:00:09.182325       1 metadata.go:63] getting MetadataService...
I0612 09:00:09.184196       1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 09:00:09.185214       1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 09:00:09.185253       1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 09:00:09.185345       1 driver.go:140] Did not find any input tags.
I0612 09:00:09.185607       1 driver.go:113] Registering Node Server
I0612 09:00:09.185637       1 driver.go:115] Registering Controller Server
I0612 09:00:09.185650       1 driver.go:118] Starting efs-utils watchdog
I0612 09:00:09.185743       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0612 09:00:09.185761       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0612 09:00:09.186166       1 driver.go:124] Starting reaper
I0612 09:00:09.186179       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0612 09:00:09.535518       1 node.go:306] NodeGetInfo: called with args

@arnavgup1
Copy link
Contributor

Hi @headyj , the error is saying that the file system is read only - Have you checked the security group for the efs file system you are using and the inbound rules within the security group? The security group needs an inbound rule that accepts NFS traffic. More info on how the file system should be configured is here.

If the inbound rule is configured properly, you can follow this document to change the read only setting within your efs file system. More info on the node-driver-registrar is here.

@headyj
Copy link
Author

headyj commented Jul 28, 2023

Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.

Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin

IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using --force. But probably they continue to run endlessly on the node

@whereisaaron
Copy link

whereisaaron commented Jul 29, 2023

I've noted this new error too. The error does say err="mkdir /var/lib/kubelet: read-only file system" so I assume it is the container filesystem that is being attempted to be written to?

I notice newer 2.4.4 version of the helm chart have added readOnlyRootFilesystem: true to most of the containers. These were not present in earlier 2.4.3 version of the chart. @headyj try patching the chart to make readOnlyRootFilesystem: false and see if that fixes it for you?

        - name: csi-driver-registrar
 ...
+           securityContext:
+             allowPrivilegeEscalation: false
+             readOnlyRootFilesystem: true

Looks like this commit to the 2.4.4 helm chart may have broken things:
eb6e3ea

Good news is you can override this in the chart deployment values:
eb6e3ea#diff-56338152bc066c1274cc12e455c5d0585a0ce0cb30831547f47a758d2a750862R36-R47

@evheniyt
Copy link

have the same issue after updating to 2.4.9 helm chart.
Fixed by setting

  sidecars:
    nodeDriverRegistrar:
      securityContext:
        readOnlyRootFilesystem: false

@mkim37
Copy link

mkim37 commented Sep 8, 2023

Registration probe error has been around forever... kubernetes-csi/node-driver-registrar#213. It's also showing for eks addon too.

@alfredkrohmer
Copy link

With the latest Helm chart I'm getting this:

E0918 16:02:01.703074       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"

It looks like it expects this folder to be mounted from the host, but looking at the volume mounts of this container:

volumeMounts:
- name: plugin-dir
mountPath: /csi
- name: registration-dir
mountPath: /registration

This is not mounted from the host, hence it ends up on the container root filesystem, which is configured as read-only.

@gcaracuel
Copy link

Problem is noted in Kubernetes docs -> https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation in this document there is a warning which is the actual problem.

Only containers privileged will be able to use mountPropagation: "Bidirectional" https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/deploy/kubernetes/base/node-daemonset.yaml#L68 (same applies to Helm Chart).

So problem here is it is trying to propagate that volume from container efs-plugin to csi-driver-registrar but this last is not priveleged.

Quicker fix is to make that csi-driver-registrar so for example if using Helm you will need:

sidecars:
  nodeDriverRegistrar:
    securityContext:
      privileged: true
      allowPrivilegeEscalation: true

Of course this is still a bug and needs a fix, or that csi-driver-registrar is privileged by default or the volumeMount is set explicitly in that container too.

@the-technat
Copy link

If I understand the comment on the linked issue right, v2.9.0 of the csi-driver-registrar would be able to deal with readOnlyFileSystem?

See kubernetes-csi/node-driver-registrar#213 (comment)

@jiangfwa
Copy link

jiangfwa commented Feb 8, 2024

Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.

Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin

IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using --force. But probably they continue to run endlessly on the node

We are using the version v1.5.4 of efs-csi-driver, it also has the memory leak issue, no matter how many memory I give to it, some of the efs-csi-nodes will have OOMKilled. Even I changed the readOnlyRootFilesystem to false.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 7, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests