Failed to create registration probe file error after updating to v2.4.4 #1028

headyj · 2023-06-08T12:13:34Z

/kind bug

What happened?

After updating from v2.4.3 to v2.4.4, memory usage seems to have more than doubled on efs-csi-node pods, some of them also seems to have memory leaks.

Also seeing this error on csi-driver-registrar container:
Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"

the main consequence seems to be that some of the pods refused to mount nodes, with timeout waiting for condition error.

How to reproduce it (as minimally and precisely as possible)?

Just update from v2.4.3 to v2.4.4

Environment

Kubernetes version (use kubectl version): v1.26 (EKS)
Driver version: v2.4.4

The text was updated successfully, but these errors were encountered:

headyj · 2023-06-08T13:25:08Z

As with this issue, it seems on my side that rolling back to v2.4.3 is not sufficient, as some efs-csi-node pods were still having their memory usage growing endlessly (even after a restart). Only draining and replacing the nodes on which these efs-csi-node pods were running seems to solve the issue

mskanth972 · 2023-06-09T13:12:43Z

Hi @headyj, Can you install latest helm version 2.4.5 and see whether the issue still persists. If yes, can you share the debugging logs?
https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

headyj · 2023-06-12T09:07:50Z

I will not be able to test the memory leak issue, as it's basically breaking our environments. What I can see is despite the Failed to create registration probe file error, draining an replacing the nodes seems to fix the memory leak issue.

What I can tell you though is that I still have the Failed to create registration probe file with 2.4.5. I tried to set logging_level to DEBUG as explained in troubleshooting but it doesn't seems to work without restarting the pod (and obviously loosing the changes). Also I tried to set v=5 in efs-csi-node daemonset as well as efs-csi-controller deployment but still not much to see in the logs on both side:

efs-csi-controller (csi-provisioner)

W0612 08:58:02.113393       1 feature_gate.go:241] Setting GA feature gate Topology=true. It will be removed in a future release.
I0612 08:58:02.113459       1 feature_gate.go:249] feature gates: &{map[Topology:true]}
I0612 08:58:02.113507       1 csi-provisioner.go:154] Version: v3.5.0
I0612 08:58:02.113533       1 csi-provisioner.go:177] Building kube configs for running in cluster...
I0612 08:58:03.155569       1 common.go:111] Probing CSI driver for readiness
I0612 08:58:03.159033       1 csi-provisioner.go:230] Detected CSI driver efs.csi.aws.com
I0612 08:58:03.161060       1 csi-provisioner.go:302] CSI driver does not support PUBLISH_UNPUBLISH_VOLUME, not watching VolumeAttachments
I0612 08:58:03.161732       1 controller.go:732] Using saving PVs to API server in background
I0612 08:58:03.162378       1 leaderelection.go:245] attempting to acquire leader lease kube-system/efs-csi-aws-com...

efs-csi-controller (liveness-probe)

I0612 08:58:02.263863       1 main.go:149] calling CSI driver to discover driver name
I0612 08:58:02.266344       1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 08:58:02.266388       1 main.go:183] ServeMux listening at "0.0.0.0:9909"

efs-csi-controller (efs-plugin)

I0612 08:58:02.158386       1 config_dir.go:63] Mounted directories do not exist, creating directory at '/etc/amazon/efs'
I0612 08:58:02.160552       1 metadata.go:63] getting MetadataService...
I0612 08:58:02.162319       1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 08:58:02.163244       1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 08:58:02.163262       1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 08:58:02.163367       1 driver.go:140] Did not find any input tags.
I0612 08:58:02.163544       1 driver.go:113] Registering Node Server
I0612 08:58:02.163553       1 driver.go:115] Registering Controller Server
I0612 08:58:02.163562       1 driver.go:118] Starting efs-utils watchdog
I0612 08:58:02.163706       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0612 08:58:02.163829       1 efs_watch_dog.go:216] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0612 08:58:02.164879       1 driver.go:124] Starting reaper
I0612 08:58:02.164894       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/var/lib/csi/sockets/pluginproxy/csi.sock", Net:"unix"}
I0612 08:58:03.159309       1 identity.go:37] GetPluginCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0612 08:58:03.160257       1 controller.go:417] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

efs-csi-node (liveness-probe)

I0612 09:00:09.421248       1 main.go:149] calling CSI driver to discover driver name
I0612 09:00:09.422425       1 main.go:155] CSI driver name: "efs.csi.aws.com"
I0612 09:00:09.422449       1 main.go:183] ServeMux listening at "0.0.0.0:9809"

efs-csi-node (csi-driver-registrar)

I0612 09:00:09.283756       1 main.go:167] Version: v2.8.0
I0612 09:00:09.283823       1 main.go:168] Running node-driver-registrar in mode=registration
I0612 09:00:09.284438       1 main.go:192] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0612 09:00:09.289374       1 main.go:199] Calling CSI driver to discover driver name
I0612 09:00:09.293048       1 node_register.go:53] Starting Registration Server at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293221       1 node_register.go:62] Registration Server started at: /registration/efs.csi.aws.com-reg.sock
I0612 09:00:09.293453       1 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0612 09:00:09.533955       1 main.go:102] Received GetInfo call: &InfoRequest{}
E0612 09:00:09.534095       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"
I0612 09:00:09.560524       1 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

-efs-csi-node (efs-plugin)

I0612 09:00:09.180667       1 config_dir.go:88] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0612 09:00:09.182325       1 metadata.go:63] getting MetadataService...
I0612 09:00:09.184196       1 metadata.go:68] retrieving metadata from EC2 metadata service
I0612 09:00:09.185214       1 cloud.go:137] EFS Client created using the following endpoint: https://elasticfilesystem.eu-west-1.amazonaws.com
I0612 09:00:09.185253       1 driver.go:84] Node Service capability for Get Volume Stats Not enabled
I0612 09:00:09.185345       1 driver.go:140] Did not find any input tags.
I0612 09:00:09.185607       1 driver.go:113] Registering Node Server
I0612 09:00:09.185637       1 driver.go:115] Registering Controller Server
I0612 09:00:09.185650       1 driver.go:118] Starting efs-utils watchdog
I0612 09:00:09.185743       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.conf since it exists already
I0612 09:00:09.185761       1 efs_watch_dog.go:221] Skip copying /etc/amazon/efs/efs-utils.crt since it exists already
I0612 09:00:09.186166       1 driver.go:124] Starting reaper
I0612 09:00:09.186179       1 driver.go:127] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0612 09:00:09.535518       1 node.go:306] NodeGetInfo: called with args

arnavgup1 · 2023-06-14T21:16:25Z

Hi @headyj , the error is saying that the file system is read only - Have you checked the security group for the efs file system you are using and the inbound rules within the security group? The security group needs an inbound rule that accepts NFS traffic. More info on how the file system should be configured is here.

If the inbound rule is configured properly, you can follow this document to change the read only setting within your efs file system. More info on the node-driver-registrar is here.

headyj · 2023-07-28T14:28:44Z

Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.

Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin

IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using --force. But probably they continue to run endlessly on the node

whereisaaron · 2023-07-29T17:17:11Z

I've noted this new error too. The error does say err="mkdir /var/lib/kubelet: read-only file system" so I assume it is the container filesystem that is being attempted to be written to?

I notice newer 2.4.4 version of the helm chart have added readOnlyRootFilesystem: true to most of the containers. These were not present in earlier 2.4.3 version of the chart. @headyj try patching the chart to make readOnlyRootFilesystem: false and see if that fixes it for you?

        - name: csi-driver-registrar
 ...
+           securityContext:
+             allowPrivilegeEscalation: false
+             readOnlyRootFilesystem: true

Looks like this commit to the 2.4.4 helm chart may have broken things:
eb6e3ea

Good news is you can override this in the chart deployment values:
eb6e3ea#diff-56338152bc066c1274cc12e455c5d0585a0ce0cb30831547f47a758d2a750862R36-R47

evheniyt · 2023-08-25T08:13:35Z

have the same issue after updating to 2.4.9 helm chart.
Fixed by setting

  sidecars:
    nodeDriverRegistrar:
      securityContext:
        readOnlyRootFilesystem: false

mkim37 · 2023-09-08T01:28:58Z

Registration probe error has been around forever... kubernetes-csi/node-driver-registrar#213. It's also showing for eks addon too.

alfredkrohmer · 2023-09-26T06:05:18Z

With the latest Helm chart I'm getting this:

E0918 16:02:01.703074       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/efs.csi.aws.com/registration"

It looks like it expects this folder to be mounted from the host, but looking at the volume mounts of this container:

aws-efs-csi-driver/charts/aws-efs-csi-driver/templates/node-daemonset.yaml

Lines 131 to 135 in cb9d97d

    
           volumeMounts: 
        
             - name: plugin-dir 
        
               mountPath: /csi 
        
             - name: registration-dir 
        
               mountPath: /registration

This is not mounted from the host, hence it ends up on the container root filesystem, which is configured as read-only.

gcaracuel · 2023-10-24T13:03:11Z

Problem is noted in Kubernetes docs -> https://kubernetes.io/docs/concepts/storage/volumes/#mount-propagation in this document there is a warning which is the actual problem.

Only containers privileged will be able to use mountPropagation: "Bidirectional" https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/deploy/kubernetes/base/node-daemonset.yaml#L68 (same applies to Helm Chart).

So problem here is it is trying to propagate that volume from container efs-plugin to csi-driver-registrar but this last is not priveleged.

Quicker fix is to make that csi-driver-registrar so for example if using Helm you will need:

sidecars:
  nodeDriverRegistrar:
    securityContext:
      privileged: true
      allowPrivilegeEscalation: true

Of course this is still a bug and needs a fix, or that csi-driver-registrar is privileged by default or the volumeMount is set explicitly in that container too.

the-technat · 2023-12-07T10:02:30Z

If I understand the comment on the linked issue right, v2.9.0 of the csi-driver-registrar would be able to deal with readOnlyFileSystem?

See kubernetes-csi/node-driver-registrar#213 (comment)

jiangfwa · 2024-02-08T01:03:30Z

Actually, we are using this config for almost 3 years now, so I can assure you that the efs is not read only and writes are working on EFS drives.

Also it seems that only some nodes are affected by these memory leaks, even if all of them are having this error message. That's why it's a bit hard to identify the problem: all the containers from efs-csi pods (daemonset) have the exact same logs, but only some of them are leaking. For any reason it seems that draining the node and replacing it solves the issue but this is definitely not something we want to do each time we update the efs plugin

IMO, pods are never stopped on the node. After we update the plugin, all of the pods running on each node are stuck in terminating, so I have to kill them using --force. But probably they continue to run endlessly on the node

We are using the version v1.5.4 of efs-csi-driver, it also has the memory leak issue, no matter how many memory I give to it, some of the efs-csi-nodes will have OOMKilled. Even I changed the readOnlyRootFilesystem to false.

k8s-triage-robot · 2024-05-08T01:53:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-06-07T02:32:31Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-07-07T02:50:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-07-07T02:50:47Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 8, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 7, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to create registration probe file error after updating to v2.4.4 #1028

Failed to create registration probe file error after updating to v2.4.4 #1028

headyj commented Jun 8, 2023

headyj commented Jun 8, 2023

mskanth972 commented Jun 9, 2023

headyj commented Jun 12, 2023

arnavgup1 commented Jun 14, 2023

headyj commented Jul 28, 2023

whereisaaron commented Jul 29, 2023 •

edited

Loading

evheniyt commented Aug 25, 2023

mkim37 commented Sep 8, 2023

alfredkrohmer commented Sep 26, 2023

gcaracuel commented Oct 24, 2023

the-technat commented Dec 7, 2023

jiangfwa commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

k8s-triage-robot commented Jun 7, 2024

k8s-triage-robot commented Jul 7, 2024

k8s-ci-robot commented Jul 7, 2024

Failed to create registration probe file error after updating to v2.4.4 #1028

Failed to create registration probe file error after updating to v2.4.4 #1028

Comments

headyj commented Jun 8, 2023

headyj commented Jun 8, 2023

mskanth972 commented Jun 9, 2023

headyj commented Jun 12, 2023

arnavgup1 commented Jun 14, 2023

headyj commented Jul 28, 2023

whereisaaron commented Jul 29, 2023 • edited Loading

evheniyt commented Aug 25, 2023

mkim37 commented Sep 8, 2023

alfredkrohmer commented Sep 26, 2023

gcaracuel commented Oct 24, 2023

the-technat commented Dec 7, 2023

jiangfwa commented Feb 8, 2024

k8s-triage-robot commented May 8, 2024

k8s-triage-robot commented Jun 7, 2024

k8s-triage-robot commented Jul 7, 2024

k8s-ci-robot commented Jul 7, 2024

whereisaaron commented Jul 29, 2023 •

edited

Loading