Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to create registration probe file" with readOnlyRootFilesystem #213

Closed
ConnorJC3 opened this issue Aug 4, 2022 · 18 comments · Fixed by #214
Closed

"Failed to create registration probe file" with readOnlyRootFilesystem #213

ConnorJC3 opened this issue Aug 4, 2022 · 18 comments · Fixed by #214

Comments

@ConnorJC3
Copy link

When using readOnlyRootFilesystem: true the node-driver-registrar tries to recursively create the directory the probe file is in, leading to an error:

E0804 18:06:27.375117       1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/ebs.csi.aws.com/registration"

In this scenario /var/lib/kubelet/plugins/ebs.csi.aws.com/ is a hostPath mounted into the container (and registration works just fine), but node-driver-registrar tries to create /var/lib/kubelet/ despite the fact it already exists and fails, never creating the probe file.

Semi-related: see kubernetes-sigs/aws-ebs-csi-driver#1333 (comment)

@mauriciopoppe
Copy link
Member

Thanks for the report, this mode https://github.com/kubernetes-csi/node-driver-registrar#health-check-with-an-exec-probe exists as a workaround to an issue related with the registration process getting stuck. Fortunately, I believe that the issue with the registration process getting stuck was fixed in kubernetes/kubernetes#110075 and the workaround shouldn't be needed anymore.

For the issue that you raised it's kinda hard to fix it because the workaround relies on creating a tmp file whose existence is checked regularly to restart the Pod and the file is always getting created.

A fix that would be backwards compatible is:

  • assume that the file is created by default
  • introduce a new flag to disable creating the file
  • use the new flag to disable creating the file in your impl

@mauriciopoppe
Copy link
Member

I think we can also use the fact that we didn't introduce any flags to enable/disable the feature, another idea would be to deprecate the exec probe and remove the code that creates the file by default from the codebase in a future release.

@ConnorJC3
Copy link
Author

ConnorJC3 commented Aug 4, 2022

@mauriciopoppe creating the file is fine, because we are passing --kubelet-registration-path=/var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock, so it should try to create the file at /var/lib/kubelet/plugins/ebs.csi.aws.com/registration, which is writable because it is mounted from the host (and that will likely always be true?)

The problem is that on the way to create /var/lib/kubelet/plugins/ebs.csi.aws.com/registration it tries to run mkdir /var/lib/kubelet. This is both unnecessary (as that directory already exists), and fails with a "read only filesystem" error causing node-driver-registrar to give up on creating the probe file.

It's not actually affecting us, because we don't use the probe mode anyways, but I figured I'd report the bug.

That said, if you think this code is totally unnecessary, removing it would also be a "fix" for this issue.

@mauriciopoppe
Copy link
Member

@ConnorJC3 thanks for pointing that out! I think I can remove these lines safely

err := os.MkdirAll(filepath.Dir(filePath), 0755)
if err != nil {
return err
}

@mauriciopoppe
Copy link
Member

I think that readOnlyRootFilesystem will block the creation of the registration file, the error you see is with mkdir but I think that you'll see the same error later when the actual registration file is created even if I remove these lines.

@mauriciopoppe
Copy link
Member

On the CSI Driver side, is the unix:///var/lib/csi/sockets/pluginproxy/csi.sock created with readOnlyRootFilesystem just fine? It be good to know if this line where the socket is created is failing too: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/5510cd8e70e098c15bc3c2a2e075298793ce0750/pkg/driver/driver.go#L112

@ConnorJC3
Copy link
Author

ConnorJC3 commented Aug 5, 2022

/var/lib/csi/sockets/pluginproxy/ is a volume mounted in the CSI driver, so the readOnlyRootFilesystem doesn't affect it. This example manifest from my PR on the EBS CSI Driver has been tested and works: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/64bae337ed98d9e47564b163fb3fd6d892b663f7/deploy/kubernetes/base/node.yaml#L50-L88

The same should be true for node-driver-registrar, /var/lib/kubelet/plugins/ebs.csi.aws.com/ is a mounted volume so creating a file inside it should work just fine as readOnlyRootFilesystem only affects the root filesystem (as the name would suggest), thus volumes are not affected.

@mauriciopoppe
Copy link
Member

#214 removes the lines that attempted to create a directory and now we only create a file in the mounted volume, it should be available in the next release

@mauriciopoppe
Copy link
Member

Reverting in #247 (comment), I think node-driver-registrar needs to create the /var/lib/kubelet/plugins/<drivername.example.com>/ which is different from the mounted volume /var/lib/kubelet/plugins/<drivername.example.com>/ to /csi.

@mauriciopoppe mauriciopoppe reopened this Nov 21, 2022
@ConnorJC3
Copy link
Author

ConnorJC3 commented Nov 23, 2022

Rather than not attempting to create the directory at all, could node-driver-registrar walk up the tree and only create the parts which are necessary?

It seems like it already has a loop to create the directories in sequence and was just relying on the fact this would silently fail if the directory already existed - so instead of relying on that could it check if the directory exists first?

Edit: I see you're using MkdirAll which has this weird behavior, but it would be simple to implement our own version that behaves more sanely. I would be willing to PR if it would be considered.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 23, 2023
@ConnorJC3
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 23, 2023
@yogeek
Copy link

yogeek commented Mar 29, 2023

Hi @mauriciopoppe
I just installed aws-ebs-csi-driver on our k8s 1.22 to prepare for the CSI migration needed to upgrade to 1.23 and the node-driver-registrar logs still have this error

As the #214 fix was reverted, what will be the next step to fix this please ?
As I am not very familiar with this component, is this issue is preventing the driver to work properly ?

(I used the helm chart aws-ebs-csi-driver-2.17.2 which installed the 1.17.0 app version including the public.ecr.aws/eks-distro/kubernetes-csi/node-driver-registrar:v2.7.0-eks-1-26-latest image)

UPDATE : it does not prevent the driver to work, I just tested on a 1.22 cluster and the volumes are created successfully

@mauriciopoppe
Copy link
Member

@yogeek I tried to also make node-driver-registrar compatible with readOnlyRootFilesystem: true in #214 but it created other problems so I had to roll it back.

The latest versions of node-driver-registrar should have the issue in #213 (comment) which only happens if the Deployment has readOnlyRootFilesystem: true but it shouldn't block the deployment of a CSI Driver if you don't need this feature.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2023
@joulaud
Copy link

joulaud commented Oct 25, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 25, 2023
@mauriciopoppe
Copy link
Member

@mattcary helped remove the code that was creating a temp file in #316 (see also the attached issues in the description) which was released in v2.9.0. You should be able to use readOnlyRootFilesystem: true without any errors with that version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants