Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mode kubelet-registration-probe is broken #244

Closed
tobiasgiese opened this issue Nov 17, 2022 · 20 comments
Closed

Mode kubelet-registration-probe is broken #244

tobiasgiese opened this issue Nov 17, 2022 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tobiasgiese
Copy link

tobiasgiese commented Nov 17, 2022

It seems that --mode=kubelet-registration-probe is broken.
Pods that configured a livenessProbe according to the recommendations fail with the following error

main.go:107] "Failed to create registration probe file" err="open /var/lib/kubelet/plugins/smb.csi.k8s.io/registration: no such file or directory" registrationProbePath="/var/lib/kubelet/plugins/smb.csi.k8s.io/registration"

The registration path and file is hard coded in the main.go.

kubeletRegistrationPathDir := filepath.Dir(*kubeletRegistrationPath)
registrationProbePath = filepath.Join(kubeletRegistrationPathDir, "registration")
// with the mode kubelet-registration-probe
if modeIsKubeletRegistrationProbe() {
lockfileExists, err := util.DoesFileExist(registrationProbePath)

But the registration file is in another location and also named differently.
pluginRegistrationPath is usually /var/lib/kubelet/plugins_registry (see the kubernetes-csi driver-volume-mounts docs)

func buildSocketPath(csiDriverName string) string {
return fmt.Sprintf("%s/%s-reg.sock", *pluginRegistrationPath, csiDriverName)
}

where nodeRegister is using buildSocketPath

func nodeRegister(csiDriverName, httpEndpoint string) {
// When kubeletRegistrationPath is specified then driver-registrar ONLY acts
// as gRPC server which replies to registration requests initiated by kubelet's
// plugins watcher infrastructure. Node labeling is done by kubelet's csi code.
registrar := newRegistrationServer(csiDriverName, *kubeletRegistrationPath, supportedVersions)
socketPath := buildSocketPath(csiDriverName)

Tobias Giese [email protected], Mercedes-Benz Tech Innovation GmbH, legal info/Impressum

@tobiasgiese
Copy link
Author

If it's accepted as a bug I want to work on this 🙂

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Nov 17, 2022

We added that mode as a workaround for Windows, see #143 and kubernetes/kubernetes#104584 for more info. The issue was fixed upstream in Kubernetes in 1.24 so if you're using that version you don't need the workaround. I think it's worth adding a note like this in the README about not using this flag anymore.

@tobiasgiese
Copy link
Author

We‘re already using Kubernetes v1.24.8 and the issue still persists

@mauriciopoppe
Copy link
Member

Sorry I think kubernetes/kubernetes#104584 was fixed in 1.25

@tobiasgiese
Copy link
Author

Ah okay, thanks for the update. So I think we don‘t need this fix 👍🏻

@tobiasgiese
Copy link
Author

Just to get this right: you mean that this probe will be fixed in v1.25 for Windows, right? Because we are facing this issue with Linux. According to the docs the exec livenessProbe should succeed and it does not. That‘s why I‘ve added this issue

@mauriciopoppe
Copy link
Member

The issue #143 only happened in Windows, we added the kubelet-registration-probe as a workaround that worked in both Linux and Windows (although we only saw the issue in Windows), later the root cause of #143 was found in kubernetes/kubernetes#104584 and was fixed in kubernetes upstream, with that fix we no longer need a livenessprobe

@aramase
Copy link
Contributor

aramase commented Nov 17, 2022

@mauriciopoppe Removing the livenessprobe check will ensure the pod doesn't restart but that error message is still concerning. If an operator looks at the container logs while debugging, that error message could be misleading even if it isn't actually used.

IMO, we should fix the behavior as there are folks still running the older k8s versions and this error will cause the pod to fail livenessprobe check. When the kubelet-registration-probe is deprecated, I think it's fine to completely remove it and force CSI driver authors to remove it from the container spec.

@mauriciopoppe
Copy link
Member

@aramase thanks for linking to this bug, I think that the kubelet-registration-probe was working with Azurefile and Azuredisk successfully so it might be a new bug, wondering if it's related with #214

There's a release note that I added with this change, wondering if this should be done in the manifests in the volume definitions (to create the directories if they don't exist beforehand)

The directories in `kubeletRegistrationPath` are assumed to be already created and are no longer created by node-driver-registrar

@aramase
Copy link
Contributor

aramase commented Nov 17, 2022

@mauriciopoppe So we create the host path directory if it doesn't exist: https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/main/deploy/secrets-store-csi-driver.yaml#L128-L131.

This host path is mounted in the node-driver-registrar path as /csi: https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/main/deploy/secrets-store-csi-driver.yaml#L35-L36.

The path used for the registration file is determined using --kubelet-registration-path which is set https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/main/deploy/secrets-store-csi-driver.yaml#L24 to /var/lib/kubelet/plugins/csi-secrets-store/csi.sock. That path is used by kubelet but doesn't exist inside the node-driver-registrar container. This config is similar to azuredisk and azurefile.

@tobiasgiese
Copy link
Author

The issue #143 only happened in Windows, we added the kubelet-registration-probe as a workaround that worked in both Linux and Windows (although we only saw the issue in Windows), later the root cause of #143 was found in kubernetes/kubernetes#104584 and was fixed in kubernetes upstream, with that fix we no longer need a livenessprobe

But the workaround does not work for Linux currently. As this was only added for Windows we can simply remove the docs for Linux, right? This is a bit misleading imo.
Btw, we're facing this issue with Kubernetes v1.24.8 and the node-driver-registrar v2.6.0 and v2.6.1. With v2.5.1 everything works as expected.

@zestysoft
Copy link

zestysoft commented Nov 21, 2022

+1 -- same setup as @tobiasgiese

@mauriciopoppe
Copy link
Member

Thank for the additional feedback @aramase & @tobiasgiese, I think that #214 is the bug, I'm going to revert it and analyze the issue in more detail later

@mauriciopoppe
Copy link
Member

/kind bug

@mauriciopoppe
Copy link
Member

versions v2.7.0 and v2.6.2 shouldn't have this issue after #247

@james-callahan
Copy link

versions v2.7.0 and v2.6.2 shouldn't have this issue after #247

This still appears to be broken in 2.7.0.

E0222 00:24:24.888380 1 main.go:107] "Failed to create registration probe file" err="mkdir /var/lib/kubelet: read-only file system" registrationProbePath="/var/lib/kubelet/plugins/ebs.csi.aws.com/registration"

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2023
@mauriciopoppe
Copy link
Member

/remove-lifecycle stale

@james-callahan sorry for getting back so late, I saw the same error when setting spec.securityContext.readOnlyRootFilesystem: true in the Pod spec, maybe your setup has this flag.

Anyways the failure message won't affect the functionality of node-driver-registrar, as part of the workaround in #152 I assume that node-driver-registrar can write a file to the root filesystem, this file which can't be created was used through the livenessProbe and node-driver-registrar running in --mode=kubelet-registration-probe mode, if you're not using the probe (you shouldn't be using it because it was a workaround for a race that was fixed) then you shouldn't have the issue and can safely ignore the message. It's a very misleading error message though 😔.

@aramase also wrote:

@mauriciopoppe Removing the livenessprobe check will ensure the pod doesn't restart but that error message is still concerning. If an operator looks at the container logs while debugging, that error message could be misleading even if it isn't actually used.

I totally agree but to remove the creation of the file and the entire workaround safely I think we need to deprecate the feature first, I created #309 to track this change.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2023
@james-callahan
Copy link

@james-callahan sorry for getting back so late, I saw the same error when setting spec.securityContext.readOnlyRootFilesystem: true in the Pod spec, maybe your setup has this flag.

I think the issue may have gone away in 2.8.0.

@mauriciopoppe
Copy link
Member

I looked at the diff between these two releases v2.6.3...v2.8.0 and I see #245 which is related to a success message being printed only in the success scenario, so in a readOnly FS you should still be seeing this error.

My setup is with node-driver-registrar:v2.8.0 with the Pod spec having spec.securityContext.readOnlyRootFilesystem: true. I think I have a deprecation plan for the flag which I'll add in detail in #309.

I'm going to close this issue because node-driver-registrar is no longer broken.

andyzhangx added a commit to andyzhangx/node-driver-registrar that referenced this issue Dec 16, 2023
b54c1ba4 Merge pull request kubernetes-csi#246 from xing-yang/go_1.21
5436c81e Change go version to 1.21.5
267b40e9 Merge pull request kubernetes-csi#244 from carlory/sig-storage
b42e5a2d nominate self (carlory) as kubernetes-csi reviewer
a17f536f Merge pull request kubernetes-csi#210 from sunnylovestiramisu/sidecar
011033de Use set -x instead of die
5deaf667 Add wrapper script for sidecar release

git-subtree-dir: release-tools
git-subtree-split: b54c1ba49469d4d5d1b5d75285e8868ffe3d328f
tyuchn added a commit to tyuchn/node-driver-registrar that referenced this issue Mar 20, 2024
dc4d0ae2 Merge pull request kubernetes-csi#249 from jsafrane/use-go-version
e681b170 Use .go-version to get Kubernetes go version
b54c1ba4 Merge pull request kubernetes-csi#246 from xing-yang/go_1.21
5436c81e Change go version to 1.21.5
267b40e9 Merge pull request kubernetes-csi#244 from carlory/sig-storage
b42e5a2d nominate self (carlory) as kubernetes-csi reviewer
a17f536f Merge pull request kubernetes-csi#210 from sunnylovestiramisu/sidecar
011033de Use set -x instead of die
5deaf667 Add wrapper script for sidecar release
f8c8cc4c Merge pull request kubernetes-csi#237 from msau42/prow
b36b5bfd Merge pull request kubernetes-csi#240 from dannawang0221/upgrade-go-version
adfddcc9 Merge pull request kubernetes-csi#243 from pohly/git-subtree-pull-fix
c4650889 pull-test.sh: avoid "git subtree pull" error
7b175a1e Update csi-test version to v5.2.0
987c90cc Update go version to 1.21 to match k/k
2c625d41 Add script to generate patch release notes
f9d5b9c0 Merge pull request kubernetes-csi#236 from mowangdk/feature/bump_csi-driver-host-path_version
b01fd537 Bump csi-driver-host-path version up to v1.12.0

git-subtree-dir: release-tools
git-subtree-split: dc4d0ae20a3dcce17fbfc745fb1f1e3b10cd9644
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants