-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better messaging for missing volume binaries on host #36280
Conversation
@@ -77,6 +77,10 @@ type Attributes struct { | |||
type Mounter interface { | |||
// Uses Interface to provide the path for Docker binds. | |||
Volume | |||
|
|||
//CanMount returns if the volume can be mounted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space at the start of comments.
Also, I recommend being more descriptive on interface comments so that those trying to implement the interface quickly understand everything they need to know to implement the method:
CanMount is called immediately prior to SetUp to check if the required components (binaries, etc.) are available on the underlying node to complete the subsequent SetUp (mount) operation. If CanMount returns false, the mount operation is aborted and an event is generated indicating that the node does not have the required binaries to complete mount. If CanMount returns true, the mount operation continues normally. The CanMount check can be enabled or disabled using the experimental-check-mount-binaries binary flag.
nit: Personal preference, I like to maintain a column width of 80.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
@@ -177,6 +177,10 @@ func (b *vsphereVolumeMounter) SetUp(fsGroup *int64) error { | |||
return b.SetUpAt(b.GetPath(), fsGroup) | |||
} | |||
|
|||
func (b *vsphereVolumeMounter) CanMount() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments to public methods. Even if the existing code isn't great, make the new code you add better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -119,14 +118,17 @@ type OperationExecutor interface { | |||
func NewOperationExecutor( | |||
kubeClient internalclientset.Interface, | |||
volumePluginMgr *volume.VolumePluginMgr, | |||
recorder record.EventRecorder) OperationExecutor { | |||
recorder record.EventRecorder, | |||
checkBinariesBeforeMount bool) OperationExecutor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming sucks but I'm going to recommend checkNodeCapabilitiesBeforeMount
instead of checkBinariesBeforeMount
in all the files. It is more verbose, but it is less likely to become inaccurate: the checks you have at the moment only look for binaries, but other plugins may want to check for kernel modules or something else in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -877,6 +882,12 @@ func (oe *operationExecutor) generateMountVolumeFunc( | |||
} | |||
} | |||
|
|||
if oe.checkBinariesBeforeMount && !volumeMounter.CanMount() { | |||
oe.recorder.Eventf(volumeToMount.Pod, api.EventTypeWarning, kevents.FailedMountVolume, "Unable to mount volume %v (spec.Name: %v) on pod %v (UID: %v). Binary required for mounting, does not exist on the node machine. Verify that your node machine has the required binaries before attempting to mount this volume type", volumeToMount.VolumeName, volumeToMount.VolumeSpec.Name(), volumeToMount.Pod.Name, volumeToMount.Pod.UID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of duplicating the string 3 times, create it once, and use the variable, so that it is easy to modify and keep consistent:
errMsg := fmt.Stringf("...", ...)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an easy one. My bad.
@@ -77,6 +77,10 @@ type Attributes struct { | |||
type Mounter interface { | |||
// Uses Interface to provide the path for Docker binds. | |||
Volume | |||
|
|||
//CanMount returns if the volume can be mounted | |||
CanMount() bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning a bool, how about returning an error. So that if the check fails plugins can return an error with a message for what exactly is missing. e.g. Required binary /sbin/mount.nfs4 is missing.
The caller, operation executor, can then use this message in the user visible error to provide more actionable information.
(Update the comment accordingly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -207,6 +207,11 @@ func (b *glusterfsMounter) GetAttributes() volume.Attributes { | |||
} | |||
} | |||
|
|||
func (b *glusterfsMounter) CanMount() bool { | |||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also implement the check for Gluster. Check with @jingxu97 what the required client binaries are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will get this change out in a separate PR.
// This flag if set checks for mount binaries on the host | ||
// If the binaries are not present, the mount is failed with | ||
// a specific error message | ||
ExperimentalCheckMountBinariesType bool `json: "experimentalCheckMountBinariesType,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
This flag, if set, enables a check prior to mount operations to verify that the required components (binaries, etc.) to mount the volume are available on the underlying node. If the check is enabled and fails the mount operation fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on options.go and other types.go file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -249,6 +249,7 @@ func (s *KubeletServer) AddFlags(fs *pflag.FlagSet) { | |||
fs.BoolVar(&s.ProtectKernelDefaults, "protect-kernel-defaults", s.ProtectKernelDefaults, "Default kubelet behaviour for kernel tuning. If set, kubelet errors if any of kernel tunables is different than kubelet defaults.") | |||
|
|||
// Hidden flags for experimental features that are still under development. | |||
fs.BoolVar(&s.ExperimentalCheckMountBinariesType, "experimental-check-mount-binaries", s.ExperimentalCheckMountBinariesType, "[Experimental] if set true, the kubelet will check for binaries on the host before performing the mount") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you have to add the new flag to hack/verify-flags/known-flags.txt
or the verify-flags-underscore.py
will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -729,7 +732,8 @@ func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *Kub | |||
klet.containerRuntime, | |||
kubeDeps.Mounter, | |||
klet.getPodsDir(), | |||
kubeDeps.Recorder) | |||
kubeDeps.Recorder, | |||
kubeCfg.ExperimentalCheckMountBinariesType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is how you can implement the "put the pod in a failed state" functionality:
- In the volumeManager, introduce a new custom error,
NonRecoverableVolumeError
.- See
AlreadyExistsError
in goroutinemap, for example.
- See
- Have operation executor return the new
NonRecoverableVolumeError
instead of a standarderror
when theCanMount()
check fails. - In this file, modify
volumeManager.WaitForAttachAndMount
to handle the newvolumemanager.NonRecoverableVolumeError
and when it is detected, in addition to the usual error handling behavior, set the pod status to failed.- See
rejectPod()
method in this file for example.
- See
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will get this change out in a separate PR
Jenkins GCE Node e2e failed for commit 0b844b251315fde32837d683449a3f619da29dfa. Full PR test history. The magic incantation to run this job again is |
@saad-ali PTAL |
5b05da3
to
f35c506
Compare
@@ -393,6 +394,72 @@ var _ = framework.KubeDescribe("Volumes [Feature:Volumes]", func() { | |||
// Must match content of test/images/volumes-tester/nfs/index.html | |||
testVolumeClient(cs, config, volume, nil, "Hello from NFS!") | |||
}) | |||
|
|||
It("should be mountable after checking mount binaries", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about what this test does and what it expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also how is this test different from the standard NFS test above and how did you verify that it is working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is the same as the previous one and I will remove it.
Unnecessary for the current one.
@@ -371,6 +373,9 @@ type operationExecutor struct { | |||
|
|||
// recorder is used to record events in the API server | |||
recorder record.EventRecorder | |||
|
|||
//This flag if set checks the binaries before mount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space at the start of comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkNodeCapabilitiesBeforeMount, if set, enables the CanMount check, which verifies that the components (binaries, etc.) required to mount the volume are available on the underlying node before attempting mount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -877,6 +882,15 @@ func (oe *operationExecutor) generateMountVolumeFunc( | |||
} | |||
} | |||
|
|||
canMountErr := volumeMounter.CanMount() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's not even call CanMount()
if checkNodeCapabilitiesBeforeMount == false
. It doesn't make much of a difference, but in the case that there is a bug in a CanMount()
implementation that causes a panic, for example, checkNodeCapabilitiesBeforeMount
can then be used to completely avoid the check (vs the way it is CanMount()
would always be called).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had that in mind previously but didn't change it because it affects the readability of code. (It's not immediately visible if the CanMount() is called in the second statement of the if block
and for now since it's unlikely to cause a panic with only the nfs implementation.)
But in the future some other implementation of CanMount() can cause a panic, so I'll move it.
canMountErr := volumeMounter.CanMount() | ||
|
||
if oe.checkNodeCapabilitiesBeforeMount && canMountErr != nil { | ||
errMsg := fmt.Sprintf("Unable to mount volume %v (spec.Name: %v) on pod %v (UID: %v). %s", volumeToMount.VolumeName, volumeToMount.VolumeSpec.Name(), volumeToMount.Pod.Name, volumeToMount.Pod.UID, canMountErr.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the bulk of the common message here, and have plugins only return only the description of what exactly is missing:
Unable to mount volume %v (spec.Name: %v) on pod %v (UID: %v). Components required for mounting do not exist on the node machine: %v. Verify that your node machine has the required components before attempting to mount this volume type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -326,6 +326,10 @@ func (_ *FakeVolume) GetAttributes() Attributes { | |||
} | |||
} | |||
|
|||
func (fv *FakeVolume) CanMount() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated for new interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// If not, it returns an error | ||
func (nfsMounter *nfsMounter) CanMount() error { | ||
exe := exec.New() | ||
missingBinaryMsg := "Binary required for mounting, does not exist on the node machine. Verify that your node machine has the required binaries before attempting to mount this volume type" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the missingBinaryMsg
here, and have the plugins only return exactly what is missing. e.g. /sbin/mount_nfs is missing
(since it will otherwise be duplicated in each plugin).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent behind this message is that it would change with each plugin and I wanted to make this nfs specific (since it has the word binaries in it).
Like you mentioned in the previous comment that for different plugins, the mount capabilities might not necessarily depend on binaries, each plugin would describe what is missing and based on the error, there would a specific message as to which binary was missing.
I have changed it now.
@@ -323,7 +323,7 @@ func Test_Run_Positive_VolumeUnmountControllerAttachEnabled(t *testing.T) { | |||
asw := cache.NewActualStateOfWorld(nodeName, volumePluginMgr) | |||
kubeClient := createTestClient() | |||
fakeRecorder := &record.FakeRecorder{} | |||
oex := operationexecutor.NewOperationExecutor(kubeClient, volumePluginMgr, fakeRecorder) | |||
oex := operationexecutor.NewOperationExecutor(kubeClient, volumePluginMgr, fakeRecorder, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: when it is not obvious what a variable is that you are passing, stick the parameter name as a comment next to it. e.g.:
NewOperationExecutor(kubeClient, volumePluginMgr, fakeRecorder, false /* checkNodeCapabilitiesBeforeMount */)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -719,6 +719,9 @@ func NewMainKubelet(kubeCfg *componentconfig.KubeletConfiguration, kubeDeps *Kub | |||
return nil, err | |||
} | |||
|
|||
if len(kubeCfg.ExperimentalMounterPath) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add a comment explaining what this is doing and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
e941460
to
a6de94f
Compare
Jenkins GCI GKE smoke e2e failed for commit e94146089fc98d21b1125ac7f59475f031e378bd. Full PR test history. The magic incantation to run this job again is |
02dce93
to
1a94fdf
Compare
1a94fdf
to
ae2598b
Compare
Jenkins GCE e2e failed for commit ae2598b7370794455a3b8a4b4ab41dd811b2e940. Full PR test history. The magic incantation to run this job again is |
Jenkins Kubemark GCE e2e failed for commit ae2598b7370794455a3b8a4b4ab41dd811b2e940. Full PR test history. The magic incantation to run this job again is |
Jenkins GCI GCE e2e failed for commit ae2598b7370794455a3b8a4b4ab41dd811b2e940. Full PR test history. The magic incantation to run this job again is |
Jenkins unit/integration failed for commit 70f250d42dfcd4a16e9cda597491e388fffe8c8e. Full PR test history. The magic incantation to run this job again is |
6390868
to
ad6bfa4
Compare
Jenkins GKE smoke e2e failed for commit ad6bfa4d01b8c2fc3b91e50b0b3dbcc39020315e. Full PR test history. The magic incantation to run this job again is |
ad6bfa4
to
d81e216
Compare
Jenkins GCE etcd3 e2e failed for commit ad6bfa4d01b8c2fc3b91e50b0b3dbcc39020315e. Full PR test history. The magic incantation to run this job again is |
Jenkins verification failed for commit d81e216. Full PR test history. The magic incantation to run this job again is |
@k8s-bot verify test this |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue |
Automatic merge from submit-queue Implement CanMount() for gfsMounter for linux **What this PR does / why we need it**: To implement CanMount() check for glusterfs. If mount binaries are not present on the underlying node, the mount will not proceed and return an error message stating so. Related to issue : #36098 Related to similar change for NFS : #36280 **Release note**: `Check binaries for GlusterFS on the underlying node before doing mount` Sample output from testing in GCE/GCI: rkouj@rkouj0:~/go/src/k8s.io/kubernetes$ kubectl describe pods Name: glusterfs Namespace: default Node: e2e-test-rkouj-minion-group-kjq3/10.240.0.3 Start Time: Fri, 11 Nov 2016 17:22:04 -0800 Labels: <none> Status: Pending IP: Controllers: <none> Containers: glusterfs: Container ID: Image: gcr.io/google_containers/busybox Image ID: Port: QoS Tier: cpu: Burstable memory: BestEffort Requests: cpu: 100m State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment Variables: Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: glusterfs: Type: Glusterfs (a Glusterfs mount on the host that shares a pod's lifetime) EndpointsName: glusterfs-cluster Path: kube_vol ReadOnly: true default-token-2zcao: Type: Secret (a volume populated by a Secret) SecretName: default-token-2zcao Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 8s 8s 1 {default-scheduler } Normal Scheduled Successfully assigned glusterfs to e2e-test-rkouj-minion-group-kjq3 7s 4s 4 {kubelet e2e-test-rkouj-minion-group-kjq3} Warning FailedMount Unable to mount volume kubernetes.io/glusterfs/6bb04587-a876-11e6-a712-42010af00002-glusterfs (spec.Name: glusterfs) on pod glusterfs (UID: 6bb04587-a876-11e6-a712-42010af00002). Verify that your node machine has the required components before attempting to mount this volume type. Required binary /sbin/mount.glusterfs is missing
@rkouj there are a lot of conflicts with this cherry-pick would you mind opening it since you are more familiar with the code? |
@jessfraz, @rkouj plans to prepare a cherry pick as soon as @jingxu97's mounter changes are in 1.4 (this PR depends on that). If @jingxu97's changes look like they may miss 1.4.7 (for any reason), then @rkouj will cherry pick this PR anyway (working around the dependency on @jingxu97's mounter PR). This PR must be cherry-picked along with #36686 |
cool looks like the mounter stuff is in so ping @rkouj for cherry-pick :) |
…upstream-release-1.4 Automated cherry pick of #36280
Commit found in the "release-1.4" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
Automatic merge from submit-queue Better messaging for missing volume binaries on host **What this PR does / why we need it**: When mount binaries are not present on a host, the error returned is a generic one. This change is to check the mount binaries before the mount and return a user-friendly error message. This change is specific to GCI and the flag is experimental now. kubernetes#36098 **Release note**: Introduces a flag `check-node-capabilities-before-mount` which if set, enables a check (`CanMount()`) prior to mount operations to verify that the required components (binaries, etc.) to mount the volume are available on the underlying node. If the check is enabled and `CanMount()` returns an error, the mount operation fails. Implements the `CanMount()` check for NFS. Sample output post change : rkouj@rkouj0:~/go/src/k8s.io/kubernetes$ kubectl describe pods Name: sleepyrc-fzhyl Namespace: default Node: e2e-test-rkouj-minion-group-oxxa/10.240.0.3 Start Time: Mon, 07 Nov 2016 21:28:36 -0800 Labels: name=sleepy Status: Pending IP: Controllers: ReplicationController/sleepyrc Containers: sleepycontainer1: Container ID: Image: gcr.io/google_containers/busybox Image ID: Port: Command: sleep 6000 QoS Tier: cpu: Burstable memory: BestEffort Requests: cpu: 100m State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment Variables: Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: data: Type: NFS (an NFS mount that lasts the lifetime of a pod) Server: 127.0.0.1 Path: /export ReadOnly: false default-token-d13tj: Type: Secret (a volume populated by a Secret) SecretName: default-token-d13tj Events: FirstSeen LastSeen Count From SubobjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 7s 7s 1 {default-scheduler } Normal Scheduled Successfully assigned sleepyrc-fzhyl to e2e-test-rkouj-minion-group-oxxa 6s 3s 4 {kubelet e2e-test-rkouj-minion-group-oxxa} Warning FailedMount Unable to mount volume kubernetes.io/nfs/32c7ef16-a574-11e6-813d-42010af00002-data (spec.Name: data) on pod sleepyrc-fzhyl (UID: 32c7ef16-a574-11e6-813d-42010af00002). Verify that your node machine has the required components before attempting to mount this volume type. Required binary /sbin/mount.nfs is missing
What this PR does / why we need it:
When mount binaries are not present on a host, the error returned is a generic one.
This change is to check the mount binaries before the mount and return a user-friendly error message.
This change is specific to GCI and the flag is experimental now.
#36098
Release note:
Introduces a flag
check-node-capabilities-before-mount
which if set, enables a check (CanMount()
) prior to mount operations to verify that the required components (binaries, etc.) to mount the volume are available on the underlying node. If the check is enabled andCanMount()
returns an error, the mount operation fails. Implements theCanMount()
check for NFS.This change is
Sample output post change :
rkouj@rkouj0:~/go/src/k8s.io/kubernetes$ kubectl describe pods
Name: sleepyrc-fzhyl
Namespace: default
Node: e2e-test-rkouj-minion-group-oxxa/10.240.0.3
Start Time: Mon, 07 Nov 2016 21:28:36 -0800
Labels: name=sleepy
Status: Pending
IP:
Controllers: ReplicationController/sleepyrc
Containers:
sleepycontainer1:
Container ID:
Image: gcr.io/google_containers/busybox
Image ID:
Port:
Command:
sleep
6000
QoS Tier:
cpu: Burstable
memory: BestEffort
Requests:
cpu: 100m
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment Variables:
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
data:
Type: NFS (an NFS mount that lasts the lifetime of a pod)
Server: 127.0.0.1
Path: /export
ReadOnly: false
default-token-d13tj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-d13tj
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
7s 7s 1 {default-scheduler } Normal Scheduled Successfully assigned sleepyrc-fzhyl to e2e-test-rkouj-minion-group-oxxa
6s 3s 4 {kubelet e2e-test-rkouj-minion-group-oxxa} Warning FailedMount Unable to mount volume kubernetes.io/nfs/32c7ef16-a574-11e6-813d-42010af00002-data (spec.Name: data) on pod sleepyrc-fzhyl (UID: 32c7ef16-a574-11e6-813d-42010af00002). Verify that your node machine has the required components before attempting to mount this volume type. Required binary /sbin/mount.nfs is missing