You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a GKE cluster on which we have deployed capsule. We have used this values.yaml file during installation via helm. But once the cluster is scaled to 0 nodes and when we try to scale up to bring new nodes, all the nodes are going to NotReady state. Even if a new node tries to come up due to autoscaling, that goes to NotReady as well. We could observe the below errors when we describe the node.
:~$ kubectl describe node gke-cluster-z8z9
Name: gke-cluster-7d487589--z8z9
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=n1-standard-4
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-c
kubernetes.io/arch=amd64
kubernetes.io/hostname=gke-cluster-7d487589-z8z9
kubernetes.io/os=linux
node.kubernetes.io/instance-type=n1-standard-4
topology.gke.io/zone=us-central1-c
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-c
Annotations: csi.volume.kubernetes.io/nodeid:
{"pd.csi.storage.gke.io":"projects/xxxxx/zones/us-central1-c/instances/gke-cluster-8z9"}
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 23 Jun 2022 06:35:09 +0000
Taints: node.kubernetes.io/not-ready:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: gke-cluster-z8z9
AcquireTime: <unset>
RenewTime: Thu, 23 Jun 2022 09:43:52 +0000
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
CorruptDockerOverlay2 False Thu, 23 Jun 2022 09:40:36 +0000 Thu, 23 Jun 2022 06:35:14 +0000 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
FrequentUnregisterNetDevice False Thu, 23 Jun 2022 09:40:36 +0000 Thu, 23 Jun 2022 06:35:14 +0000 NoFrequentUnregisterNetDevice node is functioning properly
KernelDeadlock False Thu, 23 Jun 2022 09:40:36 +0000 Thu, 23 Jun 2022 06:35:14 +0000 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Thu, 23 Jun 2022 09:40:36 +0000 Thu, 23 Jun 2022 06:35:14 +0000 FilesystemIsNotReadOnly Filesystem is not read-only
FrequentKubeletRestart False Thu, 23 Jun 2022 09:40:36 +0000 Thu, 23 Jun 2022 06:35:14 +0000 NoFrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Thu, 23 Jun 2022 09:40:36 +0000 Thu, 23 Jun 2022 06:35:14 +0000 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Thu, 23 Jun 2022 09:40:36 +0000 Thu, 23 Jun 2022 06:35:14 +0000 NoFrequentContainerdRestart containerd is functioning properly
NetworkUnavailable True Mon, 01 Jan 0001 00:00:00 +0000 Thu, 23 Jun 2022 06:35:09 +0000 NoRouteCreated Node created without a route
MemoryPressure False Thu, 23 Jun 2022 09:41:38 +0000 Thu, 23 Jun 2022 06:35:09 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 23 Jun 2022 09:41:38 +0000 Thu, 23 Jun 2022 06:35:09 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 23 Jun 2022 09:41:38 +0000 Thu, 23 Jun 2022 06:35:09 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Thu, 23 Jun 2022 09:41:38 +0000 Thu, 23 Jun 2022 06:35:09 +0000 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Also, few of the pods in kube-system namespace were also in Pending state and kube-proxy pod was crashing continuously on each node.
:~$ kubectl describe pod calico-node-fgcsr -n kube-system
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37s default-scheduler Successfully assigned kube-system/calico-node-fgcsr to gke-cluster-jvnx
Normal Pulling 37s kubelet Pulling image "gke.gcr.io/calico/cni:v3.18.6-gke.0"
Normal Pulled 35s kubelet Successfully pulled image "gke.gcr.io/calico/cni:v3.18.6-gke.0" in 1.715111913s
Normal Created 34s kubelet Created container install-cni
Normal Started 34s kubelet Started container install-cni
Normal Pulling 32s kubelet Pulling image "gke.gcr.io/calico/node:v3.18.6-gke.1"
Normal Pulled 28s kubelet Successfully pulled image "gke.gcr.io/calico/node:v3.18.6-gke.1" in 3.765461978s
Normal Created 27s kubelet Created container calico-node
Normal Started 27s kubelet Started container calico-node
Warning Unhealthy 26s kubelet Readiness probe failed: Get "http://127.0.0.1:9099/readiness": dial tcp 127.0.0.1:9099: connect: connection refused
Warning Unhealthy 8s (x2 over 18s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503
We could observe that, on autoscaling, the new VM instances had indeed come up in GKE console, but they were unable to join the cluster and exisitng nodes were going to NotReady on cluster restart.
If we delete the validatingwebhookconfiguration- capsule-validating-webhook-configuration, all these errors are resolved and cluster also works perfectly fine. We are even able to create tenants and the restrictions are also working fine.
Please let us know why is this webhook causing this problem and if this can be fixed.
The text was updated successfully, but these errors were encountered:
This webhook controls specific actions that a tenant owner could issue against nodes of their tenant. It's a requirement for the BYOD scenario, along with Capsule Proxy, section Nodes.
For your use case, the /nodes handler policy is too strict and you can change the value of failurePolicy from Fail to Ignore using the Helm --set CLI flag (--set webhooks.nodes.failurePolicy=Ignore).
Let me know if I can help you further with any change you'd like to propose, and feel free to close the issue.
We have a GKE cluster on which we have deployed capsule. We have used this values.yaml file during installation via helm. But once the cluster is scaled to 0 nodes and when we try to scale up to bring new nodes, all the nodes are going to NotReady state. Even if a new node tries to come up due to autoscaling, that goes to NotReady as well. We could observe the below errors when we describe the node.
Also, few of the pods in kube-system namespace were also in Pending state and kube-proxy pod was crashing continuously on each node.
Calico pods were having this error on describing:
We could observe that, on autoscaling, the new VM instances had indeed come up in GKE console, but they were unable to join the cluster and exisitng nodes were going to NotReady on cluster restart.
If we delete the
validatingwebhookconfiguration
-capsule-validating-webhook-configuration
, all these errors are resolved and cluster also works perfectly fine. We are even able to create tenants and the restrictions are also working fine.Please let us know why is this webhook causing this problem and if this can be fixed.
The text was updated successfully, but these errors were encountered: