Controller reports as ready even though it is not able to connect to EC2 instance metadata #548

invidian · 2020-08-26T15:20:35Z

/kind bug

What happened?

When the controller pod not able to connect to EC2 instance metadata endpoint (for example, when it's blocked by the NetworkPolicy), the deployment still reports the container as a ready for some time, then pod crashes with the following error:

I0826 15:01:28.649833       1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.5.0
panic: EC2 instance metadata is not available

goroutine 1 [running]:
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.newControllerService(0xc000191380, 0xc00001e960, 0x0, 0x16)
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/controller.go:76 +0x103
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver.NewDriver(0xc000115f70, 0x3, 0x3, 0xc000062900, 0xd4d780, 0xc000191260)
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/driver/driver.go:82 +0x3d9
main.main()
        /go/src/github.com/kubernetes-sigs/aws-ebs-csi-driver/cmd/main.go:31 +0x117

What you expected to happen?

ebs-plugin container in controller pod should wait until it's connected to EC2 instance metadata before reporting the readiness to the Kubernetes.

How to reproduce it (as minimally and precisely as possible)?

With Calico as CNI, create the following global network policy to block access to EC2 instance metadata:

apiVersion: crd.projectcalico.org/v1
kind: GlobalNetworkPolicy
metadata:
  name: block-metadata-access
spec:
  egress:
  - action: Allow
    destination:
      notNets:
      - 169.254.169.254/32
  selector: ""
  types:
  - Egress

Then, deploy the AWS EBS CSI driver as usual.

Anything else we need to know?:

It seems the deployment is currently missing readiness probes all together, so adding them is also needed to resolve this.

Environment

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"archive", BuildDate:"2020-07-01T16:28:46Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T16:51:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

Driver version: 0.5.0
Chart version: 0.4.0

The text was updated successfully, but these errors were encountered:

After we created aws-ebs-csi-driver component, we added a patch to Lokomotive, which deploys a Global Network Policy, which blocks access to EC2 Instance Metadata by default for all pods, which ended up breaking the component functionality. The issue was not spotted before, as the component does not have readiness probes defined, which has been reported upstream: kubernetes-sigs/aws-ebs-csi-driver#548 This commit fixes the component functionality, by adding the NetworkPolicy object selecting the controller pods, which unblocks all egress traffic for it, which bypasses the Global Network Policy. Closes #864 Signed-off-by: Mateusz Gozdek <[email protected]>

wongma7 · 2020-10-16T22:13:08Z

There is a livenessprobe https://github.com/kubernetes-csi/livenessprobe https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/deploy/kubernetes/base/controller.yaml#L60 , I am not sure if it can double as a readiness probe but that would probably solve this issue.

AndyXiangLi · 2021-01-12T17:58:16Z

@invidian I'm not able to reproduce this issue with the latest driver version v0.8.1. I'm using Calico as CNI and when the metadata service get blocked and deploy driver as usual, ebs-plugin container never becomes ready.
Are you able to try with the latest version on your end?

invidian · 2021-01-15T11:07:58Z

@AndyXiangLi yes, I can reproduce it using v0.8.1. Do notice, that container is in Ready state briefly after creation. Only then it crashes and goes into CrashLoopBackOff state.

invidian · 2021-02-02T21:16:10Z

Running:

helm upgrade --install aws-ebs-csi-driver --namespace kube-system --wait --atomic --set enableVolumeScheduling=true --set enableVolumeResizing=true --set enableVolumeSnapshot=true aws-ebs-csi-driver/aws-ebs-csi-driver

does reproduce the issue. I would expect Helm to never converge.

$ kgpo
+ kubectl get pods
NAME                                       READY   STATUS             RESTARTS   AGE
calico-kube-controllers-855c8775f9-xd8zm   1/1     Running            0          7h12m
calico-node-hqlt7                          1/1     Running            0          7h12m
calico-node-x79mg                          1/1     Running            1          7h12m
coredns-7d799bc4c8-9pk25                   1/1     Running            0          7h12m
ebs-csi-controller-87d4b79bd-4bnh4         5/6     CrashLoopBackOff   2          69s
ebs-csi-controller-87d4b79bd-svc44         5/6     CrashLoopBackOff   2          69s
ebs-csi-node-5h8bn                         3/3     Running            0          69s
ebs-csi-node-8vss7                         3/3     Running            0          69s
ebs-snapshot-controller-0                  1/1     Running            0          69s

vdhanan · 2021-02-08T21:38:45Z

/assign

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 26, 2020

invidian mentioned this issue Aug 26, 2020

aws-ebs-csi-driver component can't connect to EC2 instance metadata kinvolk/lokomotive#864

Closed

invidian mentioned this issue Aug 26, 2020

aws-ebs-csi-driver: add NetworkPolicy allowing access to metadata kinvolk/lokomotive#865

Merged

invidian mentioned this issue Jan 15, 2021

Update aws-ebs-csi-driver to v0.8.1 kinvolk/lokomotive#1320

Closed

k8s-ci-robot assigned vdhanan Feb 8, 2021

vdhanan mentioned this issue Feb 18, 2021

Add readiness probe so controller does not report "Ready" prematurely #751

Merged

k8s-ci-robot closed this as completed in #751 Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller reports as ready even though it is not able to connect to EC2 instance metadata #548

Controller reports as ready even though it is not able to connect to EC2 instance metadata #548

invidian commented Aug 26, 2020 •

edited

Loading

wongma7 commented Oct 16, 2020

AndyXiangLi commented Jan 12, 2021

invidian commented Jan 15, 2021

invidian commented Feb 2, 2021

vdhanan commented Feb 8, 2021

Controller reports as ready even though it is not able to connect to EC2 instance metadata #548

Controller reports as ready even though it is not able to connect to EC2 instance metadata #548

Comments

invidian commented Aug 26, 2020 • edited Loading

wongma7 commented Oct 16, 2020

AndyXiangLi commented Jan 12, 2021

invidian commented Jan 15, 2021

invidian commented Feb 2, 2021

vdhanan commented Feb 8, 2021

invidian commented Aug 26, 2020 •

edited

Loading