Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error generating accessibility requirements: no topology key found on CSINode #1372

Closed
jlubins opened this issue Sep 2, 2022 · 4 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@jlubins
Copy link

jlubins commented Sep 2, 2022

/kind bug

What happened?
Hi all, I recently upgraded my EKS cluster to 1.23, installing the drivers and doing all of the required permissions and service account tasks beforehand. My existing gp2 EBS volumes are not working out of the box, maybe to do with having to migrate them? Worse, the gp3 storage class that I create isn't working either. I followed this guide to a T. I am using dynamic volume provisioning.

What you expected to happen?

When I launch my service, my gp2 volumes' existing pvcs do not show any log output. I am expecting the driver to try to connect to them, but instead I get no events.

Name:          claim-jlubinski
Namespace:     default
StorageClass:  gp2
Status:        Bound
Volume:        pvc-1eee3858-c24a-4d16-aa43-f78b3fa68f25
Labels:        app=jupyterhub
               chart=jupyterhub-1.2.0
               component=singleuser-storage
               heritage=jupyterhub
               hub.jupyter.org/username=jlubinski
               release=jhub
Annotations:   hub.jupyter.org/username: jlubinski
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      96Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       <none>
Events:        <none>

When I launch my service with a gp3 volume, the CSI driver recognizes that it should create a volume, but I get an error that I don't quite understand. I have read online that the topology key is often the aws region. I uninstalled my aws addon version of the driver and reinstalled with helm just to be able to set this more explicitly. I also set helm to allow the driver to tolerate tainted nodes, reading somewhere that this could be the issue.
This is my gp3 storage class definition:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
allowVolumeExpansion: true
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3

This is the log output:

$ kubectl logs deployment/ebs-csi-controller -n kube-system -c csi-provisioner
W0902 17:59:51.202823       1 controller.go:934] Retrying syncing claim "aa1898b0-0ad1-4ef6-98ac-3140c00fc03d", failure 4
E0902 17:59:51.202862       1 controller.go:957] error syncing claim "aa1898b0-0ad1-4ef6-98ac-3140c00fc03d": failed to provision volume with StorageClass "gp3": error generating accessibility requirements: no topology key found on CSINode ip-172-31-24-24.ec2.internal
I0902 17:59:51.203029       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-jlubinskijlubinski-2dgp3-2dhelm", UID:"aa1898b0-0ad1-4ef6-98ac-3140c00fc03d", APIVersion:"v1", ResourceVersion:"281540809", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/claim-jlubinskijlubinski-2dgp3-2dhelm"
I0902 17:59:51.203047       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"claim-jlubinskijlubinski-2dgp3-2dhelm", UID:"aa1898b0-0ad1-4ef6-98ac-3140c00fc03d", APIVersion:"v1", ResourceVersion:"281540809", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "gp3": error generating accessibility requirements: no topology key found on CSINode ip-172-31-24-24.ec2.internal

Here is what I tried to get around it with helm configuration:

$ helm get values aws-ebs-csi-driver -n kube-system
USER-SUPPLIED VALUES:
region: us-east-1
tolerateAllTaints: true

Here is an example gp3 pvc:

Name:          claim-jlubinskijlubinski-2dgp3-2dhelm
Namespace:     default
StorageClass:  gp3
Status:        Pending
Volume:
Labels:        app=jupyterhub
               chart=jupyterhub-1.2.0
               component=singleuser-storage
               heritage=jupyterhub
               hub.jupyter.org/username=jlubinski
               release=jhub
Annotations:   hub.jupyter.org/servername: jlubinski-gp3-helm
               hub.jupyter.org/username: jlubinski
               volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
               volume.kubernetes.io/selected-node: ip-172-31-24-24.ec2.internal
               volume.kubernetes.io/storage-provisioner: ebs.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Used By:       <none>
Events:
  Type     Reason                Age                      From                                                                                      Message
  ----     ------                ----                     ----                                                                                      -------
  Normal   ExternalProvisioning  3m48s (x903 over 3h48m)  persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
  Warning  ProvisioningFailed    102s (x50 over 3h4m)     ebs.csi.aws.com_ebs-csi-controller-6f8c9fcb8c-xj9mg_20ec32e6-df44-4445-8d76-5588d0f70b47  failed to get target node: node "ip-172-31-24-24.ec2.internal" not found

How to reproduce it (as minimally and precisely as possible)?
I just installed the driver and updated my cluster and its nodes to 1.23.

Anything else we need to know?:
Sorry if this isn't enough information or too much, any help would be greatly appreciated. I thought this update sounded simple enough, but I ended up taking down my entire system with it.

Environment

  • Kubernetes version (use kubectl version):
  • v1.23.7-eks-4721010
  • Driver version:
  • aws-ebs-csi-driver-2.10.1 (helm) / 1.11.2
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 2, 2022
@torredil
Copy link
Member

torredil commented Sep 3, 2022

Hey thanks for reporting this. Are you using taints? first thing that came to mind here is aws/containers-roadmap#1706. You are one step ahead of me as I was going to suggest setting: tolerateAllTaints: true.

  • Lets confirm the config looks as we would expect by providing the output of: kubectl -n kube-system get daemonset ebs-csi-node -o yaml | grep tolerations -C 3. It should look like:
      serviceAccount: ebs-csi-node-sa
      serviceAccountName: ebs-csi-node-sa
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - hostPath:

@jlubins
Copy link
Author

jlubins commented Sep 4, 2022

Thank you for getting back to me so quickly!

Interesting, the same error seems to be given in aws/containers-roadmap#1706. I'm not trying to use taints, and I don't believe anything configured on the cluster should be automatically setting them, but I suppose I was pretty desperate to try anything that was giving others similar problems and as such, wanted to rule it out.

Seems like this maybe isn't looking as it should, due to key: CriticalAddonsOnly being in there:

      serviceAccount: ebs-csi-node-sa
      serviceAccountName: ebs-csi-node-sa
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoExecute

Output for csi DaemonSet:

Name:           ebs-csi-node
Selector:       app=ebs-csi-node,app.kubernetes.io/name=aws-ebs-csi-driver
Node-Selector:  kubernetes.io/os=linux
Labels:         app.kubernetes.io/component=ebs-csi-node
                app.kubernetes.io/managed-by=EKS
                app.kubernetes.io/name=aws-ebs-csi-driver
                app.kubernetes.io/version=1.10.0
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 2
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 2
Number of Nodes Misscheduled: 0
Pods Status:  2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=ebs-csi-node
                    app.kubernetes.io/component=ebs-csi-node
                    app.kubernetes.io/managed-by=EKS
                    app.kubernetes.io/name=aws-ebs-csi-driver
                    app.kubernetes.io/version=1.10.0
  Service Account:  ebs-csi-node-sa
  Containers:
   ebs-plugin:
    Image:      602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-ebs-csi-driver:v1.10.0
    Port:       9808/TCP
    Host Port:  0/TCP
    Args:
      node
      --endpoint=$(CSI_ENDPOINT)
      --logtostderr
      --v=2
    Limits:
      cpu:     100m
      memory:  256Mi
    Requests:
      cpu:     10m
      memory:  40Mi
    Liveness:  http-get http://:healthz/healthz delay=10s timeout=3s period=10s #success=1 #failure=5
    Environment:
      CSI_ENDPOINT:   unix:/csi/csi.sock
      CSI_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /csi from plugin-dir (rw)
      /dev from device-dir (rw)
      /var/lib/kubelet from kubelet-dir (rw)
   node-driver-registrar:
    Image:      602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/csi-node-driver-registrar:v2.5.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --csi-address=$(ADDRESS)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
      --v=2
    Limits:
      cpu:     100m
      memory:  256Mi
    Requests:
      cpu:     10m
      memory:  40Mi
    Environment:
      ADDRESS:               /csi/csi.sock
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/plugins/ebs.csi.aws.com/csi.sock
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
   liveness-probe:
    Image:      602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/livenessprobe:v2.6.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --csi-address=/csi/csi.sock
    Limits:
      cpu:     100m
      memory:  256Mi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /csi from plugin-dir (rw)
  Volumes:
   kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet
    HostPathType:  Directory
   plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/ebs.csi.aws.com/
    HostPathType:  DirectoryOrCreate
   registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry/
    HostPathType:  Directory
   device-dir:
    Type:               HostPath (bare host directory volume)
    Path:               /dev
    HostPathType:       Directory
  Priority Class Name:  system-node-critical
Events:                 <none>

I'm afraid I may be unable to get the kubelet logs for the node, as I believe I would need to ssh directly into the EC2 instance hosting the node to get them, right? I do not have the AWS permissions to do this, but I will attempt to get them and provide an update.

The EC2 instance associated with this node in particular (ip-172-31-21-210.ec2.internal) doesn't have a keypair associated with it, so I can't SSH into it. I know this is set at creation time. I'm not sure how this node was created. Am I out of luck there? Can I somehow force it to use a different node?

And for the last one:

$ kubectl logs deployment/ebs-csi-controller -n kube-system -c ebs-plugin
Found 2 pods, using pod/ebs-csi-controller-766965c675-68wml

@torredil
Copy link
Member

torredil commented Sep 6, 2022

@jlubins It appears tolerateAllTaints: true was not set. Did you re-deploy the chart after modifying the value?
helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --values ./charts/aws-ebs-csi-driver/values.yaml.

Please confirm that the driver is running on the node and provide the output of: kubectl describe node ip-172-31-21-210.ec2.internal. The driver adds topology labels to the node and we should be able to see that in the describe node output.

As far as retrieving the Kubelet logs, are you able to access the AWS console? Since you don't have a key pair, you won't be able to connect to the instance via SSH. However, I believe you can still use "EC2 Instance Connect" to spin up a terminal.

image

@jlubins
Copy link
Author

jlubins commented Sep 6, 2022

You know, I think I may have deployed with Helm incorrectly. I've since redeployed with Helm as you described: pulling and untarring the chart, as well as manually changing tolerateAllTaints in the values.yaml that the chart comes with. Now everything is working, including the old gp2 volumes!

However I did it last, I think I downloaded it directly from the helm repo, then created a values.yaml of my own that that just specified tolerateAllTaints and the region, which must have overwritten the default values, because I had far fewer USER-SUPPLIED VALUES returning from $ helm get values aws-ebs-csi-driver -n kube-system

Since everything is working properly now, I'll go ahead and close the issue. Thank you for being a second, more knowledgeable pair of eyes for me @torredil!

@jlubins jlubins closed this as completed Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants