Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUN Assignment Conflict on Dell ME5024 with Seagate Exos-X CSI Driver Leading to PersistentVolume Failures #113

Closed
agusk opened this issue Sep 28, 2024 · 6 comments
Assignees

Comments

@agusk
Copy link

agusk commented Sep 28, 2024

Describe the bug
When deploying PVCs using the Seagate Exos-X CSI driver on Dell ME5024 storage, the CSI driver repeatedly assigns the same LUN (LUN 1) to each PersistentVolume (PV) on different Kubernetes nodes. This results in the failure of subsequent deployments if another PVC already uses LUN 1 on the same node. The issue seems to occur when the driver does not increment or manage LUN assignments properly, leading to LUN conflicts.

To Reproduce
Steps to reproduce the behavior:

  1. Set up Dell ME5024 storage with Seagate Exos-X CSI driver and Fibre Channel (FC) connectivity.
  2. Deploy a Kubernetes cluster using Rancher RKE2 with multiple workers.
  3. Create two PVCs that get successfully assigned LUN 1 on different worker nodes.
  4. Attempt to create a third PVC on another node.
  5. The PVC creation fails due to a "LUN overlap" error on LUN 1.

Here is my stoage-class.yml

apiVersion: storage.k8s.io/v1
kind: StorageClass
provisioner: csi-exos-x.seagate.com
allowVolumeExpansion: true
metadata:
  name: storageclass-seagate
parameters:
  csi.storage.k8s.io/provisioner-secret-name: secret-seagate
  csi.storage.k8s.io/provisioner-secret-namespace: seagate
  csi.storage.k8s.io/controller-publish-secret-name: secret-seagate
  csi.storage.k8s.io/controller-publish-secret-namespace: seagate
  csi.storage.k8s.io/controller-expand-secret-name: secret-seagate
  csi.storage.k8s.io/controller-expand-secret-namespace: seagate
  fsType: ext4 # Desired filesystem
  pool: A # Pool for volumes provisioning
  volPrefix: stx # Desired prefix for volume naming, an underscore is appended
  storageProtocol: fc # The storage interface (iscsi, fc, sas) being used for storage i/o

Here is a sample of pvc file

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: xxxx-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi  # Adjust the size as needed
  storageClassName: storageclass-seagate

Expected behavior
The Seagate CSI driver should assign unique LUNs for each PVC deployment to avoid conflicts, even across multiple nodes.

Screenshots
N/A (logs provided instead)

Storage System (please complete the following information):

  • Vendor: Dell
  • Model: ME5024

Environment:

  • Kubernetes version: RKE2 1.30.4
  • Environment: 3 workers and 1 control plane
  • Host OS: Ubuntu 24.04 LTS
  • seagate-exos-x-csi library: version 1.9.0

Additional context
Log output from the third deployment that fails:

rpc error: code = AlreadyExists desc = lun overlap for lun: 1 volume="stx_cf249e94fa5b2c3b1f6edd10ad1" initiator="2100f4c7aa9a3168"

The first two PVCs are successfully assigned LUN 1 on different nodes, but any further deployments fail due to LUN conflicts on the same node. The expectation is that the driver should handle LUN management dynamically to avoid these conflicts.

Here is log out from seagate-exos-x-csi-controller controller pod, seagate-exos-x-csi-controller-server-xxx pod

1 system.go:147] 
I0926 04:37:09.736925       1 system.go:148] === Controller ===
I0926 04:37:09.736933       1 system.go:149] IPAddress:     10.101.99.114
I0926 04:37:09.736944       1 system.go:150] Protocol:      http://
I0926 04:37:09.736953       1 system.go:151] Controller:    B
I0926 04:37:09.736964       1 system.go:152] Platform:      Indium LX2
I0926 04:37:09.736974       1 system.go:153] SerialNumber:  xxxxx
I0926 04:37:09.736985       1 system.go:154] Status:        Operational
I0926 04:37:09.736995       1 system.go:155] MCCodeVersion: IXM200R009-02
I0926 04:37:09.737008       1 system.go:156] MCBaseVersion: IXM200R009-02
I0926 04:37:09.737019       1 system.go:158] 
I0926 04:37:09.737030       1 system.go:159] === Ports ===
I0926 04:37:09.737049       1 system.go:161] Port [0] A0, FC, 207000c0ff69647b,                ,             , 
I0926 04:37:09.737071       1 system.go:161] Port [1] A1, FC, 217000c0ff69647b,                ,             , 
I0926 04:37:09.737087       1 system.go:161] Port [2] A2, FC, 227000c0ff69647b,                ,             , 
I0926 04:37:09.737118       1 system.go:161] Port [3] A3, FC, 237000c0ff69647b,                ,             , 
I0926 04:37:09.737133       1 system.go:161] Port [4] B0, FC, 247000c0ff69647b,                ,             , 
I0926 04:37:09.737151       1 system.go:161] Port [5] B1, FC, 257000c0ff69647b,                ,             , 
I0926 04:37:09.737165       1 system.go:161] Port [6] B2, FC, 267000c0ff69647b,                ,             , 
I0926 04:37:09.737182       1 system.go:161] Port [7] B3, FC, 277000c0ff69647b,                ,             , 
I0926 04:37:09.737199       1 system.go:165] 
I0926 04:37:09.737216       1 system.go:166] === Pools ===
I0926 04:37:09.737233       1 system.go:168] Pool [0] A             Virtual   00c0fffb4dd10000a50dee6601000000
I0926 04:37:09.737254       1 system.go:168] Pool [1] B             Virtual   00c0fffb4a9a0000b90dee6601000000
I0926 04:37:09.737271       1 system.go:171] 
I0926 04:37:09.739404       1 publisher.go:39] "attach request" initiator(s)=["2100f4c7aa9a3168","2100f4c7aa9a3169"] volume="stx_5e72cdd44eeb3ed836244d8f8c2"
I0926 04:37:09.785094       1 volumes.go:420] "Get Volume Maps Host Names" hostnames=[] apistatus={"ResponseType":"Success","ResponseTypeNumeric":0,"Response":"Command completed successfully. (2024-09-26 04:37:09)","ReturnCode":0,"Time":"2024-09-26T04:37:09Z"}
I0926 04:37:09.785138       1 volumes.go:264] "listing all LUN mappings"
I0926 04:37:09.785155       1 volumes.go:230] "++ ShowHostMaps" host="2100f4c7aa9a3168"
I0926 04:37:09.831003       1 volumes.go:230] "++ ShowHostMaps" host="2100f4c7aa9a3169"
I0926 04:37:09.875897       1 volumes.go:441] "using LUN" lun=1
I0926 04:37:09.875935       1 volumes.go:334] "trying to map volume" volume="stx_5e72cdd44eeb3ed836244d8f8c2" initiator="2100f4c7aa9a3168" lun=1
I0926 04:37:09.921875       1 volumes.go:340] "status" ReturnCode=-3177
E0926 04:37:09.921930       1 volumes.go:446] "mapping error" err="rpc error: code = AlreadyExists desc = lun overlap for lun: 1" volume="stx_5e72cdd44eeb3ed836244d8f8c2" initiator="2100f4c7aa9a3168" LUN=1
I0926 04:37:09.921947       1 volumes.go:334] "trying to map volume" volume="stx_5e72cdd44eeb3ed836244d8f8c2" initiator="2100f4c7aa9a3169" lun=1
I0926 04:37:09.967950       1 volumes.go:340] "status" ReturnCode=-3177
E0926 04:37:09.968010       1 volumes.go:446] "mapping error" err="rpc error: code = AlreadyExists desc = lun overlap for lun: 1" volume="stx_5e72cdd44eeb3ed836244d8f8c2" initiator="2100f4c7aa9a3169" LUN=1
E0926 04:37:09.968037       1 driver.go:147] error mapping volume (stx_5e72cdd44eeb3ed836244d8f8c2), no initiators were mapped successfully
I0926 04:37:09.968052       1 driver.go:137] === [ROUTINE END] [0] /csi.v1.Controller/ControllerPublishVolume (71e439bc75b1) <306.883382ms> ===
I0926 04:37:29.231134       1 driver.go:126] === [ROUTINE REQUEST] [0] /csi.v1.Controller/ControllerUnpublishVolume (5dec755aebc1) <0s> ===
I0926 04:37:29.231461       1 driver.go:133] === [ROUTINE START] [1] /csi.v1.Controller/ControllerUnpublishVolume (5dec755aebc1) <2.872µs> ===
I0926 04:37:29.232005       1 controller.go:256] "using API" addresses=["http://10.101.99.114"]
I0926 04:37:29.242141       1 mc.go:103] "++ MC Login SUCCESS" ipaddress="10.101.99.114" protocol="http"
@David-T-White David-T-White self-assigned this Sep 30, 2024
@David-T-White
Copy link

Hello, thanks for the detailed bug report and logs. Sorry to hear this isn't working as expected.

The ControllerPublishVolume routine attempts to find existing LUNs for the current initiators being mapped and then selects the next highest unused LUN number.

I suspect the issue you are seeing is that LUN selection isn't functioning correctly when initiators have been configured with nicknames and/or host groups. Can you confirm that your array has initiator nicknames or hosts defined? This can be viewed with 'show initiators' in the array CLI, or in the Hosts section of the web interface.

I will be looking into a bug fix. If you do see nicknames/hosts defined, a potential workaround in the interim would be to delete any initiator nicknames/hosts defined on your array and see if LUN selection works properly with them removed.

Thanks,
Dave

@lebonez
Copy link

lebonez commented Oct 9, 2024

I do not use nicknames or host groups and I'm getting this error. By the way this is working great outside of this issue.

image

# show initiators 
Nickname       Discovered Mapped Profile  Host Type  ID                
-----------------------------------------------------------------------
initiator0001  Yes        No     Standard SAS        
initiator0002  Yes        No     Standard SAS        
initiator0003  Yes        No     Standard SAS        
initiator0004  Yes        No     Standard SAS        
-----------------------------------------------------------------------
Success: Command completed successfully. (2024-10-09 12:19:)

I'm attempting to map two PVCs that are block devices on a single VM one maps properly the other one does not.

VM Pod logs

  Type     Reason                  Age               From                     Message
  ----     ------                  ----              ----                     -------
  Normal   Scheduled               32s               default-scheduler        Successfully assigned default/virt-launcher-openmanage-vm-v2zkv to
  Normal   SuccessfulAttachVolume  25s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-2fbd1367-68fc-44cc-8227-cbc736b8a1d3"
  Normal   SuccessfulMountVolume   11s               kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-2fbd1367-68fc-44cc-8227-cbc736b8a1d3" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-2fbd1367-68fc-44cc-8227-cbc736b8a1d3/dev"
  Normal   SuccessfulMountVolume   11s               kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-2fbd1367-68fc-44cc-8227-cbc736b8a1d3" volumeMapPath "/var/lib/kubelet/pods/52783a99-efba-4725-a49a-69aa5b035aa3/volumeDevices/kubernetes.io~csi"
  Warning  FailedAttachVolume      5s (x6 over 24s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-a7d3dfe9-8dea-4041-b9c1-e2058a937e2f" : rpc error: code = Unknown desc = error mapping volume (blo_fe98dea4041b9c1e2058a937e2f), no initiators were mapped successfully

Controller logs

I1009 12:25:19.652164       1 volumes.go:334] "trying to map volume" volume="blo_fe98dea4041b9c1e2058a937e2f" initiator="5f4ee0802357e100" lun=1
I1009 12:25:19.697586       1 volumes.go:340] "status" ReturnCode=-3177
E1009 12:25:19.697628       1 volumes.go:446] "mapping error" err="rpc error: code = AlreadyExists desc = lun overlap for lun: 1" volume="blo_fe98dea4041b9c1e2058a937e2f" initiator="5f4ee0802357e100" LUN=1
I1009 12:25:19.697659       1 volumes.go:334] "trying to map volume" volume="blo_fe98dea4041b9c1e2058a937e2f" initiator="5f4ee0802357e101" lun=1
I1009 12:25:19.742606       1 volumes.go:340] "status" ReturnCode=-3177
E1009 12:25:19.742654       1 volumes.go:446] "mapping error" err="rpc error: code = AlreadyExists desc = lun overlap for lun: 1" volume="blo_fe98dea4041b9c1e2058a937e2f" initiator="5f4ee0802357e101" LUN=1
E1009 12:25:19.742700       1 driver.go:147] error mapping volume (blo_fe98dea4041b9c1e2058a937e2f), no initiators were mapped successfully
I1009 12:25:19.742721       1 driver.go:137] === [ROUTINE END] [0] /csi.v1.Controller/ControllerPublishVolume (3767c170d9eb) <190.801251ms> ===

@mukhinsumojo
Copy link

I'm experiencing the same issue. If I manually add host inside a storage interface (DELL ME5012), driver cannot assign right LUN. Always trying to assign LUN 1.
Deleting a host from a storage helps.
But if I want to assign a volume to a host manually (the same time while using seagate-exos-x-csi), I need to a host to have a nickname. So, it's impossible to use both methods simultaneity.
Hope this issue will be fixed in next release

@lfornili
Copy link

Exact same issue on HPE MSA 2060 with latest firmware.

@lebonez
Copy link

lebonez commented Oct 13, 2024

Just confirmed indeed if you remove all of the hosts and host groups and let the csi driver handle the hosts initiator mapping this works fine. This apparently needs to be modified to check for in use lun numbers that are in use by static hosts initiator mappings. I'm fine with the work around.

@David-T-White
Copy link

Addressed in v1.10.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants