Skip to content
This repository has been archived by the owner on Oct 25, 2023. It is now read-only.

HPE MSA 2060 #98

Open
vankosa opened this issue Sep 17, 2021 · 23 comments
Open

HPE MSA 2060 #98

vankosa opened this issue Sep 17, 2021 · 23 comments
Assignees
Labels
type/bug Something isn't working

Comments

@vankosa
Copy link

vankosa commented Sep 17, 2021

Good day!

I am using HPE MSA 2060 disk storage which is claimed to be supported.

I took the HPE MSA 2060 API documentation here - https://www.intesiscon.com/ficheros/manuales-tecnicos/255-HPE-a00105313en-us-HPE-MSA-1060-2060-2062-CLI-Reference-Guide.pdf

Faced some problems while using this project.

1.https://github.com/enix/san-iscsi-csi/blob/main/pkg/controller/publisher.go#L43 does not match the HPE MSA 2060 API documentation. Here, instead of host-maps should bemaps(347 documentation page)
2. If you connect several PVs with one Helm deployment, then all volumes in the disk storage receive the same LUN number, which is unacceptable.
3. For some reason, the partition table and file system are not created in the slave PV.
4.enix/dothill-api-go#12

Applications used:

  1. Ubuntu 20.04
  2. Kubernetes 1.21.4
  3. Helm 3.3.4

PS: If you need additional information - indicate what you need to provide

@abuisine
Copy link
Contributor

Hi @vankosa.

Thank you for your feedback, we lack users on this project so we are quite happy to hear from you !
The project is supposed to support HPE MSA 2060, however we have a limited set of equipments in our lab, the nearest one being a HPE MSA 2050.

We are willing to adjust what would be missing for a 2060. However, we would need a temporary access in order to make sure that everything works as it should. We are open to discussion on this matter.

You list a bunch of points, @paullaffitte could you have a look if you have time ?

I am personally curious about :

  • point 2, please precise, we instantiate a lun id per (pv,host), what does exactly seems unacceptable from your point of view ?
  • point 3, in my mind we build a filesystem on the block device attached to the node, without any partitioning in the process, do you have some logs or example that would make us understand what is going on ?

@paullaffitte do not hesitate to correct me if I am wrong 😄

Cheers

@vankosa
Copy link
Author

vankosa commented Sep 17, 2021

Thanks for the answer!

About LUN:

LOG san-iscsi-csi-controller

I0916 19:32:44.962818       1 driver.go:112] === [ROUTINE START] /csi.v1.Controller/ControllerPublishVolume ===
I0916 19:32:44.962865       1 controller.go:243] using dothill API at address https://san.unim.internal
I0916 19:32:44.962879       1 controller.go:245] dothill client is already configured for this API, skipping login
I0916 19:32:44.962892       1 publisher.go:78] attach request for initiator iqn.2021-09.internal.unim.kube151:kube151, volume id: 46739183c65641638f94a24ddd32406b
I0916 19:32:44.962907       1 dothill.go:92] -> GET /show/maps/"46739183c65641638f94a24ddd32406b"
I0916 19:32:45.069501       1 dothill.go:122] <- [0 Success] Command completed successfully. (2021-09-16 22:31:41)
I0916 19:32:45.069531       1 publisher.go:128] listing all LUN mappings
I0916 19:32:45.069544       1 dothill.go:92] -> GET /show/initiators/"iqn.2021-09.internal.unim.kube151:kube151"
I0916 19:32:45.078753       1 dothill.go:122] <- [0 Success] Command completed successfully. (2021-09-16 22:31:41)
I0916 19:32:45.078795       1 publisher.go:94] using LUN 1
I0916 19:32:45.078804       1 publisher.go:164] trying to map volume 46739183c65641638f94a24ddd32406b for initiator iqn.2021-09.internal.unim.kube151:kube151 on LUN 1
I0916 19:32:45.078816       1 dothill.go:92] -> GET /map/volume/access/rw/lun/1/initiator/iqn.2021-09.internal.unim.kube151:kube151/"46739183c65641638f94a24ddd32406b"
I0916 19:32:45.124404       1 dothill.go:122] <- [-3177 Error] The specified LUN overlaps a previously defined LUN. (2021-09-16 22:31:41)
E0916 19:32:45.124574       1 driver.go:118] rpc error: code = Internal desc = Dothill API returned non-zero code -3177 (The specified LUN overlaps a previously defined LUN. (2021-09-16 22:31:41))
I0916 19:32:45.124607       1 driver.go:121] === [ROUTINE END] /csi.v1.Controller/ControllerPublishVolume ===

About the file system:

LOG san-iscsi-csi-node

I0916 17:48:51.388779       1 driver.go:112] === [ROUTINE START] /csi.v1.Node/NodePublishVolume ===
I0916 17:48:51.388809       1 node.go:139] publishing volume f109e50f1fcc4a82a135ebcecb72d996
I0916 17:48:51.388821       1 node.go:142] ISCSI portals: [10.53.11.201 10.53.11.203]
I0916 17:48:51.388837       1 node.go:145] LUN: 1
I0916 17:48:51.388849       1 node.go:147] initiating ISCSI connection...
I0916 17:48:51.433235       1 node.go:164] attached device at /dev/mapper/3600c0ff000648a8d60ad436101000000
I0916 17:48:51.433269       1 node.go:167] device is using multipath
I0916 17:48:51.442143       1 node.go:380] Checking filesystem at /dev/mapper/3600c0ff000648a8d60ad436101000000
E0916 17:48:51.453558       1 driver.go:118] rpc error: code = DataLoss desc = filesystem seems to be corrupted: e2fsck 1.45.5 (07-Jan-2020)
ext2fs_open2: Bad magic number in super-block
e2fsck: Superblock invalid, trying backup blocks...
e2fsck: Bad magic number in super-block while trying to open /dev/mapper/3600c0ff000648a8d60ad436101000000

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

I0916 17:48:51.453602       1 driver.go:121] === [ROUTINE END] /csi.v1.Node/NodePublishVolume ===

@paullaffitte
Copy link
Collaborator

Hi @vankosa,

First, thanks for your valuable feedback, it's very much appreciated. I will try to answer your points one after the other.

  1. Indeed, the documentation for your appliance seems to define the route /show/maps instead of /show/host-maps and /show/volume-maps. It should be possible to adapt the library to support different version of the API. Speaking about versions, could you please paste here version information about your appliance yielded by the show versions command in CLI ? You should be able to get versions as json by configuring your CLI first: set cli-parameters json.
  2. Normally, we allocate one LUN per couple initiator/volume. So we can have 2 volumes with the same LUN mapped on 2 different initiators. Is your appliance supporting this or maybe it only allows a LUN to be allocated once for all ?
  3. Is it the first NodePublishVolume call or a subsequent one ? If it's not the first one, it may be actually data corruption. We implemented some checks to prevent it, but it still happens sometimes.

@vankosa
Copy link
Author

vankosa commented Sep 17, 2021

  1. show version
{

"versions":[
  {
    "object-name":"controller-a-versions",
    "meta":"/meta/versions",
    "sc-cpu-type":"Broadwell 2200MHz",
    "bundle-version":"IN110R001",
    "bundle-status":"Valid",
    "bundle-status-numeric":0,
    "bundle-version-only":"IN110R001",
    "bundle-base-version":"I110",
    "build-date":"Fri Jun 11 19:19:49 UTC",
    "sc-fw":"INS110R01-01",
    "sc-baselevel":"INS110R01-01",
    "sc-memory":"N/A",
    "sc-fu-version":"1.50.33406",
    "sc-loader":"28.019",
    "capi-version":"3.21",
    "mc-fw":"IXM110R001-01",
    "mc-loader":"1.50.33406",
    "mc-base-fw":"IXM110R001-01",
    "fw-default-platform-brand":"HPE",
    "fw-default-platform-brand-numeric":15,
    "ec-fw":"5331",
    "pld-rev":"2.7",
    "pm-cpld-version":"Unknown",
    "prm-version":"N/A",
    "hw-rev":"5.0",
    "him-rev":"2",
    "him-model":"4",
    "backplane-type":7,
    "host-channel_revision":2,
    "disk-channel_revision":2,
    "mrc-version":"0.3.2.19",
    "ctk-version":"No CTK present",
    "mcos-version":"IPM110R001-01",
    "gem-version":"usm-rss_sbbsas_indium_msa_vikings-v5.3_lite-r2021.22.0_rc1_rel_"
  },
  {
    "object-name":"controller-b-versions",
    "meta":"/meta/versions",
    "sc-cpu-type":"Broadwell 2200MHz",
    "bundle-version":"IN110R001",
    "bundle-status":"Valid",
    "bundle-status-numeric":0,
    "bundle-version-only":"IN110R001",
    "bundle-base-version":"I110",
    "build-date":"Fri Jun 11 19:19:49 UTC",
    "sc-fw":"INS110R01-01",
    "sc-baselevel":"INS110R01-01",
    "sc-memory":"N/A",
    "sc-fu-version":"1.50.33406",
    "sc-loader":"28.019",
    "capi-version":"3.21",
    "mc-fw":"IXM110R001-01",
    "mc-loader":"1.50.33406",
    "mc-base-fw":"IXM110R001-01",
    "fw-default-platform-brand":"HPE",
    "fw-default-platform-brand-numeric":15,
    "ec-fw":"5331",
    "pld-rev":"2.7",
    "pm-cpld-version":"Unknown",
    "prm-version":"N/A",
    "hw-rev":"5.0",
    "him-rev":"2",
    "him-model":"4",
    "backplane-type":7,
    "host-channel_revision":2,
    "disk-channel_revision":2,
    "mrc-version":"0.3.2.19",
    "ctk-version":"No CTK present",
    "mcos-version":"IPM110R001-01",
    "gem-version":"usm-rss_sbbsas_indium_msa_vikings-v5.3_lite-r2021.22.0_rc1_rel_"
  }
],
"status":[
  {
    "object-name":"status",
    "meta":"/meta/status",
    "response-type":"Success",
    "response-type-numeric":0,
    "response":"Command completed successfully. (2021-09-17 13:17:29)",
    "return-code":0,
    "component-id":"",
    "time-stamp":"2021-09-17 13:17:29",
    "time-stamp-numeric":1631884649
  }
]
}
  1. Yes, mapping with one LUN occurs to different initiators. But here the fact is that the same LUN is issued on volumes given to the same initiator
  2. This is the first and all subsequent

@paullaffitte
Copy link
Collaborator

paullaffitte commented Sep 17, 2021

  1. Thanks, it will be very useful when trying to adapt the api client to support different versions.
  2. I'm wondering if the issue is not coming from the differences around the /show/maps route. Indeed, to allocate a LUN, we first need to list mappings. Currently we use the /show/host-maps route (https://github.com/enix/dothill-api-go/blob/7c31773f039fcc9016577584b5b0bcc6b93ed430/endpoints.go#L95). Did you edit this part prior to your tests to make it compatible with your appliance ? If it's the case it may come from this part.
  3. I would need more logs, currently I cannot really say what happened. Could you add the following lines in your values.yaml and try again with a new PVC please ?
node:
  extraArgs:
    - --v=9 # We don't really have 9 levels of verbosity, I just want to be sure we get all available logs.

PS: Actually it would be useful for the second point to also increase verbosity on the controller.

@vankosa
Copy link
Author

vankosa commented Sep 17, 2021

All my changes to dothill-api-go are documented in enix/dothill-api-go#12.
Yes, I did change this url

Strange, but now the output is different, not related to the file system.

https://pastebin.com/ZbGrZSbH - node
https://pastebin.com/XifXgnAw - controller

@paullaffitte
Copy link
Collaborator

From what I understand, I think the issue in LUN allocation comes from the driver.dothillClient.ShowHostMaps function in publisher.go#L129. If you did update this function, please check that after a first LUN being allocated, the next call to driver.dothillClient.ShowHostMaps returns one dothill.Volume with its LUN set to 1.

For the node issue, as you said it's not the same error message than the first time. To handle this one, please refer to the "multipath is inconsistent: devices WWIDs differ" part of our troubleshooting section. But I would be glad to know if you get the original error again with some more details.

@vankosa
Copy link
Author

vankosa commented Sep 20, 2021

There I only changed https://github.com/enix/san-iscsi-csi/blob/main/pkg/controller/publisher.go#L43

I cleared all the changes, followed the recommendations and again got the log that I needed
https://pastebin.com/vGVNEiSi

@paullaffitte
Copy link
Collaborator

I could be wrong, but I suspect you're storageclass to be missing the property parameters.fsType. Which should be set to ext3 or ext4. Indeed, according to the following snippet, the filesystem is created on the device only if the current filesystem is not the targeted one and that the device don't have a filesystem yet.

klog.V(1).Infof("Detected filesystem: %q", currentFsType)
if currentFsType != fsType {
if currentFsType != "" {
return fmt.Errorf("Could not create %s filesystem on device %s since it already has one (%s)", fsType, disk, currentFsType)
}
klog.Infof("Creating %s filesystem on device %s", fsType, disk)
out, err := exec.Command(fmt.Sprintf("mkfs.%s", fsType), disk).CombinedOutput()
if err != nil {
return errors.New(string(out))
}
}

If the property in the storageclass is not set, the target filesystem will be "", which is the same as no filesystem. Thus, the device will never be formatted, leading to "filesystem corruption" error, since there is actually no healthy filesystem present on the device.
If your storage class has the property parameters.fsType, I would like to be sure that its value is correctly brought until this point, so you could replace klog.V(1).Infof("Detected filesystem: %q", currentFsType) by klog.V(1).Infof("Detected filesystem: %q, current filesystem: %q", currentFsType, fsType) and check that the value in fsType is the same as parameters.fsType in your storageclass.

@vankosa
Copy link
Author

vankosa commented Sep 21, 2021

Unfortunately, yes, there was a problem with the incorrect indication of this option.

It means that it remains to deal with the LUN

@paullaffitte
Copy link
Collaborator

paullaffitte commented Sep 21, 2021

Thanks for your contribution, I created an issue to address this problem. I think the plugin should return an error when a required parameter is missing.

About the LUN, I hear that you didn't changed the part of the source code that could be the origin of the problem, and since it involve the appliance API, I suspect an inconsistency between my version of the API and yours. Could you check that after a first LUN being allocated, the next call to driver.dothillClient.ShowHostMaps returns one dothill.Volume with its LUN set to 1. You can do it by adding klog.Infof("volumes: %+v", volumes) next to the line 143 in the following snippet (extracted from chooseLUN function).

klog.V(5).Infof("checking if LUN 1 is not already in use")
if len(volumes) == 0 || volumes[0].LUN > 1 {
return 1, nil
}

@paullaffitte paullaffitte self-assigned this Sep 21, 2021
@paullaffitte paullaffitte added the type/bug Something isn't working label Sep 21, 2021
@vankosa
Copy link
Author

vankosa commented Sep 21, 2021

I got such a log

I0921 15:50:12.393442 1 publisher.go:143] volumes: []

Could it be related to this function that I changed - https://github.com/vankosa/dothill-api-go/blob/master/endpoints.go#L91

@paullaffitte
Copy link
Collaborator

Yes, it's most likely related. Sorry for the delay.

@paullaffitte
Copy link
Collaborator

Some people at Seagate are working on a fork of our library dothill-api-go. They are working actively on making it cross-compatible between their different models. I tried to replace our implementation with theirs and it's still working on our appliance. May you give it a try on your appliance too ? I pushed my changes on the branch feat/cross-compatibility.

@vankosa
Copy link
Author

vankosa commented Oct 1, 2021

I found a solution to the problem. I will try to explain what was done.

https://github.com/vankosa/dothill-api-go/blob/master/endpoints.go#L91 is the function.

Please note here that the value of the initiatorName of the function gets here https://github.com/enix/san-iscsi-csi/blob/main/pkg/controller/publisher.go#L127

initiatorName in my case was iqn.2021-09.internal.unim.kube151: kube151, where the value after the colon is what is called nickname in the repository API documentation, so I've separated that.

https://github.com/vankosa/dothill-api-go/blob/master/endpoints.go#L107

The original code specifies host-view, however, the API requires hosts-view

@paullaffitte
Copy link
Collaborator

I'm not sure to understand your point. In https://github.com/vankosa/dothill-api-go/blob/master/endpoints.go#L107, there is already hosts-view instead of host-view. Or you mean the opposite?

@vankosa
Copy link
Author

vankosa commented Oct 1, 2021

@paullaffitte
Copy link
Collaborator

Oh indeed, I didn't realized at first sight that your link pointed to your fork. I have to check that your changes are also compatible with our appliance before trying to merge it in this repository. In the same time, could you take a look at the branch I spoke about above (feat/cross-compatibility) ? According to Seagate developers, changes they made should make the library compatible with a wider range of appliances and may fix your compatibility issue as well as keeping the code cross-compatible with other models.

@vankosa
Copy link
Author

vankosa commented Oct 1, 2021

I will definitely look, a little later, I will tell you about the results

@paullaffitte
Copy link
Collaborator

I tried your code on my appliance and it seems to be incompatible. Hoping that the Seagate version will work for you, so we can use it as it works on my appliance too.

Just for the record, here are the logs I get when I try to publish a volume on my appliance with your modifications:

I1001 09:59:49.022215       1 driver.go:112] === [ROUTINE START] /csi.v1.Controller/ControllerPublishVolume ===
I1001 09:59:49.022473       1 controller.go:243] using dothill API at address https://10.14.3.98
I1001 09:59:49.022558       1 controller.go:245] dothill client is already configured for this API, skipping login
I1001 09:59:49.023278       1 publisher.go:78] attach request for initiator iqn.1993-08.org.debian:01:be2a4889e22c, volume id: bfe7663dde884e7384a76de541ec8557
I1001 09:59:49.023373       1 dothill.go:92] -> GET /show/volume-maps/"bfe7663dde884e7384a76de541ec8557"
I1001 09:59:49.391084       1 dothill.go:122] <- [0 Success] Command completed successfully. (2021-10-01 09:59:49)
I1001 09:59:49.391446       1 publisher.go:128] listing all LUN mappings
I1001 09:59:49.391536       1 dothill.go:92] -> GET /show/maps/"01.*"
I1001 09:59:49.420451       1 dothill.go:122] <- [-10380 Error] The group was not found on the system. (2021-10-01 09:59:49)
E1001 09:59:49.421358       1 driver.go:118] Dothill API returned non-zero code -10380 (The group was not found on the system. (2021-10-01 09:59:49))
I1001 09:59:49.421511       1 driver.go:121] === [ROUTINE END] /csi.v1.Controller/ControllerPublishVolume ===

@vankosa
Copy link
Author

vankosa commented Oct 6, 2021

Checked out the branch https://github.com/enix/san-iscsi-csi/tree/feat/cross-compatibility, exactly the same problem with issuing LUN

@paullaffitte
Copy link
Collaborator

We have to discuss a bit more about this issue internally at Enix, I will come back to you in the next week.

@paullaffitte
Copy link
Collaborator

At the moment, we have to focus on other important stuff, like improving reliability for already supported systems. Once we're ready, we will try to make required changes in order to support your appliance too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants