Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: snapshot restore failed with Message = failed to get acl entries: Too many links #1514

Closed
ybrock opened this issue Oct 9, 2024 · 21 comments
Assignees
Labels
area/csi-powerscale Issue pertains to the CSI Driver for Dell EMC PowerScale type/bug Something isn't working. This is the default label associated with a bug issue.
Milestone

Comments

@ybrock
Copy link

ybrock commented Oct 9, 2024

Bug Description

Hello,

We have CSM modules 1.3.1 with CSI drivers version 1.10.1 installed on Openshift 4.14.35 (K8s 1.27.16).

We have Dell PowerScale (Isilon) configured and running ok except for this issue.

When we try to restore a snapshot from a PVC containing a symlink, the new PVC is never created (pending) and these events are reported in the CSI driver :

failed to provision volume with StorageClass "isilon-infra":
  rpc error: code = Internal desc = failed to copy snapshot id '381307'
error 'Error Source = /ifs/data/infra/.snapshot/snapshot-6cdfa72f-370a-4998-b406-22122315369f/csi/ocx/k8s-c3c0e0478f/backup/db/latest
Message = failed to get acl entries: Too many links
Source = /ifs/data/infra/.snapshot/snapshot-6cdfa72f-370a-4998-b406-22122315369f/csi/ocx/k8s-c3c0e0478f/backup/db/latest
Target = /ifs/data/infra/csi/ocx/k8s-6e4912c502/backup/db/latest '
93m         Warning   ProvisioningFailed       persistentvolumeclaim/vol3                          failed to provision volume with StorageClass "isilon-infra": rpc error: code = Internal desc = failed to copy snapshot id '382689', error 'Error Source = /ifs/data/infra/.snapshot/snapshot-5c50547b-2a8c-4ab6-8275-e79116d6395a/csi/ocx/k8s-25b1d626b2/2,Message = failed to get acl entries: Too many links,,Source = /ifs/data/infra/.snapshot/snapshot-5c50547b-2a8c-4ab6-8275-e79116d6395a/csi/ocx/k8s-25b1d626b2/2,Target = /ifs/data/infra/csi/ocx/k8s-9fde9d56f5/2 ...

The provisioner container is rising this message :

I1009 14:04:25.677248       1 controller.go:1075] Final error received, removing PVC 8bea8d03-1f57-438c-80c1-052b13d7ef6f from claims in progress
W1009 14:04:25.677258       1 controller.go:934] Retrying syncing claim "8bea8d03-1f57-438c-80c1-052b13d7ef6f", failure 138
E1009 14:04:25.677276       1 controller.go:957] error syncing claim "8bea8d03-1f57-438c-80c1-052b13d7ef6f": failed to provision volume with StorageClass "isilon-infra": rpc error: code = Internal desc = failed to copy snapshot id '382107', error 'Error Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Message = failed to get acl entries: Too many links,,Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Target = /ifs/data/infra/csi/ocx/k8s-8bea8d031f/backup/db/latest 

If there is no symlink in the file system, the snapshot restore works.

Logs

I1009 14:04:12.366227       1 leaderelection.go:281] successfully renewed lease dell-csm/csi-isilon-dellemc-com
I1009 14:04:17.374465       1 leaderelection.go:281] successfully renewed lease dell-csm/csi-isilon-dellemc-com
I1009 14:04:22.390114       1 leaderelection.go:281] successfully renewed lease dell-csm/csi-isilon-dellemc-com
I1009 14:04:25.677181       1 connection.go:251] GRPC response: {}
I1009 14:04:25.677198       1 connection.go:252] GRPC error: rpc error: code = Internal desc = failed to copy snapshot id '382107', error 'Error Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Message = failed to get acl entries: Too many links,,Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Target = /ifs/data/infra/csi/ocx/k8s-8bea8d031f/backup/db/latest 
'
I1009 14:04:25.677213       1 controller.go:848] CreateVolume failed, supports topology = false, node selected false => may reschedule = false => state = Finished: rpc error: code = Internal desc = failed to copy snapshot id '382107', error 'Error Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Message = failed to get acl entries: Too many links,,Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Target = /ifs/data/infra/csi/ocx/k8s-8bea8d031f/backup/db/latest 
'
I1009 14:04:25.677248       1 controller.go:1075] Final error received, removing PVC 8bea8d03-1f57-438c-80c1-052b13d7ef6f from claims in progress
W1009 14:04:25.677258       1 controller.go:934] Retrying syncing claim "8bea8d03-1f57-438c-80c1-052b13d7ef6f", failure 138
E1009 14:04:25.677276       1 controller.go:957] error syncing claim "8bea8d03-1f57-438c-80c1-052b13d7ef6f": failed to provision volume with StorageClass "isilon-infra": rpc error: code = Internal desc = failed to copy snapshot id '382107', error 'Error Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Message = failed to get acl entries: Too many links,,Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Target = /ifs/data/infra/csi/ocx/k8s-8bea8d031f/backup/db/latest 
'
I1009 14:04:25.677370       1 event.go:364] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"evs-crunchy", Name:"danthe", UID:"8bea8d03-1f57-438c-80c1-052b13d7ef6f", APIVersion:"v1", ResourceVersion:"2652820", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "isilon-infra": rpc error: code = Internal desc = failed to copy snapshot id '382107', error 'Error Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Message = failed to get acl entries: Too many links,,Source = /ifs/data/infra/.snapshot/snapshot-8939b9e6-a7d3-4e19-b89a-2f527f5e866a/csi/ocx/k8s-8edf05408f/backup/db/latest,Target = /ifs/data/infra/csi/ocx/k8s-8bea8d031f/backup/db/latest 

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

create a PVC on a powerscale storageClass
mount the PVC in a pod
write a file into PVC
create a symlink in the PVC pointing to previous file
take a snapshot
create a new PVC from restoring from previous snapshot

Expected Behavior

new PVC is created from snapshot

CSM Driver(s)

CSI 1.10.1
CSM 1.3.1

Installation Type

helm

Container Storage Modules Enabled

isilon
karavi

Container Orchestrator

openshift 4.14 (crio)

Operating System

redhat coreos

@ybrock ybrock added needs-triage Issue requires triage. type/bug Something isn't working. This is the default label associated with a bug issue. labels Oct 9, 2024
@csmbot
Copy link
Collaborator

csmbot commented Oct 11, 2024

@ybrock: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.


We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at [email protected].

@satyakonduri
Copy link
Contributor

satyakonduri commented Oct 30, 2024

Hi @ybrock
We don’t have a release for the 1.10.1 CSI driver and CSM 1.3.1, Could you please confirm the correct CSI driver and CSM versions?
Thank you!

@ybrock
Copy link
Author

ybrock commented Oct 30, 2024

Hi @satyakonduri

The helm chart used to install Dell CSM is 1.3.1
The CSI driver provided with this is in version v2.10.0 (tag of the image : registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.0 and registry.access.redhat.com/dellemc/csi-isilon:v2.10.0 )

Sorry for the unprecision

@jooseppi-luna
Copy link
Contributor

/sync

@jackieung-dell jackieung-dell removed the needs-triage Issue requires triage. label Nov 6, 2024
@csmbot
Copy link
Collaborator

csmbot commented Nov 6, 2024

link: 29803

@shanmydell shanmydell added this to the v1.13.0 milestone Nov 11, 2024
@kumarkgosa
Copy link

kumarkgosa commented Nov 14, 2024

Hello @ybrock ,
Can I know more details about how you are creating the symlink? I understand the target is the written file in the PVC but what about symlink location ? In the description I see you have source related to snapshots, did you create snapshot of PVC first or created a symlink first ?
Thank you!

@ybrock
Copy link
Author

ybrock commented Nov 14, 2024

Hello,
It's very simple, you have both files in the same directory, like for example a "latest" symlink pointing to a directory in the same directory, like this :

ln -s version-4-5-6  latest

You create symlink like above,
then you take a snapshot, using CSI capabilities (for example from Openshift storage menu)
then you try to restore to this snapshot

@bpjain2004
Copy link
Collaborator

Hello @ybrock, was the Powerscale user assigned with all the privileges suggested here, before the restore operation was requested?

@kumarkgosa
Copy link

Hey @ybrock
I tried recreating the steps and was still able restore snapshot to see a Bound PVC. I also tried creating a circle-symlink to see if I get too many links but I still got my pvc bound.
My steps were : Creating a PVC in Pscale Storageclass, mount it to a pod, exec into it and create a file, then symlink to latest in the directory itself. Then took a snapshot of pvc and restore it. Did I miss anything here ?
Based on out investigation the error you see it could be an array issue, Could you let us know what is the version of array (oneFS version) you used ? I tried mine on 9.5. Also could you check the ACL setting in the array UI if they look good, its under Access dropdown menu.

@donatwork
Copy link
Contributor

@ybrock Another test that would be nice to run is to reproduce the array operations outside of the CSI driver. Create the share and mount then create the link, create the snapshot, then try the restore. See if that works. In reality we do bind mounts so not exactly the same sequence of operations but you can follow the mount operations by looking at the node driver logs to see how the volume is mounted and try to reproduce those series of mounts.

The CSI driver does not deal with individual files on the volume so any traversing of the filesystem is not done by the driver but could be done as part of the PowerScale snapshot process. We will do some research on our end as well. Thanks.

@ybrock
Copy link
Author

ybrock commented Nov 15, 2024

Hello,

We will try to reproduce the issue as suggested outside the CSI driver, I'll let you know.

It the meantime, I can tell you that we're using OneFS Version: 9.5.0.8

regards

@donatwork
Copy link
Contributor

@ybrock Is the target file that you are linking a real file or are there other intermediate links to the target. Seems like the error is coming from the OS and that could be due to many levels of links or perhaps circular links (which I doubt is the case).

What happens if you do some other file level call on the volume, e.g. traverse the files via a find command. Do you see the "Too many links" error. The error itself is a system error.

@shefali-malhotra
Copy link
Collaborator

@ybrock have you tried isolating the issue ?

@ybrock
Copy link
Author

ybrock commented Nov 18, 2024

Hello
We're not using complex symlinks that could point to non reachable path.

If you just symlink any subdirectory in the current directory it will stop working.

For example :

ln -s   2.3.4 current

There is no issue with a find command inside the volume. On the NFS share all looks good and neat.
Only when trying to restore it starts getting bad.

I was informed by Dell that a bug was found and corrected related to this issue, is it true ?

@kumarkgosa
Copy link

Hey @ybrock
Just creating the symlinks on directories did not stop us from restoring the snapshot successfully. There has to be something else before the symlink part. We want to know more details about the ACL entries.
Were you informed that the bug was found in OneFS 9.5.0.8 ? I worked on 9.5.0.0 and it restored perfectly.

@ybrock
Copy link
Author

ybrock commented Nov 20, 2024

Hello,

The problem has been isolated yesterday with the help of Dell Support. We had a call with an engineer and we made some tests to reproduce the issue.

The problem has been narrowed and his clearly related to the ACLs inheritance that we need on the parents folders.

If we cut the permission inheritance totally, by removing all ACLS in the directory there is no problem anymore, the snapshot can be restored.

It seems that when the CSI driver copies back the data from the ".snapshot" directory, it tries to set ACLs on the symlink and it fails (which is maybe normal), and does not skip the error, which aborts the restore.

So it's linked to inheritance permissions, that are needed in our infrastructure to be sure a pod as the right permissions (group permissions) to write into a PVC. As the user who creates the PVC is the one used by the driver and the one who mounts and use the PVC is another random user generated by Openshift, we have to be sure the group permission allows write access to both of them. If we remove the permission inheritance the applications don't have the correct permissions on the PVC to read and write.

Kind regards

@shefali-malhotra
Copy link
Collaborator

@ybrock Thanks for the detailed information. Let us know if you need anything from us on this else let us know if we can close this issue.

@ybrock
Copy link
Author

ybrock commented Nov 21, 2024

Hello,
I'm sorry to insist,but there seems to be a bug with the way the CSi driver restores a snapshot indeed.
We noticed that the CSI driver is not using the snaphot API from the PowerScale but is copying directly the data from the hidden ".snapshot" directory.

When ACLs with inheritance are configured on the parent directory, the restore of the symlink fails. You can probably reproduce that if you set some ACLs.

We have that kind of ACLs (as reported by nfs4_getfacls) :

 nfs4_getfacl ./k8s-ae9d2765f7
# file: k8s-ae9d2765f7
A::OWNER@:rwaDdxtTnNcCy
A::GROUP@:rwaDdxtTnNcy
A:fd:10:rwaDdxtTnNcCoy
A:fdg:[email protected]:rwaDdxtTnNcCoy
A:fdg:[email protected]:rwaDdxtTnNcCoy
A:fdg:[email protected]:rwaDdxtTnNcCoy
A::EVERYONE@:tcy

@kumarkgosa
Copy link

Hi @ybrock, i was able to reproduce the issue on my 9.5.0.0 OneFS with a non-privileged account for Powerscale. Tested the same process in OneFS 9.10 and I did not see an issue. Driver does use API for copying files/directories in a snapshot to target. This may not be a driver bug. Would you consider trying the same process on an upgraded 9.10 Onefs powerscale ?

@ybrock
Copy link
Author

ybrock commented Nov 22, 2024

Hello @kumarkgosa
That's very good news.
I'll talk to our storage team to discuss when then can plan to upgrade to OneFS 9.10
At leaste we know there is a solution by upgrading.

Thank you!

@shefali-malhotra
Copy link
Collaborator

@ybrock As the issue in not there in 9.10 and can be fixed by upgrading OneFS . I guess we should be good to close this issue. Please feel free to open new issue if you face any issue after upgrading .

@coulof coulof added the area/csi-powerscale Issue pertains to the CSI Driver for Dell EMC PowerScale label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/csi-powerscale Issue pertains to the CSI Driver for Dell EMC PowerScale type/bug Something isn't working. This is the default label associated with a bug issue.
Projects
None yet
Development

No branches or pull requests