Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver gets stuck at double volume removals #1466

Closed
FooBarWidget opened this issue Oct 4, 2024 · 12 comments
Closed

Driver gets stuck at double volume removals #1466

FooBarWidget opened this issue Oct 4, 2024 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@FooBarWidget
Copy link

FooBarWidget commented Oct 4, 2024

/kind bug

What happened?

When we delete a PVC, external-provisioner sends two duplicate "DeleteVolume" commands in quick succession to the EFS driver. I don't know why, but it does, consistently.

At the same time, for security reasons, we have an IAM policy set on the EFS driver role, that restricts EFS DeleteVolume calls to only those volumes with a "cluster" tag. We don't want the driver to be able to delete any other volumes.

{
    effect: iam.Effect.ALLOW,
    resources: ['*'],
    actions: ['elasticfilesystem:DeleteAccessPoint'],
    conditions: {
        Null: {
            'aws:ResourceTag/efs.csi.aws.com/cluster': 'false',
        }
    }
}

Depending on timing, the second DeleteVolume may fail with an Access Denied, like this:

I0930 14:10:24.594800       1 controller.go:396] DeleteVolume: called with args {VolumeId:fs-xxx::fsap-yyy Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
...
I0930 14:10:26.210464       1 controller.go:396] DeleteVolume: called with args {VolumeId:fs-xxx::fsap-yyy Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
E0930 14:10:26.221581       1 driver.go:107] GRPC error: rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied

The PersistentVolumeClaim then gets stuck in a deleting state with "VolumeFailedDelete" warning events. The driver keeps retrying and keeps failing. We have to manually remove the finalizer to unstuck the PVC.

I think it's because nonexistant access points count as "not having the tag" and so the delete call fails.

What you expected to happen?
Not getting a Permission Denied. Not getting stuck.

Maybe you can first perform a DescribeAccessPoint to check whether it exists, before deleting.

How to reproduce it (as minimally and precisely as possible)?

Modify the driver role to add a tag condition, as described above.

Create a PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: efs
  volumeMode: Filesystem

Then delete it. In the driver logs you will see two DeleteVolume calls in quick succession.

It may or may not also get a Permission Denied error, depending on timing. You may have to repeat creation and deletion a couple of times to reproduce the error.

Anything else we need to know?:
We are AWS premium support customer.

Environment

  • Kubernetes version (use kubectl version): v1.29.7-eks-a18cd3a
  • Driver version: 2.0.7 (chart 3.0.8)
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 4, 2024
@FooBarWidget FooBarWidget changed the title Better handling of double volume removals Driver gets stuck at double volume removals Oct 4, 2024
@avanish23
Copy link
Contributor

Hi @FooBarWidget
We already have that in place; we first Describe the Access Point and only if it exists do we go ahead and delete it. I am guessing that the error is coming up because DescribeAccessPoint permission is granted with similar conditions set for the DeleteAccessPoint; if so please provide DescribeAccessPoint with the wildcard(*) for resources something like

{
	"Effect": "Allow",
	"Action": [
		"elasticfilesystem:DescribeAccessPoints",
         ],
	"Resource": "*"
}

If that's not the case please help us with more details.

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/support Categorizes issue or PR as a support question. labels Oct 5, 2024
@noudAndi
Copy link

Hi,

we have the same problem in our cluster. The Policy already allows describing the access Points, like shown below.

"Statement": [
    {
      "Sid": "AllowDescribe",
      "Effect": "Allow",
      "Action": [
        "elasticfilesystem:DescribeAccessPoints",
        "elasticfilesystem:DescribeFileSystems",
        "elasticfilesystem:DescribeMountTargets",
        "ec2:DescribeAvailabilityZones"
      ],
      "Resource": "*"
    },
[...]

@avanish23
Copy link
Contributor

Hi @noudAndi,
Can you please share more details on the configs and some additional relevant logs to help?

@mskanth972
Copy link
Contributor

@FooBarWidget, If I understand it correctly, the second deletion request from external provisioner is looking for a tag efs.csi.aws.com/cluster which doesn't exist and getting access denied error on volume deletion?

@mskanth972
Copy link
Contributor

Seems external provisioner side car is the issue here, EFS CSI Driver is using v5.0.1-eks-1-30-8 which has a bug or regression and which is mitigated in the latest version v5.1.0-eks-1-31-5.
We are working on releasing the latest version with this fix.

@noudAndi
Copy link

Hi @noudAndi, Can you please share more details on the configs and some additional relevant logs to help?

Hi,
after investigating further our problem could have been a bit different. Here the driver tried to delete no-existing access-points.
There is no hint, that they where missing :/

@noudAndi
Copy link

@mskanth972 It would be huge, if you could push out a new version with the updates dependencies. This bug is causing havoc on our CI pipeline. 💥

@mskanth972
Copy link
Contributor

Hi @noudAndi, Its already released to GitHub. Addons ECD is 11/15.

@noudAndi
Copy link

@mskanth972 Sorry, but what do you mean by

Addons ECD is 11/15. ?

@mskanth972
Copy link
Contributor

mskanth972 commented Oct 29, 2024

We provide EFS CSI Driver as an Addons for EKS cluster, so Addons release takes time when compared to normal GitHub release, so users using Addons need to wait till 11/15 for latest release.

@noudAndi
Copy link

Ok, thank for the explanation. I'm deploying via helm, so I just bump the image versions. Let's see!

@mskanth972
Copy link
Contributor

Closing the issue for now, feel free to reopen if the issue still persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants