Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateVolume API continues to be requested even after related objects are deleted on v1.2.0 #1022

Closed
imbohyun1 opened this issue Aug 13, 2021 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@imbohyun1
Copy link

imbohyun1 commented Aug 13, 2021

/kind bug

What happened?

Tried to create PV encrypted by KMS key with PVC and it got failed because of the permission issue. After the first attempt at CreateVolume failed, I could see the controller retrying to create the volume and failing over and over again.

From the second try, CreateVolume failed with InternalError, I have checked the internal log that the error occurred because the client token is associated with a resource that is already deleted.

"eventSource": "ec2.amazonaws.com",
"eventName": "CreateVolume",
"awsRegion": "ap-northeast-2",
"userAgent": "aws-sdk-go/1.40.4 (go1.16.6; linux; amd64) exec-env/aws-ebs-csi-driver-v1.2.0",
"errorCode": "Server.InternalError",
"errorMessage": "An internal error has occurred",
....

So I deleted the pod and pvc via kubectl (checked with kube-apiserver-audit log that the pvc deleted successfully), however, 'Createvolume' API continues to be requested even after related objects are deleted.

. controller log - first/second attempt

W0813 06:50:26.040984       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 0
E0813 06:50:26.041005       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = Internal desc = Could not create volume "pvc-1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-0a3bca4957543370d' does not exist.
	status code: 400, request id: b6d97405-784b-4b7c-b4b1-72b75a4833f5
I0813 06:50:26.041033       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979326", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = Internal desc = Could not create volume "pvc-1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-0a3bca4957543370d' does not exist.

I0813 06:50:37.041621       1 controller.go:1106] Temporary error received, adding PVC 1bef4ab2-2ee9-4a25-89af-c8d85675d6db to claims in progress
W0813 06:50:37.041656       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 1
E0813 06:50:37.041677       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 06:50:37.041836       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979326", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 06:50:39.041937       1 controller.go:1332] provision "default/ebs-claim-enc" class "ebs-sc-enc": started
I0813 06:50:39.042344       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979326", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/ebs-claim-enc"

. Delete the pod and pvc - the request of pvc deletion has made audit log

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "RequestResponse",
    "auditID": "1f4cca8d-6014-4700-ae3f-875fc6dcd75a",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/default/persistentvolumeclaims/ebs-claim-enc",
    ...
    "objectRef": {
        "resource": "persistentvolumeclaims",
        "namespace": "default",
        "name": "ebs-claim-enc",
        "apiVersion": "v1"
    },
    "responseStatus": {
        "metadata": {},
        "code": 200
    },
    "requestObject": {
        "kind": "DeleteOptions",
        "apiVersion": "v1",
        "propagationPolicy": "Background"
    },
    ...
    "responseObject": {
        "kind": "PersistentVolumeClaim",
        "apiVersion": "v1",
        "metadata": {
            "name": "ebs-claim-enc",
            "namespace": "default",
            "uid": "1bef4ab2-2ee9-4a25-89af-c8d85675d6db",
            "resourceVersion": "10979451",
            "creationTimestamp": "2021-08-13T06:50:13Z",
            "deletionTimestamp": "2021-08-13T06:50:51Z",
...

Still seeing Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db"

I0813 07:04:35.170445       1 controller.go:1332] provision "default/ebs-claim-enc" class "ebs-sc-enc": started
I0813 07:04:35.170900       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/ebs-claim-enc"
I0813 07:04:45.170878       1 controller.go:1106] Temporary error received, adding PVC 1bef4ab2-2ee9-4a25-89af-c8d85675d6db to claims in progress
W0813 07:04:45.170906       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 10
E0813 07:04:45.170935       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 07:04:45.171717       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 07:09:45.171173       1 controller.go:1332] provision "default/ebs-claim-enc" class "ebs-sc-enc": started
I0813 07:09:45.171513       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/ebs-claim-enc"
I0813 07:09:55.171555       1 controller.go:1106] Temporary error received, adding PVC 1bef4ab2-2ee9-4a25-89af-c8d85675d6db to claims in progress
W0813 07:09:55.171588       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 11
E0813 07:09:55.171609       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 07:09:55.172164       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded

I was able to end this by restarting ebs-csi-controller.

Also I have tested with v1.1.3/v1.1.2, the issue not happened. Looks like this issue has appeared after adding the feature 'utilize latest go sdk to ensure createVolume idempotency' #982.

What you expected to happen?

'Createvolume' should not be requested after pvc deleted.

How to reproduce it (as minimally and precisely as possible)?

  1. Create SC with kmsKeyId, which is not provided enough permission to noderole

    encrypted: 'true'
    kmsKeyId: arn:aws:kms:ap-northeast-2:xxxxxxxxx:key/xxxxxxxxxx
    volumeBindingMode: WaitForFirstConsumer

  2. Create PVC/Pod -> failed to create PV, PVC is still pending

  3. Delete PVC/Pod

  4. You can see ebs-csi-controller keep retrying.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): 1.17, 1.20
  • Driver version: 1.2.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 13, 2021
@vdhanan vdhanan self-assigned this Aug 16, 2021
@wongma7
Copy link
Contributor

wongma7 commented Aug 19, 2021

Did you mean to link #982 ?

The logs are coming from the external provisioner controller library. However the version of external-provisioner/csi-provisioner sidecar is the same (v2.1.1) for driver versions 1.1-1.2.
The library intentionally ignores PVC deletion events https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L680. Instead, it expects the PVC cache to be up-to-date such that if a PVC has been deleted then it will be removed from the cache, then removed from the queue, and so stopped being retried https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1015-L1018. But somehow the PVC is still in its cache so the controller keeps retrying anyway.

For the time the driver is printing these errors, is it provisioning volumes for any other PVCs? (trying to understand if the cache is stale somehow)

Also, based on my understanding of the error the driver is not actually sending CreateVolume API requests since the context deadline exceeded, but it is sending some pointless kubernetes API requests, plus of course being noisy

@imbohyun1
Copy link
Author

@wongma7

For the time the driver is printing these errors, is it provisioning volumes for any other PVCs? (trying to understand if the cache is stale somehow)

yes, I could see the new pvc was provisioning while the driver was printing this error, so as the test progressed, the number of CreateVolumes requests to create pvcs increased and the deleted pvcs continued to call CreateVolume as well.
It's not only sending kubernetes API requests, but also it was actually sending CreateVolume API. I was able to confirm with CloudTrail.

By the way, I tried to reproduce this issue again to check verbose log today, but the issue was not reproduced. Looking at the CloudTrail events, when CreateVolume was called and it failed with the client token is associated with a resource that is already deleted, the error code matched to Client.IdempotentParameterMismatch not Server.InternalError.

When the issue was reproduced, it seems like the cache data related to pvc was handled differently because of the error code from CreateVolume API was not expected error. I'll share the detailed information via Slack.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2021
@vdhanan vdhanan removed their assignment Dec 16, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants