Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpaginated calls to DescribeNetworkInterfaces getting blocked by EC2 #188

Open
ejholmes opened this issue Mar 15, 2023 · 3 comments · Fixed by #375
Open

Unpaginated calls to DescribeNetworkInterfaces getting blocked by EC2 #188

ejholmes opened this issue Mar 15, 2023 · 3 comments · Fixed by #375
Assignees
Labels
bug Something isn't working

Comments

@ejholmes
Copy link

Describe the Bug:

We've been in the process of rolling out security groups for pods across various EKS clusters in multiple AWS accounts that we own. Recently, we've attempted to do the same in one of our "larger" AWS accounts, and noticed that pods would hang indefinitely waiting for an ENI to be attached by vpc-resource-controller.

We recently raised the ENI limits on this account to a very large value, and was told that unpaginated calls to DescribeNetworkInterfaces would be blocked as a result, and it seems that this is impacting vpc-resource-controller as you can see here in this CloudTrail event:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "<REDACTED>:amazon-vpc-resource-controller-k8s",
        "arn": "arn:aws:sts::<REDACTED>:assumed-role/<REDACTED>/amazon-vpc-resource-controller-k8s",
        "accountId": "<REDACTED>",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "<REDACTED>",
                "arn": "arn:aws:iam::<REDACTED>:role/<REDACTED>-control-20210716225641340500000009",
                "accountId": "<REDACTED>",
                "userName": "<REDACTED>-control-20210716225641340500000009"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2023-03-14T21:23:53Z",
                "mfaAuthenticated": "false"
            }
        },
        "invokedBy": "eks.amazonaws.com"
    },
    "eventTime": "2023-03-14T21:35:11Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "DescribeNetworkInterfaces",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "eks.amazonaws.com",
    "userAgent": "eks.amazonaws.com",
    "errorCode": "Client.OperationNotPermitted",
    "errorMessage": "This operation is not permitted.",
    "requestParameters": {
        "networkInterfaceIdSet": {},
        "filterSet": {
            "items": [
                {
                    "name": "tag:vpcresources.k8s.aws/trunk-eni-id",
                    "valueSet": {
                        "items": [
                            {
                                "value": "eni-019dd4ee9f4aa06c8"
                            }
                        ]
                    }
                }
            ]
        }
    },
    "responseElements": null,
    "requestID": "8fd9723c-c4fa-4dd2-82be-47d8f328bf8e",
    "eventID": "fd8c44a0-b04c-430d-9e77-8c478b0f2de3",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "<REDACTED>",
    "eventCategory": "Management"
}

For AWS engineers, our case # is 12229078111

Observed Behavior:

Calls to DescribeNetworkInterfaces are blocked by EC2 with a Client.OperationNotPermitted error code

Expected Behavior:

SGP should work even if unpaginated calls are blocked.

How to reproduce it (as minimally and precisely as possible):

Increase ENI limits in an AWS account high enough that EC2 blocks unpaginated calls to DescribeNetworkInterfaces

Additional Context:

Environment:

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.16-eks-ffeb93d", GitCommit:"52e500d139bdef42fbc4540c357f0565c7867a81", GitTreeState:"clean", BuildDate:"2022-11-29T18:41:42Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
  • CNI Version 1.11.4
  • OS (Linux/Windows): Amazon Linux (EKS AMI)
@ejholmes ejholmes added the bug Something isn't working label Mar 15, 2023
@haouc
Copy link
Contributor

haouc commented Mar 15, 2023

@ejholmes thanks for reporting the issue. As you mentioned, calls to EC2 shouldn't be blocked by EC2 just because they are unpaginated. I have found the request ID in our system. We need to check with EC2 team for the root cause. Thanks.

@haouc haouc self-assigned this Mar 30, 2023
@ejholmes
Copy link
Author

Just to update this issue for anybody else that manages to run into this, the issue was from being blocked by EC2 on ec2:DescribeNetworkInterfaces. We discovered an issue where Glue was leaking ENI's and leaving them orphaned, and managed to clean those up with some effort, and lowered our limit on ENI's (we were told we needed to be under 20k to be unblocked).

Once that was done, and EC2 unblocked us from ec2:DescribeNetworkInterfaces, security groups for pods was functional in the account again.

@haouc
Copy link
Contributor

haouc commented Aug 23, 2023

@ejholmes sorry for a late response. Glad to hear it works now. We are taking an action to explore if we can re-arch the workflow managing interfaces to enable paginated API calls. The reason we can't enable it now is the pagination call won't guarantee the count of response pages which means we may have uncertain API calls to make based on how many pages EC2 can return. The risk is this will likely put a risk on user's account limit and throttle user's account if the pagintating iteration is large enough. I will keep this issue open for now for tracking purpose till we have an approach as an update. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants