Use local store for Pods using Security group #1313

SaranBalaji90 · 2020-12-07T22:20:24Z

What happened:
When ENABLE_POD_ENI is set to true, on deletion path we query APIServer to retrieve ENI information for pods using security group. This adds APIServer query on the deletion path. This can be avoided if we use local store to find the vlanID.

Code:
https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/rpc_handler.go#L198

SaranBalaji90 · 2021-02-09T22:24:04Z

For v1: We should read the data from ApiServer and populate the local file.

There are two options available when pod gets same vlanID for whatever reason -

when new pod uses vlan which is already used in local file, it shouldn't provision the pod until existing pod is cleaned up. This might delay the new pod launch.
Overwrite the file with new containerID and skip deletion of old pod vlan. This requires cleaning up hostVeth rules as its done today otherwise there will be multiple hostVeth rule for a pod.

Shreya027 · 2021-08-03T23:03:45Z

/assign Shreya027

abhipth · 2021-08-19T11:00:29Z

Since the detailed impact of the issue is not covered any place else. I will put all the info from my investigation in this thread.

IPAMD relies on the Pod Object from the API Server to get the Pod Annotation which has the IPv4 address and other details. These details are required to clean up the host networking. On setting the terminationGracePeriodSeconds to 0, the Pod will immediately be deleted and IPAMD will not be able to get the Pod Object to clean up the networking. Not setting the terminationGracePeriodSeconds can have twofold impact.

The Pod Networking will not be cleaned up. [Already covered in the documentation]
The Pod Networking of the existing Pod can be removed.

I believe the first issue is not as harmful as the second one. The second issue can happen to pods that are using the same namespace/name and happen to land up on same Node.

Here's an example of the second issue,

Running Pod

sgp-job        vpc-resource-controller-integration-pod    1/1     Running     0          124m   192.168.7.189    ip-192-168-28-218.us-west-2.compute.internal   <none>           <none>

Pod's ENI details from the Pod's Annotation

{"eniId":"eni-002f9615b9351f1cf","ifAddress":"02:52:b0:d6:75:31","privateIp":"192.168.7.189","vlanId":1,"subnetCidr":"192.168.0.0/19"}

ip link for the Pod is UP

191: vlan49ec9fd1549@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 26:a5:f5:db:ab:7d brd ff:ff:ff:ff:ff:ff link-netnsid 2

ip rule for this pod is missing

sh-4.2# ip rule
20:	from all lookup local 
512:	from all to 192.168.20.249 lookup main 
512:	from all to 192.168.29.194 lookup main 
1024:	from all fwmark 0x80/0x80 lookup main 
32766:	from all lookup main 
32767:	from all lookup default

route table 100 + 1(vlan id) is present though

ip route show table 101
192.168.7.189 dev vlan49ec9fd1549 scope link

Sequence of events from the plugin.logs

Delete Request for older Pod with same namespace/name. The request fails because the Pod Object is deleted from etcd when IPAMD tries to remove it's networking Error while trying to retrieve Pod Info: pods

{"level":"info","ts":"2021-08-19T07:41:21.381Z","caller":"routed-eni-cni-plugin/cni.go:240","msg":"Received CNI del request: ContainerID(84afbf29138a8c4cb7dbf8f361b05aced36bea3583deecc2a74366176903ae68) Netns(/proc/30192/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=sgp-job;K8S_POD_NAME=vpc-resource-controller-integration-pod;K8S_POD_INFRA_CONTAINER_ID=84afbf29138a8c4cb7dbf8f361b05aced36bea3583deecc2a74366176903ae68) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}

{"level":"error","ts":"2021-08-19T07:41:21.389Z","caller":"routed-eni-cni-plugin/cni.go:240","msg":"Error received from DelNetwork gRPC call for container 84afbf29138a8c4cb7dbf8f361b05aced36bea3583deecc2a74366176903ae68: rpc error: code = Unknown desc = Error while trying to retrieve Pod Info: pods \"vpc-resource-controller-integration-pod\" not found"}

Add request for the new pod with same namespace/name

{"level":"info","ts":"2021-08-19T07:43:53.888Z","caller":"routed-eni-cni-plugin/cni.go:111","msg":"Received CNI add request: ContainerID(63b99a817524a9e88762c1ec45365e3d5a13606a71434806c36aa0b8f1ec3459) Netns(/proc/9389/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=sgp-job;K8S_POD_NAME=vpc-resource-controller-integration-pod;K8S_POD_INFRA_CONTAINER_ID=63b99a817524a9e88762c1ec45365e3d5a13606a71434806c36aa0b8f1ec3459) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}

Failed delete request for container in 1) is received again and this time it deletes the networking of the running pod

{"level":"info","ts":"2021-08-19T07:44:51.459Z","caller":"routed-eni-cni-plugin/cni.go:240","msg":"Received CNI del request: ContainerID(84afbf29138a8c4cb7dbf8f361b05aced36bea3583deecc2a74366176903ae68) Netns() IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=sgp-job;K8S_POD_NAME=vpc-resource-controller-integration-pod;K8S_POD_INFRA_CONTAINER_ID=84afbf29138a8c4cb7dbf8f361b05aced36bea3583deecc2a74366176903ae68) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"info","ts":"2021-08-19T07:44:51.465Z","caller":"routed-eni-cni-plugin/cni.go:240","msg":"Received del network response for pod vpc-resource-controller-integration-pod namespace sgp-job sandbox 84afbf29138a8c4cb7dbf8f361b05aced36bea3583deecc2a74366176903ae68: Success:true IPv4Addr:\"192.168.7.189\" PodVlanId:1 "}

I am able to exec into this Pod. However, networking to and from the Pod is lost due to the issue.

Current recommendation is to set terminationGracePeriodSeconds so that you don't run into the issue, till the issue is fixed in IPAMD.

jayanthvn · 2022-02-25T00:48:56Z

With 1.10.2 release, we use the previous result instead of querying API Server. Closing this issue.

github-actions · 2022-02-25T00:49:16Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

joebowbeer · 2022-05-02T18:34:48Z

@abhipth I referred to your excellent explanation in awsdocs/amazon-eks-user-guide#557

SaranBalaji90 added bug enhancement labels Dec 7, 2020

jayanthvn added good first issue help wanted and removed bug labels Jan 6, 2021

SaranBalaji90 mentioned this issue Feb 9, 2021

Existing pod network not cleanedup when using security groups with stateful sets #1374

Closed

jayanthvn assigned Shreya027 Aug 3, 2021

abhipth added the bug label Aug 19, 2021

SaranBalaji90 mentioned this issue Sep 30, 2021

Container ID inconsistency results in deleted vlan device #1644

Closed

jayanthvn mentioned this issue Oct 22, 2021

Add VlanId in the cmdAdd Result struct #1705

Merged

Shreya027 assigned cgchinmay and unassigned Shreya027 Jan 10, 2022

jayanthvn closed this as completed Feb 25, 2022

joebowbeer mentioned this issue May 2, 2022

Improve terminationGracePeriodSeconds guidance in SGP considerations awsdocs/amazon-eks-user-guide#557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use local store for Pods using Security group #1313

Use local store for Pods using Security group #1313

SaranBalaji90 commented Dec 7, 2020

SaranBalaji90 commented Feb 9, 2021 •

edited

Loading

Shreya027 commented Aug 3, 2021

abhipth commented Aug 19, 2021

jayanthvn commented Feb 25, 2022

github-actions bot commented Feb 25, 2022

joebowbeer commented May 2, 2022

Use local store for Pods using Security group #1313

Use local store for Pods using Security group #1313

Comments

SaranBalaji90 commented Dec 7, 2020

SaranBalaji90 commented Feb 9, 2021 • edited Loading

Shreya027 commented Aug 3, 2021

abhipth commented Aug 19, 2021

jayanthvn commented Feb 25, 2022

github-actions bot commented Feb 25, 2022

⚠️COMMENT VISIBILITY WARNING⚠️

joebowbeer commented May 2, 2022

SaranBalaji90 commented Feb 9, 2021 •

edited

Loading