Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shutdown listener #645

Merged
merged 1 commit into from
Oct 9, 2019
Merged

Add shutdown listener #645

merged 1 commit into from
Oct 9, 2019

Conversation

mogren
Copy link
Contributor

@mogren mogren commented Oct 7, 2019

*Issue #608, #401

Description of changes:

Some changes to reduce the risk of leaking ENIs during CNI upgrades.

  • Added shutdown listener for SIGTERM and SIGINT.
  • Use exec /app/aws-k8s-agent to start the agent in order to propagate signals.
  • Stop detaching ENIs if ipamd is about to shut down.
  • Stop attaching ENIs and IPs if node is about to shut down.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mogren mogren added this to the v1.6 milestone Oct 7, 2019
Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code lgtm. Couple minor suggestions, nothing major.

ipamd/datastore/data_store.go Show resolved Hide resolved
ipamd/ipamd.go Outdated Show resolved Hide resolved
ipamd/ipamd.go Outdated
// reconcileCooldownCache keeps timestamps of the last time an IP address was unassigned from an ENI,
// so that we don't reconcile and add it back too quickly if IMDS lags behind reality.
reconcileCooldownCache ReconcileCooldownCache
terminating *int32 // Flag to warn that the pod is about to shut down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just make this an int32 and then when doing the atomic.StoreInt32() call below, just refer to the address of this field, like so:

atomic.StoreInt32(&c.terminating, 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, makes it cleaner. (Was testing the concept first with a boolean, but that's not thread safe.)

ipamd/ipamd.go Outdated
c.awsClient = client

t := int32(0) // Initializing to 0, meaning 'false'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that way you wouldn't need the above line or temporary variable.

@mogren mogren force-pushed the add-shutdown-hook branch 2 times, most recently from 624eb22 to c8c7ff8 Compare October 9, 2019 20:45
@mattmb
Copy link

mattmb commented Apr 22, 2022

@mogren @jaypipes do you think this maybe needs to be extended to prevent ipamd making any other updates in addition to the ENI updates? I'm currently debugging some weirdness and spotted some updates being made after a SIGTERM which led me here. Are you sure that if we SIGKILL at any point things will always reconcile cleanly on the next start?

@jayanthvn
Copy link
Contributor

@mogren @jaypipes do you think this maybe needs to be extended to prevent ipamd making any other updates in addition to the ENI updates? I'm currently debugging some weirdness and spotted some updates being made after a SIGTERM which led me here. Are you sure that if we SIGKILL at any point things will always reconcile cleanly on the next start?

Hi @mattmb - Can you please expand more on what other updates you are seeing?

@mattmb
Copy link

mattmb commented Apr 26, 2022

I don't have a full conclusion yet @jayanthvn but will keep you posted if I find something conclusive. This is more of a hunch right now so just interested if the design of the shutdown process has been thought about before. In our infra we issue certificates to auth IPAMd to k8s and when we renew them we have to restart IPAMd. In addition we had a config management bug where we were doing an unnecessary restart quite soon after boot. One symptom I've seen is stale IP rules that seem to be left behind when the Pod itself was terminated some time ago. Here's an example log from one host I was debugging that shows quite a bit of activity after the sigterm before the sigkill:

{"level":"info","ts":"2022-04-19T09:07:34.994-0700","caller":"runtime/asm_amd64.s:1371","msg":"Received shutdown signal, setting 'terminating' to true"}
{"level":"info","ts":"2022-04-19T09:08:01.859-0700","caller":"ipamd/ipamd.go:705","msg":"Deleted ENI(eni-09d937e6f7f9eddc1)'s IP/Prefix 172.20.89.61/32 from datastore"}
{"level":"info","ts":"2022-04-19T09:08:01.859-0700","caller":"ipamd/ipamd.go:705","msg":"Deleted ENI(eni-09d937e6f7f9eddc1)'s IP/Prefix 172.20.71.58/32 from datastore"}
{"level":"info","ts":"2022-04-19T09:08:01.859-0700","caller":"ipamd/ipamd.go:705","msg":"Deleted ENI(eni-09d937e6f7f9eddc1)'s IP/Prefix 172.20.101.159/32 from datastore"}
{"level":"info","ts":"2022-04-19T09:08:01.859-0700","caller":"ipamd/ipamd.go:1961","msg":"Trying to unassign the following IPs [172.20.89.61 172.20.71.58 172.20.101.159] from ENI eni-09d937e6f7f9eddc1"}
{"level":"error","ts":"2022-04-19T09:08:04.055-0700","caller":"ipamd/ipamd.go:640","msg":"Error finding unassigned IPs for ENI eni-0f88ba0d81e64b3c6"}
{"level":"info","ts":"2022-04-19T09:08:09.753-0700","caller":"rpc/rpc.pb.go:519","msg":"Received DelNetwork for Sandbox 735dd25330d30a0b2f936bdfd52395c6ce6309af89b0fe0514d5cbba9b7f40e3"}
{"level":"info","ts":"2022-04-19T09:08:09.753-0700","caller":"datastore/data_store.go:1008","msg":"UnAssignPodIPv4Address: Unassign IP 172.20.99.254 from sandbox aws-cni/735dd25330d30a0b2f936bdfd52395c6ce6309af89b0fe0514d5cbba9b7f40e3/eth0"}
{"level":"info","ts":"2022-04-19T09:08:09.754-0700","caller":"ipamd/rpc_handler.go:204","msg":"UnassignPodIPv4Address: sandbox aws-cni/735dd25330d30a0b2f936bdfd52395c6ce6309af89b0fe0514d5cbba9b7f40e3/eth0's ipAddr 172.20.99.254, DeviceNumber 1"}
{"level":"info","ts":"2022-04-19T09:08:09.754-0700","caller":"rpc/rpc.pb.go:519","msg":"Send DelNetworkReply: IPv4Addr 172.20.99.254, DeviceNumber: 1, err: <nil>"}
{"level":"info","ts":"2022-04-19T09:08:13.707-0700","caller":"rpc/rpc.pb.go:519","msg":"Received DelNetwork for Sandbox a48196fc6d713fd542aa0305d9d47001dc962879c74a546f8034e00d687d6c8b"}
{"level":"info","ts":"2022-04-19T09:08:13.707-0700","caller":"datastore/data_store.go:1008","msg":"UnAssignPodIPv4Address: Unassign IP 172.20.77.92 from sandbox aws-cni/a48196fc6d713fd542aa0305d9d47001dc962879c74a546f8034e00d687d6c8b/eth0"}
{"level":"info","ts":"2022-04-19T09:08:13.707-0700","caller":"ipamd/rpc_handler.go:204","msg":"UnassignPodIPv4Address: sandbox aws-cni/a48196fc6d713fd542aa0305d9d47001dc962879c74a546f8034e00d687d6c8b/eth0's ipAddr 172.20.77.92, DeviceNumber 1"}
{"level":"info","ts":"2022-04-19T09:08:13.707-0700","caller":"rpc/rpc.pb.go:519","msg":"Send DelNetworkReply: IPv4Addr 172.20.77.92, DeviceNumber: 1, err: <nil>"}
{"level":"info","ts":"2022-04-19T09:08:33.712-0700","caller":"retry/retry.go:70","msg":"Successfully deleted ENI: eni-0dcb02a83ec89ab00"}
{"level":"error","ts":"2022-04-19T09:08:34.060-0700","caller":"ipamd/ipamd.go:640","msg":"Error finding unassigned IPs for ENI eni-0f88ba0d81e64b3c6"}
{"level":"info","ts":"2022-04-19T09:09:04.066-0700","caller":"ipamd/ipamd.go:705","msg":"Deleted ENI(eni-09d937e6f7f9eddc1)'s IP/Prefix 172.20.107.40/32 from datastore"}
{"level":"info","ts":"2022-04-19T09:09:04.066-0700","caller":"ipamd/ipamd.go:705","msg":"Deleted ENI(eni-09d937e6f7f9eddc1)'s IP/Prefix 172.20.72.208/32 from datastore"}
{"level":"info","ts":"2022-04-19T09:09:04.066-0700","caller":"ipamd/ipamd.go:1961","msg":"Trying to unassign the following IPs [172.20.107.40 172.20.72.208] from ENI eni-09d937e6f7f9eddc1"}
{"level":"error","ts":"2022-04-19T09:09:04.774-0700","caller":"ipamd/ipamd.go:640","msg":"Error finding unassigned IPs for ENI eni-0f88ba0d81e64b3c6"}
{"level":"info","ts":"2022-04-19T09:09:05.054-0700","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.9.1-2-g8ae2e186  ..."}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants