Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix queue-processor rbn draining by default #478

Merged
merged 4 commits into from
Aug 17, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The aws-node-termination-handler (NTH) can operate in two different modes: Insta

The aws-node-termination-handler **[Instance Metadata Service](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html) Monitor** will run a small pod on each host to perform monitoring of IMDS paths like `/spot` or `/events` and react accordingly to drain and/or cordon the corresponding node.

The aws-node-termination-handler **Queue Processor** will monitor an SQS queue of events from Amazon EventBridge for ASG lifecycle events, EC2 status change events, and Spot Interruption Termination Notice events. When NTH detects an instance is going down, we use the Kubernetes API to cordon the node to ensure no new work is scheduled there, then drain it, removing any existing work. The termination handler **Queue Processor** requires AWS IAM permissions to monitor and manage the SQS queue and to query the EC2 API.
The aws-node-termination-handler **Queue Processor** will monitor an SQS queue of events from Amazon EventBridge for ASG lifecycle events, EC2 status change events, Spot Interruption Termination Notice events, and Spot Rebalance Recommendation events. When NTH detects an instance is going down, we use the Kubernetes API to cordon the node to ensure no new work is scheduled there, then drain it, removing any existing work. The termination handler **Queue Processor** requires AWS IAM permissions to monitor and manage the SQS queue and to query the EC2 API.

You can run the termination handler on any Kubernetes cluster running on AWS, including self-managed clusters and those created with Amazon [Elastic Kubernetes Service](https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html).

Expand Down Expand Up @@ -80,9 +80,11 @@ IMDS Processor Mode allows for a fine-grained configuration of IMDS paths that a
- `enableRebalanceMonitoring`
- `enableScheduledEventDraining`

By default, IMDS mode will only Cordon in response to a Rebalance Recommendation event (all other events are Cordoned and Drained). Cordon is the default for a rebalance event because it's not known if an ASG is being utilized and if that ASG is configured to replace the instance on a rebalance event. If you are using an ASG w/ rebalance recommendations enabled, then you can set the `enableRebalanceDraining` flag to true to perform a Cordon and Drain when a rebalance event is received.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clear 👍

I think adding another row for this "configuration granularity" would also be helpful in the feature table. Not required for this PR so sending 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back-and-forth on adding to the feature table. I think it actually makes sense to leave it off. The configuration granularity is available in queue-processor mode as well, you just have to do it via the EventBridge rules rather than command line flags to NTH. For example, you could not create an EventBridge rule for rebalance recommendations and have rebalance recs turned off on your ASG to ignore those. The cordon only action isn't possible on its own, but if cordon only logic is desired for a specific purpose, it would be fairly easy to setup with lifecycle hook timeouts and increasing the pod termination grace period. It's probably never optimal to cordon-only without another system taking further action at some point. We only do it for safety so that you don't drain workloads that are not going to be backfilled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so just change the row name to helm configuration granularity, but actually these are fair points 👍 I'm good leaving it off


The `enableSqsTerminationDraining` must be set to false for these configuration values to be considered.

The Queue Processor Mode does not allow for fine-grained configuration of which events are handled through helm configuration keys. Instead, you can modify your Amazon EventBridge rules to not send certain types of events to the SQS Queue so that NTH does not process those events.
The Queue Processor Mode does not allow for fine-grained configuration of which events are handled through helm configuration keys. Instead, you can modify your Amazon EventBridge rules to not send certain types of events to the SQS Queue so that NTH does not process those events. All events when operating in Queue Processor mode are Cordoned and Drained unless the `cordon-only` flag is set to true.


The `enableSqsTerminationDraining` flag turns on Queue Processor Mode. When Queue Processor Mode is enabled, IMDS mode cannot be active. NTH cannot respond to queue events AND monitor IMDS paths. Queue Processor Mode still queries for node information on startup, but this information is not required for normal operation, so it is safe to disable IMDS for the NTH pod.
Expand Down
42 changes: 21 additions & 21 deletions cmd/node-termination-handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -318,35 +318,35 @@ func drainOrCordonIfNecessary(interruptionEventStore *interruptioneventstore.Sto
runPreDrainTask(node, nodeName, drainEvent, metrics, recorder)
}

podNameList, err := node.FetchPodNameList(nodeName)
if err != nil {
log.Err(err).Msgf("Unable to fetch running pods for node '%s' ", nodeName)
}
drainEvent.Pods = podNameList
err = node.LogPods(podNameList, nodeName)
if err != nil {
log.Err(err).Msg("There was a problem while trying to log all pod names on the node")
}

if nthConfig.CordonOnly || (drainEvent.IsRebalanceRecommendation() && !nthConfig.EnableRebalanceDraining) {
podNameList, err := node.FetchPodNameList(nodeName)
if err != nil {
log.Err(err).Msgf("Unable to fetch running pods for node '%s' ", nodeName)
}
drainEvent.Pods = podNameList
err = node.LogPods(podNameList, nodeName)
if err != nil {
log.Err(err).Msg("There was a problem while trying to log all pod names on the node")
}

if nthConfig.CordonOnly || (!nthConfig.EnableSQSTerminationDraining && drainEvent.IsRebalanceRecommendation() && !nthConfig.EnableRebalanceDraining) {
err = cordonNode(node, nodeName, drainEvent, metrics, recorder)
} else {
err = cordonAndDrainNode(node, nodeName, metrics, recorder, nthConfig.EnableSQSTerminationDraining)
}

if nthConfig.WebhookURL != "" {
webhook.Post(nodeMetadata, drainEvent, nthConfig)
}

if err != nil {
<-interruptionEventStore.Workers
} else {
interruptionEventStore.MarkAllAsProcessed(nodeName)
if drainEvent.PostDrainTask != nil {
runPostDrainTask(node, nodeName, drainEvent, metrics, recorder)
}
<-interruptionEventStore.Workers
}
if err != nil {
<-interruptionEventStore.Workers
} else {
interruptionEventStore.MarkAllAsProcessed(nodeName)
if drainEvent.PostDrainTask != nil {
runPostDrainTask(node, nodeName, drainEvent, metrics, recorder)
}
<-interruptionEventStore.Workers
}

}

Expand Down
4 changes: 2 additions & 2 deletions config/helm/aws-node-termination-handler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Parameter | Description | Default

Parameter | Description | Default
--- | --- | ---
`enableSqsTerminationDraining` | If true, this turns on queue-processor mode which drains nodes when an SQS termination event is received| `false`
`enableSqsTerminationDraining` | If true, this turns on queue-processor mode which drains nodes when an SQS termination event is received. | `false`
`queueURL` | Listens for messages on the specified SQS queue URL | None
`awsRegion` | If specified, use the AWS region for AWS API calls, else NTH will try to find the region through AWS_REGION env var, IMDS, or the specified queue URL | ``
`checkASGTagBeforeDraining` | If true, check that the instance is tagged with "aws-node-termination-handler/managed" as the key before draining the node | `true`
Expand All @@ -107,8 +107,8 @@ Parameter | Description | Default
--- | --- | ---
`enableScheduledEventDraining` | [EXPERIMENTAL] If true, drain nodes before the maintenance window starts for an EC2 instance scheduled event | `false`
`enableSpotInterruptionDraining` | If true, drain nodes when the spot interruption termination notice is received | `true`
`enableRebalanceMonitoring` | If true, cordon nodes when the rebalance recommendation notice is received | `false`
`enableRebalanceDraining` | If true, drain nodes when the rebalance recommendation notice is received | `false`
`enableRebalanceMonitoring` | If true, cordon nodes when the rebalance recommendation notice is received. If you'd like to drain the node in addition to cordoning, then also set `enableRebalanceDraining`. | `false`
`useHostNetwork` | If `true`, enables `hostNetwork` for the Linux DaemonSet. NOTE: setting this to `false` may cause issues accessing IMDSv2 if your account is not configured with an IP hop count of 2 | `true`

### Kubernetes Configuration
Expand Down
2 changes: 1 addition & 1 deletion pkg/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ func ParseCliArgs() (config Config, err error) {
flag.BoolVar(&config.EnableScheduledEventDraining, "enable-scheduled-event-draining", getBoolEnv(enableScheduledEventDrainingConfigKey, enableScheduledEventDrainingDefault), "[EXPERIMENTAL] If true, drain nodes before the maintenance window starts for an EC2 instance scheduled event")
flag.BoolVar(&config.EnableSpotInterruptionDraining, "enable-spot-interruption-draining", getBoolEnv(enableSpotInterruptionDrainingConfigKey, enableSpotInterruptionDrainingDefault), "If true, drain nodes when the spot interruption termination notice is received")
flag.BoolVar(&config.EnableSQSTerminationDraining, "enable-sqs-termination-draining", getBoolEnv(enableSQSTerminationDrainingConfigKey, enableSQSTerminationDrainingDefault), "If true, drain nodes when an SQS termination event is received")
flag.BoolVar(&config.EnableRebalanceMonitoring, "enable-rebalance-monitoring", getBoolEnv(enableRebalanceMonitoringConfigKey, enableRebalanceMonitoringDefault), "If true, cordon nodes when the rebalance recommendation notice is received")
flag.BoolVar(&config.EnableRebalanceMonitoring, "enable-rebalance-monitoring", getBoolEnv(enableRebalanceMonitoringConfigKey, enableRebalanceMonitoringDefault), "If true, cordon nodes when the rebalance recommendation notice is received. If you'd like to drain the node in addition to cordoning, then also set \"enableRebalanceDraining\".")
flag.BoolVar(&config.EnableRebalanceDraining, "enable-rebalance-draining", getBoolEnv(enableRebalanceDrainingConfigKey, enableRebalanceDrainingDefault), "If true, drain nodes when the rebalance recommendation notice is received")
flag.BoolVar(&config.CheckASGTagBeforeDraining, "check-asg-tag-before-draining", getBoolEnv(checkASGTagBeforeDrainingConfigKey, checkASGTagBeforeDrainingDefault), "If true, check that the instance is tagged with \"aws-node-termination-handler/managed\" as the key before draining the node")
flag.StringVar(&config.ManagedAsgTag, "managed-asg-tag", getEnv(managedAsgTagConfigKey, managedAsgTagDefault), "Sets the tag to check for on instances that is propogated from the ASG before taking action, default to aws-node-termination-handler/managed")
Expand Down