Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't scale down ASGs when placeholder nodes are discovered #5846

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 5 additions & 11 deletions cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go
Original file line number Diff line number Diff line change
Expand Up @@ -277,10 +277,6 @@ func (m *asgCache) setAsgSizeNoLock(asg *asg, size int) error {
return nil
}

func (m *asgCache) decreaseAsgSizeByOneNoLock(asg *asg) error {
return m.setAsgSizeNoLock(asg, asg.curSize-1)
}

// DeleteInstances deletes the given instances. All instances must be controlled by the same ASG.
func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error {
m.mutex.Lock()
Expand Down Expand Up @@ -308,11 +304,9 @@ func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error {

for _, instance := range instances {
// check if the instance is a placeholder - a requested instance that was never created by the node group
// if it is, just decrease the size of the node group, as there's no specific instance we can remove
// if it is, simply remove it from the cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By removing the node(s) in the local cache, but not actually scaling down the corresponding ASG, the ongoing/pending request for additional capacity on the provider may impact what capacity is accessible subsequent requests. Basically the ongoing request is at best a capacity spoof and capacity dead-lock at worst.

Put another way, if you request a 100 instances in ASG_1, and get 25 before running into a InsufficientInstanceCapacity, it's probably not the right answer to leave that request on-going. It's not unreasonable to think that the ongoing request for an additional 75 nodes in ASG_1 could affect subsequent requests to provision instances of that type in the same or other ASGs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking a look at this. I'm not very familiar with what the placeholder instances are intended to do. There is probably a good reason why they are being created, but when I wrote this patch, they were only ever checked in this one location. Maybe this has changed since then?
If you take a look at the linked bug report, I've mapped how under certain circumstances the flow of the code leads to unsafe decommisioning of instances. I see your point as well, but from our experience, once a request runs into the insufficient capacity issue, it's considered completed (or better failed) and no longer ongoing. So by sending requests to scale the ASG back down, we're not canceling the failed request, but rather removing already existing capacity.

if m.isPlaceholderInstance(instance) {
klog.V(4).Infof("instance %s is detected as a placeholder, decreasing ASG requested size instead "+
"of deleting instance", instance.Name)
m.decreaseAsgSizeByOneNoLock(commonAsg)
klog.V(4).Infof("instance %s is detected as a placeholder", instance.Name)
} else {
// check if the instance is already terminating - if it is, don't bother terminating again
// as doing so causes unnecessary API calls and can cause the curSize cached value to decrement
Expand Down Expand Up @@ -341,10 +335,10 @@ func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error {
return err
}
klog.V(4).Infof(*resp.Activity.Description)

// Proactively decrement the size so autoscaler makes better decisions
commonAsg.curSize--
}

// Proactively decrement the size so autoscaler makes better decisions
commonAsg.curSize--
}
return nil
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -543,7 +543,7 @@ func TestDeleteNodesWithPlaceholder(t *testing.T) {
}
err = asgs[0].DeleteNodes([]*apiv1.Node{node})
assert.NoError(t, err)
a.AssertNumberOfCalls(t, "SetDesiredCapacity", 1)
a.AssertNumberOfCalls(t, "SetDesiredCapacity", 0)
a.AssertNumberOfCalls(t, "DescribeAutoScalingGroupsPages", 1)

newSize, err := asgs[0].TargetSize()
Expand Down