-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: correctly handle lack of capacity of AWS spot ASGs #2008
Conversation
6fa2503
to
31d13f0
Compare
cluster-autoscaler/core/utils.go
Outdated
// a failed scale up | ||
if wasPlaceholder { | ||
klog.Warningf("Timeout trying to scale node group %s, enabling backoff for the group", nodeGroup.Id()) | ||
clusterStateRegistry.RegisterFailedScaleUp(nodeGroup, metrics.Timeout, time.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failed scale-up logic should happen regardless, triggered from https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/clusterstate/clusterstate.go#L237. At minimum this will mess up metrics (you'll double count failed scale-ups), not sure if it breaks anything else.
I think adding the placeholders alone is enough, all the changes outside AWS cloudprovider are redundant (and mess up metrics). The timeout/backoff mechanism has worked for years in GCP, all the core parts should be there already (it may not be obvious from code because in GCP placeholder instances are created by MIG, so they're already returned by API calls to GCP). |
Hmm, so if tracking scaleUpRequest is enough, why it didn't work in the first place? After scaleUpRequest times out, |
|
OK, that might be it. Let me remove the code in |
I may be missing some context here, but I think I recall there was a problem with ASG immediately changing its target size back to actual size when the instance fails to start (see #1263). So scaleUpRequest never times out. |
@@ -323,6 +350,11 @@ func (m *asgCache) regenerate() error { | |||
return err | |||
} | |||
|
|||
// If currently any ASG has more Desired than running Instances, introduce placeholders | |||
// for the instances to come up. This is required to track Desired instances that | |||
// will never come up, like with Spot Request that can't be fulfilled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do see several issue leads to desire > current. Sometimes onDemand nodes can run out of capacity like hitting limit or nodes with strict condition like in one placement group.
OK, I think I know better now what is going on. Indeed, after scaling up a node group that won't increase its size the Still, my first proposal is (almost) good. The problem is that at some point the node group needs to be scaled down by the The above approach is needed, as we can't rely on Am I right? What do you think? ==edit== |
31d13f0
to
4000692
Compare
Thanks a lot @piontec! I managed to test it and it seems to be working like a charm 👌 |
cluster-autoscaler/core/utils.go
Outdated
logRecorder.Eventf(apiv1.EventTypeWarning, "DeleteUnregisteredFailed", | ||
"Failed to remove node %s: %v", unregisteredNode.Node.Name, err) | ||
return removedAny, err | ||
_, wasPlaceholder := err.(*cloudprovider.PlaceholderDeleteError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not believe this is right approach to determine the nature of deleted instance based on error returned from DeleteNodes()
(especially given the fact that DeleteNodes
takes multiple nodes as an argument).
Instead we should based the behavior on what we got from NodeGroup.Nodes()
.
Actually it seems you already have all the needed fields in the interface. The NodeGroup.Nodes()
returns InstanceStatus
for each node. And there is InstanceState
field which can currently be one of Instance{Running,Creating,Deleting}
. It seems to me that you want to backoff nodegroup if deleted node is in InstanceCreating
state. If you think you would need to extend the InstanceStatus
in some way (e.g. introduce more error classes or something - we can discuss it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the above is right. We want to have feedback about the whole DeleteNodes()
, we don't need this to be instance-by-instance. The thing is: if you requested 4 new instances, 4 came up successfully, but there's no spot capacity for the 5th, the ASG should be "backed-off" exactly the same as if no instance was started.
Of course, we can set InstanceState
for each instance, then after the call iterate instances again to check if any is not coming up, but I'd like to introduce a new state for that, like CreationTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the above is right. We want to have feedback about the whole DeleteNodes(), we don't need this to be instance-by-instance. The thing is: if you requested 4 new instances, 4 came up successfully, but there's no spot capacity for the 5th, the ASG should be "backed-off" exactly the same as if no instance was started.
I agree that we want to backoff in such case.
Of course, we can set InstanceState for each instance, then after the call iterate instances again to check if any is not coming up, but I'd like to introduce a new state for that, like CreationTimeout
InstanceState
should still be InstanceCreating
. Maybe the wording is not best but InstanceCreating
means both:
- instance which is being created and we still do not know what will be outcome of that process
- instance which was created and creation result in a failure
For the latter case the ErrorInfo
in the InstanceStatus
is set to non-nil value. And IMO you should use this field to differentiate between successful and failed creation. It seems that you can still use the OtherErrorClass
when you set ErrorInfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I can have a look at that today. BTW, is someone of you going to KubeCon Barcelona next week?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks. I will be in BCN and bunch of other people from autoscaling team too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a problem with this approach. It's not easy to update Node.InstanceState
within aws provider, as it only keeps track of instances refs (a single string). This is common in the whole provider. Only when Nodes()
is called, the array of strings (instance ids) is converted to []Nodes
, so there's no easy way to keep additional state of nodes without heavily changing interfaces and returned types of aws provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. Yet I do not believe that changing signature a DeleteNodes
is the way to go. I do not like the idea of making point changes to general interface, so it is somewhat less of an effort to support it for one specific implementation.
If exposing InstanceState for all the instances in AWS is problematic, maybe you can fairly easily set those to the placeholder nodes only?
4000692
to
1cfffab
Compare
I did just that, please check now, although it's still not pretty |
ping @losipiuk @MaciekPytel , what do you think now? |
scaleToZeroSupported = true | ||
placeholderInstanceNamePrefix = "i-placeholder-" | ||
// TimeoutedPlaceholderName is used to mark placeholder instances that did not come up with timeout | ||
TimeoutedPlaceholderName = "i-timeouted-placeholder" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/timeouted/timedout/
return err | ||
// check if the instance is a placeholder - a requested instance that was never created by the node group | ||
// if it is, just decrease the size of the node group, as there's no specific instance we can remove | ||
matched, err := regexp.MatchString(fmt.Sprintf("^%s\\d+$", placeholderInstanceNamePrefix), instance.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use regexp.Compile
to compile this regular expression just once, given that the prefix string is fixed. The resulting regexp.Regex
value is safe for repeated and concurrent use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it would be nicer to extract the isPlaceHolderInstance(instance)
method instead inlining regex.MatchString
klog.V(4).Infof("instance %s is detected as a placeholder, decreasing ASG requested size instead "+ | ||
"of deleting instance", instance.Name) | ||
m.decreaseAsgSizeByOneNoLock(commonAsg) | ||
// mark this instance using its name as a timeouted placeholder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/timeouted/timed out/
for i := real; i < desired; i++ { | ||
id := fmt.Sprintf("%s%d", placeholderInstanceNamePrefix, i) | ||
klog.V(4).Infof("Instance group %s has only %d instances created while requested count is %d."+ | ||
"Creating placeholder instance with ID %s", *g.AutoScalingGroupName, real, desired, id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either add a leading space before "Creating" or a trailing space after the period on the preceding line. As written, the two concatenated sentences will wind up with no intervening space character.
Also, punctuate the second sentence like the first one.
cluster-autoscaler/core/utils.go
Outdated
} | ||
if failedPlaceholder { | ||
// this means only a placeholder instance was deleted - it is an instance, that was requested, | ||
// but was not create before StartUpTimeout. It means something's wrong with this specific |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/create/created/
return err | ||
// check if the instance is a placeholder - a requested instance that was never created by the node group | ||
// if it is, just decrease the size of the node group, as there's no specific instance we can remove | ||
matched, err := regexp.MatchString(fmt.Sprintf("^%s\\d+$", placeholderInstanceNamePrefix), instance.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it would be nicer to extract the isPlaceHolderInstance(instance)
method instead inlining regex.MatchString
@@ -41,7 +41,7 @@ import ( | |||
|
|||
const ( | |||
// MaxNodeStartupTime is the maximum time from the moment the node is registered to the time the node is ready. | |||
MaxNodeStartupTime = 15 * time.Minute | |||
MaxNodeStartupTime = 1 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert all changes outside of cloudprovider.
If you correctly set the InstanceStatus
backing off should be handled automatically by ClusterStateRegistry.handleOutOfResourcesErrorsForNodeGroup
// mark this instance using its name as a timeouted placeholder | ||
asg := m.instanceToAsg[*instance] | ||
delete(m.instanceToAsg, *instance) | ||
instance.Name = TimeoutedPlaceholderName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seems very much not right.
IMO the instance should be marked as "timed-out" in some routine which refreshes node groups state. I guess in the same place when you are creating placeholders you should also monitor if some of the place holders are "old enough" to treat them as timed out. Then you can mark those as such and when instances are listed set OutOfResources
error for them.
Then the DeleteInstances will be called by the core logic for timed-out instances. And here you will only remove the place holder, and decrease the asg size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point, but the problem is again in how AWS cloud provider works. Inside, it keeps instance IDs only and these IDs need to stay constant. It also doesn't track any timeouts - that's on the core's side, specifically in clusterstate.go. The logic that tracks instances and handles registration timeouts is there.
We either have to reuse/link this logic over here: auto_scaling_groups.go or assume that if a DeleteInstances
call comes for a placeholder, it's always because of timeout related to non-existing ScaleUp request. Basically, this whole logic is to guard against just a single use case: when CAS is started when already required > running
for an ASG and the instances never come up.
I think there are the following possible cases for placeholder instances:
- ScaleUp request times out. In that case, we don't have to modify Nodes state at all, as the time out handler will scale down the related ASG, mark it as "backed off" and decrease its size, which will make placeholders to be gone on the next refresh (that starts every main loop).
- As I wrote above, on CAS start "required > running" already; in that case,
DeleteNodes
will be invoked by the Core logic (after detecting, that the node is listed by the cloud provider, but not registered in k8s) and we assume, that node is deleted because of time out. One possible modification is to explicitly pass "isTimeout" flag here.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point, but the problem is again in how AWS cloud provider works. Inside, it keeps instance IDs only and these IDs need to stay constant. It also doesn't track any timeouts - that's on the core's side, specifically in clusterstate.go. The logic that tracks instances and handles registration timeouts is there.
We either have to reuse/link this logic over here: auto_scaling_groups.go or assume that if a
DeleteInstances
call comes for a placeholder, it's always because of timeout related to non-existing ScaleUp request. Basically, this whole logic is to guard against just a single use case: when CAS is started when alreadyrequired > running
for an ASG and the instances never come up.
My understanding was that you wanted speed up the reaction time of CA when node insntances coming from spot-backed ASG are not showing up.
If this is just about fixing bug that the 15-minutes timeout mechanism built-in into CA does not trigger up situation is simpler (but also benefits are smaller).
- ScaleUp request times out. In that case, we don't have to modify Nodes state at all, as the time out handler will scale down the related ASG, mark it as "backed off" and decrease its size, which will make placeholders to be gone on the next refresh (that starts every main loop).
Yes. Should work that way.
- As I wrote above, on CAS start "required > running" already; in that case,
DeleteNodes
will be invoked by the Core logic (after detecting, that the node is listed by the cloud provider, but not registered in k8s) and we assume, that node is deleted because of time out. One possible modification is to explicitly pass "isTimeout" flag here.
If we just added placeholders and not modify code in any other way we would get to stable state after DeleteNodes are called. Just ASG would not be backed-off this time. Is that a problem? Do we care about (possibly rare) case when CA is restarted while scale-up is in progress? The state will get stable eventually. I do not believe we want to optimize for restart scenario. CA was never designed to be restarted very often. It survives restarts but it does not necessarily behaves optimally in such situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that 15 minute timeout the one that's configured by the --max-node-provision-time
command-line flag? There are several flags with default values of 15 minutes, and there's a const MaxNodeStartupTime
with the same value. It doesn't look like that one can be changed without rebuilding the program.
15 minutes is too long for us to wait to react to an ASG's failure to acquire a machine, so we'd like to be able to dial this down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@losipiuk OK, I'll try to split this patch in 2: first provide a general fix, that has the worst case of recovering in 2 * NodeStartupTime
, then create another one, that can work in just 1 * NodeStartupTime
@seh It's --max-node-provision-time
. And yeah... and the general recommendation is "do not change that". Still, in my tests, I dialed it down to 2 minutes and it seemed to work. YMMV.
I've been testing this patch in a cluster running Kubernetes version 1.13.4. I just saw the container crash like this:
I don't know yet whether that's related to these changes. |
That failure above seems to happen within about two seconds after deleting a bunch of placeholder instances and reducing the desired count on the corresponding ASG. I can reproduce it easily. |
We have indeed been running this version in our clusters for a couple of weeks, and have found that it mostly alleviates the problem with “stuck” ASGs. The chattiness of the log messages about placeholder instances still makes me anxious, wondering if everything is working as it should, but the net observable behavior is what we asked for. The backoff delay parameters could be exposed as configuration, but I understand that doing so would then require validating that they all work sensibly together, and we’d have to document several more command-line flags. We don’t need to take that on here. This patch is about enabling behavior that was already supposed to work this way, as opposed to introducing new features. I hope @piontec agrees with my summary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for huge delay on review.
Looks good. Just please clean up the unneede changes remaining after previous approaches.
@@ -484,7 +484,7 @@ func sanitizeTemplateNode(node *apiv1.Node, nodeGroup string, ignoredTaints tain | |||
|
|||
// Removes unregistered nodes if needed. Returns true if anything was removed and error if such occurred. | |||
func removeOldUnregisteredNodes(unregisteredNodes []clusterstate.UnregisteredNode, context *context.AutoscalingContext, | |||
currentTime time.Time, logRecorder *utils.LogEventRecorder) (bool, error) { | |||
clusterStateRegistry *clusterstate.ClusterStateRegistry, currentTime time.Time, logRecorder *utils.LogEventRecorder) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert.
@@ -514,6 +514,7 @@ func removeOldUnregisteredNodes(unregisteredNodes []clusterstate.UnregisteredNod | |||
"Failed to remove node %s: %v", unregisteredNode.Node.Name, err) | |||
return removedAny, err | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plaese revert
@@ -451,12 +451,12 @@ func TestRemoveOldUnregisteredNodes(t *testing.T) { | |||
assert.Equal(t, 1, len(unregisteredNodes)) | |||
|
|||
// Nothing should be removed. The unregistered node is not old enough. | |||
removed, err := removeOldUnregisteredNodes(unregisteredNodes, context, now.Add(-50*time.Minute), fakeLogRecorder) | |||
removed, err := removeOldUnregisteredNodes(unregisteredNodes, context, clusterState, now.Add(-50*time.Minute), fakeLogRecorder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert
assert.NoError(t, err) | ||
assert.False(t, removed) | ||
|
||
// ng1_2 should be removed. | ||
removed, err = removeOldUnregisteredNodes(unregisteredNodes, context, now, fakeLogRecorder) | ||
removed, err = removeOldUnregisteredNodes(unregisteredNodes, context, clusterState, now, fakeLogRecorder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert
@@ -274,6 +274,14 @@ func (csr *ClusterStateRegistry) updateScaleRequests(currentTime time.Time) { | |||
csr.scaleDownRequests = newScaleDownRequests | |||
} | |||
|
|||
// BackoffNodeGroup is used to force the specified nodeGroup to go into backoff mode, which | |||
// means it won't be used for scaling out temporarily | |||
func (csr *ClusterStateRegistry) BackoffNodeGroup(nodeGroup cloudprovider.NodeGroup, currentTime time.Time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert
if err := m.asgCache.DeleteInstances(instances); err != nil { | ||
return err | ||
} | ||
return m.forceRefresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we should either have a log message here stating that a refresh is being forced or update the message in forceRefresh
as currently it could be confusing for users with the log message
in forceRefresh
stating: "Refreshed ASG list, next refresh after %v", m.lastRefresh.Add(refreshInterval))
Awesome work, really glad to see someone finally getting around to resolving this. One brief comment in addition to @losipiuk 's. |
@piontec, do you anticipate being able to address @losipiuk's and @gjtempleton's suggestions soon? |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the code that needs to be reverted per @losipiuk's requests, this code looks reasonable to me. My only question is around e2e testing. I don't see any functional test that would stress the code paths introduced by this patch. Are those functional tests missing because they would be difficult to reproduce? Or was the test absence simply an oversight?
Best,
-jay
I'm willing to take over this patch, and can handle the requests to revert a few lines. I'm not sure what's involved with writing end-to-end tests for this change in behavior. What would the procedure be? Submit a fresh PR that starts from where this branch leaves off? |
Regarding e2e tests: I'm not exactly sure what you mean by that. Currently CA uses unittests and e2e tests in real cluster - the latter only exist for GCP. Obviously there is no way to cover this without porting them to AWS. I have no idea how much work that would involve (I expect a lot), but I don't think this PR should be gated on it. |
Agree. It's more like some efforts we can put on sig-testing separately.
Hi @seh Is there a way we can get in touch with @piontec, not sure what happend to |
@Jeffwan, I see that Lukasz does publish an email address in his GitHub profile, but I expect that he's receiving updates for each message posted here. I'm willing to write to him once, just in case I'm wrong about those notifications. I don't want to bother him if he's had to move on to other projects. Stay tuned. |
@@ -254,7 +254,9 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) errors.AutoscalerError | |||
unregisteredNodes := a.clusterStateRegistry.GetUnregisteredNodes() | |||
if len(unregisteredNodes) > 0 { | |||
klog.V(1).Infof("%d unregistered nodes present", len(unregisteredNodes)) | |||
removedAny, err := removeOldUnregisteredNodes(unregisteredNodes, autoscalingContext, currentTime, autoscalingContext.LogRecorder) | |||
removedAny, err := removeOldUnregisteredNodes(unregisteredNodes, autoscalingContext, a.clusterStateRegistry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just twigged that this will need reverting as well whilst playing around with backporting this into a 1.3 version.
Hey! I no longer work at aurea/devfactory and I lost write access there. I need to switch this work to my private github account. Please continue the discussion here: #2235. I did the requested cleanup there already. |
Good to now. Thanks! |
To those who might be trying to track down issues with speedy transitions from spot to OnDemand, see #3241. |
@piontec: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This fixes a situation, where AWS cloud provider is used and at least one of the managed ASGs is using spot instances. Instances requested from such ASG can never come up due to too low bid or lack of capacity in spot pool. RIght now, this triggers a situation, where CAS considers such new nodes as "coming up", because
comig_up_nodes = requested - real
. As there's no instance identity to track, timeout mechanism for startup time won't kick in.This patch introduces artificial placeholder IDs for instances requested from any ASG. If such instance won't be available before the timeout kicks in, the size of the ASG is decreased and the ASG group is marked as "backoff".
Fixes #1996 and most probably also #1133 and #1795