-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node pool scale up timeout #1133
Comments
CA should timeout scale-up (IIRC after 15 minutes) and put the node group that failed to scale-up in 'backoff' state. At that point it should try to scale-up again, ignoring this node group. You can see this by looking for one of the following:
If you can reproduce it and doesn't see any of the above it's a bug. In that case can you provide some more details (especially CA version you're using). |
We're running version 1.2.2, but I can try with 1.3.1. |
It shouldn't make any difference, this was added earlier than 1.2 (I don't remember exactly, but probably 1.1 timeframe?). So you're saying it's stuck on |
As @aleksandra-malinowska pointed out to me the timeout is effectively reset if there is another scale-up on the same node group (ie. CA only notices it if the last of multiple overlapping scale-ups times out). So that may be another thing to look for in the logs. |
The logs are in this gist. Please ignore all ASGs and pods in It looks like the scale-up timeout is working just fine and the pool is marked as unhealthy. However, in the next loop iteration CA doesn't consider another group that would've fit the same pod ( |
I think the problem is that building
|
It seems that an easy fix would be to completely ignore upcoming nodes from unhealthy groups in |
CA handles this by resizing the node group back to original size after timed-out scale-up. This is done in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L180. However, going through the code it looks like this may not work for timed-out scale-from-0. It would be signified by lines in log looking like I'll try to reproduce later to confirm this theory (not enough time this week, sorry), but I'm fairly confident that's what's happening. If I'm right it's a bug in clusterstate. I have an idea how to fix it, but clusterstate is not the easiest thing to reason about and I need to have some time to dig into it to make sure I'm not breaking anything. |
@MaciekPytel did you have a chance to look into it yet? |
@MaciekPytel could you reproduce this? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
cc: @losipiuk |
👍 using |
With recent code changes If you want to implement it for AWS you may take a look at implementation for GCE which is already in. Note: For CA logic to react on error the |
Circling back to @aermakov-zalando's point. Does it not make sense to skip upcoming nodes from unhealthy groups? We should be able to assume that the node is no longer upcoming and not increment |
@viggeh Are you still planning on tackling this in a follow-up PR? We're hitting this and could take a stab at the recommended implementation, if that's helpful. |
@davidquarles Sorry for the late reply. I've not had the time to take a proper look at this and don't expect to do so in the next couple of weeks. If you can take the problem on that would be awesome! |
@MaciekPytel @mwielgus @viggeh I'm no AWS expert, but AFAICT instances aren't added to the ASG when spot requests are made. With an open, unfulfillable spot request triggered by an ASG, I see that the ASG's desired number of instances has been incremented, and the spot request shows up as an associated scaling activity, but there are no new instances under the ASG. Is the aforementioned strategy for solving this still valid? If so, would I just use |
I'll add that this is also valid for any failing reason other than spot requests not being fulfilled, eg: corrupted LC or LT config - AMI not found, missing subnet or SG, etc.. |
I don't see this behaviour. In my test the ASG is just left at the new value even though the new instance didn't get provisioned. |
Just to add my test results to this issue, if it helps any... I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:
But spot price is not fulfilled so instance is not created. Then
Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:
And pods are left unschedulable. |
We're running into the same issues here. I know it's only been like 13 days, but did you happen to find a workaround @max-rocket-internet? |
@choseh nope 😐 |
Please address this issue. |
EDIT: This was not a fault on Kubernetes side, i simply ran into an IP Address Quota on Google side...Stumbled across this too. I tried to use a scale-to-zero pool besides my existing cluster for new CI/CD gitlab workers. Running on GKE with Sadly there is no scaling above 1 node, thus 5 are allowed. Main-Cluster is currently running on 4 nodes. Tried with and without preemptible nodes (guess the AWS term is SPOT for this) pod description:
|
AFAICT, #2235 properly handles the bug described here, and that PR has been merged. Can we can close this bug out now? |
While the fix in #2235 is indeed for the AWS cloud provider, the solution is the one recommended by @MaciekPytel and @losipiuk for all cloud providers that suffer from the "I don't have an actual Instance (yet)" problem with their autoscaling API. See here for @MaciekPytel recommendation: Note that he mentions that the managed instance groups API in GCE does the whole placeholder thing behind the scenes but is essentially what was implemented for the AWS cloud provider in #2235 |
I see. We only really care about AWS anyway, so I'll just close the issue. Thanks for the update! |
* Bumping K8s dependencies to v1.28 Signed-off-by: Yuki Iwai <[email protected]> * Bumping controller-runtime to v0.16.2 Signed-off-by: Yuki Iwai <[email protected]> * Bumping cert-controller to v0.9.0 Signed-off-by: Yuki Iwai <[email protected]> * Add an executable permission to the generate-internal-groups.sh Signed-off-by: Yuki Iwai <[email protected]> * Ignore Unexported cache Options when testing Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>
Fix out of GOPATH client-go generation after `Bump K8s dependencies to v1.28 kubernetes#1133`
The autoscaler has a timeout for non-ready nodes which forces it to kill those nodes and potentially select a different node pool in the next iteration. However, in the situation where the node pool cannot scale up at all it'll happily wait forever, keeping pods in Pending state without trying to compensate.
For example, setting multiple AWS Spot node pools with different instance types, or setting up a Spot pool and an On Demand pool doesn't really work. We'd expect CA to scale up one of the ASGs, detect a few minutes later that there's still no nodes coming up (because the corresponding Spot pool doesn't have capacity) and fall back to another pool. What actually happens is that CA will scale up the node pool by increasing desired capacity and then not do anything at all other than printing
Upcoming 1 nodes
/Failed to find readiness information for ...
.The text was updated successfully, but these errors were encountered: