Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node pool scale up timeout #1133

Closed
aermakov-zalando opened this issue Aug 9, 2018 · 30 comments
Closed

Node pool scale up timeout #1133

aermakov-zalando opened this issue Aug 9, 2018 · 30 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@aermakov-zalando
Copy link
Contributor

The autoscaler has a timeout for non-ready nodes which forces it to kill those nodes and potentially select a different node pool in the next iteration. However, in the situation where the node pool cannot scale up at all it'll happily wait forever, keeping pods in Pending state without trying to compensate.

For example, setting multiple AWS Spot node pools with different instance types, or setting up a Spot pool and an On Demand pool doesn't really work. We'd expect CA to scale up one of the ASGs, detect a few minutes later that there's still no nodes coming up (because the corresponding Spot pool doesn't have capacity) and fall back to another pool. What actually happens is that CA will scale up the node pool by increasing desired capacity and then not do anything at all other than printing Upcoming 1 nodes/Failed to find readiness information for ....

@MaciekPytel
Copy link
Contributor

CA should timeout scale-up (IIRC after 15 minutes) and put the node group that failed to scale-up in 'backoff' state. At that point it should try to scale-up again, ignoring this node group. You can see this by looking for one of the following:

  • Scale-up timed out for node group <name> after <time> in logs
  • ScaleUpTimedOut event on cluster-autoscaler-status configmap (in kube-system ns)
  • Node group showing scale-up status as "Backoff" in status configmap
  • failed_scale_ups_total metric increasing

If you can reproduce it and doesn't see any of the above it's a bug. In that case can you provide some more details (especially CA version you're using).

@aermakov-zalando
Copy link
Contributor Author

We're running version 1.2.2, but I can try with 1.3.1.

@MaciekPytel
Copy link
Contributor

It shouldn't make any difference, this was added earlier than 1.2 (I don't remember exactly, but probably 1.1 timeframe?).

So you're saying it's stuck on Upcoming 1 node for more than 15 minutes? Can you provide a log of initial scale-up, a loop immediately after and another loop after 15+ minutes?

@MaciekPytel
Copy link
Contributor

As @aleksandra-malinowska pointed out to me the timeout is effectively reset if there is another scale-up on the same node group (ie. CA only notices it if the last of multiple overlapping scale-ups times out). So that may be another thing to look for in the logs.

@aermakov-zalando
Copy link
Contributor Author

aermakov-zalando commented Aug 9, 2018

The logs are in this gist. Please ignore all ASGs and pods in 1b, the nodes there were constantly being created and destroyed by Spot termination.

It looks like the scale-up timeout is working just fine and the pool is marked as unhealthy. However, in the next loop iteration CA doesn't consider another group that would've fit the same pod (scale_up.go:178] No need for any nodes in nodepool-default-worker-m4-splitaz-aws-ACCT-eu-central-1-kube-aws-test-aermakov64-AutoScalingGroup1a-LJECHLYTYAKY in 03-next-iteration.txt). It only tries to scale it up after I manually disable the unhealthy node group by setting its max size to 0.

@aermakov-zalando
Copy link
Contributor Author

aermakov-zalando commented Aug 9, 2018

I think the problem is that building upcomingNodes in

upcomingNodes := make([]*schedulercache.NodeInfo, 0)
ignores whether the node group is healthy or not, so the loop in
for _, nodeGroup := range nodeGroups {
thinks that the pod could be scheduled on the perma-upcoming node and doesn't consider other groups, but I could also be completely wrong.

@aermakov-zalando
Copy link
Contributor Author

It seems that an easy fix would be to completely ignore upcoming nodes from unhealthy groups in ScaleUp(). It doesn't look like the list is used for anything other than scheduling estimation, so it shouldn't affect anything else. WDYT?

@MaciekPytel
Copy link
Contributor

MaciekPytel commented Aug 9, 2018

CA handles this by resizing the node group back to original size after timed-out scale-up. This is done in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L180. However, going through the code it looks like this may not work for timed-out scale-from-0. It would be signified by lines in log looking like Readiness for node group <group> not found, which I do see in your log.

I'll try to reproduce later to confirm this theory (not enough time this week, sorry), but I'm fairly confident that's what's happening. If I'm right it's a bug in clusterstate. I have an idea how to fix it, but clusterstate is not the easiest thing to reason about and I need to have some time to dig into it to make sure I'm not breaking anything.

@aleksandra-malinowska aleksandra-malinowska added area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. labels Aug 13, 2018
@jrake-revelant
Copy link

@MaciekPytel did you have a chance to look into it yet?

@szuecs
Copy link
Member

szuecs commented Sep 27, 2018

@MaciekPytel could you reproduce this?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2018
@MaciekPytel
Copy link
Contributor

/remove-lifecycle stale
/sigh I keep meaning to go back and fix this..

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 27, 2018
@MaciekPytel
Copy link
Contributor

cc: @losipiuk

@phspagiari
Copy link

👍 using 1.2.4 (K8S 1.10.12) we had problems using spot and ondemand nodepools. CA tries (ad infinitum) request a spot (even with the price-too-low state) and never fallback to the next nodepools (ondemand).

@losipiuk
Copy link
Contributor

With recent code changes cloudprovider.NodeGroup.Nodes() returns Instance objects. The intention of this method is to return an Instance object for not only running nodes but also those which are being crated or deleted. Additionally, for nodes which are being created (with Status.State==InstanceCreating), ErrorInfo can be provided. The core logic now reacts to errors specified via ErrorInfo and will retract from scale-up of give node group (and backoff node group) immediately, instead waiting 15mins for timeout.

If you want to implement it for AWS you may take a look at implementation for GCE which is already in.

Note: For CA logic to react on error the ErrorClass must be set to OutOfResourcesErrorClass. Probably it will be extended to OtherErrorClass too.

@viggeh
Copy link
Contributor

viggeh commented Mar 8, 2019

Circling back to @aermakov-zalando's point. Does it not make sense to skip upcoming nodes from unhealthy groups? We should be able to assume that the node is no longer upcoming and not increment upcomingNodes. That would make cluster-autoscaler fall back on a different ASG for the pod that cannot be unscheduled.

@davidquarles
Copy link

@viggeh Are you still planning on tackling this in a follow-up PR? We're hitting this and could take a stab at the recommended implementation, if that's helpful.

@viggeh
Copy link
Contributor

viggeh commented Apr 2, 2019

@davidquarles Sorry for the late reply. I've not had the time to take a proper look at this and don't expect to do so in the next couple of weeks. If you can take the problem on that would be awesome!

@davidquarles
Copy link

@MaciekPytel @mwielgus @viggeh I'm no AWS expert, but AFAICT instances aren't added to the ASG when spot requests are made. With an open, unfulfillable spot request triggered by an ASG, I see that the ASG's desired number of instances has been incremented, and the spot request shows up as an associated scaling activity, but there are no new instances under the ASG. Is the aforementioned strategy for solving this still valid? If so, would I just use AwsNodeGroup.TemplateNodeInfo to create fake nodes (with Status.State==InstanceCreating and corresponding ErrorInfo)?

@mvisonneau
Copy link

I'll add that this is also valid for any failing reason other than spot requests not being fulfilled, eg: corrupted LC or LT config - AMI not found, missing subnet or SG, etc..

@max-rocket-internet
Copy link

CA handles this by resizing the node group back to original size after timed-out scale-up.

I don't see this behaviour. In my test the ASG is just left at the new value even though the new instance didn't get provisioned.

@max-rocket-internet
Copy link

Just to add my test results to this issue, if it helps any...

I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:

I0417 13:42:04.765399       1 scale_up.go:427] Best option to resize: eu01-stg-spot-2
I0417 13:42:04.765417       1 scale_up.go:431] Estimated 1 nodes needed in eu01-stg-spot-2
I0417 13:42:04.765439       1 scale_up.go:533] Final scale-up plan: [{eu01-stg-spot-2 2->3 (max: 20)}]

But spot price is not fulfilled so instance is not created. Then max-node-provision-time passes:

W0417 14:03:32.422046       1 clusterstate.go:198] Scale-up timed out for node group eu01-stg-spot-1 after 15m8.316405684s
W0417 14:03:32.422109       1 clusterstate.go:221] Disabling scale-up for node group eu01-stg-spot-1 until 2019-04-17 14:08:32.247733959 +0000 UTC m=+4217.943746338
W0417 14:03:32.532256       1 scale_up.go:329] Node group eu01-stg-spot-1 is not ready for scaleup - backoff

Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:

I0417 14:09:23.451114       1 scale_up.go:412] No need for any nodes in eu01-stg
I0417 14:09:23.451516       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-1
I0417 14:09:23.452883       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-2

And pods are left unschedulable.

@choseh
Copy link

choseh commented Apr 30, 2019

We're running into the same issues here. I know it's only been like 13 days, but did you happen to find a workaround @max-rocket-internet?

@max-rocket-internet
Copy link

@choseh nope 😐

@baxor
Copy link

baxor commented May 1, 2019

Please address this issue.

@MaxWinterstein
Copy link

MaxWinterstein commented Jun 17, 2019

EDIT: This was not a fault on Kubernetes side, i simply ran into an IP Address Quota on Google side...


Stumbled across this too. I tried to use a scale-to-zero pool besides my existing cluster for new CI/CD gitlab workers. Running on GKE with 1.13.6-gke.0.

Sadly there is no scaling above 1 node, thus 5 are allowed. Main-Cluster is currently running on 4 nodes.

Tried with and without preemptible nodes (guess the AWS term is SPOT for this)

pod description:

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   16s (x2 over 16s)  default-scheduler   0/5 nodes are available: 4 node(s) didn't match node selector, 5 Insufficient cpu.
  Normal   NotTriggerScaleUp  4s (x2 over 15s)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 in backoff after failed scale-up

@jaypipes
Copy link
Contributor

AFAICT, #2235 properly handles the bug described here, and that PR has been merged. Can we can close this bug out now?

@aermakov-zalando
Copy link
Contributor Author

@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?

@jaypipes
Copy link
Contributor

@jaypipes The fix in #2235 is AWS specific, so other cloud providers would still be affected by this. Or is the idea here that every cloud provider will have to implement the same logic to work around an issue in the clusterstate?

While the fix in #2235 is indeed for the AWS cloud provider, the solution is the one recommended by @MaciekPytel and @losipiuk for all cloud providers that suffer from the "I don't have an actual Instance (yet)" problem with their autoscaling API. See here for @MaciekPytel recommendation:

#2008 (comment)

Note that he mentions that the managed instance groups API in GCE does the whole placeholder thing behind the scenes but is essentially what was implemented for the AWS cloud provider in #2235

@aermakov-zalando
Copy link
Contributor Author

I see. We only really care about AWS anyway, so I'll just close the issue. Thanks for the update!

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
* Bumping K8s dependencies to v1.28

Signed-off-by: Yuki Iwai <[email protected]>

* Bumping controller-runtime to v0.16.2

Signed-off-by: Yuki Iwai <[email protected]>

* Bumping cert-controller to v0.9.0

Signed-off-by: Yuki Iwai <[email protected]>

* Add an executable permission to the generate-internal-groups.sh

Signed-off-by: Yuki Iwai <[email protected]>

* Ignore Unexported cache Options when testing

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
Fix out of GOPATH client-go generation after
`Bump K8s dependencies to v1.28 kubernetes#1133`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests