Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale from 0, unwanted nodes #2165

Closed
okgolove opened this issue Jul 5, 2019 · 12 comments
Closed

Scale from 0, unwanted nodes #2165

okgolove opened this issue Jul 5, 2019 · 12 comments
Labels
area/provider/aws Issues or PRs related to aws provider lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@okgolove
Copy link

okgolove commented Jul 5, 2019

Hello. I have three ASGs:
main [min:1, max: 1]
spots [min: 1, max:10]
test-asg [min:0, max:0, tainted]

Taint is specified for ASG and instances tags.

CA always creates a new node in the test-asg group there are no pods being scheduled on the test ASG nodes though. Then, it deletes the node (after unneeded period) and creates again (loop).

How can I fix this?

I0705 09:32:44.591122       1 auto_scaling_groups.go:245] Regenerating instance to ASG map for ASGs: [spots test-asg]
W0705 09:32:44.802636       1 clusterstate.go:539] Readiness for node group test-asg not found
W0705 09:32:44.804225       1 clusterstate.go:321] Failed to find readiness information for test-asg
W0705 09:32:44.804240       1 clusterstate.go:377] Failed to find readiness information for test-asg
W0705 09:32:44.804245       1 clusterstate.go:321] Failed to find readiness information for test-asg
@okgolove
Copy link
Author

relates #2008

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 16, 2019

what's the expand strategy are you using?

@Jeffwan
Copy link
Contributor

Jeffwan commented Jul 16, 2019

Sorry for late response. just come back from vacation :D

@okgolove
Copy link
Author

I'm using the default expander (i.e. random).

@exdx
Copy link
Contributor

exdx commented Aug 10, 2019

What settings are you using when running the autoscaler (the flags)? And which version? Could be a something in the configuration causing this

@okgolove
Copy link
Author

cluster-autoscaler:v1.3.3
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/prod.test.com
      --balance-similar-node-groups=true
      --logtostderr=true
      --skip-nodes-with-local-storage=false
      --skip-nodes-with-system-pods=false
      --stderrthreshold=info
      --v=4

@exdx
Copy link
Contributor

exdx commented Aug 12, 2019

Maybe something to do with balance-similar-node-groups=true? By default its false. This flag attempts to balance similar node groups which is somewhat like the behavior you're seeing. I would try with it set to false, just to see.

@MaciekPytel
Copy link
Contributor

My guess would be scale-from-0 logic incorrectly guessing how the node would look like. CA sees the template node that would help the pending pods, so it scales up. Once the node is created it turns out it looks differently than expected and it doesn't really fit the pods. So CA deletes it. Once there are 0 nodes it goes back to using scale-from-0 template and the situation repeats.

@okgolove
Copy link
Author

okgolove commented Aug 12, 2019

I don't like the message:
Failed to find readiness information for test-asg

It seems something is going wrong.
As I wrote it looks like #2008

@Jeffwan
Copy link
Contributor

Jeffwan commented Oct 11, 2019

/area provider/aws

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Oct 11, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2020
@okgolove
Copy link
Author

okgolove commented Jan 9, 2020

It seems it got fixed.

@okgolove okgolove closed this as completed Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/aws Issues or PRs related to aws provider lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

6 participants