-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale up delayed by Cluster Autoscaler for large request #3835
Comments
Hi! I'd like to add some more logs here. So this is what we've noticed:
|
Interesting. First of all, CA exposes health check endpoint based on last activity - I recommend configuring livenessProbe so that CA is automatically restarted if it gets stuck. That being said - this looks like scale-up logic getting stuck calculating different options for scale-up, which is a fairly typical performance problem for CA. Can you share some more details about your config? The relevant information would be:
Some general thoughts:
|
Thank you @MaciekPytel!
About #3429, thank you for sharing it! Do you have an ETA for when will it be released? L.E. - I am repro-ing this again, and what I notice is a spike in the number of unregistered nodes: Only 54 nodes out of 146 are considered ready by CA
|
Also, I don't think the liveness probe will help here as restarting the CA does not recover it from the "freeze" state. Here are the logs after restart:
|
This looks very much like CA stuck in binpacking simulation. Binpacking itself doesn't write any logs and the ~15m delay between evaluating 2 ASGs visible in logs fits well with this theory. I'm now almost sure #3429 should help a lot. Regarding timeline - it was included in 1.20 release. Would you mind you retrying your test using CA 1.20.0? If it performs better I think it will be a good justification to cherry-pick #3429 to older releases (note: there were also other scalability improvements in 1.20, so technically I'm just making an educated guess as to which one is relevant for your case, but based on where the time is spent I'm fairly confident). |
Thank you @MaciekPytel! This is very helpful! I will test with CA 1.20.0 and will get back with some results. |
CA 1.20.0 seems to have helped. I scaled out a deployment from 1 replica to 1000 with CA image v1.20.0
Observed CA scale up logs:
No pending pods:
|
Also, with respect to v1.20.0, I came across one of the comment #3721 (comment). When I try to scale app (with defined zone based DoNotSchedule topology constraint) from 1 to 1000 replicas the cluster scale up does not happen one node at a time as mentioned in the #3721 (comment). Although the scale did happen in several stages as can be seen above logs. Are there any possible impact on scaling time etc. that we will observe for applications with zone based topology constraint defined with maxSkew: 1 ? |
Thank you @nshekhar221 for validating it! |
@MaciekPytel Gentle enquiry on the possibility of getting the fix cherry-picked for v1.18 and v1.19 ? |
@MaciekPytel Can you please confirm if there is a possibility to cherry-pick the fix for v1.18 and v1.19 ? |
Also got it by this on 1.18, would love to see this cherry picked to at least 1.19! |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
1.18.2
Component version:
What k8s version are you using (
kubectl version
)?:1.18.6
What environment is this in?:
AWS
What did you expect to happen?:
Cluster Autoscaler to scale the nodes on the cluster when unschedulable pods are found on the cluster.
What happened instead?:
Scaling happened by increasing the number of nodes in the AWS ASG(AutoScaling Group) after almost 4 hours from the time when unschedulable pods are detected on the cluster.
Anything else we need to know?:
Scaling request that gets delayed is somewhat on the larger side (from 76 nodes to around 660+ nodes)
Cluster Autoscaler has logged "Start refreshing cloud provider node instances cache" & "Refresh cloud provider node instances cache finished" types of logs between the time unschedulable pods detected on the cluster and the time when scaling actually happened(i.e 4 hours later).
No error logs are reported.
Scaling happened as expected before and after this 4 hours interval.
The text was updated successfully, but these errors were encountered: