-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA does not scale up from zero nodes in group #903
Comments
@mumoshu I'm not sure how GPUs are handled by scale from 0 in aws, can you help? |
@bskiba Hi. Unfortunately I'm not familiar with the specific feature. But I understand that we need to implement node templates to support it for aws. And according to the documentation it is implemented. So, @wskinner Did you add required tags to ASGs that backs your node groups? You need those tags for the scale-from-zero feature to function. Probably this part of the cluster-autoscaler doc helps! |
@mumoshu I thought I had added those tags but I hadn't. Now that I have them, I am seeing a different issue. The logs say scale up failed due to insufficient GPUs.
The spec for gpu-nodes looks like this:
|
@mumoshu Any idea what's going on here? |
@wskinner Hi. Thanks for the report! I took some time to read the relevant code, and it turns out the aws provider does seem to support the scale-from-zero feature for gpu nodes. So my guess is that we're still missing something. Would it be possible that your cluster-autoscaler is outdated, hence missing a fix relevant to the feature? Which version of CA are you using? Basically, cluster-autoscaler needs to build a "node template" which tells how many cores, gpus, and how much memory a node provides. For aws provider we build it by fetching the launch configuration from the relevant asg: autoscaler/cluster-autoscaler/cloudprovider/aws/aws_manager.go Lines 201 to 224 in bcb4f9e
Also:
And we do have the correct number of GPUs set for p2.xlarge: autoscaler/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go Lines 546 to 551 in 8225983
So it should just work if we setup things correctly. But please feel free to ask me anything. |
@wskinner And your CA seems outdated to me
Can you upgrade it to 1.2.0? |
@mumoshu I'm on k8s 1.8, and this language scared me:
The compatibility matrix suggests only the 1.0.x branch is compatible with my Kubernetes version. |
I reported the same thing in #929: I setup a GPU pool, and autoscaler works fine scaling up from 1 to n nodes, but not from 0 to n nodes. The error message is:
This is on Kubernetes 1.9.6 with autoscaler 1.1.2. The nodes carry the label {
"ResourceType": "auto-scaling-group",
"ResourceId": "gpus.ci.k8s.local",
"PropagateAtLaunch": true,
"Value": "gpus",
"Key": "k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup"
}, If I start a node, I see it has the required capacity:
This is the simple deployment I use to test it: apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: simple-gpu-test
spec:
replicas: 1
template:
metadata:
labels:
app: "simplegputest"
spec:
containers:
- name: "nvidia-smi-gpu"
image: "nvidia/cuda:8.0-cudnn5-runtime"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do nvidia-smi; sleep 5; done;" ]
volumes:
- hostPath:
path: /usr/local/nvidia
name: nvidia |
Hmm I'm not a go developer, so I'm not sure how I can patch some of the improvements in 1.2 to 1.1 like the ones from #648 |
Fwiw, after upgrading to |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm using autoscaler 1.14.6 (
|
@icy When scaling from 0 nodes CA guesses how a new node would look like and checks if the pending pods would be able to run on this node. In your case the node predicted by CA doesn't have the label requested by pod using nodeSelector or nodeAffinity. |
Continue E2E cleanup if apiserver is down
What happened:
Cluster Autoscaler will not scale up from zero nodes. However, it will scale up from one node.
I have a node group whose template includes p2.xlarge GPU instances. With zero running instances in my gpu-nodes node group, I create a new Job that requests 2 pods, each with 1 GPU. The pods are unschedulable, and CA logs show:
I0524 15:30:32.066956 1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"engine", Name:"distributed-job-2xp8n", UID:"34fa9255-5f67-11e8-bede-068abf0075c0", APIVersion:"v1", ResourceVersion:"98300", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
The pods never get created, and no GPU instances get spun up.
What you expected to happen:
CA should scale up the cluster by adding two p2.xlarge instances to the gpu-nodes group.
How to reproduce it (as minimally and precisely as possible):
In a kops cluster on AWS:
Anything else we need to know?:
Environment:
Kubernetes version (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.7", GitCommit:"b30876a5539f09684ff9fde266fda10b37738c9c", GitTreeState:"clean", BuildDate:"2018-01-16T21:52:38Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08
Kernel (e.g.
uname -a
): Linux ip-172-20-40-189 4.4.115-k8s Fix imports in cluster autoscaler after migrating it from contrib #1 SMP Thu Feb 8 15:37:40 UTC 2018 x86_64 GNU/LinuxInstall tools: kops 1.8.1
Others:
CA image: gcr.io/google_containers/cluster-autoscaler:v1.0.5
The text was updated successfully, but these errors were encountered: