Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA does not scale up from zero nodes in group #903

Closed
wskinner opened this issue May 29, 2018 · 17 comments
Closed

CA does not scale up from zero nodes in group #903

wskinner opened this issue May 29, 2018 · 17 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@wskinner
Copy link

What happened:
Cluster Autoscaler will not scale up from zero nodes. However, it will scale up from one node.
I have a node group whose template includes p2.xlarge GPU instances. With zero running instances in my gpu-nodes node group, I create a new Job that requests 2 pods, each with 1 GPU. The pods are unschedulable, and CA logs show:
I0524 15:30:32.066956 1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"engine", Name:"distributed-job-2xp8n", UID:"34fa9255-5f67-11e8-bede-068abf0075c0", APIVersion:"v1", ResourceVersion:"98300", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
The pods never get created, and no GPU instances get spun up.

What you expected to happen:
CA should scale up the cluster by adding two p2.xlarge instances to the gpu-nodes group.

How to reproduce it (as minimally and precisely as possible):
In a kops cluster on AWS:

  1. Create a node group which includes p2.xlarge as the instance type.
  2. Create a Job that requests 2 containers, each requesting a single nvidia.com/gpu.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.7", GitCommit:"b30876a5539f09684ff9fde266fda10b37738c9c", GitTreeState:"clean", BuildDate:"2018-01-16T21:52:38Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration: AWS

  • OS (e.g. from /etc/os-release): k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08

  • Kernel (e.g. uname -a): Linux ip-172-20-40-189 4.4.115-k8s Fix imports in cluster autoscaler after migrating it from contrib #1 SMP Thu Feb 8 15:37:40 UTC 2018 x86_64 GNU/Linux

  • Install tools: kops 1.8.1

  • Others:
    CA image: gcr.io/google_containers/cluster-autoscaler:v1.0.5

@bskiba bskiba added bug area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels May 29, 2018
@bskiba
Copy link
Member

bskiba commented May 29, 2018

@mumoshu I'm not sure how GPUs are handled by scale from 0 in aws, can you help?

@mumoshu
Copy link
Contributor

mumoshu commented May 29, 2018

@bskiba Hi. Unfortunately I'm not familiar with the specific feature. But I understand that we need to implement node templates to support it for aws. And according to the documentation it is implemented.

So,

@wskinner Did you add required tags to ASGs that backs your node groups? You need those tags for the scale-from-zero feature to function. Probably this part of the cluster-autoscaler doc helps!

@wskinner
Copy link
Author

wskinner commented May 29, 2018

@mumoshu I thought I had added those tags but I hadn't. Now that I have them, I am seeing a different issue. The logs say scale up failed due to insufficient GPUs.

I0529 18:18:13.297024 1 utils.go:432] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop I0529 18:18:13.297494 1 scale_up.go:54] Pod engine/distributed-job-9xd62 is unschedulable I0529 18:18:13.297544 1 scale_up.go:54] Pod engine/distributed-job-4fkjq is unschedulable I0529 18:18:13.553111 1 scale_up.go:146] Scale-up predicate failed: PodFitsResources predicate mismatch, cannot put engine/distributed-job-9xd62 on template-node-for-gpu-nodes.<my-cluster>-5817535016063941893, reason: Insufficient nvidia.com/gpu I0529 18:18:13.553234 1 scale_up.go:146] Scale-up predicate failed: PodFitsResources predicate mismatch, cannot put engine/distributed-job-4fkjq on template-node-for-gpu-nodes.<my-cluster>-5817535016063941893, reason: Insufficient nvidia.com/gpu

The spec for gpu-nodes looks like this:

  image: <my image>
  kubelet:
    featureGates:
      DevicePlugins: "true"
  machineType: p2.xlarge
  maxSize: 4
  minSize: 0
  nodeLabels:
    can-scale-to-zero: "true"
    kops.k8s.io/instancegroup: gpu-nodes
  role: Node
  subnets:
  - us-west-2a

@wskinner
Copy link
Author

wskinner commented Jun 1, 2018

@mumoshu Any idea what's going on here?

@mumoshu
Copy link
Contributor

mumoshu commented Jun 5, 2018

@wskinner Hi. Thanks for the report!

I took some time to read the relevant code, and it turns out the aws provider does seem to support the scale-from-zero feature for gpu nodes. So my guess is that we're still missing something. Would it be possible that your cluster-autoscaler is outdated, hence missing a fix relevant to the feature? Which version of CA are you using?


Basically, cluster-autoscaler needs to build a "node template" which tells how many cores, gpus, and how much memory a node provides. For aws provider we build it by fetching the launch configuration from the relevant asg:

func (m *AwsManager) getAsgTemplate(asg *asg) (*asgTemplate, error) {
instanceTypeName, err := m.service.getInstanceTypeByLCName(asg.LaunchConfigurationName)
if err != nil {
return nil, err
}
if len(asg.AvailabilityZones) < 1 {
return nil, fmt.Errorf("Unable to get first AvailabilityZone for %s", asg.Name)
}
az := asg.AvailabilityZones[0]
region := az[0 : len(az)-1]
if len(asg.AvailabilityZones) > 1 {
glog.Warningf("Found multiple availability zones, using %s\n", az)
}
return &asgTemplate{
InstanceType: InstanceTypes[instanceTypeName],
Region: region,
Zone: az,
Tags: asg.Tags,
}, nil
}

Also:

template, err := ng.awsManager.getAsgTemplate(ng.asg)

node.Status.Capacity[gpu.ResourceNvidiaGPU] = *resource.NewQuantity(template.InstanceType.GPU, resource.DecimalSI)

And we do have the correct number of GPUs set for p2.xlarge:

"p2.xlarge": {
InstanceType: "p2.xlarge",
VCPU: 4,
MemoryMb: 62464,
GPU: 1,
},

So it should just work if we setup things correctly. But please feel free to ask me anything.

@mumoshu
Copy link
Contributor

mumoshu commented Jun 5, 2018

@wskinner And your CA seems outdated to me

gcr.io/google_containers/cluster-autoscaler:v1.0.5

Can you upgrade it to 1.2.0?

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed bug labels Jun 5, 2018
@wskinner
Copy link
Author

wskinner commented Jun 5, 2018

@mumoshu I'm on k8s 1.8, and this language scared me:

We strongly recommend using Cluster Autoscaler with version for which it was meant. We don't do ANY cross version testing so if you put the newest Cluster Autoscaler on an old cluster there is a big chance that it won't work as expected.

The compatibility matrix suggests only the 1.0.x branch is compatible with my Kubernetes version.
I did try 1.2.0 as you recommended, and encountered this:
I0605 18:08:10.826924 1 scale_up.go:59] Pod <podname1> is unschedulable I0605 18:08:10.826977 1 scale_up.go:59] Pod <podname2> is unschedulable I0605 18:08:11.533804 1 scale_up.go:186] No expansion options

@mumoshu
Copy link
Contributor

mumoshu commented Jun 7, 2018

@wskinner I understand your situation. Then my best recommendation is cherry-pick 4eb8391 into CA v1.0.x and build your own docker image from that.

@alexnederlof
Copy link

I reported the same thing in #929:

I setup a GPU pool, and autoscaler works fine scaling up from 1 to n nodes, but not from 0 to n nodes. The error message is:

I0605 11:27:29.865576       1 scale_up.go:54] Pod default/simple-gpu-test-6f48d9555d-l9822 is unschedulable
I0605 11:27:29.961051       1 scale_up.go:86] Upcoming 0 nodes
I0605 11:27:30.005163       1 scale_up.go:146] Scale-up predicate failed: PodFitsResources predicate mismatch, cannot put default/simple-gpu-test-6f48d9555d-l9822 on template-node-for-gpus.ci.k8s.local-5829202798403814789, reason: Insufficient nvidia.com/gpu
I0605 11:27:30.005262       1 scale_up.go:175] No pod can fit to gpus.ci.k8s.local
I0605 11:27:30.005324       1 scale_up.go:180] No expansion options
I0605 11:27:30.005393       1 static_autoscaler.go:299] Calculating unneeded nodes
I0605 11:27:30.008919       1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"simple-gpu-test-6f48d9555d-l9822", UID:"3416d787-68b3-11e8-8e8f-0639a6e973b0", APIVersion:"v1", ResourceVersion:"12429157", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
I0605 11:27:30.031707       1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler

This is on Kubernetes 1.9.6 with autoscaler 1.1.2.

The nodes carry the label kops.k8s.io/instancegroup=gpus, which is also present in the autoscaler group on AWS:

{
            "ResourceType": "auto-scaling-group",
            "ResourceId": "gpus.ci.k8s.local",
            "PropagateAtLaunch": true,
            "Value": "gpus",
            "Key": "k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup"
        },

If I start a node, I see it has the required capacity:

Capacity:
 cpu:             4
 memory:          62884036Ki
 nvidia.com/gpu:  1
 pods:            110

This is the simple deployment I use to test it:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: simple-gpu-test
spec: 
  replicas: 1
  template:
    metadata:
      labels:
        app: "simplegputest"
    spec:
      containers: 
      - name: "nvidia-smi-gpu"
        image: "nvidia/cuda:8.0-cudnn5-runtime"
        resources: 
          limits: 
             nvidia.com/gpu: 1 # requesting 1 GPU
        volumeMounts:
        - mountPath: /usr/local/nvidia
          name: nvidia
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "while true; do nvidia-smi; sleep 5; done;" ]
      volumes:
      - hostPath:
          path: /usr/local/nvidia
        name: nvidia

@alexnederlof
Copy link

Hmm I'm not a go developer, so I'm not sure how I can patch some of the improvements in 1.2 to 1.1 like the ones from #648

@osterman
Copy link

Fwiw, after upgrading to 1.2.0 it works for me. Not sure if/what other regressions were introduced, but autoscaling from zero works and I get pods scheduled on the instances.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 27, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@icy
Copy link
Contributor

icy commented Dec 4, 2019

I'm using autoscaler 1.14.6 (k8s.gcr.io/cluster-autoscaler:v1.14.6), and I get this issue: The autoscaler doesn't scale out when there isn't any node in the group. Error I found

I1204 11:12:56.381397       1 utils.go:237] Pod circleci-2c9db67ab-elasticsearch-5cc75895f9-kmhvs can't be scheduled on eks-euw1-local20191203121554568900000005, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
I1204 11:12:56.381493       1 utils.go:237] Pod circleci-2c9db67ab-postgres-5d79d447c7-6vtmh can't be scheduled on eks-euw1-local20191203121554568900000005, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
I1204 11:12:56.381509       1 utils.go:237] Pod circleci-2c9db67ab-redis-848667559b-5z4p8 can't be scheduled on eks-euw1-local20191203121554568900000005, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector

@MaciekPytel
Copy link
Contributor

@icy When scaling from 0 nodes CA guesses how a new node would look like and checks if the pending pods would be able to run on this node. In your case the node predicted by CA doesn't have the label requested by pod using nodeSelector or nodeAffinity.
The logic of guessing what labels the first node in a given node group would have is specific to a given cloudprovider and (unless you're using hosted k8s such as GKE) requires some kind of manual tagging of underlying cloudprovider autoscaling group. The details are described in README of each cloudprovider.

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
Continue E2E cleanup if apiserver is down
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants