Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaling Quickstart #881

Closed
qwinkler opened this issue Apr 26, 2020 · 7 comments
Closed

Cluster Autoscaling Quickstart #881

qwinkler opened this issue Apr 26, 2020 · 7 comments
Labels
triage/support Indicates an issue that is a support question.

Comments

@qwinkler
Copy link

Hello guys. Thanks for such a great project!

As I understood, it is possible to integrate the cluster-autoscaler with machine-controller. It there any guide about how to do it?

First of all, I created the cluster using this quickstart.
Then I installed the cluster-autoscaler with clusterapi provider:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: autoscaler-cluster-api
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: autoscaler-cluster-api
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: autoscaler-cluster-api
  namespace: default
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: "autoscaler"
  name: autoscaler-cluster-api
spec:
  ports:
    - port: 8085
      protocol: TCP
      targetPort: 8085
      name: http
  selector:
    app: "autoscaler"
  type: "ClusterIP"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: "autoscaler"
  name: autoscaler-cluster-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: "autoscaler"
  template:
    metadata:
      labels:
        app: "autoscaler"
    spec:
      containers:
        - name: autoscaler
          image: "k8s.gcr.io/cluster-autoscaler:v1.18.0"
          imagePullPolicy: "IfNotPresent"
          command:
            - ./cluster-autoscaler
            - --cloud-provider=clusterapi
            - --namespace=default
            - --logtostderr=true
            - --stderrthreshold=info
            - --v=4
          livenessProbe:
            httpGet:
              path: /health-check
              port: 8085
          ports:
            - containerPort: 8085
      serviceAccountName: autoscaler-cluster-api

Also I created some pods and they are in the Pending state, because there is no available worker nodes, so it cannot be scheduled. I did it to test the autoscaler. As there is no available node, then the autoscaler should create them. The MachineDeployment that I created with these annotations ignoring my annotations. It just created 1 worker node and that's it. Here is my MachineDeployment:

apiVersion: "cluster.k8s.io/v1alpha1"
kind: MachineDeployment
annotations:
  cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "0"
  cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: "3"
metadata:
  name: test-worker
  namespace: kube-system
spec:
  paused: false
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  minReadySeconds: 0
  selector:
    matchLabels:
      foo: bar
  template:
    metadata:
      labels:
        foo: bar
    spec:
      providerSpec:
        value:
          sshPublicKeys:
            - "my_ssh_key.pub here"
          cloudProvider: "hetzner"
          cloudProviderSpec:
            token:
              secretKeyRef:
                namespace: kube-system
                name: cloud-provider-credentials
                key: HZ_TOKEN
            serverType: "cx11"
            networks:
              - "network_created_in_tutorial"
          operatingSystem: "ubuntu"
          operatingSystemSpec:
            distUpgradeOnBoot: false
      versions:
        kubelet: "v1.16.1"

Here is the cluster autoscaler logs:

I0426 07:55:24.738536       1 reflector.go:211] Listing and watching *v1.CSINode from k8s.io/client-go/informers/factory.go:135
E0426 07:55:24.741292       1 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.CSINode: the server could not find the requested resource
I0426 07:55:24.908025       1 reflector.go:211] Listing and watching *unstructured.Unstructured from k8s.io/client-go/dynamic/dynamicinformer/informer.go:91
E0426 07:55:24.910140       1 reflector.go:178] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to list *unstructured.Unstructured: the server could not find the requested resource
I0426 07:55:28.712184       1 reflector.go:211] Listing and watching *unstructured.Unstructured from k8s.io/client-go/dynamic/dynamicinformer/informer.go:91
E0426 07:55:28.714695       1 reflector.go:178] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to list *unstructured.Unstructured: the server could not find the requested resource
I0426 07:55:28.892196       1 reflector.go:211] Listing and watching *v1.CSINode from k8s.io/client-go/informers/factory.go:135
E0426 07:55:28.894531       1 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.CSINode: the server could not find the requested resource
I0426 07:55:30.304634       1 reflector.go:211] Listing and watching *unstructured.Unstructured from k8s.io/client-go/dynamic/dynamicinformer/informer.go:91
E0426 07:55:30.309674       1 reflector.go:178] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to list *unstructured.Unstructured: the server could not find the requested resource
I0426 07:55:30.518605       1 reflector.go:211] Listing and watching *unstructured.Unstructured from k8s.io/client-go/dynamic/dynamicinformer/informer.go:91
E0426 07:55:30.522108       1 reflector.go:178] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to list *unstructured.Unstructured: the server could not find the requested resource

What am I doing wrong? Maybe I need to reconfigure something?

@qwinkler qwinkler added the triage/support Indicates an issue that is a support question. label Apr 26, 2020
@kron4eg
Copy link
Member

kron4eg commented Apr 26, 2020

Hi,

IIRC cluster.k8s.io/cluster-api-autoscaler-node-group-min-size annotation can not be < 1: https://github.com/kubernetes/autoscaler/blob/972e30a5d9eece175a54fa5dfc0ed902b34f02b1/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_utils.go#L92-L94
As additional debug set please increase --v=4 in cluster-autoscaler deployment, and probably move it to the kube-system from default namespace.

@qwinkler
Copy link
Author

Thank you for your help. I increased the debug level in the cluster-autoscaler (--v=7) and moved it to the kube-system. Also I increased the minimum node group size to 1. The behaviour is still the same.

I found out, that autoscaler is looking for the cluster.x-k8s.io/v1alpha2 api, while the machine-controller is using cluster.k8s.io/v1alpha1:

I0426 20:11:52.246854       1 reflector.go:211] Listing and watching *unstructured.Unstructured from k8s.io/client-go/dynamic/dynamicinformer/informer.go:91
I0426 20:11:52.247234       1 round_trippers.go:420] GET https://10.96.0.1:443/apis/cluster.x-k8s.io/v1alpha2/machinedeployments?limit=500&resourceVersion=0
I0426 20:11:52.247276       1 round_trippers.go:427] Request Headers:
I0426 20:11:52.247295       1 round_trippers.go:431]     Accept: application/json
I0426 20:11:52.247308       1 round_trippers.go:431]     User-Agent: cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format
I0426 20:11:52.247323       1 round_trippers.go:431]     Authorization: Bearer <masked>
I0426 20:11:52.251500       1 round_trippers.go:446] Response Status: 404 Not Found in 4 milliseconds
E0426 20:11:52.251699       1 reflector.go:178] k8s.io/client-go/dynamic/dynamicinformer/informer.go:91: Failed to list *unstructured.Unstructured: the server could not find the requested resource

The rest of the logs looks okay:

I0426 20:11:53.410020       1 round_trippers.go:420] GET https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler
I0426 20:11:53.410092       1 round_trippers.go:427] Request Headers:
I0426 20:11:53.410110       1 round_trippers.go:431]     User-Agent: cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format
I0426 20:11:53.410128       1 round_trippers.go:431]     Authorization: Bearer <masked>
I0426 20:11:53.410139       1 round_trippers.go:431]     Accept: application/json, */*
I0426 20:11:53.414364       1 round_trippers.go:446] Response Status: 200 OK in 4 milliseconds
I0426 20:11:53.414765       1 round_trippers.go:420] PUT https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler
I0426 20:11:53.414786       1 round_trippers.go:427] Request Headers:
I0426 20:11:53.414795       1 round_trippers.go:431]     User-Agent: cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format
I0426 20:11:53.414804       1 round_trippers.go:431]     Authorization: Bearer <masked>
I0426 20:11:53.414810       1 round_trippers.go:431]     Content-Type: application/json
I0426 20:11:53.414817       1 round_trippers.go:431]     Accept: application/json, */*
I0426 20:11:53.418191       1 round_trippers.go:446] Response Status: 200 OK in 3 milliseconds
I0426 20:11:53.418434       1 leaderelection.go:272] successfully renewed lease kube-system/cluster-autoscaler
I0426 20:11:54.806481       1 pathrecorder.go:240] cluster-autoscaler: "/health-check" satisfied by exact match
I0426 20:11:55.418854       1 round_trippers.go:420] GET https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler
I0426 20:11:55.418913       1 round_trippers.go:427] Request Headers:
I0426 20:11:55.418929       1 round_trippers.go:431]     Accept: application/json, */*
I0426 20:11:55.418941       1 round_trippers.go:431]     User-Agent: cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format
I0426 20:11:55.418956       1 round_trippers.go:431]     Authorization: Bearer <masked>
I0426 20:11:55.425058       1 round_trippers.go:446] Response Status: 200 OK in 6 milliseconds
I0426 20:11:55.425542       1 round_trippers.go:420] PUT https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler
I0426 20:11:55.425562       1 round_trippers.go:427] Request Headers:
I0426 20:11:55.425575       1 round_trippers.go:431]     Accept: application/json, */*
I0426 20:11:55.425589       1 round_trippers.go:431]     Authorization: Bearer <masked>
I0426 20:11:55.425601       1 round_trippers.go:431]     User-Agent: cluster-autoscaler/v0.0.0 (linux/amd64) kubernetes/$Format
I0426 20:11:55.425612       1 round_trippers.go:431]     Content-Type: application/json
I0426 20:11:55.429767       1 round_trippers.go:446] Response Status: 200 OK in 4 milliseconds
I0426 20:11:55.429995       1 leaderelection.go:272] successfully renewed lease kube-system/cluster-autoscaler

@kron4eg
Copy link
Member

kron4eg commented Apr 26, 2020

By looking at this:
https://github.com/kubernetes/autoscaler/blob/972e30a5d9eece175a54fa5dfc0ed902b34f02b1/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L44-L46

I think you need to set CAPI_GROUP=cluster.k8s.io environment variable in cluster-autoscaler deployment.

@qwinkler
Copy link
Author

qwinkler commented Apr 27, 2020

@kron4eg Thanks! I didn't found it, because this was added after the v1.18.0 release. I have to build the image myself from the master branch.

I faced with other problem:

I0427 06:14:18.043715       1 scale_up.go:326] Pod default/testpod-15 is unschedulable
I0427 06:14:18.044563       1 scale_up.go:364] Upcoming 0 nodes
I0427 06:14:18.044623       1 scale_up.go:441] No expansion options

Also, I found the strange logs:

I0427 06:14:18.052827       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"testpod-15", UID:"32a02e16-6396-4bd3-978f-06f781fdd94a", APIVersion:"v1", ResourceVersion:"17940", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added):

It is strange, because here is the pod's limits:

    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:        200m
      memory:     256Mi

And I choose the cx31 server type (2 VCPU, 8 RAM)

And this is the only 1 Pending pod in all cluster.

UPD: I tried to manually scale the nodes. Scaling down do not work too:

I0427 08:31:14.565562       1 pre_filtering_processor.go:57] Skipping sm-control-plane-2 - no node group config
I0427 08:31:14.565662       1 pre_filtering_processor.go:57] Skipping sm-control-plane-3 - no node group config
I0427 08:31:14.565889       1 pre_filtering_processor.go:57] Skipping sm-pool1-86c4c676b7-mxh5p - no node group config
I0427 08:31:14.566061       1 pre_filtering_processor.go:57] Skipping sm-test-5546dff48b-vb88h - no node group config
I0427 08:31:14.566236       1 pre_filtering_processor.go:57] Skipping sm-test-5546dff48b-vnlgb - no node group config
I0427 08:31:14.566375       1 pre_filtering_processor.go:57] Skipping sm-test-5546dff48b-lmpjx - no node group config
I0427 08:31:14.566523       1 pre_filtering_processor.go:57] Skipping sm-test-5546dff48b-7ql6c - no node group config
I0427 08:31:14.566557       1 pre_filtering_processor.go:57] Skipping sm-control-plane-1 - no node group config
I0427 08:31:14.566614       1 static_autoscaler.go:500] Scale down status: unneededOnly=false lastScaleUpTime=2020-04-27 06:14:08.038348682 +0000 UTC m=+18.287526670 lastScaleDownDeleteTime=2020-04-27 06:
14:08.038348923 +0000 UTC m=+18.287526907 lastScaleDownFailTime=2020-04-27 06:14:08.038349147 +0000 UTC m=+18.287527130 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0427 08:31:14.566701       1 static_autoscaler.go:513] Starting scale down
I0427 08:31:14.566948       1 scale_down.go:867] No candidates for scale down

@kron4eg
Copy link
Member

kron4eg commented Apr 27, 2020

To tell you the truth, I'm not really sure why it doesn't work, we are yet to see / testdrive the integration, see #391.

@qwinkler
Copy link
Author

@kron4eg After some debugging, I found, that I created the wrong annotations for my MachineDeployment 🤦
I fixed it and now it works like a charm!

To make autoscaler work you will need:
Build your own docker image, because these changes weren't released yet (https://github.com/kubernetes/autoscaler/blob/972e30a5d9eece175a54fa5dfc0ed902b34f02b1/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L44-L46):

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/cluster-autoscaler
make build-in-docker && make make-image
docker tag staging-k8s.gcr.io/cluster-autoscaler:dev username/cluster-autoscaler:tag
docker push username/cluster-autoscaler:tag

Set --cloud-provider flag and CAPI_GROUP in the cluster-autoscaler deployment. Example:

   spec:
      containers:
        - image: "username/cluster-autoscaler:tag"
          command:
            - ./cluster-autoscaler
            - --cloud-provider=clusterapi
            - --namespace=kube-system
            - --logtostderr=true
            - --stderrthreshold=info
            - --v=4
          env:
            - name: CAPI_GROUP
              value: "cluster.k8s.io"

Create the new node with correct annotations:

apiVersion: "cluster.k8s.io/v1alpha1"
kind: MachineDeployment
metadata:
  name: autoscaling-pool
  namespace: kube-system
  annotations:
    cluster.k8s.io/cluster-api-autoscaler-node-group-min-size: "1"
    cluster.k8s.io/cluster-api-autoscaler-node-group-max-size: "3"

@kron4eg
Copy link
Member

kron4eg commented Apr 28, 2020

oh... hehe. thanks for the update @smoulderme !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

2 participants