Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling up not work right on Virtual Kubelet #906

Closed
3d0it opened this issue Jul 1, 2020 · 26 comments
Closed

Scaling up not work right on Virtual Kubelet #906

3d0it opened this issue Jul 1, 2020 · 26 comments
Assignees
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity

Comments

@3d0it
Copy link

3d0it commented Jul 1, 2020

A deployment with replicas = 0 scale up to 1 not only when the threshold is reached, but when the metrics moves from 0 to 1
(ONLY ON Virtual Kubelet, on phisical nodes works).

Expected Behavior

Scale up the deployment when the threshold is reached, not when the metrics change from 0 to 1

Actual Behavior

Deployment scaled up when metrics query return 1

Steps to Reproduce the Problem

  1. Create a deployment with min replica count = 0
  2. Write a simple query with a threshold greather than 1 (for example 2)
  3. When the query returns 1 instead of 0, a new istance of the deployment is created

I attached my yaml file. In my scenario, I have a deployment with only a replica (deployment name pushengineweb) on physical nodes.
When a custom Prometheus query (a simple count of connected client) reach the threshold, I want to create a new pod on the Virtual Kubelet (deployment name pushengineweb-scale).

The same configuration, without scale on the Virtual Kubelet but on physical nodes, works fine.

Specifications

  • KEDA Version: 1.4.1
  • Platform & Version: Azure AKS
  • Kubernetes Version: 1.16.9
  • Scaler(s): Prometheus - chart prometheus-operator-8.10.0 app version 0.36.0
    example.txt
@3d0it 3d0it added the bug Something isn't working label Jul 1, 2020
@tomkerkhove
Copy link
Member

Hm we don't control what Virtual Kubelet does, we merely handle 0->1 scaling and the rest is up to the Kubernetes HPA. So if VK decides to kick in, that's because he made that decision.

Can you share the logs from KEDA please?

@3d0it
Copy link
Author

3d0it commented Jul 1, 2020

Hi, this is the last part of the log. I recreate the case described in the issue few minute ago:

W0630 13:33:41.429075       1 reflector.go:289] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: watch of *unstructured.Unstructured ended with: too old resource version: 23416111 (23417551)
W0630 14:04:11.452836       1 reflector.go:289] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: watch of *unstructured.Unstructured ended with: too old resource version: 23421646 (23422593)
W0630 14:33:40.468099       1 reflector.go:289] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: watch of *unstructured.Unstructured ended with: too old resource version: 23426687 (23427430)
W0630 14:59:45.483728       1 reflector.go:289] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: watch of *unstructured.Unstructured ended with: too old resource version: 23431563 (23431769)
W0630 15:26:03.497028       1 reflector.go:289] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: watch of *unstructured.Unstructured ended with: too old resource version: 23435863 (23436111)
{"level":"info","ts":1593532066.1006677,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593532066.1007018,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593532355.8509173,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593532355.8509517,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593558317.7918286,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593558317.7918668,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593594317.7921321,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593594317.7921786,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607721.0510402,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607721.051074,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607721.3310325,"logger":"controller_scaledobject","msg":"Updated HPA according to ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject","HPA.Namespace":"default","HPA.Name":"keda-hpa-pushengineweb-scale"}
{"level":"info","ts":1593607721.3315866,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607721.3316038,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607730.4269576,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607730.4270058,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607826.3706076,"logger":"scalehandler","msg":"Successfully updated deployment","ScaledObject.Namespace":"default","ScaledObject.Name":"pushengineweb-scaledobject","ScaledObject.ScaleType":"deployment","Deployment.Namespace":"default","Deployment.Name":"pushengineweb-scale","Original Replicas Count":0,"New Replicas Count":1}
{"level":"info","ts":1593607837.8436048,"logger":"controller_scaledobject","msg":"Reconciling ScaledObject","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}
{"level":"info","ts":1593607837.8436441,"logger":"controller_scaledobject","msg":"Detected ScaleType = Deployment","Request.Namespace":"default","Request.Name":"pushengineweb-scaledobject"}

Thanks

@3d0it
Copy link
Author

3d0it commented Jul 3, 2020

Hello, I also add the HPA logs.

If I scale on physical node the "Condition section" isn't logged.

kubectl describe horizontalpodautoscaler.autoscaling/keda-hpa-pushengineweb-scale
Name:                                                    keda-hpa-pushengineweb-scale
Namespace:                                               default
Labels:                                                  app.kubernetes.io/managed-by=keda-operator
                                                         app.kubernetes.io/name=keda-hpa-pushengineweb-scale
                                                         app.kubernetes.io/part-of=pushengineweb-scaledobject
                                                         app.kubernetes.io/version=1.4.1
Annotations:                                             <none>
CreationTimestamp:                                       Fri, 03 Jul 2020 08:54:10 +0200
Reference:                                               Deployment/pushengineweb-scale
Metrics:                                                 ( current / target )
  "pushengine_connected_client" (target average value):  <unknown> / 2
Min replicas:                                            1
Max replicas:                                            4
Deployment pods:                                         0 current / 0 desired
Conditions:
  Type           Status  Reason             Message
  ----           ------  ------             -------
  AbleToScale    True    SucceededGetScale  the HPA controller was able to get the target's current scale
  ScalingActive  False   ScalingDisabled    scaling is disabled since the replica count of the target is zero
Events:          <none>

Thanks

@3d0it
Copy link
Author

3d0it commented Jul 3, 2020

Hi, another info, if it could be useful, scaling down to zero according to this issue #524 works fine, only scale up from zero not works as expected

@mchmarny
Copy link
Contributor

I can reliably reproduce this issue using this demo. Example log showing significant queue lag:

{"level":"debug","ts":1598799261.2208235,"logger":"kafka_scaler","msg":"Group autoscaling has a lag of 147410 for topic messages and partition 0\n"}
{"level":"debug","ts":1598799261.2208657,"logger":"scalehandler","msg":"Scaler for scaledObject is active","ScaledObject.Namespace":"default","ScaledObject.Name":"queue-outoscaling-scaler","ScaledObject.ScaleType":"deployment","Scaler":{}}
{"level":"debug","ts":1598799261.2310224,"logger":"scalehandler","msg":"ScaledObject's Status was properly updated","ScaledObject.Namespace":"default","ScaledObject.Name":"queue-outoscaling-scaler","ScaledObject.ScaleType":"deployment"}

And the HPA error:

W0830 14:54:22.057920       1 reflector.go:289] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: watch of *unstructured.Unstructured ended with: too old resource version: 1894736 (1894804)

@tomkerkhove
Copy link
Member

Can we look into this for v2 @zroubalik @ahmelsayed?

@zroubalik
Copy link
Member

For me it looks more like a problem in internal go client/libs and not in KEDA. I don't have an access to cluster with Virtual Kubelet, so I can't reproduce :(
But it might be solved in v2, since we have significantly updated client-go version.

@mchmarny
Copy link
Contributor

mchmarny commented Aug 31, 2020

Just to clarify, I'm able to reproduce this on a vanilla AKS cluster (v1.16.13 right now although I've seen this on v1.17 as well). I've put together a script to reproduce this issue here

@yaron2
Copy link
Contributor

yaron2 commented Aug 31, 2020

@zroubalik @tomkerkhove this seems to reproduce on non vkubelet clusters as @mchmarny mentioned. I think this needs a 1.x minor release with the updated go client.

@tomkerkhove
Copy link
Member

In that case we should be good to go with v2 as @zroubalik mentioned!

Stay tuned, beta coming soon. Would you mind testing this out then?

@mchmarny
Copy link
Contributor

mchmarny commented Aug 31, 2020

After the ScaledObject updates for v2 compatibility the error is no longer logged and HPA is successfully updated. Tested for 0-1, 1-n, n-1, 1-0 scenarios.

2020-08-31T19:29:35.146Z	INFO	controllers.ScaledObject	Reconciling ScaledObject	{"Request.Namespace": "default", "Request.Name": "queue-outoscaling-scaler"}
2020-08-31T19:29:35.205Z	INFO	controllers.ScaledObject	Updated HPA according to ScaledObject	{"Request.Namespace": "default", "Request.Name": "queue-outoscaling-scaler", "HPA.Namespace": "default", "HPA.Name": "keda-hpa-queue-outoscaling-scaler"}
2020-08-31T19:29:35.205Z	INFO	controllers.ScaledObject	Initializing Scaling logic according to ScaledObject Specification	{"Request.Namespace": "default", "Request.Name": "queue-outoscaling-scaler"}
2020-08-31T19:29:35.249Z	INFO	scaleexecutor	Successfully set ScaleTarget replicas count to ScaledObject minReplicaCount	{"scaledobject.Name": "queue-outoscaling-scaler", "scaledObject.Namespace": "default", "scaleTarget.Name": "queue-outoscaling-subscriber", "ScaleTarget.Replicas": 1}

BTW, can you add topic lag back in on ScaledObject reconciling? That log entry was super helpful in debugging in v1.5.

WRT signaling this to users, given this impacts everyone using Helm to deploy onto default k8s version in most CSPs right now, I do agree with @yaron2, 1.x update ASAP is needed. Additionally, may be worth indicating which version of k8s the current KEDA release works with reliably and point users to the deployment from head as a stopgap for anything after.

@zroubalik
Copy link
Member

zroubalik commented Sep 1, 2020

@mchmarny thanks for the confirmation! I am glad it is fixed.

BTW, can you add topic lag back in on ScaledObject reconciling? That log entry was super helpful in debugging in v1.5.

The log is still there, have you tried to enable debug log level? (FYI this was changed as well in v2: https://github.com/kedacore/keda#setting-log-levels

Updating go client libraries in v1 would be non trivial, we encourage users to migrate to v2 as soon as it is out. And as for the supported k8s version, users should be able to KEDA on k8s 1.13+, this problem is probably not in the kubernetes but in the client libraries, IMHO :)

@mchmarny
Copy link
Contributor

I was excited to see Keda v2.0.0-beta come so I've retested this issue with the new version using. I'm still seeing same issue, Keda sees the correct lag but does not scale the target.

I've updated the demo so it's easier to reproduce. I'm seeing this on AKS Kubernetes version 1.16.13.

2020-09-14T02:05:52.300Z	INFO	controllers.ScaledObject	Initializing Scaling logic according to ScaledObject Specification	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.300Z	DEBUG	controllers.ScaledObject	ScaledObject is defined correctly and is ready for scaling	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.306Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "queue-depth-scaler", "namespace": "default"}
2020-09-14T02:05:52.306Z	INFO	controllers.ScaledObject	Reconciling ScaledObject	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.306Z	DEBUG	controllers.ScaledObject	Parsed Group, Version, Kind, Resource	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T02:05:52.345Z	DEBUG	controllers.ScaledObject	ScaledObject is defined correctly and is ready for scaling	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.347Z	DEBUG	kafka_scaler	Group autoscaling-subscriber has a lag of 508 for topic metrics and partition 0

2020-09-14T02:05:52.347Z	DEBUG	scalehandler	Scaler for scaledObject is active	{"Metrics Name": "kafka-metrics-autoscaling-subscriber"}
2020-09-14T02:05:52.351Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "queue-depth-scaler", "namespace": "default"}
2020-09-14T02:05:52.364Z	DEBUG	scalehandler	Watching with pollingInterval	{"type": "ScaledObject", "namespace": "default", "name": "queue-depth-scaler", "PollingInterval": "10s"}
2020-09-14T02:06:02.413Z	DEBUG	kafka_scaler	Group autoscaling-subscriber has a lag of 507 for topic metrics and partition 0

@tomkerkhove
Copy link
Member

Can you have a look please @ahmelsayed ?

@mchmarny can you share some info on what you are seeing? Does it stay with 0 instances or what happens?

@zroubalik
Copy link
Member

@mchmarny would be great if you can take a look on KEDA Metrics Server logs, to check whether there is some problem

@mchmarny
Copy link
Contributor

The configured HPA doesn't seem to take effect. I'm seeing Keda doing its 0-1 and 1-0 scaling on target just fine. As the lag increases however, correct queue lag is being logged by operator but the configured HPA does not scale the target beyond 1, even when its target is 3. This is all on AKS with Kubernetes version v1.16.13

Here is the scaled object I'm using to replicate this issue:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: subscriber-scaler
spec:
  scaleTargetRef:
    name: autoscaling-subscriber
  pollingInterval: 15
  minReplicaCount: 0
  maxReplicaCount: 10
  cooldownPeriod: 30
  triggers:
  - type: kafka
    metadata:
      topic: metrics
      bootstrapServers: kafka-cp-kafka.kafka.svc.cluster.local:9092
      consumerGroup: autoscaling-subscriber
      lagThreshold: "3"

Controller logs after I apply it:

2020-09-14T12:38:06.812Z	INFO	controllers.ScaledObject	Reconciling ScaledObject	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.812Z	INFO	controllers.ScaledObject	Adding Finalizer for the ScaledObject	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.837Z	DEBUG	controllers.ScaledObject	Adding scaledObjectName label on ScaledObject	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "value": "subscriber-scaler"}
2020-09-14T12:38:06.843Z	DEBUG	controllers.ScaledObject	Parsed Group, Version, Kind, Resource	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T12:38:06.855Z	INFO	controllers.ScaledObject	Detected resource targeted for scaling	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "resource": "apps/v1.Deployment", "name": "autoscaling-subscriber"}
2020-09-14T12:38:06.855Z	INFO	controllers.ScaledObject	Creating a new HPA	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "HPA.Namespace": "default", "HPA.Name": "keda-hpa-subscriber-scaler"}
2020-09-14T12:38:06.888Z	DEBUG	controllers.ScaledObject	Notify scaleHandler of an update in scaledObject	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.888Z	INFO	controllers.ScaledObject	Initializing Scaling logic according to ScaledObject Specification	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.888Z	DEBUG	controllers.ScaledObject	ScaledObject is defined correctly and is ready for scaling	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.895Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "subscriber-scaler", "namespace": "default"}
2020-09-14T12:38:06.895Z	INFO	controllers.ScaledObject	Reconciling ScaledObject	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.895Z	DEBUG	controllers.ScaledObject	Parsed Group, Version, Kind, Resource	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T12:38:06.938Z	DEBUG	controllers.ScaledObject	ScaledObject is defined correctly and is ready for scaling	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.939Z	DEBUG	kafka_scaler	Group autoscaling-subscriber has a lag of 0 for topic metrics and partition 0
2020-09-14T12:38:06.944Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "subscriber-scaler", "namespace": "default"}
2020-09-14T12:38:06.948Z	INFO	scaleexecutor	Successfully scaled ScaleTarget to 0 replicas	{"scaledobject.Name": "subscriber-scaler", "scaledObject.Namespace": "default", "scaleTarget.Name": "autoscaling-subscriber"}
2020-09-14T12:38:06.955Z	DEBUG	scalehandler	Watching with pollingInterval	{"type": "ScaledObject", "namespace": "default", "name": "subscriber-scaler", "PollingInterval": "15s"}

And the created HPA

Name:                                                             keda-hpa-subscriber-scaler
Namespace:                                                        default
Labels:                                                           app.kubernetes.io/managed-by=keda-operator
                                                                  app.kubernetes.io/name=keda-hpa-subscriber-scaler
                                                                  app.kubernetes.io/part-of=subscriber-scaler
                                                                  app.kubernetes.io/version=v2
                                                                  scaledObjectName=subscriber-scaler
Annotations:                                                      <none>
CreationTimestamp:                                                Mon, 14 Sep 2020 05:38:06 -0700
Reference:                                                        Deployment/autoscaling-subscriber
Metrics:                                                          ( current / target )
  "kafka-metrics-autoscaling-subscriber" (target average value):  <unknown> / 3
Min replicas:                                                     1
Max replicas:                                                     10
Deployment pods:                                                  0 current / 0 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    SucceededGetScale   the HPA controller was able to get the target's current scale
  ScalingActive   False   ScalingDisabled     scaling is disabled since the replica count of the target is zero
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

And when the lag increases:

2020-09-14T12:41:22.586Z	DEBUG	kafka_scaler	Group autoscaling-subscriber has a lag of 252 for topic metrics and partition 0
2020-09-14T12:41:22.586Z	DEBUG	scalehandler	Scaler for scaledObject is active	{"Metrics Name": "kafka-metrics-autoscaling-subscriber"}
2020-09-14T12:41:22.597Z	INFO	scaleexecutor	Successfully updated ScaleTarget	{"scaledobject.Name": "subscriber-scaler", "scaledObject.Namespace": "default", "scaleTarget.Name": "autoscaling-subscriber", "Original Replicas Count": 0, "New Replicas Count": 1}
2020-09-14T12:41:25.993Z	INFO	controllers.ScaledObject	Reconciling ScaledObject	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:41:25.993Z	DEBUG	controllers.ScaledObject	Parsed Group, Version, Kind, Resource	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T12:41:26.017Z	DEBUG	controllers.ScaledObject	ScaledObject is defined correctly and is ready for scaling	{"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:41:26.023Z	DEBUG	controller	Successfully Reconciled	{"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "subscriber-scaler", "namespace": "default"}

The HPA is getting updated

NAME                         REFERENCE                           TARGETS     MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-subscriber-scaler   Deployment/autoscaling-subscriber   3/3 (avg)   1         10        1          17m

But the replica count stays at 1 even when I let the lag build up for a while

NAME                                      READY   STATUS    RESTARTS   AGE
autoscaling-subscriber-65c484f9c5-jrlzl   2/2     Running   0          6m

Note, I'm not seeing the k8s client errors like I did before.

Here are the metrics api server logs

I0914 12:08:17.300550       1 main.go:84] keda_metrics_adapter "msg"="KEDA Version: v2"
I0914 12:08:17.300630       1 main.go:85] keda_metrics_adapter "msg"="KEDA Commit: 717aab81817f0d9b6d02b0e6d1499367c751faf2"
I0914 12:08:17.300826       1 main.go:86] keda_metrics_adapter "msg"="Go Version: go1.15.1"
I0914 12:08:17.300844       1 main.go:87] keda_metrics_adapter "msg"="Go OS/Arch: linux/amd64"
I0914 12:08:18.951620       1 request.go:621] Throttling request took 1.048362821s, request: GET:https://10.0.0.1:443/apis/keda.sh/v1alpha1?timeout=32s
I0914 12:08:19.804392       1 provider.go:49] keda_metrics_adapter/provider "msg"="starting"
I0914 12:08:19.804438       1 main.go:115] keda_metrics_adapter "msg"="starting adapter..."
2020-09-14 12:08:19.804873 I | Starting metrics server at :9022
I0914 12:08:20.376400       1 serving.go:307] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0914 12:08:20.858983       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0914 12:08:20.859030       1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0914 12:08:20.859032       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0914 12:08:20.859048       1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0914 12:08:20.859253       1 dynamic_serving_content.go:130] Starting serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key
I0914 12:08:20.859743       1 secure_serving.go:178] Serving securely on [::]:6443
I0914 12:08:20.859787       1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0914 12:08:20.959293       1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0914 12:08:20.959381       1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file

@zroubalik
Copy link
Member

zroubalik commented Sep 14, 2020

@mchmarny thanks! Seems like the problem is definitely somewhere between HPA and Metrics Server. Would you mind posting Metrics Server Debug logs? https://github.com/kedacore/keda/blob/v2/BUILD.md#metrics-server-logging

UPDATE: we should be able to see this message, to confirm that the request from HPA was correctly delivered to Metrics Server: https://github.com/kedacore/keda/blob/v2/pkg/provider/provider.go#L61

@mchmarny
Copy link
Contributor

mchmarny commented Sep 14, 2020

Yep, that line in there

I0914 13:30:48.230028       1 provider.go:61] keda_metrics_adapter/provider "msg"="Keda provider received request for external metrics"  "metric name"="kafka-metrics-autoscaling-subscriber" "metricSelector"="scaledObjectName=subscriber-scaler" "namespace"="default"

@zroubalik
Copy link
Member

zroubalik commented Sep 14, 2020

@mchmarny Could you please try to get the metrics, to see what is the returned value? (we should improve logging in Metrics Server anyway)

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/kafka-metrics-autoscaling-subscriber?labelSelector=scaledObjectName%3Dsubscriber-scaler"

@zroubalik
Copy link
Member

That's ^ basically the request that HPA is sending.

@mchmarny
Copy link
Contributor

{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/kafka-metrics-autoscaling-subscriber"
  },
  "items": [
    {
      "metricName": "kafka-metrics-autoscaling-subscriber",
      "metricLabels": null,
      "timestamp": "2020-09-14T14:17:51Z",
      "value": "3"
    }
  ]
}

@mchmarny
Copy link
Contributor

This is with the lag around 400 so 3 replicas is what I would expect.

2020-09-14T14:18:42.541Z	DEBUG	scalehandler	Scaler for scaledObject is active	{"Metrics Name": "kafka-metrics-autoscaling-subscriber"}
2020-09-14T14:18:57.602Z	DEBUG	kafka_scaler	Group autoscaling-subscriber has a lag of 358 for topic metrics and partition 0

@zroubalik
Copy link
Member

@mchmarny I can see the problem, you have set just 1 Kafka partition. KEDA will not scale out beyond the number of partitions, because it doesn't make a sense. I bet, that if you increase the number of partitions, the scaling will work as expected.

@mchmarny
Copy link
Contributor

Oh ye, the one consumer per partition limit within a group... I was swapping things around to make this more reproducible and forgot to change back after last time. Thanks

@stale
Copy link

stale bot commented Oct 13, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Oct 13, 2021
@stale
Copy link

stale bot commented Oct 20, 2021

This issue has been automatically closed due to inactivity.

@stale stale bot closed this as completed Oct 20, 2021
SpiritZhou pushed a commit to SpiritZhou/keda that referenced this issue Jul 18, 2023
Signed-off-by: Vighnesh Shenoy <[email protected]>

Signed-off-by: Vighnesh Shenoy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale All issues that are marked as stale due to inactivity
Projects
None yet
Development

No branches or pull requests

6 participants