-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling up not work right on Virtual Kubelet #906
Comments
Hm we don't control what Virtual Kubelet does, we merely handle 0->1 scaling and the rest is up to the Kubernetes HPA. So if VK decides to kick in, that's because he made that decision. Can you share the logs from KEDA please? |
Hi, this is the last part of the log. I recreate the case described in the issue few minute ago:
Thanks |
Hello, I also add the HPA logs. If I scale on physical node the "Condition section" isn't logged.
Thanks |
Hi, another info, if it could be useful, scaling down to zero according to this issue #524 works fine, only scale up from zero not works as expected |
I can reliably reproduce this issue using this demo. Example log showing significant queue lag: {"level":"debug","ts":1598799261.2208235,"logger":"kafka_scaler","msg":"Group autoscaling has a lag of 147410 for topic messages and partition 0\n"}
{"level":"debug","ts":1598799261.2208657,"logger":"scalehandler","msg":"Scaler for scaledObject is active","ScaledObject.Namespace":"default","ScaledObject.Name":"queue-outoscaling-scaler","ScaledObject.ScaleType":"deployment","Scaler":{}}
{"level":"debug","ts":1598799261.2310224,"logger":"scalehandler","msg":"ScaledObject's Status was properly updated","ScaledObject.Namespace":"default","ScaledObject.Name":"queue-outoscaling-scaler","ScaledObject.ScaleType":"deployment"} And the HPA error: W0830 14:54:22.057920 1 reflector.go:289] pkg/mod/k8s.io/[email protected]+incompatible/tools/cache/reflector.go:94: watch of *unstructured.Unstructured ended with: too old resource version: 1894736 (1894804) |
Can we look into this for v2 @zroubalik @ahmelsayed? |
For me it looks more like a problem in internal go client/libs and not in KEDA. I don't have an access to cluster with Virtual Kubelet, so I can't reproduce :( |
Just to clarify, I'm able to reproduce this on a vanilla AKS cluster (v1.16.13 right now although I've seen this on v1.17 as well). I've put together a script to reproduce this issue here |
@zroubalik @tomkerkhove this seems to reproduce on non vkubelet clusters as @mchmarny mentioned. I think this needs a 1.x minor release with the updated go client. |
In that case we should be good to go with v2 as @zroubalik mentioned! Stay tuned, beta coming soon. Would you mind testing this out then? |
After the 2020-08-31T19:29:35.146Z INFO controllers.ScaledObject Reconciling ScaledObject {"Request.Namespace": "default", "Request.Name": "queue-outoscaling-scaler"}
2020-08-31T19:29:35.205Z INFO controllers.ScaledObject Updated HPA according to ScaledObject {"Request.Namespace": "default", "Request.Name": "queue-outoscaling-scaler", "HPA.Namespace": "default", "HPA.Name": "keda-hpa-queue-outoscaling-scaler"}
2020-08-31T19:29:35.205Z INFO controllers.ScaledObject Initializing Scaling logic according to ScaledObject Specification {"Request.Namespace": "default", "Request.Name": "queue-outoscaling-scaler"}
2020-08-31T19:29:35.249Z INFO scaleexecutor Successfully set ScaleTarget replicas count to ScaledObject minReplicaCount {"scaledobject.Name": "queue-outoscaling-scaler", "scaledObject.Namespace": "default", "scaleTarget.Name": "queue-outoscaling-subscriber", "ScaleTarget.Replicas": 1} BTW, can you add topic lag back in on ScaledObject reconciling? That log entry was super helpful in debugging in v1.5. WRT signaling this to users, given this impacts everyone using Helm to deploy onto default k8s version in most CSPs right now, I do agree with @yaron2, 1.x update ASAP is needed. Additionally, may be worth indicating which version of k8s the current KEDA release works with reliably and point users to the deployment from head as a stopgap for anything after. |
@mchmarny thanks for the confirmation! I am glad it is fixed.
The log is still there, have you tried to enable debug log level? (FYI this was changed as well in v2: https://github.com/kedacore/keda#setting-log-levels Updating go client libraries in v1 would be non trivial, we encourage users to migrate to v2 as soon as it is out. And as for the supported k8s version, users should be able to KEDA on k8s 1.13+, this problem is probably not in the kubernetes but in the client libraries, IMHO :) |
I was excited to see Keda v2.0.0-beta come so I've retested this issue with the new version using. I'm still seeing same issue, Keda sees the correct lag but does not scale the target. I've updated the demo so it's easier to reproduce. I'm seeing this on AKS Kubernetes version 1.16.13. 2020-09-14T02:05:52.300Z INFO controllers.ScaledObject Initializing Scaling logic according to ScaledObject Specification {"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.300Z DEBUG controllers.ScaledObject ScaledObject is defined correctly and is ready for scaling {"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.306Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "queue-depth-scaler", "namespace": "default"}
2020-09-14T02:05:52.306Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.306Z DEBUG controllers.ScaledObject Parsed Group, Version, Kind, Resource {"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T02:05:52.345Z DEBUG controllers.ScaledObject ScaledObject is defined correctly and is ready for scaling {"ScaledObject.Namespace": "default", "ScaledObject.Name": "queue-depth-scaler"}
2020-09-14T02:05:52.347Z DEBUG kafka_scaler Group autoscaling-subscriber has a lag of 508 for topic metrics and partition 0
2020-09-14T02:05:52.347Z DEBUG scalehandler Scaler for scaledObject is active {"Metrics Name": "kafka-metrics-autoscaling-subscriber"}
2020-09-14T02:05:52.351Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "queue-depth-scaler", "namespace": "default"}
2020-09-14T02:05:52.364Z DEBUG scalehandler Watching with pollingInterval {"type": "ScaledObject", "namespace": "default", "name": "queue-depth-scaler", "PollingInterval": "10s"}
2020-09-14T02:06:02.413Z DEBUG kafka_scaler Group autoscaling-subscriber has a lag of 507 for topic metrics and partition 0 |
Can you have a look please @ahmelsayed ? @mchmarny can you share some info on what you are seeing? Does it stay with 0 instances or what happens? |
@mchmarny would be great if you can take a look on KEDA Metrics Server logs, to check whether there is some problem |
The configured HPA doesn't seem to take effect. I'm seeing Keda doing its 0-1 and 1-0 scaling on target just fine. As the lag increases however, correct queue lag is being logged by operator but the configured HPA does not scale the target beyond 1, even when its target is 3. This is all on AKS with Kubernetes version v1.16.13 Here is the scaled object I'm using to replicate this issue: apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: subscriber-scaler
spec:
scaleTargetRef:
name: autoscaling-subscriber
pollingInterval: 15
minReplicaCount: 0
maxReplicaCount: 10
cooldownPeriod: 30
triggers:
- type: kafka
metadata:
topic: metrics
bootstrapServers: kafka-cp-kafka.kafka.svc.cluster.local:9092
consumerGroup: autoscaling-subscriber
lagThreshold: "3" Controller logs after I apply it: 2020-09-14T12:38:06.812Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.812Z INFO controllers.ScaledObject Adding Finalizer for the ScaledObject {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.837Z DEBUG controllers.ScaledObject Adding scaledObjectName label on ScaledObject {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "value": "subscriber-scaler"}
2020-09-14T12:38:06.843Z DEBUG controllers.ScaledObject Parsed Group, Version, Kind, Resource {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T12:38:06.855Z INFO controllers.ScaledObject Detected resource targeted for scaling {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "resource": "apps/v1.Deployment", "name": "autoscaling-subscriber"}
2020-09-14T12:38:06.855Z INFO controllers.ScaledObject Creating a new HPA {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "HPA.Namespace": "default", "HPA.Name": "keda-hpa-subscriber-scaler"}
2020-09-14T12:38:06.888Z DEBUG controllers.ScaledObject Notify scaleHandler of an update in scaledObject {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.888Z INFO controllers.ScaledObject Initializing Scaling logic according to ScaledObject Specification {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.888Z DEBUG controllers.ScaledObject ScaledObject is defined correctly and is ready for scaling {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.895Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "subscriber-scaler", "namespace": "default"}
2020-09-14T12:38:06.895Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.895Z DEBUG controllers.ScaledObject Parsed Group, Version, Kind, Resource {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T12:38:06.938Z DEBUG controllers.ScaledObject ScaledObject is defined correctly and is ready for scaling {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:38:06.939Z DEBUG kafka_scaler Group autoscaling-subscriber has a lag of 0 for topic metrics and partition 0
2020-09-14T12:38:06.944Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "subscriber-scaler", "namespace": "default"}
2020-09-14T12:38:06.948Z INFO scaleexecutor Successfully scaled ScaleTarget to 0 replicas {"scaledobject.Name": "subscriber-scaler", "scaledObject.Namespace": "default", "scaleTarget.Name": "autoscaling-subscriber"}
2020-09-14T12:38:06.955Z DEBUG scalehandler Watching with pollingInterval {"type": "ScaledObject", "namespace": "default", "name": "subscriber-scaler", "PollingInterval": "15s"} And the created HPA Name: keda-hpa-subscriber-scaler
Namespace: default
Labels: app.kubernetes.io/managed-by=keda-operator
app.kubernetes.io/name=keda-hpa-subscriber-scaler
app.kubernetes.io/part-of=subscriber-scaler
app.kubernetes.io/version=v2
scaledObjectName=subscriber-scaler
Annotations: <none>
CreationTimestamp: Mon, 14 Sep 2020 05:38:06 -0700
Reference: Deployment/autoscaling-subscriber
Metrics: ( current / target )
"kafka-metrics-autoscaling-subscriber" (target average value): <unknown> / 3
Min replicas: 1
Max replicas: 10
Deployment pods: 0 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False ScalingDisabled scaling is disabled since the replica count of the target is zero
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events: <none> And when the lag increases: 2020-09-14T12:41:22.586Z DEBUG kafka_scaler Group autoscaling-subscriber has a lag of 252 for topic metrics and partition 0
2020-09-14T12:41:22.586Z DEBUG scalehandler Scaler for scaledObject is active {"Metrics Name": "kafka-metrics-autoscaling-subscriber"}
2020-09-14T12:41:22.597Z INFO scaleexecutor Successfully updated ScaleTarget {"scaledobject.Name": "subscriber-scaler", "scaledObject.Namespace": "default", "scaleTarget.Name": "autoscaling-subscriber", "Original Replicas Count": 0, "New Replicas Count": 1}
2020-09-14T12:41:25.993Z INFO controllers.ScaledObject Reconciling ScaledObject {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:41:25.993Z DEBUG controllers.ScaledObject Parsed Group, Version, Kind, Resource {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler", "GVK": "apps/v1.Deployment", "Resource": "deployments"}
2020-09-14T12:41:26.017Z DEBUG controllers.ScaledObject ScaledObject is defined correctly and is ready for scaling {"ScaledObject.Namespace": "default", "ScaledObject.Name": "subscriber-scaler"}
2020-09-14T12:41:26.023Z DEBUG controller Successfully Reconciled {"reconcilerGroup": "keda.sh", "reconcilerKind": "ScaledObject", "controller": "scaledobject", "name": "subscriber-scaler", "namespace": "default"} The HPA is getting updated NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
keda-hpa-subscriber-scaler Deployment/autoscaling-subscriber 3/3 (avg) 1 10 1 17m But the replica count stays at 1 even when I let the lag build up for a while NAME READY STATUS RESTARTS AGE
autoscaling-subscriber-65c484f9c5-jrlzl 2/2 Running 0 6m Note, I'm not seeing the k8s client errors like I did before. Here are the metrics api server logs I0914 12:08:17.300550 1 main.go:84] keda_metrics_adapter "msg"="KEDA Version: v2"
I0914 12:08:17.300630 1 main.go:85] keda_metrics_adapter "msg"="KEDA Commit: 717aab81817f0d9b6d02b0e6d1499367c751faf2"
I0914 12:08:17.300826 1 main.go:86] keda_metrics_adapter "msg"="Go Version: go1.15.1"
I0914 12:08:17.300844 1 main.go:87] keda_metrics_adapter "msg"="Go OS/Arch: linux/amd64"
I0914 12:08:18.951620 1 request.go:621] Throttling request took 1.048362821s, request: GET:https://10.0.0.1:443/apis/keda.sh/v1alpha1?timeout=32s
I0914 12:08:19.804392 1 provider.go:49] keda_metrics_adapter/provider "msg"="starting"
I0914 12:08:19.804438 1 main.go:115] keda_metrics_adapter "msg"="starting adapter..."
2020-09-14 12:08:19.804873 I | Starting metrics server at :9022
I0914 12:08:20.376400 1 serving.go:307] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0914 12:08:20.858983 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0914 12:08:20.859030 1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0914 12:08:20.859032 1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0914 12:08:20.859048 1 shared_informer.go:223] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0914 12:08:20.859253 1 dynamic_serving_content.go:130] Starting serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key
I0914 12:08:20.859743 1 secure_serving.go:178] Serving securely on [::]:6443
I0914 12:08:20.859787 1 tlsconfig.go:240] Starting DynamicServingCertificateController
I0914 12:08:20.959293 1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0914 12:08:20.959381 1 shared_informer.go:230] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file |
@mchmarny thanks! Seems like the problem is definitely somewhere between HPA and Metrics Server. Would you mind posting Metrics Server Debug logs? https://github.com/kedacore/keda/blob/v2/BUILD.md#metrics-server-logging UPDATE: we should be able to see this message, to confirm that the request from HPA was correctly delivered to Metrics Server: https://github.com/kedacore/keda/blob/v2/pkg/provider/provider.go#L61 |
Yep, that line in there I0914 13:30:48.230028 1 provider.go:61] keda_metrics_adapter/provider "msg"="Keda provider received request for external metrics" "metric name"="kafka-metrics-autoscaling-subscriber" "metricSelector"="scaledObjectName=subscriber-scaler" "namespace"="default" |
@mchmarny Could you please try to get the metrics, to see what is the returned value? (we should improve logging in Metrics Server anyway)
|
That's ^ basically the request that HPA is sending. |
{
"kind": "ExternalMetricValueList",
"apiVersion": "external.metrics.k8s.io/v1beta1",
"metadata": {
"selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/kafka-metrics-autoscaling-subscriber"
},
"items": [
{
"metricName": "kafka-metrics-autoscaling-subscriber",
"metricLabels": null,
"timestamp": "2020-09-14T14:17:51Z",
"value": "3"
}
]
} |
This is with the lag around 2020-09-14T14:18:42.541Z DEBUG scalehandler Scaler for scaledObject is active {"Metrics Name": "kafka-metrics-autoscaling-subscriber"}
2020-09-14T14:18:57.602Z DEBUG kafka_scaler Group autoscaling-subscriber has a lag of 358 for topic metrics and partition 0 |
@mchmarny I can see the problem, you have set just 1 Kafka partition. KEDA will not scale out beyond the number of partitions, because it doesn't make a sense. I bet, that if you increase the number of partitions, the scaling will work as expected. |
Oh ye, the one consumer per partition limit within a group... I was swapping things around to make this more reproducible and forgot to change back after last time. Thanks |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed due to inactivity. |
Signed-off-by: Vighnesh Shenoy <[email protected]> Signed-off-by: Vighnesh Shenoy <[email protected]>
A deployment with replicas = 0 scale up to 1 not only when the threshold is reached, but when the metrics moves from 0 to 1
(ONLY ON Virtual Kubelet, on phisical nodes works).
Expected Behavior
Scale up the deployment when the threshold is reached, not when the metrics change from 0 to 1
Actual Behavior
Deployment scaled up when metrics query return 1
Steps to Reproduce the Problem
I attached my yaml file. In my scenario, I have a deployment with only a replica (deployment name pushengineweb) on physical nodes.
When a custom Prometheus query (a simple count of connected client) reach the threshold, I want to create a new pod on the Virtual Kubelet (deployment name pushengineweb-scale).
The same configuration, without scale on the Virtual Kubelet but on physical nodes, works fine.
Specifications
example.txt
The text was updated successfully, but these errors were encountered: