Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keda operator fails with "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)" in v.2.15.1 using Azure event Hub trigger #6084

Open
chamindac opened this issue Aug 16, 2024 · 18 comments · May be fixed by kedacore/charts#714

Comments

@chamindac
Copy link

chamindac commented Aug 16, 2024

I have keda deployed with version v2.15.1 on AKS using work load identity. AKS k8s version is 1.29.7.
My scaled job trigges based on azure event hub. Keda operator shows issue "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w()"

The setup was working fine with KEDA v2.14.2 on AKS using work load identity. AKS k8s version is 1.29.7.

Scled job shows below issues

Status:
  Conditions:
    Message:  Some triggers defined in ScaledJob are not working correctly
    Reason:   PartialTriggerError
    Status:   Unknown
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Status:   Unknown
    Type:     Fallback
    Status:   Unknown
    Type:     Paused
Events:
  Type     Reason              Age                     From           Message
  ----     ------              ----                    ----           -------
  Normal   KEDAScalersStarted  7m16s (x4 over 7m16s)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  7m16s                   scale-handler  Started scalers watch
  Normal   ScaledJobReady      7m16s                   keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    7m16s (x2 over 7m16s)   scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    2m16s (x61 over 7m14s)  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)

The keda operator pod log shows below

2024-08-16T12:57:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "mydemo-scaledjob", "scaledJob.Namespace": "avalanche", "Number of running Jobs": 0}
2024-08-16T12:57:17Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "mydemo-scaledjob", "scaledJob.Namespace": "avalanche", "Number of pending Jobs": 0}
2024-08-16T12:57:22Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "mydemo-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182

If I deploy KEDA v2.14.2 or v2.14.3 on top of v2.15.1 without changing anything else in my setup everything starts to work fine. and status of my scaled job comes back to normal as below log shows.

Status:
  Conditions:
    Message:  ScaledJob is defined correctly and is ready to scaling
    Reason:   ScaledJobReady
    Status:   True
    Type:     Ready
    Message:  Scaling is not performed because triggers are not active
    Reason:   ScalerNotActive
    Status:   False
    Type:     Active
    Status:   Unknown
    Type:     Fallback
    Status:   Unknown
    Type:     Paused
Events:
  Type     Reason              Age                 From           Message
  ----     ------              ----                ----           -------
  Normal   KEDAScalersStarted  20m (x4 over 20m)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  20m                 scale-handler  Started scalers watch
  Normal   ScaledJobReady      20m                 keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    20m (x2 over 20m)   scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    19m (x18 over 20m)  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w(<nil>)
  Normal   KEDAScalersStarted  16m (x2 over 16m)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  16m                 scale-handler  Started scalers watch
  Normal   ScaledJobReady      16m                 keda-operator  ScaledJob is ready for scaling
  Normal   KEDAJobsCreated     16m                 scale-handler  Created 1 jobs
  Normal   KEDAScalersStarted  14m (x2 over 14m)   scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  14m                 scale-handler  Started scalers watch
  Normal   KEDAJobsCreated     12m (x22 over 14m)  scale-handler  Created 0 jobs

Below are more information on my setup.

I deployed keda using below

helm repo add kedacore https://kedacore.github.io/charts
            helm repo update

            helm upgrade keda kedacore/keda --install `
              --namespace keda `
              --version 2.15.1 `
              --set serviceAccount.operator.create=true `
              --set serviceAccount.operator.name=keda-operator `
              --set podIdentity.azureWorkload.enabled=true `
              --set podIdentity.azureWorkload.clientId=$(sys_aks_uai_client_id) `
              --set podIdentity.azureWorkload.tenantId=$(tenantid)

KEDA triiger auth setup as

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: av-keda-trigger-auth
  namespace: mynamespace
spec:
  podIdentity:
    provider: azure-workload

My scaled job triggers

triggers:
    - type: azure-eventhub
      metadata:
        consumerGroup: largevideogenerator
        unprocessedEventThreshold: "1"
        activationUnprocessedEventThreshold: "0"
        blobContainer: largevideogenerator-largevideogenerationrequired
        eventHubNamespace: myeventhubnamespace
        eventHubName: largevideogenerationrequired
        storageAccountName: mystoragename
        checkpointStrategy: blobMetadata
      authenticationRef:
        name: av-keda-trigger-auth
    - type: azure-eventhub
      metadata:
        consumerGroup: largevideogenerator
        unprocessedEventThreshold: "1"
        activationUnprocessedEventThreshold: "0"
        blobContainer: largevideogenerator-regeneratelargevideo
        eventHubNamespace: myeventhubnamespace
        eventHubName: regeneratelargevideo
        storageAccountName: mystoragename
        checkpointStrategy: blobMetadata
      authenticationRef:
        name: av-keda-trigger-auth

I can provide more information and logs if required.

In summary this is what happens

  • In fresh setup of KEDA v2.15.1 with everything else identical - does not work and give "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: %!w()" in keda operator pod.
  • In failing setup of KEDA v2.15.1 if i deploy KEDA v2.14.2 or v2.14.3 triggers run and scaledjobs getting created as expected.
  • In fresh setup of KEDA v2.14.2 with everything else identical - works fine.
@chamindac
Copy link
Author

chamindac commented Aug 19, 2024

Is this due to AKS kubernetes version compatibility with KEDA version? From documentation here it seems the KEDA add on uses AKS kubernetes 1.30 with KEDA 2.14.. and KEDA 2.15 is to be used in AKS kubernets 1.31

So, when we deploy KEDA to AKS, without using AKS add on for KEDA, should we consider the same versions, as used by add on depending on AKS kubernetes version?

For now as a solution for my problem I am going to stay with KEDA 2.14 until I upgrade my AKS to use kubernetes 1.31, before retrying KEDA 2.15

@JorTurFer
Copy link
Member

JorTurFer commented Aug 19, 2024

Hello
I can't reproduce the issue. I've included a specific e2e test case to cover it but it passes, this is the trigger configuration):

metadata:
activationUnprocessedEventThreshold: '10'
blobContainer: {{.CheckpointContainerName}}
checkpointStrategy: blobMetadata
consumerGroup: {{.ConsumerGroup}}
unprocessedEventThreshold: '64'
eventHubName: {{.EventHubName}}
eventHubNamespace: {{.EventHubNamespaceName}}
storageAccountName: {{.AccountName}}

Could you share the blob metadata?
image

@chamindac
Copy link
Author

Hi.. below is checkpoint blob metadata
image

@JorTurFer
Copy link
Member

I've found that the error is wrongly handled and that's why you see without any extra info.
I've created a PR fixing the error. Are you willing to try with the fixed tag? it's ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152.
It's built from main so it's almost v2.15.1. This is the change to improve the info:
image

I think that with this change we will see extra info about the error

@chamindac
Copy link
Author

@JorTurFer thank you for response.. I have moved on to use the managed add on for KEDA for AKS. So, I am currently on AKS with kubernetes 1.30.3 with KEDA 2.14.

However I will try to create a test environment and test the fixed version of KEDA and get back to you

@chamindac
Copy link
Author

chamindac commented Sep 2, 2024

@JorTurFer I tried deploying with ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152 using keda-2.15.1.yaml (changing keda operator image as shown below)

image: ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152 # ghcr.io/kedacore/keda:2.15.1 # chaminda
        imagePullPolicy: Always

The keda operator crashloopback off with below in logs of keda-operator pod

2024/09/02 09:07:25 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
2024-09-02T09:07:25Z    INFO    setup   Starting manager
2024-09-02T09:07:25Z    INFO    setup   KEDA Version: pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152
2024-09-02T09:07:25Z    INFO    setup   Git Commit: 4776d09c8fd761814c1eb9ba7e964ceace651152
2024-09-02T09:07:25Z    INFO    setup   Go Version: go1.22.5
2024-09-02T09:07:25Z    INFO    setup   Go OS/Arch: linux/amd64
2024-09-02T09:07:25Z    INFO    setup   Running on Kubernetes 1.30      {"version": "v1.30.3"}
2024-09-02T09:07:26Z    INFO    controller-runtime.metrics      Starting metrics server
2024-09-02T09:07:26Z    INFO    controller-runtime.metrics      Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-09-02T09:07:26Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
I0902 09:07:26.032037       1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
I0902 09:07:41.268027       1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource", "source": "kind source: *v1alpha1.ClusterCloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-09-02T09:07:41Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-09-02T09:07:41Z    INFO    Starting Controller     {"controller": "cert-rotator"}
2024-09-02T09:07:41Z    INFO    cert-rotation   starting cert rotator controller
2024-09-02T09:07:41Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:53
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:54
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:07:41Z    INFO    cert-rotation   no cert refresh needed
2024-09-02T09:07:41Z    INFO    cert-rotation   certs are ready in /certs
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "cert-rotator", "worker count": 1}
2024-09-02T09:07:41Z    INFO    cert-rotation   no cert refresh needed
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-02T09:07:41Z    INFO    cert-rotation   no cert refresh needed
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-02T09:07:41Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
2024-09-02T09:07:41Z    INFO    Starting workers        {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2024-09-02T09:07:42Z    INFO    cert-rotation   CA certs are injected to webhooks
2024-09-02T09:07:42Z    INFO    grpc_server     Starting Metrics Service gRPC Server    {"address": ":9666"}
2024-09-02T09:07:51Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:01Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:11Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:21Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:31Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:41Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:08:51Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:01Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:11Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:21Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:31Z    ERROR   controller-runtime.source.EventHandler  if kind is a CRD, it should be installed before calling Start   {"kind": "ClusterCloudEventSource.eventing.keda.sh", "error": "no matches for kind \"ClusterCloudEventSource\" in version \"eventing.keda.sh/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func2
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:87
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:88
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
        /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/kind.go:56
2024-09-02T09:09:41Z    ERROR   Could not wait for Cache to sync        {"controller": "clustercloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "ClusterCloudEventSource", "error": "failed to wait for clustercloudeventsource caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.ClusterCloudEventSource"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:203
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:208
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223
2024-09-02T09:09:41Z    INFO    Stopping and waiting for non leader election runnables
2024-09-02T09:09:41Z    INFO    Stopping and waiting for leader election runnables
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    Shutdown signal received, waiting for all workers to finish     {"controller": "cert-rotator"}
2024-09-02T09:09:41Z    INFO    cert-rotation   stopping cert rotator controller
W0902 09:09:41.269909       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.269969       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of apiregistration.k8s.io/v1, Kind=APIService ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270024       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "cert-rotator"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-09-02T09:09:41Z    INFO    All workers finished    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-09-02T09:09:41Z    INFO    Stopping and waiting for caches
W0902 09:09:41.270185       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ClusterTriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270224       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.CloudEventSource ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270262       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledJob ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270297       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledObject ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270339       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.TriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0902 09:09:41.270400       1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v2.HorizontalPodAutoscaler ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2024-09-02T09:09:41Z    INFO    Stopping and waiting for webhooks
2024-09-02T09:09:41Z    INFO    Stopping and waiting for HTTP servers
2024-09-02T09:09:41Z    INFO    shutting down server    {"kind": "health probe", "addr": "[::]:8081"}
2024-09-02T09:09:41Z    INFO    controller-runtime.metrics      Shutting down metrics server with timeout of 1 minute
2024-09-02T09:09:41Z    INFO    Wait completed, proceeding to shutdown the manager
2024-09-02T09:09:41Z    ERROR   setup   problem running manager {"error": "failed to wait for clustercloudeventsource caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.ClusterCloudEventSource"}
main.main
        /workspace/cmd/operator/main.go:329
runtime.main
        /usr/local/go/src/runtime/proc.go:271

@JorTurFer
Copy link
Member

Oh, sorry, we introduced a new CRD (that'll be ship with v2.16), this is the CRD that you need to deploy into the cluster too -> https://github.com/kedacore/keda/blob/main/config/crd/bases/eventing.keda.sh_clustercloudeventsources.yaml
It's for the CloudEvent integration, so probably it doesn't matter in your case xD

@chamindac
Copy link
Author

@JorTurFer with the CRD deployed now keda operator seems to be needning some additional permissions

"system:serviceaccount:keda:keda-operator" is the service account I am using for enabling workload identity. With this CRD does the workload identity require any additional permisions in Azure resources or for AKS cluster?


2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-03T09:51:19Z    INFO    cert-rotation   no cert refresh needed
2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-09-03T09:51:19Z    INFO    cert-rotation   Ensuring CA cert        {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-09-03T09:51:20Z    INFO    cert-rotation   CA certs are injected to webhooks
2024-09-03T09:51:20Z    INFO    grpc_server     Starting Metrics Service gRPC Server    {"address": ":9666"}
W0903 09:51:20.604804       1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
E0903 09:51:20.604845       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch *v1alpha1.ClusterCloudEventSource: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
W0903 09:51:22.747786       1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
E0903 09:51:22.747830       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch *v1alpha1.ClusterCloudEventSource: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
W0903 09:51:26.531065       1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope
E0903 09:51:26.531111       1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch *v1alpha1.ClusterCloudEventSource: failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope

@JorTurFer
Copy link
Member

yes, I forgot it, sorry.
This permissions have to be added to KEDA's Cluster Role as it needs read the CRD
image

@chamindac
Copy link
Author

chamindac commented Sep 5, 2024

@JorTurFer With the changes you mentioned above, I managed to run keda operator with your tag ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152 using keda-2.15.1.yaml

The issue seems to be with 2.15.1 the keda-operator and the event hub trigger is looking for none existing checkpoint blob.

For example here are my two scaled jobs current checkpoint blobs

largepreview-scaledjob
There is no checkpoint/7 blob but scaleedjob trigger and keda operator is looking for such a blob

image

As per scaled job log it is looking for checkpoint blob 7

Events:
  Type     Reason              Age                From           Message
  ----     ------              ----               ----           -------
  Normal   KEDAScalersStarted  25m (x4 over 25m)  scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  25m                scale-handler  Started scalers watch
  Normal   ScaledJobReady      25m                keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    25m (x2 over 25m)  scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    19m                scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largepreviewgenerator-largepreviewrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largepreviewrequired/largepreviewgenerator/checkpoint/7
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:56c49771-901e-0026-5c78-ff2f40000000
Time:2024-09-05T09:44:10.8104812Z</Message></Error>
--------------------------------------------------------------------------------
  Warning  KEDAScalerFailed  19m  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largepreviewgenerator-largepreviewrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largepreviewrequired/largepreviewgenerator/checkpoint/7

largevideo-scaledjob
There is no checkpoint/0 blob but scaleedjob trigger and keda operator is looking for such a blob

image

As per scaled job log it is looking for checkpoint blob 0

Events:
  Type     Reason              Age                From           Message
  ----     ------              ----               ----           -------
  Normal   KEDAScalersStarted  34m (x4 over 34m)  scale-handler  Scaler azure-eventhub is built.
  Normal   KEDAScalersStarted  34m                scale-handler  Started scalers watch
  Normal   ScaledJobReady      34m                keda-operator  ScaledJob is ready for scaling
  Warning  KEDAScalerFailed    34m (x2 over 34m)  scale-handler  unable to get runtimeInfo for metrics: context canceled
  Warning  KEDAScalerFailed    27m                scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound
--------------------------------------------------------------------------------
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:36a080bb-701e-0063-5178-fffaa3000000
Time:2024-09-05T09:44:47.6370308Z</Message></Error>
--------------------------------------------------------------------------------
  Warning  KEDAScalerFailed  26m  scale-handler  unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0
--------------------------------------------------------------------------------
RESPONSE 404: 404 The specified blob does not exist.
ERROR CODE: BlobNotFound

Both of the scaled jobs showing same symptoms only with 2.15.1 and failing by looking for none existing checkpoint blob name. The keda operator (with tag ghcr.io/kedacore/keda-test:pr-6096-4776d09c8fd761814c1eb9ba7e964ceace651152) shows below logs for the scaled jobs agian showning looking for none exsiting blobs

2024-09-05T09:50:22Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "largevideo-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 The specified blob does not exist.\nERROR CODE: BlobNotFound\n--------------------------------------------------------------------------------\n<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.\nRequestId:dfd7075d-101e-0007-1779-ff0b3b000000\nTime:2024-09-05T09:50:22.6388386Z</Message></Error>\n--------------------------------------------------------------------------------\n"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
2024-09-05T09:50:22Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largevideo-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of running Jobs": 0}
2024-09-05T09:50:22Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largevideo-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of pending Jobs": 0}
2024-09-05T09:50:25Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "largepreview-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largepreviewgenerator-largepreviewrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largepreviewrequired/largepreviewgenerator/checkpoint/7\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 The specified blob does not exist.\nERROR CODE: BlobNotFound\n--------------------------------------------------------------------------------\n<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.\nRequestId:7f71945f-f01e-0052-6279-ff1bb0000000\nTime:2024-09-05T09:50:25.6750071Z</Message></Error>\n--------------------------------------------------------------------------------\n"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182
2024-09-05T09:50:25Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largepreview-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of running Jobs": 0}
2024-09-05T09:50:25Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "largepreview-scaledjob", "scaledJob.Namespace": "mynamespace", "Number of pending Jobs": 0}
2024-09-05T09:50:27Z    ERROR   scale_handler   Error getting scaler metrics and activity, but continue {"scaledJob.Name": "largevideo-scaledjob", "Scaler": "*scalers.azureEventHubScaler:", "error": "unable to get unprocessedEventCount for metrics: unable to get checkpoint from storage: GET https://myehnstoragename.blob.core.windows.net/largevideogenerator-largevideogenerationrequired/ch-eh-dev-euw-001-2-green.servicebus.windows.net/largevideogenerationrequired/largevideogenerator/checkpoint/0\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 The specified blob does not exist.\nERROR CODE: BlobNotFound\n--------------------------------------------------------------------------------\n<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.\nRequestId:ba2246b8-801e-0005-6179-ffb583000000\nTime:2024-09-05T09:50:27.6339979Z</Message></Error>\n--------------------------------------------------------------------------------\n"}
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledJobMetrics
        /workspace/pkg/scaling/scale_handler.go:853
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).isScaledJobActive
        /workspace/pkg/scaling/scale_handler.go:897
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).checkScalers
        /workspace/pkg/scaling/scale_handler.go:262
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).startScaleLoop
        /workspace/pkg/scaling/scale_handler.go:182

When I deploy KEDA with keda-2.14.1.yaml (or 2.14.2 helm chart or 2.14.3 helm chart) and setup everything else (triggers, scaled job setting) configured exactly same way, there is no such logs for looking for unavaialbe blobs for checkpoints. My scale jobs working as expected without any issues with exact same setup with KEDA 2.14.x.

The issue above is only happening with 2.15.1

I suspect KEDA 2.15.1 is not refreshing the storage check point blob list correctly before checking for checkpoint blob metadata. While 2.14.x KEDA is not having the problem.

@JorTurFer
Copy link
Member

We introduced a bug when we upgrade the SDK but I think that this PR will solve the issue -> #6096

Are you willing to test the fix? This is the tag with the fix -> ghcr.io/kedacore/keda-test:pr-6096-9b8be4868a27c304646cf8cb0735357eb272bd38

@chamindac
Copy link
Author

@JorTurFer The PR #6096 seems to have fixed the issue. I deployed ghcr.io/kedacore/keda-test:pr-6096-9b8be4868a27c304646cf8cb0735357eb272bd38 to my environment with keda-2.15.1.yaml, and for last 24 hours eventhub scaler works as expected.

Will you be releasing a fixed version for 2.15.x or is this issue going to be fixed only with 2.16.x. Just want to know if we will have to skip using 2.15.x (it is impossible to use with this issue) and wait for 2.16.x ?

@JorTurFer
Copy link
Member

I missed this message sorry :/
Mos probably we will just ship v2.16, WDYT @zroubalik @wozniakjan ?

@chamindac
Copy link
Author

Thanks @JorTurFer will close this issue once 2.16 available and tested

@Nhattd97
Copy link

Nhattd97 commented Nov 6, 2024

Hi team, how long will it take for this fix to be released? I've been using Azure ACA and have been experiencing this issue for a while.

@JorTurFer
Copy link
Member

how long will it take for this fix to be released?

#6260

I've been using Azure ACA and have been experiencing this issue for a while.

We are not related with ACA team, so our release doesn't solve ACA issue as they have their own lifecycle

@JorTurFer
Copy link
Member

@chamindac , The version 2.16 was release 12 weeks ago, can I close this issue?

@sishays
Copy link

sishays commented Dec 4, 2024

Hi @JorTurFer I get this same error deploying keda via the helm chart, using version 2.16.0, when using watchNamespaces:
failed to list *v1alpha1.ClusterCloudEventSource: clustercloudeventsources.eventing.keda.sh is forbidden: User "system:serviceaccount:keda:keda-operator" cannot list resource "clustercloudeventsources" in API group "eventing.keda.sh" at the cluster scope

Without watchNamespaces keda is deployed successfully.
I suspect that there's an rbac issue.

I see that clustercloudeventsources is allowed via the keda-operator cluster role.

When watchNamespace is not in use, the chart binds the keda-operator service account to the above cluster role at the cluster level using clusterrolebindings:

kubectl get clusterrolebinding | grep keda-operator
keda-operator  ClusterRole/keda-operator

In the above scenario keda works.

However, when using watchNamespace the keda clusterrolebinding is switched to a rolebinding on the watched namespaces (+ the keda namspace):

kubectl get rolebinding -A | grep -i clusterrole/keda-operator
keda          keda-operator                                       ClusterRole/keda-operator                             9m35s
test1         keda-operator                                       ClusterRole/keda-operator                             9m35s
test2         keda-operator                                       ClusterRole/keda-operator                             9m35s
test3         keda-operator                                       ClusterRole/keda-operator                             9m35s

This results in the keda-operator service account being bound to the keda-operator cluster role at the namespace level, using the above rolebindings, which limits access to clustercloudeventsources at the namespace scope (while the crd is cluster scoped).

I have tried adding the relevant block of permission to the minimal cluster role:

kubectl edit clusterrole keda-operator-minimal-cluster-role
- apiGroups:
  - eventing.keda.sh
  resources:
  - cloudeventsources
  - cloudeventsources/status
  - clustercloudeventsources
  - clustercloudeventsources/status
  verbs:
  - get
  - list
  - patch
  - update
  - watch

And the keda-operator pod works and stops failing to list the resource.

I suggest, if my understanding is correct and this makes sense to you design wise, to add the clustercloudeventsources resources that exist in the keda-operator cluster role template, to the minimal cluster role template as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Triage
Development

Successfully merging a pull request may close this issue.

4 participants