Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keda operator facing issue in finding offset block #4215

Closed
sinanub18 opened this issue Feb 8, 2023 · 31 comments
Closed

Keda operator facing issue in finding offset block #4215

sinanub18 opened this issue Feb 8, 2023 · 31 comments
Labels
bug Something isn't working

Comments

@sinanub18
Copy link

Report

Getting error from keda operator saying "error":"error finding offset block for topic XXX-XXX-XXX and partition 1"

{"level":"error","ts":"2023-02-08T16:47:13Z","logger":"kafka_scaler","msg":"","type":"ScaledObject","namespace":"XXX","name":"XXX-XXX-XXX","error":"error finding offset block for topic XXX.XXX-XXX and partition 1","stacktrace":"github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).getLagForPartition\n\t/workspace/pkg/scalers/kafka_scaler.go:448\ngithub.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).getTotalLag\n\t/workspace/pkg/scalers/kafka_scaler.go:597\ngithub.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).GetMetricsAndActivity\n\t/workspace/pkg/scalers/kafka_scaler.go:568\ngithub.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsForScaler\n\t/workspace/pkg/scaling/cache/scalers_cache.go:77\ngithub.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).GetScaledObjectMetrics\n\t/workspace/pkg/scaling/scale_handler.go:439\ngithub.com/kedacore/keda/v2/pkg/metricsservice.(*GrpcServer).GetMetrics\n\t/workspace/pkg/metricsservice/server.go:45\ngithub.com/kedacore/keda/v2/pkg/metricsservice/api._MetricsService_GetMetrics_Handler\n\t/workspace/pkg/metricsservice/api/metrics_grpc.pb.go:79\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/workspace/vendor/google.golang.org/grpc/server.go:1340\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/workspace/vendor/google.golang.org/grpc/server.go:1713\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/workspace/vendor/google.golang.org/grpc/server.go:965"}

Expected Behavior

Keda operator should face issue to find offset block

Actual Behavior

Throws error with error finding offset block for topic

Steps to Reproduce the Problem

  1. Deploy KEDA 2.9.0
  2. Check logs of keda operator

Logs from KEDA operator

{"level":"error","ts":"2023-02-08T16:47:13Z","logger":"kafka_scaler","msg":"","type":"ScaledObject","namespace":"YYY","name":"XXX-XXX,"error":"error finding offset block for topic XX.XXX-XXX and partition 2","stacktrace":"github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).getLagForPartition\n\t/workspace/pkg/scalers/kafka_scaler.go:448\ngithub.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).getTotalLag\n\t/workspace/pkg/scalers/kafka_scaler.go:597\ngithub.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).GetMetricsAndActivity\n\t/workspace/pkg/scalers/kafka_scaler.go:568\ngithub.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsForScaler\n\t/workspace/pkg/scaling/cache/scalers_cache.go:77\ngithub.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).GetScaledObjectMetrics\n\t/workspace/pkg/scaling/scale_handler.go:439\ngithub.com/kedacore/keda/v2/pkg/metricsservice.(*GrpcServer).GetMetrics\n\t/workspace/pkg/metricsservice/server.go:45\ngithub.com/kedacore/keda/v2/pkg/metricsservice/api._MetricsService_GetMetrics_Handler\n\t/workspace/pkg/metricsservice/api/metrics_grpc.pb.go:79\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/workspace/vendor/google.golang.org/grpc/server.go:1340\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/workspace/vendor/google.golang.org/grpc/server.go:1713\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\t/workspace/vendor/google.golang.org/grpc/server.go:965"}

KEDA Version

2.9.0

Kubernetes Version

1.25

Platform

Google Cloud

Scaler Details

Kafka Scaler

Anything else?

No response

@JorTurFer
Copy link
Member

Any idea about this @zroubalik ?

@zroubalik
Copy link
Member

@sinanub18
Copy link
Author

sinanub18 commented Feb 9, 2023

Try to check this setting: https://keda.sh/docs/2.9/scalers/apache-kafka/#new-consumers-and-offset-reset-policy

@zroubalik The ScaledObjets already have the variable offsetResetPolicy: latest for kafka scalers

@jorgenfries
Copy link

jorgenfries commented Mar 9, 2023

Experiencing the same issue, and log errors.
Workaround is restarting keda operator pod :-/
Status on the specific scaledobjects are fine - no errors are logged here.

Chart version: helm.sh/chart=keda-2.9.4

@dttung2905
Copy link
Contributor

@jorgenfries could you help to try with keda 2.10 if possible ? This version will have better error logging from PR #4233. It won't directly solve your problem but might provide more information for troubleshooting at least 🙏

@jorgenfries
Copy link

@dttung2905
Updated this weekend. helm.sh/chart=keda-2.10.0
Will post the output when/if it happens again.

@johnnytardin
Copy link

@jorgenfries did you notice the error still occurring after the update to 2.10.0?

@jorgenfries
Copy link

@johnnytardin - Actually I haven't seen it since. However I only noticed it once with the earlier version.
I am performing some intensive testing on our Kafka workloads the coming month, so if it's still present I should encounter it again soonish.

@johnnytardin
Copy link

johnnytardin commented Mar 27, 2023

I performed the update last week to 2.10.0 and continue to get the errors.
Here is the log but without further information.

"error finding offset block for topic [HIDDEN] and partition 13 from offset block: map[]"

"error finding offset block for topic [HIDDEN] and partition 0 from offset block: map[]"

"error finding offset block for topic [HIDDEN] 4 from offset block: map[]"

"error finding offset block for topic [HIDDEN] 1 from offset block: map[]"

@jorgenfries
Copy link

I experienced it again also with 2.10.0 - same log entries as @johnnytardin
Sadly I recreated the pod without getting the logs out before doing so -.-
However a pod restart of keda-operator fixed it again, and the scaling started correctly again.

@dttung2905
Copy link
Contributor

"error finding offset block for topic [HIDDEN] and partition 13 from offset block: map[]"

Its quite weird actually. From your log (which was added since 2.10 ), the offset block map is empty which could be eventually traced back to this method call

func (s *kafkaScaler) getConsumerOffsets(topicPartitions map[string][]int32) (*sarama.OffsetFetchResponse, error) {
offsets, err := s.admin.ListConsumerGroupOffsets(s.metadata.group, topicPartitions)
if err != nil {
return nil, fmt.Errorf("error listing consumer group offsets: %w", err)
}
if offsets.Err > 0 {
errMsg := fmt.Errorf("error listing consumer group offsets: %w", offsets.Err)
s.logger.Error(errMsg, "")
}
return offsets, nil
}

which uses ListConsumerGroupOffsets() from the samara library. Might have something to do with samara itself idk. I'm not too sure as @jorgenfries said the error went away after pod recreation 🤔 ( which makes it even harder to reproduce consistantly ). Do you have any other ideas @JorTurFer @zroubalik ?

@sansmoraxz
Copy link
Contributor

Not sure if entirely related but I was facing something similar #4466
Except that it wasn't logging any errors.

Restarting the keda-operator at least seemed to fix it (temporarily).

@oshmoun
Copy link

oshmoun commented Apr 28, 2023

just had the issue here as well, here is the stacktrace from the logs

2023-04-28T09:31:42Z	ERROR	kafka_scaler		{"type": "ScaledObject", "namespace": "HIDDEN", "name": "HIDDEN", "error": "error finding offset block for topic HIDDEN and partition 4"}
github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).getLagForPartition
	/workspace/pkg/scalers/kafka_scaler.go:448
github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).getTotalLag
	/workspace/pkg/scalers/kafka_scaler.go:597
github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/kafka_scaler.go:568

I'm on version 2.9.3

I did not need to restart keda to fix the issue, but simply disable idle scaling by removing idleReplicaCount from the configured ScaledObjects

@JorTurFer
Copy link
Member

I did not need to restart keda to fix the issue, but simply disable idle scaling by removing idleReplicaCount from the configured ScaledObjects

idleReplicaCount doesn't have impact at scaler level, it's something external for KEDA itself, not for kafka trigger, I think that your problem has gone because modifying the ScaledObject you have triggered a rebuild of the scaler

@JorTurFer
Copy link
Member

Could it be related with some internal change in sarama client? AFAIR, there isn't any change in kafka scaler related with that code?
@zroubalik ?

@oshmoun
Copy link

oshmoun commented May 3, 2023

I did not need to restart keda to fix the issue, but simply disable idle scaling by removing idleReplicaCount from the configured ScaledObjects

idleReplicaCount doesn't have impact at scaler level, it's something external for KEDA itself, not for kafka trigger, I think that your problem has gone because modifying the ScaledObject you have triggered a rebuild of the scaler

of course you are correct. The issue just re-occurred today with idle scaling deactivated.

@sinanub18
Copy link
Author

Could it be related with some internal change in sarama client? AFAIR, there isn't any change in kafka scaler related with that code? @zroubalik ?

Hi @JorTurFer Its not about recent changes as I am facing this issue since 2.6.0

@oshmoun
Copy link

oshmoun commented Jun 12, 2023

Would like to also mention that I have tested fallback scaling, hoping that keda would at least respect that in the case of this scaling failure, but it does not. Fallback scaling is rendered broken when this issue occurs.

@JorTurFer
Copy link
Member

Do you have a way to reproduce this error?

@oshmoun
Copy link

oshmoun commented Jun 13, 2023

Do you have a way to reproduce this error?

unfortunately not, as it happens randomly
though if you are referring to the fallback scaling failure, I guess one can simply introduce an error in the code where the current error is reported, and check from there

github.com/kedacore/keda/v2/pkg/scalers.(*kafkaScaler).getLagForPartition
	/workspace/pkg/scalers/kafka_scaler.go:448

@JorTurFer
Copy link
Member

though if you are referring to the fallback scaling failure

No no, I meant the root cause

@oshmoun
Copy link

oshmoun commented Jul 31, 2023

sorry for the lack of update on this matter. I have updated keda to 2.11.1, and since then the issue has not occurred. Hopefully not just a fluke, since the issue is of random nature.
But so far so good 🤞

@stale
Copy link

stale bot commented Sep 29, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Sep 29, 2023
@leoeareis
Copy link

Any update related to that @oshmoun in new version?

@oshmoun
Copy link

oshmoun commented Oct 3, 2023

Any update related to that @oshmoun in new version?

Sorry for the lack of updates. Ever since updating the issue has not occurred, and keda scaling has been working smoothly.
I guess in this scenario the saying "no news is good news" applies perfectly 🙂

@JorTurFer
Copy link
Member

As it looks as solved, I'm going to close the issue. If the problem happens again, comment here and I can reopen it.
Just to announce it, we have added another "flavour" of kafka scaler as experimental scaler: https://keda.sh/docs/2.12/scalers/apache-kafka-go/
It's experimental because it doesn't have all the functionality yet, but it uses kafka-go instead of sarama client. It's covered by e2e test cases and even though there could be a missing test, the majority of them are the same (I mean, it's not just random code placed there). You can give a try too if the issue persists

@stale stale bot removed the stale All issues that are marked as stale due to inactivity label Oct 3, 2023
@rxg8255
Copy link
Contributor

rxg8255 commented Oct 17, 2023

Facing same issue, K8s version 1.25 and KEDA 2.9.3
Tried restarting the KEDA operator but no luck.
Could any please let me know = the fix?

@sansmoraxz
Copy link
Contributor

@rxg8255 can you update your KEDA operator to newer versions viz 2.12.0 and check if it still gives the same issue?

Alternatively, have a look at Apache Kafka - (Experimental) and if your use case is supported check if issue persists.

@rxg8255
Copy link
Contributor

rxg8255 commented Oct 17, 2023

Thanks for the details @sansmoraxz. Is there any work around for the issue apart from upgrading it to 2.12.0 As the upgrade will take quite a time as we cannot upgrade immediately.

@sansmoraxz
Copy link
Contributor

Maybe try resetting your consumer group and/ or topic. Not sure if it would help.

I think there was a bug in the underlying dependency solved in 2.11 as @oshmoun stated above. #4215 (comment)

@rxg8255
Copy link
Contributor

rxg8255 commented Oct 17, 2023

@tomkerkhove @zroubalik
@sansmoraxz
Currently we are on AKS version 1.25 and the compatibility matrix says K8s version v1.26 - v1.28 is compatible with latest KEDA version i.e. 2.12.

Sure, we will upgrade KEDA to 2.11 to test the scenario and will keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

10 participants