Skip to content
This repository has been archived by the owner on Oct 18, 2024. It is now read-only.

Canary reports bouts of "The provided member is not known in the current generation" when consuming messages leading to incorrect latency numbers #161

Open
k-wall opened this issue Jan 20, 2022 · 8 comments

Comments

@k-wall
Copy link
Contributor

k-wall commented Jan 20, 2022

We are using strimzi-canary 0.2.0, occasionally we are seeing extended periods where the canary reports The provided member is not known in the current generation. During this time it appears the end to end message latency observed by the canary become extended. We also see a spikes in the strimzi_canary_consumer_error_total metric which correlates with the appearance of the messages.

E0120 15:32:57.682808       1 consumer.go:92] Error received whilst consuming from topic: kafka: error while consuming __redhat_strimzi_canary/1: kafka server: The provided member is not known in the current generation.

Log from a canary instance that experienced several lengthy bouts of the problem.

canary.log

Screenshot 2022-01-20 at 18 23 21

Screenshot 2022-01-20 at 18 32 15

@k-wall k-wall changed the title The provided member is not known in the current generation Canary reports bouts on "The provided member is not known in the current generation" when consuming messages leading to incorrect latency numbers Jan 20, 2022
@k-wall k-wall changed the title Canary reports bouts on "The provided member is not known in the current generation" when consuming messages leading to incorrect latency numbers Canary reports bouts of "The provided member is not known in the current generation" when consuming messages leading to incorrect latency numbers Jan 20, 2022
@ppatierno
Copy link
Member

Can you enable the Sarama logging and provide an updated canary log again. Taking a look at Sarama I see this error when joining a consumer group that seems to happen frequently (rebalance?). Sarama logging could provide us more hints about the underlying problem.

https://github.com/Shopify/sarama/blob/main/consumer_group.go#L253

@k-wall
Copy link
Contributor Author

k-wall commented Jan 21, 2022

@ppatierno I don't have a reproducer for this issue at the moment, any idea how we might induce a state like this? I imagine that the service side logs might be informative too.

@k-wall
Copy link
Contributor Author

k-wall commented Jan 21, 2022

I don't know if this helps (still looking), but found this comment IBM/sarama#1866 which suggests tuning timeouts might help.

The thread looks potentially interesting too: IBM/sarama#2118

@k-wall
Copy link
Contributor Author

k-wall commented Jan 21, 2022

I don't claim to have a strong understanding of the problem yet, but from the logs we can see that the consume continues after encountering this condition. I am speculating about in NewConsumerService() making the function routine that reads the consumer error actually cancel the consumer. My reasoning is, if we get into this state, it would be better to cause the consumer to be recreated:

	for err := range consumerGroup.Errors() {
		glog.Errorf("Error received whilst consuming from topic: %v", err)
		recordsConsumerFailed.With(labels).Inc()
		if cs.cancel != nil {
			cs.cancel()
		}
	}

I want to withdraw this comment.

@k-wall
Copy link
Contributor Author

k-wall commented Jan 24, 2022

To move forward on this I wonder about adding the ability to control logging (including sarama logging) dynamically, so that logging can be enabled easily when the issue is seen

@k-wall
Copy link
Contributor Author

k-wall commented Feb 3, 2022

@ppatierno @tombentley asked elsewhere why the canary uses a consumergroup at all? the canary's role is just to measure message latency. It should use the simplest way to achieve that goal. Are there good reasons to use a consumergroup for the canary?

@tombentley
Copy link
Member

I can't think of any off the top of my head.

@ppatierno
Copy link
Member

I used a consumer group as I was used to do with Java clients but I don't see any specific reason why we couldn't switch to not using it. Unless we don't see the possibility to scale the Canary application for having more consumers but it's really not our case for the purpose of the Canary itself.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants