-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka-Minion as alternative to Burrow for consumer lag monitoring #259
Conversation
Kafka minion tries to launch a partition consumer for each partition of the consumer offsets topic. Therefore it first has to get the topics partition count. I can imagine two reasons why it has failed:
Is one of these two conditions true? If not I'll try to further investigate |
Yes, that could have been the case. It was a new cluster. I'll check more closely next time. |
Did you have a chance to give it a spin? I just released v0.1.2 with some more features :-) |
in comparison with ./prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments, but I'll be working on the Kafka Minion helm chart during the next few days and I'll submit more comments about the K8s manifests.
failureThreshold: 1 | ||
httpGet: | ||
port: http | ||
path: /metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kafka Minion 1.1.2 introduces a dedicated readiness check which is 200 once Kafka Minion has initially consumed the __consumer_offsets
topic which is the point in time when it starts exposing metrics. This is a required feature to run Kafka Minion in high availability / multiple replicas. This is recommended if you intend to setup alerting on these metrics.
Since this can take some time it requires some loose timeouts:
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 60 # 60 * 10s equals 10min, should be adapted depending on the given resources and size of consumer offsets topic
httpGet:
path: /readycheck
port: http
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5fc33f4 swiches to this endpoints but keeps everything else default
failureThreshold: 3 | ||
httpGet: | ||
port: http | ||
path: /metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a separate endpoint which checks if it's still connected to at least one kafka broker:
livenessProbe:
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
httpGet:
path: /healthcheck
port: http
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5fc33f4 swiches to this endpoints but keeps everything else default
## Consumer lag monitoring | ||
|
||
See [Burrow](../linkedin-burrow) | ||
or [Kafka Minion](../consumers-prometheus/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some additional comments what one may prefer depending on the use case / environment?
- Many kafka clusters to monitor with just one Exporter? => Burrow
- Only interested in Consumer Health check? => Burrow
- Want metrics in prometheus? => Kafka Minion
- Looking for HA support? => Kafka Minion
- Using versioning in group ids (e. g. consumer group name "email-sender-5" where 5 indicates the version) ? => Kafka Minion
In fact they can supplement each other and it may be a valid desire to operate both of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's lots and lots of research to be done for anyone who wants to set up a Kafka stack and I see this repository as a collection of examples rather than a way to discuss the choices.
I've validated with a dev Prometheus stack now. Adding the grafana dashboard will be a separate PR because I didn't want to deal with the jsonnet stuff in https://github.com/coreos/kube-prometheus when Kustomize can produce config maps. @weeco The current |
because the v0.1.2 build had an env saying 0.1.1, later fixed in 5a9b9f3
I am aware, that's a fault of mine. It'll be fixed with v0.1.3. |
See #255 (comment)
@weeco FYI
Only tested in Minikube so far with three replicas. With two kafka replicas I got
{"error":"kafka server: Replication-factor is invalid.","level":"panic","msg":"failed to get partition count","time":"2019-04-04T03:23:38Z","topic":"__consumer_offsets"}
which might have been a config error. Default replication factor not updated etc.