Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat - Kafka consumer lag per partition stats #3608

Closed
sherry-ger opened this issue Feb 16, 2017 · 18 comments
Closed

Metricbeat - Kafka consumer lag per partition stats #3608

sherry-ger opened this issue Feb 16, 2017 · 18 comments
Assignees
Labels
enhancement Metricbeat Metricbeat module Team:Integrations Label for the Integrations team

Comments

@sherry-ger
Copy link

Feature:

For Metricbeat Kafka module, it would be nice to:

  • Be able to collect consumer lag per partition or broker and consumer offsets.
  • Collect CPU and memory usage per Kafka broker and consumer instance.
  • Sample dashboard containing the stats.

The issue should probably be a part of #3005

@Mrclark05
Copy link

regarding metricbeats and the kafka module...Kafka lag per topic/partition would be the most useful metric for monitoring processing

@anthonyKiggundu
Copy link

I was equally interested in having the consumer lag per partition_id/topic and i labored to make it work in my repository --> https://github.com/antonking/GOLANG

@exekias
Copy link
Contributor

exekias commented Oct 23, 2017

that's awesome @antonking, could you open a PR with your changes applied to metricbeat?

@anthonyKiggundu
Copy link

@exekias how do you do that? I've no idea what PR means to be honest.

@exekias
Copy link
Contributor

exekias commented Oct 26, 2017

Sorry @antonking PR == Pull Request, I can assist in the process if needed :)

@anthonyKiggundu
Copy link

@exekias please do assist and if you need anything from me in the process, kindly let me know. Thx lots

@Mrclark05
Copy link

@antonking @exekias I would like this to be a feature of kafka metricbeat, for us its the metric most useful. Did you have any joy with the PR?

@anthonyKiggundu
Copy link

@Mrclark05 please feel free to make the feature a PR because i've never done that before. Thx

@ruflin ruflin added the module label Feb 26, 2018
@jsoriano jsoriano self-assigned this Jul 17, 2018
@jsoriano
Copy link
Member

A kafka dashboard has been merged to master (#8457) it includes a visualization for consumer lag. This value is still not available as a single value, but it is calculated using offset values from consumergroup and partition metricsets.

It'd be great if you could give a try to it, we plan to release this first dashboard on 6.5.0.

@ruflin ruflin added the Team:Integrations Label for the Integrations team label Nov 21, 2018
@alvarolobato
Copy link

Let us know if the dashboards are not enough.

@tkinkead
Copy link

tkinkead commented Feb 1, 2019

I've been working with the new Kafka dashboard, and if I'm understanding it correctly, it's not quite right.

The current lag visualization takes the highest offset value from the partition stats and the highest offset value from the consumer group stats. The lag value is generated by taking the difference between the two (or returning zero if the consumergroup offset is higher). We can then group the results by another field, for example, the Kafka topic name. There are still a few issues with this approach, though.

  • If we have multiple clients in a single consumer group, the lag value doesn't represent the true lag of the consumer group. The Max aggregations mean that we may be comparing the consumergroup.offset value from one client against the partition.offset.newest value of a partition which that client is not connecting to.

  • If we have more than one Kafka consumer group accessing the same topic, then there's no meaningful way that I see to associate a lag value to one consumer group or the other without hard-coding in each consumer group. In a large deployment, that can be problematic.

The root of the issue is that we have to compare across data types to get the lag value (the consumergroup metricset vs. the partition metricset). The easiest solution looks to be including the lag value in the consumergroup metricset. Lag can then be associated with any combination of client ID, consumergroup ID, topic, and partition.

@jsoriano
Copy link
Member

jsoriano commented Feb 4, 2019

@tkinkead thanks a lot for your feedback, I am going to reopen the issue.

If we have multiple clients in a single consumer group, the lag value doesn't represent the true lag of the consumer group. The Max aggregations mean that we may be comparing the consumergroup.offset value from one client against the partition.offset.newest value of a partition which that client is not connecting to.

Yes, you are right about that, this can be a not so bad overview value if all partitions grow at the same rate, but I am not sure if this is always the case.

If you select topic and partition, is the value shown correct?

If we have more than one Kafka consumer group accessing the same topic, then there's no meaningful way that I see to associate a lag value to one consumer group or the other without hard-coding in each consumer group. In a large deployment, that can be problematic.

Yes, you are right, I am afraid this case won't work. We should probably aggregate per consumer group, not only per topic.

The root of the issue is that we have to compare across data types to get the lag value (the consumergroup metricset vs. the partition metricset). The easiest solution looks to be including the lag value in the consumergroup metricset. Lag can then be associated with any combination of client ID, consumergroup ID, topic, and partition.

Metricbeat aims to collect mainly metrics of local services. It can also be configured to monitor services on other hosts or in cloud providers, but the intended premise is that metricbeat don't have to connect to other hosts unless otherwise needed.
For this case what we do is to collect all the offset we can collect from the local brokers, and then calculate the lag on query/visualization time. This is more complicated, but it should work and it respects the intended premise.

Said that, maybe it worths in this case to (at least optionally) connect to the partition leader from the consumergroup metricset and directly calculate the lag to gain in simplicity for this important value.

@jsoriano jsoriano reopened this Feb 4, 2019
@tkinkead
Copy link

tkinkead commented Feb 4, 2019

If you select topic and partition, is the value shown correct?

No, because we have consumers in multiple consumer groups connected to the same topic and partition. If I select topic and partition in the dashboard, I will still only see the consumer group with the highest lag value for that topic and partition, not the lag values for all consumer groups.

Metricbeat aims to collect mainly metrics of local services. It can also be configured to monitor services on other hosts or in cloud providers, but the intended premise is that metricbeat don't have to connect to other hosts unless otherwise needed.

That's understandable. The best way to get the lag value is to pull it from the zookeeper API directly, rather than connecting to partition leaders and collecting offsets and then doing the calculations. It's possible that this simply isn't the correct way to do this. Perhaps lag and consumer group data should be built into a kafka metricset on the zookeeper module instead?

@ChrsMark
Copy link
Member

ChrsMark commented Dec 5, 2019

Hey @tkinkead !

We have introduced consumer_lag metric in #14822 (the difference between the partition offset and consumer offset) and we updated the dashboard so as to make use of this new metric in #14863.

Let us know what you think about this, and if you agree we can close this issue.
Thanks!

@Lumotheninja
Copy link

Hi, can I just ask how does Metricbeat gets its offsets since from Kafka 2.1.x onwards offsets are stored in the kafka brokers? Is it zookeeper or the individual brokers?

@ChrsMark
Copy link
Member

Hi, can I just ask how does Metricbeat gets its offsets since from Kafka 2.1.x onwards offsets are stored in the kafka brokers? Is it zookeeper or the individual brokers?

Hi @Lumotheninja , consumer offsets are being retrieved from brokers.

@Lumotheninja
Copy link

Right, just wanted to raise that question since @tkinhead suggested zookeeper. +1 on PR

@andresrc
Copy link
Contributor

Closing for now. Please reopen if the issue is not solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Metricbeat Metricbeat module Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests