Telegraf 1.12 poor performance (kafka_consumer -> influxdb) #6354

robcowart · 2019-09-05T19:51:43Z

Just a heads up. Someone needs to check the newly released Telegraf 1.12. I had Telegraf pulling data from Kafka and sending to InfluxDB. After the upgrade Telegraf started falling behind, and I could see a lag of over 2.5M events in Kafka (that is about an hour of data this environment).

I downgraded to 1.11.5 and this previous version was able to catch up and clear the Kafka backlog in less than 2 mins.

I don't know whether it was the Kafka input of the InfluxDB output that was the problem, as there were no error logs of any kind.

danielnelson · 2019-09-05T20:15:09Z

I can look into the kafka_consumer performance, we updated the Kafka library to a newer version which was required to support the latest Kafka versions. No major changes in the InfluxDB output or agent code.

robcowart · 2019-09-05T20:21:19Z

FYI, I have been using Telegraf 1.11.x with Confluent Platform 5.3 (Kafka 2.3) for a few weeks now and experienced no issues.

…

On Thu 5. Sep 2019 at 22:15, Daniel Nelson ***@***.***> wrote: I can look into the kafka_consumer performance, we updated the Kafka library to a newer version which was required to support the latest Kafka versions. No major changes in the InfluxDB output or agent code. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6354?email_source=notifications&email_token=ACOZHKQ5DJXVIW5HNN6RHMDQIFSF7A5CNFSM4IUBZ7BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ATV2Q#issuecomment-528562922>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACOZHKTOUHHWL7M7CTN7KSLQIFSF7ANCNFSM4IUBZ7BA> .

danielnelson · 2019-09-05T20:31:05Z

Here is the issue that initated the change #6136. In 1.11.5 we use sarama-cluster which has been deprecated for some time.

danielnelson · 2019-09-05T22:59:28Z

A quick check of the kafka_consumer plugin didn't turn up any difference on my modest test setup, I can do 10k metrics/second with a single Telegraf, single line of line protocol per message and gzip, and it appears to be limited by how fast I can write to the Kafka broker.

Can you check the following:

What message rate are you getting with 1.11.5 compared to 1.12.0? The internal plugin may be useful for collecting this.
Can you add the relavant telegraf configuration?
How much CPU is Telegraf using with 1.11.5 and 1.12.0?

robcowart · 2019-09-06T08:30:49Z

The Kafka environment is:
3 Brokers (Confluent Platform 5.3.0 - i.e. Kafka 2.3)
One topic with 6 partitions and replica factor 2.

Here is the Telegraf config (only the uncommented lines and without credentials)...

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  collection_jitter = "0s"

  metric_batch_size = 1000
  metric_buffer_limit = 10000

  flush_interval = "9s"
  flush_jitter = "2s"

  precision = "50ms"

  debug = false
  quiet = false
  logfile = ""

  hostname = ""
  omit_hostname = false

[[inputs.kafka_consumer]]
  brokers = ["192.0.2.1:9092","192.0.2.2:9092","192.0.2.3:9092"]
  topics = ["telegraf"]
  version = "2.0.0"

  consumer_group = "telegraf-influxdb1"
  offset = "oldest"
  max_message_len = 1000000

  data_format = "influx"

[[outputs.influxdb]]
  urls = ["http://192.0.2.11:8086"]
  database = "telegraf"

Let's walk through some charts...

last 24-hours

The reason for the upgrade throughout the environment was to add the new inputs logstash and docker_logs (which BTW both work well), as well as a few other inputs. With the addition of data the overall volume will obviously increase, which is why the right-side of the chart is a bit higher than the left-side.

I will break this chart down step-by-step...

steady-state prior to upgrade

This is the volume of writes before Telegraf was upgraded from 1.11.3 to 1.12.0.

after the upgrade to 1.12.0

Here you can see that the overall volume of traffic has decreased. However, since data is still coming into Kafka at the rates above (actually a little higher because of the new data) InfluxDB begins to fall behind. This was how I first recognized there was an issue, my charts were missing recent data and getting worse.

troubleshooting

As I was trying to figure out what was wrong I restarted Telegraf a few times. What I noticed was that immediately after the restart Telegraf would consume data at a faster rate. That is the three little spikes seen on this graph. However it would then quickly fall back to a slower, insufficient rate.

You mention that your quick tests showed good throughput. My observations would expect this for a short period after startup, however it won't last for a longer duration. The test should be run over a longer period reading a topic with a larges amount of data (at least a few million records).

You will also notice in this chart the really big spike to the right. This was after the downgrade to Telegraf 1.11.5. It is only a brief spike because this previous version was able to consume the data lagged in Kafka at such a fast rate that the problem cleared itself almost immediately, and the overall system was back to normal.

steady-state after the downgrade to 1.11.5

Back at 1.11.5 everything is back to normal. Even with the higher rate from the new data being collected in this environment, there is no lag consuming data from Kafka.

CPU

I don't have CPU data specific for Telegraf, however the following charts still reveal a lot.

On these charts you can still see the three smaller spikes from the Telegraf restarts and the larger spike after the downgrade. Additionally there is activity between the second and third small spike, and after the large downgrade spike.

This system also has a 2.0-alpha instance with its own separate Telegraf instance that was consuming data from the same Kafka topic (different consumer group). This CPU activity is related to when I shutdown the 2.0-alpha container (the activity between small spikes 2 and 3) to eliminate it while troubleshooting, and when I later restarted it after the downgrade. Since this instance was shutdown for a while, it was further behind, and thus took longer to catchup. I actually felt that it processed the lagged data slower than the 1.x instance, but 2.0 is alpha and that isn't really related to this issue.

With that cleared up, it is most telling to focus on the steady-state utilization between small spike 1 and 2, and the steady-state after the downgrade. You can see that there was actually less CPU utilization as less data was being consumed. This would lead me to suspect that the issue isn't related to a condition where excessive resources are being consumed by 1.12.0, resulting in insufficient resources to consume more data.

It seems that there is either something wrong with how Telegraf is communicating with Kafka, or there is some other issue internal to Telegraf. I do wonder about the latter, because of the brief moment after a restart where data was consumed quickly. It is as if data is read fine until some internal buffers are full, however Telegraf can't clear those buffers fast enough and everything slows down. In this case the issue isn't the Kafka input, rather some other issue internal to Telegraf. This is of course speculation, but it does fit the observed information.

danielnelson · 2019-09-06T18:14:03Z

Can you try setting this in the kafka_consumer plugin, you can set this for both 1.11 and 1.12, this is double the default. It may be that the batch size is not being filled to trigger an immediate flush.

max_undelivered_messages = 2000

robcowart · 2019-09-06T18:30:45Z

I have read the description of that setting...

  ## Maximum messages to read from the broker that have not been written by an
  ## output.  For best throughput set based on the number of metrics within
  ## each message and the size of the output's metric_batch_size.
  ##
  ## For example, if each message from the queue contains 10 metrics and the
  ## output metric_batch_size is 1000, setting this to 100 will ensure that a
  ## full batch is collected and the write is triggered immediately without
  ## waiting until the next flush_interval.

... and it sounds like this will only help to improve the nearness to realtime when the overall volume of data is low. In the case where there is data backing up in Kafka, there is no shortage of data to fill a batch.

Can you explain why you believe that this could be an issue?

oacosta40 · 2019-09-06T19:06:57Z

Hi there, I had the same issue too with the latest telegraf release (1.12) so I had to downgrade to 1.11.5. This are some of the log messages after the installation of 1.12:

2019-09-04T17:55:35Z I! Starting Telegraf 1.12.0
2019-09-04T17:55:35Z I! Loaded inputs: tomcat tomcat tomcat tomcat tomcat nginx http burrow kafka_consumer
2019-09-04T17:55:35Z I! Loaded aggregators:
2019-09-04T17:55:35Z I! Loaded processors:
2019-09-04T17:55:35Z I! Loaded outputs: influxdb influxdb influxdb influxdb
2019-09-04T17:55:35Z I! Tags enabled: host=srvcomunicacionesplvfl09
2019-09-04T17:55:35Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"srvcomunicacionesplvfl09", Flush Interval:10s
2019-09-04T17:55:50Z E! [inputs.kafka_consumer] Could not mark message delivered: 83221
2019-09-04T17:56:40Z E! [inputs.kafka_consumer] Could not mark message delivered: 88466
2019-09-04T18:08:20Z E! [inputs.kafka_consumer] Could not mark message delivered: 157892
2019-09-04T18:28:10Z E! [inputs.kafka_consumer] Could not mark message delivered: 277058
2019-09-04T18:33:50Z E! [inputs.kafka_consumer] Could not mark message delivered: 311296
2019-09-04T18:37:50Z E! [inputs.kafka_consumer] Error in plugin: kafka: error while consuming siscom-metricas-analisis-bitacora/1: kafka server: The requested offset
 is outside the range of offsets maintained by the server for the given topic/partition.
2019-09-04T18:38:21Z E! [inputs.kafka_consumer] Could not mark message delivered: 346191
2019-09-04T18:49:10Z E! [inputs.kafka_consumer] Could not mark message delivered: 410441
2019-09-04T19:15:20Z E! [inputs.kafka_consumer] Could not mark message delivered: 567517
2019-09-04T19:16:10Z E! [inputs.kafka_consumer] Error in plugin: kafka: error while consuming siscom-metricas-analisis-bitacora/19: kafka server: The requested offse
t is outside the range of offsets maintained by the server for the given topic/partition.
2019-09-04T19:16:32Z E! [inputs.kafka_consumer] Could not mark message delivered: 614128
2019-09-04T19:21:50Z E! [inputs.kafka_consumer] Error in plugin: kafka: error while consuming siscom-metricas-analisis-bitacora/23: kafka server: The requested offse
t is outside the range of offsets maintained by the server for the given topic/partition.
2019-09-04T19:22:27Z E! [inputs.kafka_consumer] Could not mark message delivered: 717521
2019-09-04T19:31:20Z E! [inputs.kafka_consumer] Could not mark message delivered: 770952
2019-09-04T19:38:40Z E! [inputs.kafka_consumer] Could not mark message delivered: 815242

danielnelson · 2019-09-06T19:14:37Z

Telegraf triggers a write when the flush_interval elapses or when an output receives metric_batch_size worth of new metrics, whichever comes first. Another way to describe this option is that it is the max number of unacknowledged messages that the input will take; if this isn't enough to fill a complete metric_batch_size then the input will need to wait until the next flush_interval
elapses and the write succeeds to continue taking data from Kafka.

I'm considering this because of how uniform the writes / second are, looks like Telegraf is holding back or I would expect more irregularities.

Could not mark message delivered: 815242

I ran into this yesterday during my testing too, I have a fix for 1.12.1. It didn't occur to me until now, but you are right this would eventually eat into the undelivered messages and slow/stop the plugin.

Error in plugin: kafka: error while consuming siscom-metricas-analisis-bitacora/23: kafka server: The requested offse
t is outside the range of offsets maintained by the server for the given topic/partition.

I have not seen this one, maybe its a follow on issue from the prior bug?

Let me put up a pull request with the fix for the Could not mark message delivered issue, do you think you could test a development build?

robcowart · 2019-09-06T19:24:09Z

@danielnelson writes are fairly uniform because this is all IT monitoring data, collecting on either 30 or 60 sec. intervals.

If you can point me to a dev build, I can test it, but it won't be until late Sunday or early Monday.

robcowart · 2019-09-06T19:42:33Z

How about we test first and then decide whether to close this?

danielnelson · 2019-09-06T19:49:37Z

We can reopen if the fix doesn't handle it. This is our workflow on issues, merging the PR closes the issue.

Here are some builds with the fix:

oacosta40 · 2019-09-06T23:06:25Z

I've been testing the new build and everything seems to be working fine. Thanks for the fix @danielnelson.
This is a snapshot that shows the different behaivors regarding the consumer's lag on the same topic with the different telegraf versions: Snapshot

danielnelson · 2019-09-06T23:30:52Z

@oacosta40 Great news, thanks for the help!

robcowart · 2019-09-07T06:56:50Z

I was able to implement quickly this morning. Initial indications are positive. I will check back in on it in a few hours to confirm. The one thing I noticed is that this 1.13 build is missing some of the new stuff that was in 1.12, such as the docker_log input.

robcowart · 2019-09-07T09:00:15Z

Things look good regarding ingestion from Kafka.

adithyamk · 2019-09-20T18:39:06Z

I was a bit late with using 1.12 and now I am facing the same performance issue. Hence I was searching for a solution and found this. I will update to the nightly build of 1.13 and try it out.
@robcowart Thanks for the performance metrics.
@danielnelson Thanks for fixing it.

danielnelson · 2019-09-20T19:22:13Z

@adithyamk This should be fixed in the 1.12.1 release, so it shouldn't be required to go to a nightly build.

mohsin106 · 2020-03-01T12:44:16Z

@danielnelson I'm running Telegraf 1.13.3 ( Telegraf unknown (git: master 88a8963) ) and when I'm trying to read from Kafka I see the following messages in my debug output:

2020-03-01T12:31:24Z D! [sarama] consumer/broker/1025 abandoned subscription to my-test-topic/4 because consuming was taking too long
2020-03-01T12:31:26Z D! [sarama] consumer/broker/1024 abandoned subscription to my-test-topic/3 because consuming was taking too long
2020-03-01T12:31:26Z D! [sarama] consumer/broker/1026 abandoned subscription to my-test-topic/0 because consuming was taking too long
2020-03-01T12:31:29Z D! [sarama] consumer/broker/1023 added subscription to my-test-topic/2
2020-03-01T12:31:30Z D! [sarama] consumer/broker/1023 abandoned subscription to my-test-topic/2 because consuming was taking too long
2020-03-01T12:31:33Z D! [sarama] consumer/broker/1027 added subscription to my-test-topic/1
2020-03-01T12:31:33Z D! [sarama] consumer/broker/1027 abandoned subscription to my-test-topic/1 because consuming was taking too long
2020-03-01T12:31:35Z D! [sarama] consumer/broker/1025 added subscription to my-test-topic/4
2020-03-01T12:31:36Z D! [sarama] consumer/broker/1025 abandoned subscription to my-test-topic/4 because consuming was taking too long
2020-03-01T12:31:37Z D! [sarama] consumer/broker/1026 added subscription to my-test-topic/0
2020-03-01T12:31:37Z D! [sarama] consumer/broker/1024 added subscription to my-test-topic/3
2020-03-01T12:31:37Z D! [sarama] consumer/broker/1026 abandoned subscription to my-test-topic/0 because consuming was taking too long
2020-03-01T12:31:37Z D! [sarama] consumer/broker/1024 abandoned subscription to my-test-topic/3 because consuming was taking too long
2020-03-01T12:31:42Z D! [sarama] consumer/broker/1023 added subscription to my-test-topic/2

Seems like brokers are being torn down and rebuilt to process the ingestion. Is this normal? Does this error message attribute to a loss of data possibly?

My Telegraf config looks like this:
[global_tags]

[agent]
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "5s"
flush_jitter = "0s"
debug = true
quiet = false
hostname = "telegraf-agent-1"
omit_hostname = false

[[inputs.kafka_consumer]]
brokers = ["kafka-server01.mgt.com:9093", "kafka-server02.mgt.com:9093", "kafka-server03.mgt.com:9093", "kafka-server04.mgt.com:9093", "kafka-server05.mgt.com:9093"]
topics = ["my-test-topic"]
client_id = "telegraf-client-id-1"
consumer_group= "telegraf-agent-v1"
offset = "oldest"
data_format = "influx"
max_undelivered_messages = 2500
enable_tls = true
tls_ca = "/etc/telegraf/ca.pem"
tls_cert = "/etc/telegraf/telegraf.cer"
tls_key = "/etc/telegraf/telegraf-priv.key"

[[outputs.influxdb]]
urls = ["http://10.10.10.10:8086"]
database = "telegraf"
write_consistency = "any"
timeout = "5s"

robcowart · 2020-03-01T15:11:35Z

@mohsin106 this is really a different issue. It looks like Telegraf is failing to connect to any broker and is trying each broker in the brokers list one after the other. Those logs indicate an errTimedOut condition.

My first question is whether your Kafka brokers are intentionally set to listen on port 9093? The default is 9092.

mohsin106 · 2020-03-01T18:19:18Z

@robcowart I can open a different issue if this isn't related. I was just trying to understand what condition would generate those log messages I was seeing. Is it a limitation of Telegraf or something on the Kafka side? I am able to get a ton of data into my InfluxDB but was concerned if any data was being dropped. The Kafka cluster is not managed by me but yes it is intentionally set to listen to port 9093.

danielnelson · 2020-03-02T18:47:00Z

@mohsin106 Yes, could you open a new issue? I think maybe it is taking too long from when the message is read from Kafka until the offsets are updated.

sgreszcz · 2020-06-24T22:28:26Z

I'm experiencing similar performance issues. I'm using 6 telegraf instances to produce collectd metrics to a kafka topic. I'm then reading this kafka topic with another telegraf instance and sending it to influxDB on the same server. The protocol is influxdb line protocol and no compression.

The consumer telegraf and influx DB is running on an 8 core machine with 32GB ram and is hardly using any resources. However there is consumer lag from the telegraf consumer.

I'm running influxDB 2.0-beta and telegraf 1.14.4 (git: HEAD c6fff6d) in docker.

danielnelson · 2020-06-24T23:04:38Z

The fix for this issue is to make sure your max_undelivered_messages is larger than your metric_batch_size. This ensures that the output is triggered to write immediately without waiting for the flush_interval to elapse:

[agent]
  metric_batch_size = 1000

[[inputs.kafka_consumer]]
  max_undelivered_messages = 2500

sgreszcz · 2020-06-25T09:08:12Z

The fix for this issue is to make sure your max_undelivered_messages is larger than your metric_batch_size. This ensures that the output is triggered to write immediately without waiting for the flush_interval to elapse:
[agent]
  metric_batch_size = 1000

[[inputs.kafka_consumer]]
  max_undelivered_messages = 2500

Thank you, how can I have it poll more often?

sgreszcz · 2020-06-25T10:21:07Z

OK, I made these changed, and I think it is working. It caught up to the 25M metrics lag and bounces around from 0 lag to around 2500 depending on the polling. I'm still not sure if this is "right" but seems to be working:

[agent]
  ## Default data collection interval for all inputs
  interval = "500ms"
metric_batch_size = 1000
[[inputs.kafka_consumer]]
  ## Kafka brokers.
  ## Maximum length of a message to consume, in bytes (default 0/unlimited);
  ## larger messages are dropped
  max_message_len = 1000000

Thank you @danielnelson

AlberTadrous · 2020-08-14T02:00:02Z

FYI, I have been using Telegraf 1.11.x with Confluent Platform 5.3 (Kafka 2.3) for a few weeks now and experienced no issues.
…
On Thu 5. Sep 2019 at 22:15, Daniel Nelson @.***> wrote: I can look into the kafka_consumer performance, we updated the Kafka library to a newer version which was required to support the latest Kafka versions. No major changes in the InfluxDB output or agent code. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6354?email_source=notifications&email_token=ACOZHKQ5DJXVIW5HNN6RHMDQIFSF7A5CNFSM4IUBZ7BKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ATV2Q#issuecomment-528562922>, or mute the thread https://github.com/notifications/unsubscribe-auth/ACOZHKTOUHHWL7M7CTN7KSLQIFSF7ANCNFSM4IUBZ7BA .

Hi,
I am new to telegraf, and I want to use kafka_consumer plugin with Confluent Cloud Platform but I always get
client/metadata fetching metadata for all topics from broker xxxxxxxxxxxxxx:9092
2020-08-14T01:56:24Z D! [sarama] Failed to read SASL handshake header : unexpected EOF
2020-08-14T01:56:24Z D! [sarama] Error while performing SASL handshake xxxxxxxxxxxxxx:9092
2020-08-14T01:56:24Z D! [sarama] Closed connection to broker xxxxxxxxxxxxxx:9092
2020-08-14T01:56:24Z D! [sarama] client/metadata got error from broker -1 while fetching metadata: unexpected EOF

Do you know what should I do?
Thanks,

danielnelson added area/kafka performance problems with decreased performance or enhancements that improve performance labels Sep 5, 2019

robcowart changed the title ~~Telegraf 1.12 performance too slow~~ Telegraf 1.12 poor performance (kafka_consumer -> influxdb) Sep 6, 2019

danielnelson mentioned this issue Sep 6, 2019

Fix could not mark message delivered error in kafka_consumer #6363

Merged

3 tasks

danielnelson closed this as completed in #6363 Sep 6, 2019

danielnelson added this to the 1.12.1 milestone Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telegraf 1.12 poor performance (kafka_consumer -> influxdb) #6354

Telegraf 1.12 poor performance (kafka_consumer -> influxdb) #6354

robcowart commented Sep 5, 2019

danielnelson commented Sep 5, 2019

robcowart commented Sep 5, 2019 via email

danielnelson commented Sep 5, 2019

danielnelson commented Sep 5, 2019

robcowart commented Sep 6, 2019

danielnelson commented Sep 6, 2019

robcowart commented Sep 6, 2019

oacosta40 commented Sep 6, 2019

danielnelson commented Sep 6, 2019

robcowart commented Sep 6, 2019

robcowart commented Sep 6, 2019

danielnelson commented Sep 6, 2019

oacosta40 commented Sep 6, 2019

danielnelson commented Sep 6, 2019

robcowart commented Sep 7, 2019

robcowart commented Sep 7, 2019

adithyamk commented Sep 20, 2019

danielnelson commented Sep 20, 2019

mohsin106 commented Mar 1, 2020

robcowart commented Mar 1, 2020 •

edited

Loading

mohsin106 commented Mar 1, 2020

danielnelson commented Mar 2, 2020

sgreszcz commented Jun 24, 2020

danielnelson commented Jun 24, 2020

sgreszcz commented Jun 25, 2020

sgreszcz commented Jun 25, 2020 •

edited

Loading

AlberTadrous commented Aug 14, 2020

Telegraf 1.12 poor performance (kafka_consumer -> influxdb) #6354

Telegraf 1.12 poor performance (kafka_consumer -> influxdb) #6354

Comments

robcowart commented Sep 5, 2019

danielnelson commented Sep 5, 2019

robcowart commented Sep 5, 2019 via email

danielnelson commented Sep 5, 2019

danielnelson commented Sep 5, 2019

robcowart commented Sep 6, 2019

last 24-hours

steady-state prior to upgrade

after the upgrade to 1.12.0

troubleshooting

steady-state after the downgrade to 1.11.5

CPU

danielnelson commented Sep 6, 2019

robcowart commented Sep 6, 2019

oacosta40 commented Sep 6, 2019

danielnelson commented Sep 6, 2019

robcowart commented Sep 6, 2019

robcowart commented Sep 6, 2019

danielnelson commented Sep 6, 2019

oacosta40 commented Sep 6, 2019

danielnelson commented Sep 6, 2019

robcowart commented Sep 7, 2019

robcowart commented Sep 7, 2019

adithyamk commented Sep 20, 2019

danielnelson commented Sep 20, 2019

mohsin106 commented Mar 1, 2020

robcowart commented Mar 1, 2020 • edited Loading

mohsin106 commented Mar 1, 2020

danielnelson commented Mar 2, 2020

sgreszcz commented Jun 24, 2020

danielnelson commented Jun 24, 2020

sgreszcz commented Jun 25, 2020

sgreszcz commented Jun 25, 2020 • edited Loading

AlberTadrous commented Aug 14, 2020

robcowart commented Mar 1, 2020 •

edited

Loading

sgreszcz commented Jun 25, 2020 •

edited

Loading