Feature to "pause" input message queue consumers while output(s) are down #2240

biker73 · 2017-01-09T09:41:30Z

Bug report

Relevant telegraf.conf:

[global_tags]
env="UAT"

[agent]
interval = "5s"
round_interval = true
metric_batch_size = 5000
metric_buffer_limit = 20000
collection_jitter = "0s"
flush_jitter = "0s
precision = "ms"
debug = true
quiet = false
omit_hostname = false

[[outputs.influxdb]]
urls = ["host:port"]
database = "historical"
retention_policy = "events.10d"
write_consistency = "any"
timeout = "5s"
username = "userid"
password = "password"

System info:

[Include Telegraf version, operating system name, and other relevant details]

Steps to reproduce:

start telegraf using kafka input and an influxdb output
Ensure a stream of new data is sent to the kafka topic with a valid timestamp for the time the data was generated (ie do NOT use the influx auto generated one)
Verify messages are on kafka queues and that these appear influx via telegraf
Stop influx while new messages being generated on to the kafka topic being read
After 2-3 m sample time start influx
Verify there is a data gap in the data in influx for the period it was stopped
Confirm in telegraf logs metrics were dropped due to no influxdb output available

Expected behaviour:

Telegraf should retain the kafka offset of the last successful write. If the influxdb (or other) output is not available it should stop reading data and pause poll until the output once again becomes available. Once available it should start to read from kafka form the stored offset of the last successful write.

Actual behaviour:

Messages are dropped / ignored. As a time series platform telegraf / influx need to cope with outages otherwise there are huge gaps in data from where influx has been unavailable.

Use case: [Why is this important (helps with prioritizing requests)]

To ensure no gaps / loss of data in a time series platform, Kafka stores the data so telegraf should ensure it detects influx is not available and stop trying to send metrics and subsequently dropping them. It should resume form the last good kafka offset when influx becomes available.

sparrc · 2017-01-09T10:07:37Z

telegraf does buffer up to metric_buffer_limit messages.

It's true that Kafka in particular could be handled differently. Currently there is no notification system informing an input plugin what has happened on the output end of telegraf. Thus far, we have designed telegraf inputs to be independent of the outputs. Implementing this feature would have to fundamentally change that.

biker73 · 2017-01-09T10:19:26Z

Thanks for the quick response, I was unsure whether to log as a bug or a feature, I suspect now more a feature. I understand that the buffer helps but this is possibly more for handling momentary network glitches / latency changes handling / data bursts etc. which will help for a few seconds / minutes. An outage of hours / a weekend due to maintenance of an unexpected issue is not good for this. Maybe this can be changed to a feature request or longer term integration in some way ? I am sure many others would benefit from this too. As a workaround I guess I can manually re-load data from a specific offset from kafka for the now - I'll need to see if telegraf logs the offsets so i know where to load from.

Thanks

Jason

sparrc · 2017-01-09T10:21:41Z

See also #2265

biker73 · 2017-01-13T11:27:04Z

This would also work very effectively (and preferred I think), I raised #2265 as this could be a easier solution in terms of re-work / coding and leave the effort on the implementor to manage. However it would be for kafka only whereas a pause mechanism would be universal.

danielnelson · 2018-11-12T19:01:05Z

@biker73 I have added this functionality to 1.9 (currently in rc). The queue consumers, including kafka_consumer have a new option max_undelivered_messages that limits how many messages will be pulled from the queue before sending. If you could try it out and let me know if you run into any issues that would be really valuable.

sparrc added this to the Future Milestone milestone Jan 9, 2017

sparrc mentioned this issue Jan 13, 2017

Feature Request - Specific Kafka Offset in telegraf Kafka Consumer #2265

Closed

sparrc changed the title ~~Bug - telegraf drops input data if influx output becomes unavailable~~ Implement option of "pausing" telegraf message queue consumers while output(s) are down Jan 13, 2017

sparrc changed the title ~~Implement option of "pausing" telegraf message queue consumers while output(s) are down~~ Feature to "pause" telegraf message queue consumers while output(s) are down Jan 13, 2017

sparrc changed the title ~~Feature to "pause" telegraf message queue consumers while output(s) are down~~ Feature to "pause" input message queue consumers while output(s) are down Jan 13, 2017

dirkdevriendt mentioned this issue Jun 10, 2017

[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905

Closed

danielnelson removed this from the Future Milestone milestone Jun 14, 2017

danielnelson closed this as completed Nov 12, 2018

danielnelson added this to the 1.9.0 milestone Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature to "pause" input message queue consumers while output(s) are down #2240

Feature to "pause" input message queue consumers while output(s) are down #2240

biker73 commented Jan 9, 2017

sparrc commented Jan 9, 2017

biker73 commented Jan 9, 2017

sparrc commented Jan 9, 2017

biker73 commented Jan 9, 2017

sparrc commented Jan 13, 2017

biker73 commented Jan 13, 2017

danielnelson commented Nov 12, 2018

Feature to "pause" input message queue consumers while output(s) are down #2240

Feature to "pause" input message queue consumers while output(s) are down #2240

Comments

biker73 commented Jan 9, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behaviour:

Actual behaviour:

Use case: [Why is this important (helps with prioritizing requests)]

sparrc commented Jan 9, 2017

biker73 commented Jan 9, 2017

sparrc commented Jan 9, 2017

biker73 commented Jan 9, 2017

sparrc commented Jan 13, 2017

biker73 commented Jan 13, 2017

danielnelson commented Nov 12, 2018