-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature to "pause" input message queue consumers while output(s) are down #2240
Comments
telegraf does buffer up to It's true that Kafka in particular could be handled differently. Currently there is no notification system informing an input plugin what has happened on the output end of telegraf. Thus far, we have designed telegraf inputs to be independent of the outputs. Implementing this feature would have to fundamentally change that. |
Thanks for the quick response, I was unsure whether to log as a bug or a feature, I suspect now more a feature. I understand that the buffer helps but this is possibly more for handling momentary network glitches / latency changes handling / data bursts etc. which will help for a few seconds / minutes. An outage of hours / a weekend due to maintenance of an unexpected issue is not good for this. Maybe this can be changed to a feature request or longer term integration in some way ? I am sure many others would benefit from this too. As a workaround I guess I can manually re-load data from a specific offset from kafka for the now - I'll need to see if telegraf logs the offsets so i know where to load from. Thanks Jason |
see also #802 |
I think #802 can be resolved using Kafka :) All we then need is for telegraf to auto recover from where it left off in the event of a platform failure somewhere :) |
I'm changing the title of this issue because I think it's a good general feature to have for all message queue input plugins. Basically the idea would be that telegraf could signal to some input plugins (namely message queues, which have their own persistent storage) to stop accepting new messages until all output plugins are operational again. My thoughts on this are that it would only apply to message queues, and not apply to plugins that don't have a clear datastore behind them, like mem, cpu, statsd, tcp_listener, etc. See also #2265 |
This would also work very effectively (and preferred I think), I raised #2265 as this could be a easier solution in terms of re-work / coding and leave the effort on the implementor to manage. However it would be for kafka only whereas a pause mechanism would be universal. |
@biker73 I have added this functionality to 1.9 (currently in rc). The queue consumers, including |
Bug report
Relevant telegraf.conf:
[global_tags]
env="UAT"
[agent]
interval = "5s"
round_interval = true
metric_batch_size = 5000
metric_buffer_limit = 20000
collection_jitter = "0s"
flush_jitter = "0s
precision = "ms"
debug = true
quiet = false
omit_hostname = false
[[outputs.influxdb]]
urls = ["host:port"]
database = "historical"
retention_policy = "events.10d"
write_consistency = "any"
timeout = "5s"
username = "userid"
password = "password"
System info:
[Include Telegraf version, operating system name, and other relevant details]
Steps to reproduce:
Expected behaviour:
Telegraf should retain the kafka offset of the last successful write. If the influxdb (or other) output is not available it should stop reading data and pause poll until the output once again becomes available. Once available it should start to read from kafka form the stored offset of the last successful write.
Actual behaviour:
Messages are dropped / ignored. As a time series platform telegraf / influx need to cope with outages otherwise there are huge gaps in data from where influx has been unavailable.
Use case: [Why is this important (helps with prioritizing requests)]
To ensure no gaps / loss of data in a time series platform, Kafka stores the data so telegraf should ensure it detects influx is not available and stop trying to send metrics and subsequently dropping them. It should resume form the last good kafka offset when influx becomes available.
The text was updated successfully, but these errors were encountered: