-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inputs.mqtt/mqtt_consumer: allow connection errors on start #10694
Comments
Hi, Unfortunately, this is the intended behavior of Telegraf, but I do see room for improvement on a per-plugin basis. First, let's consider, your error message:
Think about what happens when a user mistypes their username/password or sets the wrong hostname/IP address for a service to collect from in their config. It is a lot less clear to users that something is wrong if Telegraf keeps on going. In your case, (if I ignore the timestamp) is that network error due to config error or an actual network issue? Failing prevents a false sense that everything is working. In terms of improvements, I do think we should add some retries around some error conditions like #10078 tries to call out. Given you are working with mqtt, it looks like you lost connection, we tried to connect and bailed? I am of the opinion that we should have some sort of exponential backoff retry logic in cases like these, but we should ultimately fail if after t time things do not clear up. Thoughts? |
Thanks @powersj it turns out the problem was caused by an incorrect mqtt server URL (needed ssl:// instead of tcp:// for MQTTS), but my main concern was that the whole telegraf service (in my case the Kubernetes pod's container) crashed and had to restart. Would it have crashed if I'd had other inputs configured that were successfully connected? I think each input should fail after a connection timeout and possibly some retries, but that should not cause the whole service to fail. Does that sound reasonable? Sorry I'm a telegraf and influxdb newbie and just learning as I go, I hope I'm not making incorrect assumptions about how it does or should work. |
Experiencing the same issue with This problem seems to occurs from v1.19.0 . As a workaround I have downgraded to v1.18.3 for the time being. Just to add my two cents, some kind of flag to indicate that an input has to be healthy would be ideal to allow the user to pick and choose (with a default in place) which inputs matter. |
Just like the issue #11289 I submitted, I hope to add a configuration, and the user decides whether to ignore the input that fails init. For details, please refer to the issue I wrote I hope to get a clear answer from you @srebhan : Do you want to do this? how to do? If it's too late, we'll try to fix it ourselves. thank you |
I have the same issue here, if the telegraf cannot connect to mongodb or dockerd, the whole telegraf crashed. As we are using one telegraf to monitor a number of things, one input error makes other inputs stop working even if other inputs are correctly initialized. I think this behavior does not make sense. if we have a double-edge sword here, I hope we could have an option to let users decide how to configure it. |
* add ignore_init_fail_input option for ignore initialization failed Input influxdata#11289 influxdata#10694 * rename option ignore_init_fail_input to ignore_error_inputs
@haoel please create an issue for MongoDB and a separate one for Docker with a description of the failure. We should fix the two plugins. |
I guess this is not resolved yet? Seems crazy that one bad input would prevent all other collectors from functioning. In my case, I have multiple MQTT brokers, and if one drops off the network, none work because Telegraf can't handle the failed connection on startup. Should note that if Telegraf has connection on startup, everything is fine. If the connection to the mqtt broker then drops out, Telegraf doesn't care and keeps on chugging. As it should. Telegraf is rarely used in isolation, and anyone who is ingesting data is likely doing something with that data, and has other methods of noticing if there's a problem. One failed input shouldn't take down the whole system. |
I would like to add another use case that hopefully supports a request to change the error handling behavior within an input. I have a device on TTN which intermittently changes the payload of a specific topic. In the example below, the payload sometimes includes status detail within the path "uplink_message.decoded_payload", and sometimes it does not, however the location data is always included in the payload. (which I need).
When the status detail is not included the entire input fails with an error code indicated on --debug. Although the location data is valid, the measurement is not parsed to the influxdb output. A note here. I am very new to this, so could well be approaching this in the wrong way. Any advice is very welcome. |
+1 from me for some initial retry with exponential backoff. The failure mode I've observed was that the configuration in telegraf was 100% valid, it just took a little bit longer for the MQTT service start up after boot and telegraf service on the same machine errored out in the meantime. |
Is there any solution or work around for this? My telegraf tries to connect to an MQTT server, and it also runs ping tests to check connectivity to a number of devices, and then writes everything to infulxdb. I had my MQTT server go down. Unfortunately this cause telegraf to continually restart and never run or complete any of the ping tests. I would be happy if I could just make telegraf try to connect to the MQTT server every 30 seconds or something, and in the meantime continue to run the ping tests as normal. |
@CubicEarth and all others, please test the binary in PR #15486, available as soon as CI finished the tests, and set |
Hi Sven,
Sadly I don't have a setup to trinker and test with as my system is live.
But addressing this adds critical functionality. Thanks!!!
Corey
…On Tue, Jun 11, 2024 at 12:06 PM Sven Rebhan ***@***.***> wrote:
@CubicEarth <https://github.com/CubicEarth> and all others, please test
the binary in PR #15486
<#15486>, available as soon as
CI finished the tests, and set startup_error_behavior = "retry" in your
plugin configuration! Let me know if this fixes you issue!
—
Reply to this email directly, view it on GitHub
<#10694 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB4446BTLMIYGCVI7RICCZTZG5DB7AVCNFSM5O7TZV72U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMJWGE2DEOJZHE2Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@mprasil or @simonsmart99 or anyone else reading this, can you please test?!? |
@srebhan testing it with live system is a bit more involved, but I put together a little test configuration: # cat telegraf.toml
[agent]
hostname = "test"
[[outputs.file]]
files = ["stdout"]
[[inputs.mqtt_consumer]]
data_format = "value"
servers = ["tcp://localhost:1883"]
topics = ["test/topic/#"]
startup_error_behavior = "retry" and then ran the downloaded binary (while having mqtt off) with
So it looks like it works exactly as expected, it disables the plugin while MQTT is unreachable with that |
The only issue I've observed is that once it connects it never tries to reconnect again should connection be dropped again. So if I start with MQTT up, then stop MQTT and start it again, telegraf will forever print |
Can you enable debug logging in your agent config and set |
I could not enable client_trace:
I've ran test with the debug enabled:
On the MQTT side I can see the firs connection:
But when I start it next time, it does not see any clients connecting:
If you want to test it yourself, the config I used is up in my previous comment and the MQTT is just locally running docker runun -it --net=host --rm eclipse-mosquitto |
@mprasil I guess the issue is that the connection loss detection has some insane defaults... It depends on the "keep-alive" interval and the "ping timeout" which are set to 60 seconds and 10 seconds respectively. So the time until we reconnect will sum up the two plus (in the worst case) your I added two parameters to the config |
I have ran the test again with the same config as before:
It's been couple minutes and telegraf still did not reconnect. |
@mprasil did you really download and run the latest version from the PR? |
Just downloaded latest version from #15486 and it indeed seems to reconnect. However I managed to crash it after couple rounds of reconnections:
It is kind of random, sometimes it happens on first try, sometimes it takes multiple tries. |
Thanks for all your testing @mprasil! |
Yeah, all good with the latest version. I've tortured telegraf with disconnection every couple seconds and it kept reconnecting as it should, no crashes observed. Thank you @srebhan I'm looking forward to run this in prod at some stage. |
This commit was merged in June and included in the 1.32 milestone. I'm on 1.31.3. Is there an estimated timeline for 1.32 push? |
Well the release was on Monday... :-D |
Haha just saw it hit when I updated my raspberry this morning. Thanks! Implemented the line in my conf file and it's looking good so far. |
Feature Request
Proposal:
Telegraf should not crash when a single input fails to connect to its source. Ideally it would continue to retry the connection for that input, or permanently fail but continue running so that other inputs and outputs continue to work normally. There seem to be several bug reports related to this, including #3167 and #10078.
Current behavior:
A single telegraf mqtt_consumer input that fails to connect to an mqtt broker causes the entire telegraf service to shut down.
Example log (after the final log entry the telegraf service exits and logging ceases):
Desired behavior:
Each input would be responsible for its own data source connection and not affect other inputs/outputs when the connection fails.
Use case:
The software is not usable in production without this functionality.
The text was updated successfully, but these errors were encountered: