-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[outputs.mqtt] (Windows Service) Fails to reconnect after network disconnect #15418
Comments
Hi, I have put up #15429 with an option to enable the mqtt client trace logging. This should show us if the client attempts to reconnect or not. Could you please, enable The artifacts will be added as a comment to that issue above in 20-30mins from this message.
When I did this, I did not see this specific error message. I did see that the client is always set to retry a connection and it appears to try in an exponential backoff, meaning after 1, 2, 4, 8, 16 seconds, etc. However, it was ultimately able to reconnect after I brought things back up. |
Thanks for looking into this. I grabbed the artifact and tested with it. With debug = true and quiet = true, no logs are generated when the WiFi adapter is disabled, adding quiet = false generates the same logs as above. I'm not seeing any additional logging from the MQTT client.
|
Yeah don't do that ;) quiet turns off logging. You want to omit quiet or set it to false. Debug and client_trace both need to be true. |
Here is logs from an additional test, 30 sec online, 30 sec offline, then restore network for 2 min. This is with debug=true, client_trace=true, quiet=false. The output did not recover, nor do I see additional logging from the client_trace option. |
I know you have it above, but to confirm, you have this in your actual config under the mqtt output? Are you sure you are running that file. This is what I see, note the
We need the client logs as connection retry handling is dealt with the client, not actually telegraf. Are you running this debug version as a service as well? If so, can I suggest running it by hand via the CLI? |
Yes, I can confirm that I have client_trace = true. I tried the release version of Telegraf 1.30 and get an config option does not exist error with that config file, just to double check. I also tried running from the terminal and not as a service, no change to logs, nothing shows from the client. I then tried removing the logging so it would log to stdout and still no change to logs. It also didn't recover when ran from the terminal, so it's likely not a service only issue. I also created an Ubuntu VM, grabbed the same artifact and a minimal config file, it will reconnect when the network is restored, but doesn't generate any logs for the client, in fact it seems to not recognize it was disconnected and keeps reporting a successful write. Any further ideas on how to troubleshoot, I'm at a loss on why I'm not seeing any client logs as you show in your output. |
Update on this, tried changing the MQTT version to 3.1.1 instead of 5 and the client logs appear. This appears to be an issue with the MQTT 5 Client, using the 3.1.1 client the connection recovers as expected, but fails to recover with the MQTT 5 client. |
Thanks for digging in further! I'll take a look at getting the debug logging working for v5. I had incorrectly assumed the debug logging we were setting up worked for both v3.1.1 and v5 clients. |
I've pushed an update to the PR, could you please download the artifacts and try now with the v5 client? You should see debug logs prefaced with
|
Was able to test with the latest artifact. Similar test as before, 30sec connected, turn off NIC for 30sec, then turn NIC on for 1min. Logs: This is the interesting thing to me, it looks like when we attempt to reconnect, the username is malformed.
|
Thank you for digging into that! It does look like we were storing this as a temporary string as a part of the secret store. This behavior lines up with what you were seeing. I've pushed an update just now to store the full username. You should have new artifacts in 20-30mins. |
Closer, now we get a credential error.
Full logs: |
Seems I forgot that bytes is not safe to use after destroy either. Pushed one more, this hopefully does it :D |
Can confirm that 59805e4 resolves the issue. Thanks for your help! |
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.30.3, Windows 10 Pro 10.0.19045 Build 19045, Running as Windows Service
Docker
No response
Steps to reproduce
Expected behavior
Output resumes on network recovery, would expect this to be on the next interval 1 sec or 1000 data points.
Actual behavior
No further messages until service is restarted.
Additional info
This isn't limited to the network adapter, restarting the broker will cause the same result.
The text was updated successfully, but these errors were encountered: