Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[outputs.mqtt] (Windows Service) Fails to reconnect after network disconnect #15418

Closed
prairiesnpr opened this issue May 29, 2024 · 14 comments · Fixed by #15429
Closed

[outputs.mqtt] (Windows Service) Fails to reconnect after network disconnect #15418

prairiesnpr opened this issue May 29, 2024 · 14 comments · Fixed by #15429
Labels
bug unexpected problem or unintended behavior

Comments

@prairiesnpr
Copy link

Relevant telegraf.conf

[agent]
    interval = "1s"
    round_interval = true
    metric_batch_size = 1000
    metric_buffer_limit = 1000000
    collection_jitter = "0s"
    flush_interval = "1s"
    flush_jitter = "0s"
    debug = true
    quiet = false
    omit_hostname = true
    logfile = "C:\\telegraf\\telegraf.log"
    logfile_rotation_interval = "24h"
    logfile_rotation_max_archives = 5 

[[outputs.mqtt]]
  servers = ["tls://mqtt.site.com:8883"]
  protocol = "5"
  topic = 'company/location/group/machine/{{ .Tag "name" }}/{{ .Tag "data_type" }}'
  qos = 0
  keep_alive = 30
  username = "$MQTT_USERNAME"
  password = "$MQTT_PASSWORD"
  client_id = "$MQTT_CLIENT_ID"
  timeout = "5s"
  insecure_skip_verify = true
  data_format = "json"
  json_transformation = '$merge([{"value": fields.value, "timestamp": timestamp}])'
  
  [outputs.mqtt.v5]
     content_type = ""
     response_topic = ""
     message_expiry = "0s"
   [outputs.mqtt.v5.user_properties]
     "machine" = "$MQTT_CLIENT_ID"

Logs from Telegraf

2024-05-29T11:40:04Z I! Starting Telegraf 1.30.3 brought to you by InfluxData the makers of InfluxDB
2024-05-29T11:40:04Z I! Available plugins: 233 inputs, 9 aggregators, 31 processors, 24 parsers, 60 outputs, 5 secret-stores
2024-05-29T11:40:04Z I! Loaded inputs: execd (2x) modbus (3x)
2024-05-29T11:40:04Z I! Loaded aggregators: 
2024-05-29T11:40:04Z I! Loaded processors: override (3x) regex unpivot
2024-05-29T11:40:04Z I! Loaded secretstores: 
2024-05-29T11:40:04Z I! Loaded outputs: mqtt
2024-05-29T11:40:04Z I! Tags enabled: 
2024-05-29T11:40:04Z I! [agent] Config: Interval:1s, Quiet:false, Hostname:"", Flush Interval:1s
2024-05-29T11:40:04Z D! [agent] Initializing plugins
2024-05-29T11:40:04Z I! [inputs.modbus] Got 9 request(s) touching 24 holding registers for 12 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 inputs registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 discrete registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 coil registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 5 request(s) touching 12 holding registers for 6 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 inputs registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 discrete registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 coil registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 5 request(s) touching 10 holding registers for 5 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 inputs registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 discrete registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [inputs.modbus] Got 0 request(s) touching 0 coil registers for 0 fields (slave 1)
2024-05-29T11:40:04Z I! [processors.regex] tags: Using explicit mode...
2024-05-29T11:40:04Z I! [processors.regex] tags: Using explicit mode...
2024-05-29T11:40:04Z D! [agent] Connecting outputs
2024-05-29T11:40:04Z D! [agent] Attempting connection to [outputs.mqtt]
2024-05-29T11:40:04Z D! [agent] Successfully connected to outputs.mqtt
2024-05-29T11:40:04Z I! [inputs.execd] Starting process: C:\\Program Files\\Python312\\python [C:\\telegraf\\app1.py -c COM4]
2024-05-29T11:40:04Z I! [inputs.execd] Starting process: C:\\Program Files\\Python312\\python [C:\\telegraf\\app2.py -c COM3]
2024-05-29T11:40:08Z D! [inputs.modbus]  [Logs Truncated] 
2024-05-29T11:40:08Z D! [outputs.mqtt] Wrote batch of 32 metrics in 4.6409ms
2024-05-29T11:40:08Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:40:08Z D! [inputs.modbus]  [Logs Trunctuated] 
2024-05-29T11:40:12Z D! [outputs.mqtt] Wrote batch of 32 metrics in 4.114ms
2024-05-29T11:40:12Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:40:08Z D! [inputs.modbus]  [Logs Trunctuated] 
--- WiFi Adapter Disabled ---
2024-05-29T11:40:38Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 29 times]
2024-05-29T11:40:38Z D! [outputs.mqtt] Wrote batch of 30 metrics in 517.1µs
2024-05-29T11:40:38Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:40:08Z D! [inputs.modbus]  [Logs Truncated] 
2024-05-29T11:40:39Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 31 times]
2024-05-29T11:40:39Z D! [outputs.mqtt] Wrote batch of 32 metrics in 0s
--- Repeats While Adapter is Disabled ---
--- Adapter Enabled ---
2024-05-29T11:40:08Z D! [inputs.modbus]  [Logs Truncated] 
2024-05-29T11:41:49Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 31 times]
2024-05-29T11:41:49Z D! [outputs.mqtt] Wrote batch of 32 metrics in 0s
2024-05-29T11:41:49Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
--- Repeats ---
--- Service Stopped ---
2024-05-29T11:41:49Z D! [agent] Stopping service inputs
2024-05-29T11:41:50Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:50Z D! [outputs.mqtt] Wrote batch of 9 metrics in 0s
2024-05-29T11:41:50Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:51Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:51Z D! [outputs.mqtt] Wrote batch of 9 metrics in 0s
2024-05-29T11:41:51Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:52Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:52Z D! [outputs.mqtt] Wrote batch of 9 metrics in 0s
2024-05-29T11:41:52Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:53Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:54Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:54Z D! [outputs.mqtt] Wrote batch of 9 metrics in 115µs
2024-05-29T11:41:54Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:54Z I! [inputs.execd] Process C:\\Program Files\\Python312\\python.exe shut down
2024-05-29T11:41:55Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:55Z D! [outputs.mqtt] Wrote batch of 9 metrics in 157.3µs
2024-05-29T11:41:55Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:56Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:56Z D! [outputs.mqtt] Wrote batch of 9 metrics in 149.6µs
2024-05-29T11:41:56Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:57Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:57Z D! [outputs.mqtt] Wrote batch of 9 metrics in 0s
2024-05-29T11:41:57Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:58Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 3 times] 
2024-05-29T11:41:58Z D! [outputs.mqtt] Wrote batch of 4 metrics in 0s
2024-05-29T11:41:58Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:59Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 4 times] 
2024-05-29T11:41:59Z D! [outputs.mqtt] Wrote batch of 5 metrics in 0s
2024-05-29T11:41:59Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:59Z D! [agent] Input channel closed
2024-05-29T11:41:59Z D! [agent] Processor channel closed
2024-05-29T11:41:59Z D! [agent] Processor channel closed
2024-05-29T11:41:59Z D! [agent] Processor channel closed
2024-05-29T11:41:59Z D! [agent] Processor channel closed
2024-05-29T11:41:59Z D! [agent] Processor channel closed
2024-05-29T11:41:59Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-05-29T11:41:59Z W! [outputs.mqtt] Could not publish message to MQTT server: connection with the MQTT server is currently down [Line repeats 8 times] 
2024-05-29T11:41:59Z D! [outputs.mqtt] Wrote batch of 9 metrics in 0s
2024-05-29T11:41:59Z D! [outputs.mqtt] Buffer fullness: 0 / 1000000 metrics
2024-05-29T11:41:59Z I! [agent] Stopping running outputs
2024-05-29T11:41:59Z D! [agent] Stopped Successfully

System info

Telegraf 1.30.3, Windows 10 Pro 10.0.19045 Build 19045, Running as Windows Service

Docker

No response

Steps to reproduce

  1. Start Telegraf
  2. Turn WiFi adapter off
  3. Enable WiFi adapter
  4. MQTT Output will stop sending until service is restarted

Expected behavior

Output resumes on network recovery, would expect this to be on the next interval 1 sec or 1000 data points.

Actual behavior

No further messages until service is restarted.

Additional info

This isn't limited to the network adapter, restarting the broker will cause the same result.

@prairiesnpr prairiesnpr added the bug unexpected problem or unintended behavior label May 29, 2024
powersj added a commit to powersj/telegraf that referenced this issue May 30, 2024
@powersj
Copy link
Contributor

powersj commented May 30, 2024

Hi,

I have put up #15429 with an option to enable the mqtt client trace logging. This should show us if the client attempts to reconnect or not. Could you please, enable client_trace = true in your mqtt output and try that artifact and provide the full and complete logs?

The artifacts will be added as a comment to that issue above in 20-30mins from this message.

connection with the MQTT server is currently down

When I did this, I did not see this specific error message. I did see that the client is always set to retry a connection and it appears to try in an exponential backoff, meaning after 1, 2, 4, 8, 16 seconds, etc. However, it was ultimately able to reconnect after I brought things back up.

@powersj powersj added the waiting for response waiting for response from contributor label May 30, 2024
@prairiesnpr
Copy link
Author

Thanks for looking into this. I grabbed the artifact and tested with it.

With debug = true and quiet = true, no logs are generated when the WiFi adapter is disabled, adding quiet = false generates the same logs as above. I'm not seeing any additional logging from the MQTT client.

[agent]
    interval = "1s"
    round_interval = true
    metric_batch_size = 1000
    metric_buffer_limit = 1000000
    collection_jitter = "0s"
    flush_interval = "1s"
    flush_jitter = "0s"
    debug = true
    quiet = true
    omit_hostname = true
    logfile = "C:\\telegraf\\telegraf.log"
    logfile_rotation_interval = "24h"
    logfile_rotation_max_archives = 5 

[[outputs.mqtt]]
  servers = ["tls://mqtt.site.com:8883"]
  protocol = "5"
  topic = 'company/location/group/machine/{{ .Tag "name" }}/{{ .Tag "data_type" }}'
  qos = 0
  client_trace = true
  keep_alive = 30
  username = "$MQTT_USERNAME"
  password = "$MQTT_PASSWORD"
  client_id = "$MQTT_CLIENT_ID"
  timeout = "5s"
  insecure_skip_verify = true
  data_format = "json"
  json_transformation = '$merge([{"value": fields.value, "timestamp": timestamp}])'

  [outputs.mqtt.v5]
     content_type = "json"
     response_topic = ""
     message_expiry = "0s"
   [outputs.mqtt.v5.user_properties]
     "machine" = "$MQTT_CLIENT_ID"

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label May 30, 2024
@powersj
Copy link
Contributor

powersj commented May 30, 2024

With debug = true and quiet = true

Yeah don't do that ;) quiet turns off logging. You want to omit quiet or set it to false. Debug and client_trace both need to be true.

@powersj powersj added the waiting for response waiting for response from contributor label May 30, 2024
@prairiesnpr
Copy link
Author

telegraf.log

Here is logs from an additional test, 30 sec online, 30 sec offline, then restore network for 2 min.

This is with debug=true, client_trace=true, quiet=false.

The output did not recover, nor do I see additional logging from the client_trace option.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jun 3, 2024
@powersj
Copy link
Contributor

powersj commented Jun 3, 2024

client_trace=true

I know you have it above, but to confirm, you have this in your actual config under the mqtt output? Are you sure you are running that file.

This is what I see, note the [client] and [store] debug messages:

2024-06-03T14:14:58Z I! Loading config: config.toml
2024-06-03T14:14:58Z I! Starting Telegraf 1.31.0-a2b47a1a brought to you by InfluxData the makers of InfluxDB
2024-06-03T14:14:58Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-06-03T14:14:58Z I! Loaded inputs: exec
2024-06-03T14:14:58Z I! Loaded aggregators: 
2024-06-03T14:14:58Z I! Loaded processors: 
2024-06-03T14:14:58Z I! Loaded secretstores: 
2024-06-03T14:14:58Z I! Loaded outputs: file mqtt
2024-06-03T14:14:58Z I! Tags enabled: 
2024-06-03T14:14:58Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2024-06-03T14:14:58Z D! [agent] Initializing plugins
2024-06-03T14:14:58Z D! [agent] Connecting outputs
2024-06-03T14:14:58Z D! [agent] Attempting connection to [outputs.file]
2024-06-03T14:14:58Z D! [agent] Successfully connected to outputs.file
2024-06-03T14:14:58Z D! [agent] Attempting connection to [outputs.mqtt]
2024-06-03T14:14:58Z D! [outputs.mqtt] [client]  Connect()
2024-06-03T14:14:58Z D! [outputs.mqtt] [store]   memorystore initialized
2024-06-03T14:14:58Z D! [outputs.mqtt] [client]  about to write new connect msg
2024-06-03T14:14:58Z D! [outputs.mqtt] [client]  EOF
2024-06-03T14:14:58Z D! [outputs.mqtt] [client]  failed to connect to broker, trying next
2024-06-03T14:14:58Z D! [outputs.mqtt] [client]  Failed to connect to a broker
2024-06-03T14:14:58Z D! [outputs.mqtt] [store]   memorystore closed
2024-06-03T14:14:58Z E! [agent] Failed to connect to [outputs.mqtt], retrying in 15s, error was "network Error : EOF"

We need the client logs as connection retry handling is dealt with the client, not actually telegraf.

Are you running this debug version as a service as well? If so, can I suggest running it by hand via the CLI?

@powersj powersj added the waiting for response waiting for response from contributor label Jun 3, 2024
@prairiesnpr
Copy link
Author

Yes, I can confirm that I have client_trace = true.

I tried the release version of Telegraf 1.30 and get an config option does not exist error with that config file, just to double check.

I also tried running from the terminal and not as a service, no change to logs, nothing shows from the client.

I then tried removing the logging so it would log to stdout and still no change to logs.

It also didn't recover when ran from the terminal, so it's likely not a service only issue.

I also created an Ubuntu VM, grabbed the same artifact and a minimal config file, it will reconnect when the network is restored, but doesn't generate any logs for the client, in fact it seems to not recognize it was disconnected and keeps reporting a successful write.

Any further ideas on how to troubleshoot, I'm at a loss on why I'm not seeing any client logs as you show in your output.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jun 3, 2024
@prairiesnpr
Copy link
Author

prairiesnpr commented Jun 3, 2024

Update on this, tried changing the MQTT version to 3.1.1 instead of 5 and the client logs appear.

This appears to be an issue with the MQTT 5 Client, using the 3.1.1 client the connection recovers as expected, but fails to recover with the MQTT 5 client.

@powersj
Copy link
Contributor

powersj commented Jun 3, 2024

Thanks for digging in further!

I'll take a look at getting the debug logging working for v5. I had incorrectly assumed the debug logging we were setting up worked for both v3.1.1 and v5 clients.

@powersj
Copy link
Contributor

powersj commented Jun 3, 2024

@prairiesnpr,

I've pushed an update to the PR, could you please download the artifacts and try now with the v5 client? You should see debug logs prefaced with [paho] like the following:

2024-06-03T18:59:41Z D! [agent] Attempting connection to [outputs.mqtt]
2024-06-03T18:59:41Z D! [paho]  failed to connect to tls://localhost:1883: dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-03T18:59:51Z D! [paho]  failed to connect to tls://localhost:1883: dial tcp 127.0.0.1:1883: connect: connection refused
2024-06-03T19:00:01Z D! [paho]  failed to connect to tls://localhost:1883: dial tcp 127.0.0.1:1883: connect: connection refused

@powersj powersj added the waiting for response waiting for response from contributor label Jun 3, 2024
@prairiesnpr
Copy link
Author

Was able to test with the latest artifact. Similar test as before, 30sec connected, turn off NIC for 30sec, then turn NIC on for 1min.

Logs:
telegraf_connection_loss_v5.txt

This is the interesting thing to me, it looks like when we attempt to reconnect, the username is malformed.

2024-06-05T13:10:10Z D! [paho] failed to connect to tls://<mqtt-server>:8883: failed to connect to server: Sent CONNECT with malformed username.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jun 5, 2024
@powersj
Copy link
Contributor

powersj commented Jun 5, 2024

@prairiesnpr,

Thank you for digging into that! It does look like we were storing this as a temporary string as a part of the secret store. This behavior lines up with what you were seeing. I've pushed an update just now to store the full username.

You should have new artifacts in 20-30mins.

@prairiesnpr
Copy link
Author

Closer, now we get a credential error.

2024-06-05T14:29:18Z D! [paho] failed to connect to tls://<mqtt-server>:8883: failed to connect to server: Authentication failed because of invalid credentials

Full logs:
telegraf_connection_loss_v5_2.log

@powersj
Copy link
Contributor

powersj commented Jun 5, 2024

Seems I forgot that bytes is not safe to use after destroy either. Pushed one more, this hopefully does it :D

@prairiesnpr
Copy link
Author

Can confirm that 59805e4 resolves the issue. Thanks for your help!

powersj added a commit to powersj/telegraf that referenced this issue Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants