Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent MQTTS Connection Failures (IDFGH-13979) #288

Open
3 tasks done
vicatcu opened this issue Oct 30, 2024 · 18 comments
Open
3 tasks done

Intermittent MQTTS Connection Failures (IDFGH-13979) #288

vicatcu opened this issue Oct 30, 2024 · 18 comments

Comments

@vicatcu
Copy link

vicatcu commented Oct 30, 2024

Checklist

  • Checked the issue tracker for similar issues to ensure this is not a duplicate
  • Read the documentation to confirm the issue is not addressed there and your configuration is set correctly
  • Tested with the latest version to ensure the issue hasn't been fixed

How often does this bug occurs?

rarely

Expected behavior

MQTTS connection should consistently work

Actual behavior (suspected bug)

When connecting to my MQTT Server (self-hosted, Mosquitto), I intermittently experience the failures to connect with MQTTS.

Error logs or terminal output

When it fails to connect, I get the following output from my code:


E (15049) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF
E (15050) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=119
[MQTT EVENT] reported from esp-tls (0x8008)
[MQTT EVENT] Last errno string (Success)
E (15065) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned -2
E (15068) mqtt_client: MQTT connect failed

Steps to reproduce the behavior

My code effectively looks like this:

  static esp_mqtt_client_config_t mqtt_cfg;
  mqtt_cfg.broker.address.hostname = settings->server;
  mqtt_cfg.broker.address.transport = MQTT_TRANSPORT_OVER_SSL;
  mqtt_cfg.broker.address.port = 8883;
  mqtt_cfg.credentials.username = settings->username;
  mqtt_cfg.credentials.client_id = settings->identity;
  mqtt_cfg.credentials.authentication.password = settings->password;
  mqtt_cfg.broker.verification.certificate = root_ca;
  esp_mqtt_client_handle_t _client = esp_mqtt_client_init(&mqtt_cfg);
  esp_mqtt_client_register_event(_client, (esp_mqtt_event_id_t) ESP_EVENT_ANY_ID, mqtt_event_handler, NULL);
  const esp_err_t ret = esp_mqtt_client_start(_client);
  boolean connected = false;
  if (ESP_OK == ret) {
      connected = waitForMqttConnected();
      // polls for global _mqttIsConnected global variable set in mqtt event handler to become true
      // case MQTT_EVENT_CONNECTED sets it to true
      // case MQTT_EVENT_DISCONNECTED sets it to false
      // case MQTT_EVENT_ERROR sets it to false 
      // or times out after 15 seconds
  }

  ... event handler code that is printing out the [MQTT EVENT] log messages looks like this:
    case MQTT_EVENT_ERROR:
      lastMqttEventId = MQTT_EVENT_ERROR;
      _mqttIsError = true;
      _mqttIsConnected = false;
      esp_tls_stack_err = 0;
      if (event->error_handle->error_type == MQTT_ERROR_TYPE_TCP_TRANSPORT) {
        if (event->error_handle->esp_tls_last_esp_err  != 0) {
          Serial.printf("[MQTT EVENT] reported from esp-tls (0x%x)\r\n", event->error_handle->esp_tls_last_esp_err);
        }
        if (event->error_handle->esp_tls_stack_err != 0) {
          esp_tls_stack_err = event->error_handle->esp_tls_stack_err;
          Serial.printf("[MQTT EVENT] reported from tls stack (0x%x)\r\n", event->error_handle->esp_tls_stack_err);
        }
        if (event->error_handle->esp_transport_sock_errno != 0) {
          Serial.printf("[MQTT EVENT] captured as transport's socket errno (0x%x)\r\n", event->error_handle->esp_transport_sock_errno);
        }
        Serial.printf("[MQTT EVENT] Last errno string (%s)\r\n", strerror(event->error_handle->esp_transport_sock_errno));
      }
      break;
   ...

Project release version

idf-release_v5.1-632e0c2a

System architecture

Intel/AMD 64-bit (modern PC, older Mac)

Operating system

Linux

Operating system version

Ubuntu 22.04

Shell

Bash

Additional context

No response

@vicatcu
Copy link
Author

vicatcu commented Oct 30, 2024

Cross-posted from espressif/arduino-esp32#10541

@github-actions github-actions bot changed the title Intermittent MQTTS Connection Failures Intermittent MQTTS Connection Failures (IDFGH-13979) Oct 30, 2024
@VojtechBartoska
Copy link

@euripedesrocha Can you please help with triage? Thanks

@vicatcu
Copy link
Author

vicatcu commented Nov 7, 2024

@euripedesrocha just checking in, what do you think? how can I help us solve this problem?

@euripedesrocha
Copy link
Collaborator

@vicatcu first I need more information. Could you provide logs?
If the disconnection is happening rarely and mqtt client is able to reconnect, the behavior is as expected. Is it possible that the disconnection is a local network failure?
The error presented is typical from disconnection from the broker, I suggest you also look into the broker side logs also.

@vicatcu
Copy link
Author

vicatcu commented Nov 8, 2024

@euripedesrocha the logs were included in the OP:

E (15049) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF
E (15050) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=119
[MQTT EVENT] reported from esp-tls (0x8008)
[MQTT EVENT] Last errno string (Success)
E (15065) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned -2
E (15068) mqtt_client: MQTT connect failed

This is certainly not typical of the output in the usual connect-report-disconnect case...

@euripedesrocha
Copy link
Collaborator

@vicatcu I saw this, I meant more logs to identify if we have an issue.

Why would you say it's not usual disconnect? The logs state that the socket was closed by the peer ( I know that they are not great in clarifying that).

As I suggest you could find your issue by looking into your broker's log.

@vicatcu
Copy link
Author

vicatcu commented Nov 11, 2024

I would say it's not a normal disconnect on the basis that my code doesn't act like an MQTT_EVENT_CONNECTED event ever happened. The trace I do get suggests its something related to TLS that's the problem. But to be fair, I don't have an explicit printf output in that switch-case, so I should probably add printf outputs to every switch-case in the event handler to get a better trace of what exactly is happening on the device side.

It's difficult for me to get broker-logs as this is a production server with lots of clients connecting and disconnecting all the time - a bit of a fire hose. If you have advice on how I might instrument that, I'm open to exploring / examining it.

My solution space is:

  • identifying if there is a bug in the library and helping squash it if possible
  • alternatively, identifying a strategy for coping with / recovering from whatever this intermittent behavior actually is

The behavioral pattern in my code is actually this if it helps:

  • power up
  • connect to Wi-Fi
  • connect to MQTT broker (via MQTTS on TCP port 8883)
  • publish some messages
  • disconnect from MQTT broker
  • shut down

And to be clear, when I say "power up" and "shutdown" I don't mean sleep modes, I mean power-off / power-on.

@euripedesrocha
Copy link
Collaborator

@vicatcu,
as you mention, the connected event is not happening because it doesn't complete. In such scenarios, the broker is rejecting the connect control message and closing.
Few things that might trigger such cases:

  • Authentication is incorrect
  • Repeated client id

If you are using a self-hosted mosquitto in production, you maybe enable the logs to the sys channel and subscribe to it to monitor. It will make the identification of the root cause easy to look to those logs.

@vicatcu
Copy link
Author

vicatcu commented Nov 12, 2024

@euripedesrocha I captured a bit more output with additional logging enabled in my application:

// the following two lines must be in response to esp_mqtt_client_start(_client)
E (14350) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF
E (14350) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=119

// then the MQTT_EVENT_ERROR fires and these messages print out
[MQTT EVENT] MQTT_EVENT_ERROR
[MQTT EVENT] reported from esp-tls (0x8008)
[MQTT EVENT] Last errno string (Success)
E (14366) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned -2
E (14372) mqtt_client: MQTT connect failed

// then the MQTT_EVENT_DISCONNECTED fires
[MQTT EVENT] MQTT_EVENT_DISCONNECTED

I do have access to the sys channel and can subscribe to it to monitor it... but what exactly should I be looking for?

@vicatcu
Copy link
Author

vicatcu commented Nov 12, 2024

I'm also seeing sometimes that I get the MQTT_EVENT_CONNECTED messages after I have given up and timed out (after 15 seconds)... that seems like a long time to connect

@vicatcu
Copy link
Author

vicatcu commented Nov 12, 2024

For what it's worth, I tried implementing a "retry strategy" (up to three attempts) before giving up and I got this behavior:

// first attempt (same as before)
E (19477) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF
E (19477) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=119

[MQTT EVENT] MQTT_EVENT_ERROR
[MQTT EVENT] reported from esp-tls (0x8008)
[MQTT EVENT] Last errno string (Success)
E (19492) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned -2
E (19498) mqtt_client: MQTT connect failed

[MQTT EVENT] MQTT_EVENT_DISCONNECTED
 Failed, no connect

// second attempt (first retry)
[MQTT EVENT] MQTT_EVENT_CONNECTED

My application is battery powered, though, so it's a trade-off whether to stay on and keep trying or to give up quickly, save state, and try again later. When it works, it works quickly, when it doesn't it takes a bit of time to fail.

@vicatcu
Copy link
Author

vicatcu commented Nov 12, 2024

If I subscribe to $SYS/broker/log/# on the broker, and filter messages down to those containing the MQTT ID of my development unit, are pairs of messages like this:

1731447729: New client connected from 74.69.54.130 as XXXXXXXXXXXXXXXX (c1, k120, u'XXXXXXXXXXXXXXXX').
1731447729: Client XXXXXXXXXXXXXXXX disconnected.

Do I need to widen the net on what I'm looking at somehow?

@vicatcu
Copy link
Author

vicatcu commented Nov 15, 2024

I've captured about 19 hours of logs from my firmware captured in PuTTY and run some post-hoc analysis on it. The summary of my preliminary analysis is that about 53 of the 232 cycles (23%), at power up cycles of about every 5 minutes - hard shutdown between attempts, required one or more retries. Among those, 46 cycles (~20%) successfully connected with 1-2 retries (37 required one retry, 9 required 2 retries), and 7 (~3%) failed to connect within 2 retries. These statistics don't seem that good to me... is this what one would expect?

I'm running another data capture over the weekend without using MQTTS and will report back on the results. I have a hypothesis that this behavior is only evident when using TLS. I'm not sure where that leaves me if the hypotheses proves to be true...

@euripedesrocha
Copy link
Collaborator

Hi @vicatcu, as I mentioned before from the logs you have, it looks like some issue in getting a response from the broker during connection process.
The error the MQTT client is showing happens when the peer in the connection closes the socket. Few questions that might help to find the cause:
Since you are adding the information that you are battery powered: Are you going into deep sleep?
Do you disconnect the client or just put the device to sleep?
How is your keep alive with the broker compared to your device sleep patterns? Is it possible that the broker is seeing your reconnection attempt and dropping the incorrect connection?
Is it possible that you have more devices with the same client ID?

@vicatcu
Copy link
Author

vicatcu commented Nov 20, 2024

@euripedesrocha as promised, I ran the test without MQTTS and got these results:

295 cycles total
277 cycles (93.9%) no connection errors (connected first try) 
18 cycles (6.1%)  connection errors (did not connect on first try)
16 cycles successfully connected on the second attempt
1 cycle successfully connected on the third attempt
1 cycle did not successfully connect after three attempts

So my observations here are

  1. The results are markedly different with and without MQTTS
    • ~3 times more likely to not connect on the first try with MQTTS
  2. Even without TLS retries are required to connect in a significant number of cases
  3. Even without TLS three retries is not always enough retries

What else can I do to help resolve this. Suffice it to say, not using MQTTS is not a great solution for me.

Digging into the non-TLS logs I see the connection failures manifesting in a variety of ways:

Worst Case (the one that failed to connect after three attempts)

Attempting MQTT connection... using Authentication...[MQTT ERROR] Connection Timed out
 Failed, no connect
Attempting MQTT connection... using Authentication...E (44007) esp-tls: [sock=48] select() timeout
E (44008) transport_base: Failed to open a new connection: 32774
E (44008) mqtt_client: Error transport connect
[MQTT EVENT] reported from esp-tls (0x8006)
[MQTT EVENT] Last errno string (Success)
[MQTT ERROR] Connection Timed out
 Failed, no connect
Attempting MQTT connection... using Authentication...E (59028) esp-tls: [sock=48] select() timeout
E (59029) transport_base: Failed to open a new connection: 32774
E (59029) mqtt_client: Error transport connect
[MQTT EVENT] reported from esp-tls (0x8006)
[MQTT EVENT] Last errno string (Success)
[MQTT ERROR] Connection Timed out
 Failed, no connect

Here's a case where the second attempt worked (some errors came out presumably related to the first attempt?)

MQTT INIT...OK
Attempting MQTT connection... using Authentication...[MQTT ERROR] Connection Timed out
 Failed, no connect
Attempting MQTT connection... using Authentication...E (40677) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned 0
E (40677) mqtt_client: MQTT connect failed
 OK

Another similar case, but different Error number (39615) in output:

Attempting MQTT connection... using Authentication...[MQTT ERROR] Connection Timed out
 Failed, no connect
Attempting MQTT connection... using Authentication...E (39615) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned 0
E (39615) mqtt_client: MQTT connect failed
 OK

Sometimes it looks like this succeeding on the second attempt:

Attempting MQTT connection... using Authentication...[MQTT ERROR] Connection Timed out
 Failed, no connect
Attempting MQTT connection... using Authentication... OK

Here's one that succeeded on the third attempt:

MQTT INIT...OK
Attempting MQTT connection... using Authentication...[MQTT ERROR] Connection Timed out
 Failed, no connect
Attempting MQTT connection... using Authentication...E (39377) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned 0
E (39377) mqtt_client: MQTT connect failed
[MQTT ERROR] Connection Timed out
 Failed, no connect
Attempting MQTT connection... using Authentication... OK

@vicatcu
Copy link
Author

vicatcu commented Nov 20, 2024

Hi @vicatcu, as I mentioned before from the logs you have, it looks like some issue in getting a response from the broker during connection process. The error the MQTT client is showing happens when the peer in the connection closes the socket. Few questions that might help to find the cause: Since you are adding the information that you are battery powered: Are you going into deep sleep? Do you disconnect the client or just put the device to sleep? How is your keep alive with the broker compared to your device sleep patterns? Is it possible that the broker is seeing your reconnection attempt and dropping the incorrect connection? Is it possible that you have more devices with the same client ID?

  • I'm not using any kind of sleep. The processor literally shuts down between cycles (hardware power off).
  • I implement explicit MQTT disconnects after connecting and before shutting down.
  • I don't think I'm configuring the keepalive timeout in the esp-mqtt library configuration structure, so I assume whatever the default setting is what is being applied.
  • I guarantee the Client IDs are all unique - there's no possibility that I have more than one device with a given Client ID.
  • Again I'm using mosquitto as my broker, nothing exotic going on here in this respect.

@euripedesrocha
Copy link
Collaborator

@vicatcu,
thanks for all the information added. Unfortunately is hard for me to identify possible causes and suggest improvements on your system or to point that we have an issue on esp-mqtt, or other idf components from the information I got from you. I'm not able to replicate the behavior you have within our current test scenario, I'll put some time to add more monitoring to our current test and try to identify if we have some faulty behavior on our side.

Could you share more logs with the level set to VERBOSE? I suggest put on verbose mode:

  • mqtt_client
  • transport_base
  • transport
  • esp_tls

I would also like you to share how is the network scenario on your side. Direct WIFI connection? Are you going through some 4G modem? Is mesh involved?

The latest logs you shared are indicating timeout, while the first ones indicated that the connection was closed by the peer. Did you had any changes in your system?

@vicatcu
Copy link
Author

vicatcu commented Nov 21, 2024

@euripedesrocha Network setup is very basic - direct connection to Wi-Fi router (Ubiquiti UDM) plugged into a terrestrial cable modem (Spectrum ISP). I need some instruction on how to set the level to VERBOSE on those module in the Arduino environment. I think I probably should also try and offer up a minimal reproducible example since my complete code base is private and big, in which I try and closely mirror the code in my application without as much complexity.

The latest logs don't represent any changes in my system, other than as I indicated, turning off TLS in the firmware. Nothing changed on my network or on the back-end / MQTT broker side of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants