-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent MQTTS Connection Failures (IDFGH-13979) #288
Comments
Cross-posted from espressif/arduino-esp32#10541 |
@euripedesrocha Can you please help with triage? Thanks |
@euripedesrocha just checking in, what do you think? how can I help us solve this problem? |
@vicatcu first I need more information. Could you provide logs? |
@euripedesrocha the logs were included in the OP:
This is certainly not typical of the output in the usual connect-report-disconnect case... |
@vicatcu I saw this, I meant more logs to identify if we have an issue. Why would you say it's not usual disconnect? The logs state that the socket was closed by the peer ( I know that they are not great in clarifying that). As I suggest you could find your issue by looking into your broker's log. |
I would say it's not a normal disconnect on the basis that my code doesn't act like an It's difficult for me to get broker-logs as this is a production server with lots of clients connecting and disconnecting all the time - a bit of a fire hose. If you have advice on how I might instrument that, I'm open to exploring / examining it. My solution space is:
The behavioral pattern in my code is actually this if it helps:
And to be clear, when I say "power up" and "shutdown" I don't mean sleep modes, I mean power-off / power-on. |
@vicatcu,
If you are using a self-hosted mosquitto in production, you maybe enable the logs to the sys channel and subscribe to it to monitor. It will make the identification of the root cause easy to look to those logs. |
@euripedesrocha I captured a bit more output with additional logging enabled in my application:
I do have access to the sys channel and can subscribe to it to monitor it... but what exactly should I be looking for? |
I'm also seeing sometimes that I get the MQTT_EVENT_CONNECTED messages after I have given up and timed out (after 15 seconds)... that seems like a long time to connect |
For what it's worth, I tried implementing a "retry strategy" (up to three attempts) before giving up and I got this behavior:
My application is battery powered, though, so it's a trade-off whether to stay on and keep trying or to give up quickly, save state, and try again later. When it works, it works quickly, when it doesn't it takes a bit of time to fail. |
If I subscribe to
Do I need to widen the net on what I'm looking at somehow? |
I've captured about 19 hours of logs from my firmware captured in PuTTY and run some post-hoc analysis on it. The summary of my preliminary analysis is that about 53 of the 232 cycles (23%), at power up cycles of about every 5 minutes - hard shutdown between attempts, required one or more retries. Among those, 46 cycles (~20%) successfully connected with 1-2 retries (37 required one retry, 9 required 2 retries), and 7 (~3%) failed to connect within 2 retries. These statistics don't seem that good to me... is this what one would expect? I'm running another data capture over the weekend without using MQTTS and will report back on the results. I have a hypothesis that this behavior is only evident when using TLS. I'm not sure where that leaves me if the hypotheses proves to be true... |
Hi @vicatcu, as I mentioned before from the logs you have, it looks like some issue in getting a response from the broker during connection process. |
@euripedesrocha as promised, I ran the test without MQTTS and got these results:
So my observations here are
What else can I do to help resolve this. Suffice it to say, not using MQTTS is not a great solution for me. Digging into the non-TLS logs I see the connection failures manifesting in a variety of ways: Worst Case (the one that failed to connect after three attempts)
Here's a case where the second attempt worked (some errors came out presumably related to the first attempt?)
Another similar case, but different Error number (39615) in output:
Sometimes it looks like this succeeding on the second attempt:
Here's one that succeeded on the third attempt:
|
|
@vicatcu, Could you share more logs with the level set to VERBOSE? I suggest put on verbose mode:
I would also like you to share how is the network scenario on your side. Direct WIFI connection? Are you going through some 4G modem? Is mesh involved? The latest logs you shared are indicating timeout, while the first ones indicated that the connection was closed by the peer. Did you had any changes in your system? |
@euripedesrocha Network setup is very basic - direct connection to Wi-Fi router (Ubiquiti UDM) plugged into a terrestrial cable modem (Spectrum ISP). I need some instruction on how to set the level to VERBOSE on those module in the Arduino environment. I think I probably should also try and offer up a minimal reproducible example since my complete code base is private and big, in which I try and closely mirror the code in my application without as much complexity. The latest logs don't represent any changes in my system, other than as I indicated, turning off TLS in the firmware. Nothing changed on my network or on the back-end / MQTT broker side of things. |
Checklist
How often does this bug occurs?
rarely
Expected behavior
MQTTS connection should consistently work
Actual behavior (suspected bug)
When connecting to my MQTT Server (self-hosted, Mosquitto), I intermittently experience the failures to connect with MQTTS.
Error logs or terminal output
Steps to reproduce the behavior
My code effectively looks like this:
Project release version
idf-release_v5.1-632e0c2a
System architecture
Intel/AMD 64-bit (modern PC, older Mac)
Operating system
Linux
Operating system version
Ubuntu 22.04
Shell
Bash
Additional context
No response
The text was updated successfully, but these errors were encountered: