-
-
Notifications
You must be signed in to change notification settings - Fork 97
Unresponsive device #217
Comments
Oh dear. @FrankRenp has seen similar issues in #204 If the Blue LED is on then I assume it's still operational and only the WiFi is not working, so no MQTT messages either. Although it would be good if you could also confirm that. I've made a small change to 1.9.3b to show the number of TCP disconnects in the I'll look into the Retain number in Telnet and also the vertical bars in Firefox - thanks for reporting. |
Unplugged the device and updated at my PC via OTA to b2 I have one addtional question about the vscode compilation on windows. I used your instruction. All prerequisites are ok. But when i compile using 'Platformio build' it always fails at the point when a file will be downloaded to .pio\libdeps\release_tmp_installing-lnxbzn-package (example): Windows permission error. The folder isn't created. But if i try 'platformio upload' instead, its runs (i do not see any activity in that area). After upload, then also build runs well. What do you mean? |
Is it possible for you to change the mqtt heartbeat format towards json? |
|
meanwhile i have a configuration build up for extracting strings - here wifi strength:
|
Today i saw several reboots. I have created a direct web links to the esp in HA (TCP/IP address). If i try that link on my samsung mobile phone, i see that the browser download something (takes indefinite time) and nothing happens (on a PC its working). Then the esp will be unreachable some seconds later and often it restarts. Now the event log says: Log was erased due to probable file corruption |
Any ideas on how to reproduce? Like does it happen after using the telnet and web frequently? I've been running my test system for almost a day with no outages or MQTT disconnects yet. I've only done the occasional Telnet to check system health.
it may be a problem with the SPIFFs since the event log was corrupted. Set I'll continue to monitor and find a sequence to reproduce a crash or WiFi disconnect. I suspect its something in the web. |
Dear @proddy The device is currently available via telnet and web (thru the client wifi): Could i perform some tests to get more details? I am using a 10.x.x.x network instead of 192.168.x.x. I see at the telnet console that the 10.x.x.y address is set using my wifi and at the same time its stated that the "Device is in AP mode with SSID ems-esp". If i go near to the device, i can switch over the wifi to ems-esp with my mobile device. Update: (I have not tested if i can connect the device also using this IP address. ) - I can't connect telnet to 192.168.4.1 using the ems-esp wifi. I also noticed in my hearbeat history (HA) that the device shows up a significant decrease of wifi signal strength 5 minutes before (from around 50% to 36%). |
now, that is interesting. The fact that it switched to AP mode is strange. In the code I check for Wifi disconnects and if there are more than 10 dropouts it will go to AP mode. What does the System Stability say in My test system has been running for than 24hrs with no disconnects. When I purposely kill the Mosquitto MQTT broker and restart it the EMS-ESP successfully reconnects (and increments the # TCP disconnect counter). The same when I turn off the WiFi. So it seems to be able to reconnect each time. When you see the MQTT connection broken, does it re-connect or just stay down? Do you have the HA availability_topic on the topic "home/ems-esp/status" ? Things to test
|
I checked the mqtt disconnects: 0 this morming. Also i saw no restart, because of the uptime which i track using heartbeat in HA - it was not zeroed. The bad thing is, i restarted the esp by accident (just 5 min before: i set up an automation to restart the esp when the status topic shows offline (test)). Could you please try to decrease the signal strength of your wifi using metal or placing the device far away from the access point? I thing the problem arises when the wifi strength is critical around 30%. I de-activate the restart automation and wait until the next outage come. Then i try to get more details out of it. The second test will be deactivate mqtt for two days... |
type I will start logging the heartbeat to my HA and use Gafana to see if there is any loss in wifi strength |
When I first updated 1.9.0 => 1.9.2 I had similar looking symptoms and a completely unstable system (see #218). Since the I downgraded to 1.9.1 and after 24h of stable work updated to 1.9.2 again. Since now +24h system is stable - no MQTT connection loss and parallel Web & telnet access is fine.
|
Thanks @stbuerger. The WiFi websocket will timeout and lose the connection if left on in the browser and I have seen some cases of a browser refresh (f5) causing the ESP to reboot. This is something I need to look into. the left-side scrollbar I fixed yesterday (in the dev branch). |
Dear @proddy Today i started my PC, openend a browser and logged into the esp-homepage, then the ESP restarts spontaniously. My log file is disabled by default.
For me it seams to be an issue with the Web interface (and telnet) on the one hand (restarts) and an issue if the esp switches over to AP mode (lost mqtt connection) on the other. I'll try some test the next days... I thought about the AP issue: |
Update: I just used the download of the 1.9.3 bin release and try with your compilation. Perhaps its related to my compilation... |
For sharing, here is my HA configuration (heartbeat):
And the customization:
|
Added this to the wiki here https://github.com/proddy/EMS-ESP/wiki/Home-Assistant |
Dear @proddy |
could be. I'm working on using TravisCI to create nightly builds so we're on the same binary. Hopefully, I'll wrap this up in the next days. |
did it. The latest dev builds will be available in https://github.com/proddy/EMS-ESP/releases |
update: running since 3 days without any problem using your binary. |
maybe its getting too hot? can you touch the cap on the ESP to see if your finger melts? |
I created a test application which is essentially the EMS-ESP without the EMS, so only WiFI, MQTT etc. This way I can stress test for days and watch for disconnects or any other unusual behavior. The available heap memory is worrying, it means there is a memory leak somewhere or the automatic garbage collector is not kicking in earlier enough. Hopefully with this test bed I can find the culprit. Thanks for reporting this. |
After 9 days, the device seems to have issues to connect the first time: The wifi connection is more or less stable around 52% with some short peaks: worst 42; best 58% today (like all days before without any issue!). At boot time the graph showed 48% |
strange. It's a shame it takes so long to reproduce! |
I've made a similar observervation. Today in the morning (after approx. 2 days, 9 hours), the number of disconnects increases, and in the afternoon (after approx. 2day, 22 hours) as well. Wifi connection is between 38% (connection to the fritz.box) and 100% (connection to the repeater). I do not see any change, to identify, if a connection change is related to the to disconnects, but I cannot exclude that. |
I have a test system thats running and tracking mqtt disconnects, wifi strength and memory via Gafana. It was running for 2 days (with no interruptions) until my wife unplugged the USB charger that was powering the Wemos D1 to vacuum clean the floor! doh. It's restarted now and lets see what it reports back. The test code I'm using is in https://github.com/proddy/MyESP |
Currently 27h uptime with 1.9.4b10. |
Current uptime approx 51 hours. Some Wifi-Reconnects but no mqtt disconnects - looks good. [NTP] Internet time: 12:06:30 UTC on 11/11. Local time: 13:06:30 CET |
I haven't seen any dropout for days. Closing. Please re-open if you still see signs of instability |
Dear @proddy
yesterday i was totally luky about the current release (1.93b1).
Late in the evening i got into trouble: Suddenly the esp8266 could not be reached any longer. No MQTT, no SSH no http. The device restarted often. This happend while i used the telnet console in parallel with the webinterface. I reset the device several times, try to connect. But doesn't help.
Then i restarted my Windows-PC, the dedicated Access Point and stopped a raspi i just worked on with a test environment for HA,esphome,mqtt,...
Then it worked again until this morning at 9 am: Then it stopped working without any hint (light is still on). I don't have any idea whats happened...
I also saw that the esp was unresponsive serveral times. The web interface can't be reached. The browser showed only text pages (caching?). I also noticed, that the mdns name could not be resolved reliable.
I stressed the dedicated wlan segment with my laptop (dsl test) and saw, that the esp had problems to be reached using my wired PC.
The web log showed only the restarts. Nothing else.
This morning then i tried to connect to the web interface and saw that suddenly the dashboard noticed a problem with tx. Then the device was not reachable again and restarts.
13:00: The web interface isn't realy reachable but the telnet does. I can see mqtt connection errors and a fatal exception 4 flag 3 (Soft_WDT) ...
The event do not show any detail about that. Either the lost tx nor any issue.
I am not quite sure what happened. Perhaps the signal strength is not good enough. My next action will be that i place the access point directly in the same room... And check the external power supply (common installation for several devices).
Signal strength: ~40%
Router: AVM 7590
Access Point: AVM Repeater 1750E
Could you please add additional events, like lost communication to mqtt, bad status of mdns (if applicable), higher load of the cpu, lost tx?
Could you please publish a most rescent firmware version 1.9.3bx to me to be sure that my compilation isn't the reason for that?
Do you have another idea to test?
I also noticed that the telnet interface show a huge integer number for the mqtt retain flag, while the web interface shows the correct one. Perhaps this is a bug.
Last but not least my firefox 69.0.3 (Windows) always show a vertical slider on the navigation bar between the icons and the text items
The text was updated successfully, but these errors were encountered: