Skip to content
This repository has been archived by the owner on Oct 4, 2021. It is now read-only.

Unresponsive device #217

Closed
S-Przybylski opened this issue Oct 21, 2019 · 33 comments
Closed

Unresponsive device #217

S-Przybylski opened this issue Oct 21, 2019 · 33 comments
Labels
bug Something isn't working

Comments

@S-Przybylski
Copy link

Dear @proddy
yesterday i was totally luky about the current release (1.93b1).
Late in the evening i got into trouble: Suddenly the esp8266 could not be reached any longer. No MQTT, no SSH no http. The device restarted often. This happend while i used the telnet console in parallel with the webinterface. I reset the device several times, try to connect. But doesn't help.
Then i restarted my Windows-PC, the dedicated Access Point and stopped a raspi i just worked on with a test environment for HA,esphome,mqtt,...
Then it worked again until this morning at 9 am: Then it stopped working without any hint (light is still on). I don't have any idea whats happened...

I also saw that the esp was unresponsive serveral times. The web interface can't be reached. The browser showed only text pages (caching?). I also noticed, that the mdns name could not be resolved reliable.
I stressed the dedicated wlan segment with my laptop (dsl test) and saw, that the esp had problems to be reached using my wired PC.
The web log showed only the restarts. Nothing else.
This morning then i tried to connect to the web interface and saw that suddenly the dashboard noticed a problem with tx. Then the device was not reachable again and restarts.

13:00: The web interface isn't realy reachable but the telnet does. I can see mqtt connection errors and a fatal exception 4 flag 3 (Soft_WDT) ...

The event do not show any detail about that. Either the lost tx nor any issue.

I am not quite sure what happened. Perhaps the signal strength is not good enough. My next action will be that i place the access point directly in the same room... And check the external power supply (common installation for several devices).

Signal strength: ~40%
Router: AVM 7590
Access Point: AVM Repeater 1750E

Could you please add additional events, like lost communication to mqtt, bad status of mdns (if applicable), higher load of the cpu, lost tx?

Could you please publish a most rescent firmware version 1.9.3bx to me to be sure that my compilation isn't the reason for that?

Do you have another idea to test?

I also noticed that the telnet interface show a huge integer number for the mqtt retain flag, while the web interface shows the correct one. Perhaps this is a bug.

Last but not least my firefox 69.0.3 (Windows) always show a vertical slider on the navigation bar between the icons and the text items

@S-Przybylski S-Przybylski added the bug Something isn't working label Oct 21, 2019
@proddy
Copy link
Collaborator

proddy commented Oct 21, 2019

Oh dear. @FrankRenp has seen similar issues in #204

If the Blue LED is on then I assume it's still operational and only the WiFi is not working, so no MQTT messages either. Although it would be good if you could also confirm that.

I've made a small change to 1.9.3b to show the number of TCP disconnects in the system command, which gets reset at each boot time. I'll also run some stress tests to see if I can recreate the problem.

I'll look into the Retain number in Telnet and also the vertical bars in Firefox - thanks for reporting.

@S-Przybylski
Copy link
Author

Unplugged the device and updated at my PC via OTA to b2
Telnet: Retain is fixed!
Remounted the device again. Runs without any issue! Webinterface is responding fastly
Wifi around 50% (i moved the device a little bit on the wall).
The light is always on. A strange thing is that i often could connect using telnet while web doens't function.
Regarding the bar: I mean a scroll bar within the navigation plane, it is always present, also in full screen
I will retest now!

I have one addtional question about the vscode compilation on windows. I used your instruction. All prerequisites are ok. But when i compile using 'Platformio build' it always fails at the point when a file will be downloaded to .pio\libdeps\release_tmp_installing-lnxbzn-package (example): Windows permission error. The folder isn't created. But if i try 'platformio upload' instead, its runs (i do not see any activity in that area). After upload, then also build runs well. What do you mean?

@S-Przybylski
Copy link
Author

Is it possible for you to change the mqtt heartbeat format towards json?
Then this values could be reported by HA for testing too (graph,..)

@proddy
Copy link
Collaborator

proddy commented Oct 21, 2019

  • I tested with Firefox and see the long vertical white bar at the left hand of the screen. I can reproduce this with 1.9.0 too so it's apparently been there from the beginning of time. I'll look into it. Probably the menu JavaScript
  • The Build and Upload should be the same. Some times its just good to delete the whole .pio folder and then build again using F11 (on Windows in VSC)
  • I have a script that does nightly builds of the dev branch using TravisCI. Its not 100% perfect yet but you should be able to pull working .bin files. It's under here. I need a few more days to tidy it up.
  • MQTT heartbeat in JSON is a great idea. I'll do that too.

proddy added a commit that referenced this issue Oct 21, 2019
@S-Przybylski
Copy link
Author

S-Przybylski commented Oct 21, 2019

meanwhile i have a configuration build up for extracting strings - here wifi strength:

value_template: '{{ value.split(" ")[2].split("=")[1].split("%")[0] }}'

@S-Przybylski
Copy link
Author

Today i saw several reboots.
At least one time the esp8266 had a memory leak (value = 18%) (i track the heartbeat with HA).
Often MQTT sends information while the webinterface can't be reached any longer. Also the telnet interface is better available then the web interface.

I have created a direct web links to the esp in HA (TCP/IP address). If i try that link on my samsung mobile phone, i see that the browser download something (takes indefinite time) and nothing happens (on a PC its working). Then the esp will be unreachable some seconds later and often it restarts.

Now the event log says: Log was erased due to probable file corruption

@proddy
Copy link
Collaborator

proddy commented Oct 22, 2019

Any ideas on how to reproduce? Like does it happen after using the telnet and web frequently?

I've been running my test system for almost a day with no outages or MQTT disconnects yet. I've only done the occasional Telnet to check system health.

 [APP] EMS-ESP version: 1.9.3b3
 [APP] MyESP version: 1.2.12
 [APP] Build timestamp: 2019-10-21 18:46:59
 [APP] Uptime: 0 days 21 hours 18 minutes 40 seconds
 [APP] System Load: 2%
 [WIFI] WiFi Hostname: ems-esp
 [WIFI] WiFi IP: 10.10.10.198
 [WIFI] WiFi signal strength: 72%
 [MQTT] is connected (heartbeat enabled)
 [NTP] Time in UTC is 19:22:14
 [SYSTEM] System is Stable
 [SYSTEM] Board: PLATFORMIO_D1_MINI
 [SYSTEM] Restart count: 0
 [SYSTEM] # TCP disconnects: 0

it may be a problem with the SPIFFs since the event log was corrupted. Set log_events to off and see if that helps with the stability.

I'll continue to monitor and find a sequence to reproduce a crash or WiFi disconnect. I suspect its something in the web.

@S-Przybylski
Copy link
Author

S-Przybylski commented Oct 23, 2019

Dear @proddy
yesterday i put b3 on the device. It ran nearly 12h without any problem. Then i saw in HA that the mqtt connection was broken (i use availability_topic). I checked mqtt server - thats fine - other devices like sonoff are working. The telnet interface and web is also reachable. I see that the heap is 13136 byte (web) and also an old bad behaviour: The device says that its in AP mode again while i communicate using client wifi with my individual ssid!
Could it be that the outgoing connection leads to errors because the device try to use the AP mode ip address to get in contact with the mqtt server, while all incoming connections are handled thru the firstly established client wifi address?

The device is currently available via telnet and web (thru the client wifi): Could i perform some tests to get more details?

I am using a 10.x.x.x network instead of 192.168.x.x. I see at the telnet console that the 10.x.x.y address is set using my wifi and at the same time its stated that the "Device is in AP mode with SSID ems-esp". If i go near to the device, i can switch over the wifi to ems-esp with my mobile device.

Update: (I have not tested if i can connect the device also using this IP address. ) - I can't connect telnet to 192.168.4.1 using the ems-esp wifi.

I also noticed in my hearbeat history (HA) that the device shows up a significant decrease of wifi signal strength 5 minutes before (from around 50% to 36%).

@proddy
Copy link
Collaborator

proddy commented Oct 23, 2019

now, that is interesting. The fact that it switched to AP mode is strange. In the code I check for Wifi disconnects and if there are more than 10 dropouts it will go to AP mode. What does the System Stability say in system? Also the # restarts ?

My test system has been running for than 24hrs with no disconnects. When I purposely kill the Mosquitto MQTT broker and restart it the EMS-ESP successfully reconnects (and increments the # TCP disconnect counter). The same when I turn off the WiFi. So it seems to be able to reconnect each time.

When you see the MQTT connection broken, does it re-connect or just stay down?

Do you have the HA availability_topic on the topic "home/ems-esp/status" ?

Things to test

  • see system and check for stability, #restarts and #disconnects
  • turn off all MQTT, using mqtt_enabled and turn off event logging (see previous comment) and see if its still stable

@S-Przybylski
Copy link
Author

S-Przybylski commented Oct 23, 2019

I checked the mqtt disconnects: 0 this morming. Also i saw no restart, because of the uptime which i track using heartbeat in HA - it was not zeroed.
Yes is always use the availability_topic for mqtt topics!
I already followed your advise to disable the event log since yesterday.
Reconnect of MQTT: The first time i access the telnet i saw the mqtt error messages. In the status (web) is saw the last send out mqtt messages, which are definitly old.
In HA i do not saw any reconnect. HA displays offline.

The bad thing is, i restarted the esp by accident (just 5 min before: i set up an automation to restart the esp when the status topic shows offline (test)).

Could you please try to decrease the signal strength of your wifi using metal or placing the device far away from the access point? I thing the problem arises when the wifi strength is critical around 30%.
I do not have any idea why the strength decreases spontanious. The access point (avm), which is mounted in the basement in a nother room, do not show any wifi changes or disconnects.

I de-activate the restart automation and wait until the next outage come. Then i try to get more details out of it.
One question: Where i get details like you posted before like [system] ? I only saw these when i opened the telnet first time.

The second test will be deactivate mqtt for two days...
The third test will then be move the access point.

@proddy
Copy link
Collaborator

proddy commented Oct 23, 2019

type system in Telnet

I will start logging the heartbeat to my HA and use Gafana to see if there is any loss in wifi strength

@stbuerger
Copy link

When I first updated 1.9.0 => 1.9.2 I had similar looking symptoms and a completely unstable system (see #218). Since the I downgraded to 1.9.1 and after 24h of stable work updated to 1.9.2 again. Since now +24h system is stable - no MQTT connection loss and parallel Web & telnet access is fine.
What I noticed:

  • After all FW updates (and maybe all power off/on) the Web interface seems to be a little shaky at first connection. This clears after a first full LogOut -> LogIn cycle.
  • The interface also seems to encounter some issues after beeing "left alone" for a while (+30min or so) - but recovers after LogOut/LogIn - whereas a browser close/re-open doesn't help.
  • The "left-side ScrollBar" ist present in as well FireFox and also M$ Edge - Chrome on all platforms inkl. Android is OK.

@proddy
Copy link
Collaborator

proddy commented Oct 24, 2019

Thanks @stbuerger. The WiFi websocket will timeout and lose the connection if left on in the browser and I have seen some cases of a browser refresh (f5) causing the ESP to reboot. This is something I need to look into.

the left-side scrollbar I fixed yesterday (in the dev branch).

@S-Przybylski
Copy link
Author

Dear @proddy
my system was stable for 96h using 1.9.3b3! No issues, no reconnects of MQTT - everything was fine.

Today i started my PC, openend a browser and logged into the esp-homepage, then the ESP restarts spontaniously. My log file is disabled by default.
I used telnet and logged in. I got no information about the restart while i logged in. I typed "system" and then the esp restarts again. Then i logged in again and tried "system" and i saw a fatal exception:

 [SYSTEM] Last reset reason: Exception
 [SYSTEM] Last reset info: Fatal exception:0 flag:2 (EXCEPTION) epc1:0x40213640 epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000000 depc                                                                                   :0x00000000

For me it seams to be an issue with the Web interface (and telnet) on the one hand (restarts) and an issue if the esp switches over to AP mode (lost mqtt connection) on the other.

I'll try some test the next days...

I thought about the AP issue:
Does it make sense to add a value to the mqtt heatbeat that "hopefully" indicates if the ESP is switched to AP mode? Or use a predefined string value to identify possible problems (e.g. AP mode, number of lost connections, high load, low memory, etc...) like an event text ?

@S-Przybylski
Copy link
Author

Update: I just used the download of the 1.9.3 bin release and try with your compilation. Perhaps its related to my compilation...

@S-Przybylski
Copy link
Author

For sharing, here is my HA configuration (heartbeat):
@proddy: Should we put that into the wiki?

sensors.yaml:
###################################################
# EMS-ESP Statistics - heartbeat
###################################################
- platform: mqtt
  state_topic: 'ems-esp/heartbeat'
  name: 'ems_esp_wifi'
  unit_of_measurement: '%'
  value_template: '{{ value_json.rssid }}'
  qos: 1
  availability_topic: 'ems-esp/status'
  payload_available: "online"
  payload_not_available: "offline"

- platform: mqtt
  state_topic: 'ems-esp/heartbeat'
  name: 'ems_esp_load'
  unit_of_measurement: '%'
  value_template: '{{ value_json.load }}'
  qos: 1
  availability_topic: 'ems-esp/status'
  payload_available: "online"
  payload_not_available: "offline"

- platform: mqtt
  state_topic: 'ems-esp/heartbeat'
  name: 'ems_esp_freemem'
  unit_of_measurement: '%'
  value_template: '{{ value_json.freemem }}'
  qos: 1
  availability_topic: 'ems-esp/status'
  payload_available: "online"
  payload_not_available: "offline"

- platform: mqtt
  state_topic: 'ems-esp/heartbeat'
  name: 'ems_esp_uptime'
  unit_of_measurement: 's'
  value_template: '{{ value_json.uptime }}'
  qos: 1
  availability_topic: 'ems-esp/status'
  payload_available: "online"
  payload_not_available: "offline"

- platform: mqtt
  state_topic: 'ems-esp/heartbeat'
  name: 'ems_esp_version'
#  unit_of_measurement: ''
  value_template: '{{ value_json.version }}'
  qos: 1
  availability_topic: 'ems-esp/status'
  payload_available: "online"
  payload_not_available: "offline"

- platform: mqtt
  state_topic: 'ems-esp/heartbeat'
  name: 'ems_esp_mqtt_disconnects'
  value_template: '{{ value_json.MQTTdisconnects }}'
  qos: 1
  availability_topic: 'ems-esp/status'
  payload_available: "online"
  payload_not_available: "offline"

And the customization:

customize.yaml:
sensor.ems_esp_wifi:
  icon: mdi:wifi
  friendly_name: 'Wifi strength'

sensor.ems_esp_load:
  icon: mdi:chart-line
  friendly_name: 'System load'

sensor.ems_esp_freemem:
  icon: mdi:memory
  friendly_name: 'Free Memory'

sensor.ems_esp_uptime:
  icon: mdi:update
  friendly_name: 'Uptime'

sensor.ems_esp_version:
  icon: mdi:mini-sd
  friendly_name: 'Firmare Version'

sensor.ems_esp_mqtt_disconnects:
  icon: mdi:chart-line
  friendly_name: 'MQTT disconnects'

@proddy
Copy link
Collaborator

proddy commented Oct 29, 2019

Added this to the wiki here https://github.com/proddy/EMS-ESP/wiki/Home-Assistant

@S-Przybylski
Copy link
Author

S-Przybylski commented Oct 29, 2019

Dear @proddy
update: After i installed your binary release two days before i do not have any issue up to now. I will check that in the next days.
Perhaps it also have something to do with the compilation process...

@proddy
Copy link
Collaborator

proddy commented Oct 29, 2019

could be. I'm working on using TravisCI to create nightly builds so we're on the same binary. Hopefully, I'll wrap this up in the next days.

@proddy
Copy link
Collaborator

proddy commented Oct 29, 2019

did it. The latest dev builds will be available in https://github.com/proddy/EMS-ESP/releases

@S-Przybylski
Copy link
Author

update: running since 3 days without any problem using your binary.
Today i noticed that the wifi signal suddenly decreased about 20%. I will investigate on that... (I expect that the device will go to AP mode later on ...)

@proddy
Copy link
Collaborator

proddy commented Oct 31, 2019

maybe its getting too hot? can you touch the cap on the ESP to see if your finger melts?

@S-Przybylski
Copy link
Author

Today i put my finger onto the chip: it's warm but not hot.
Status: Currently 5 days online with 1.9.3 (your binary) without any issue!!!

I only see a degration of the memory over time. I assume that happenend sometimes but not always if i try telnet and web interface to check...
freemem

@proddy
Copy link
Collaborator

proddy commented Nov 2, 2019

I created a test application which is essentially the EMS-ESP without the EMS, so only WiFI, MQTT etc. This way I can stress test for days and watch for disconnects or any other unusual behavior. The available heap memory is worrying, it means there is a memory leak somewhere or the automatic garbage collector is not kicking in earlier enough. Hopefully with this test bed I can find the culprit. Thanks for reporting this.

@S-Przybylski
Copy link
Author

Update:
My ems_esp runs about 8 days without any problem (using our compilation)!
The free memory converge to 46 %. Sometimes the free memory is lower, often if the telnet console is started, later on it's freeing the memory again.
freemem

@S-Przybylski
Copy link
Author

After 9 days, the device seems to have issues to connect the first time:
Today at 10:18 the MQTT counter switched directly from 0 to 26! Interesting: The esp runs without any restart up to the time, i tried to connect to the system at 17:32 (web and telnet)! Also i saw that the Wifi was correctly set, mqtt was ok and the free memory was about 43%. Shortly before it restarts, it decreses about 10% to 33% while i opened a telnet session... but recovered to 43% just before the restart.
I am not quite sure if a try to open the web interface or the telnet command 'system' was the reason for the reboot. I assume that it was the web access.

The wifi connection is more or less stable around 52% with some short peaks: worst 42; best 58% today (like all days before without any issue!). At boot time the graph showed 48%

@proddy
Copy link
Collaborator

proddy commented Nov 5, 2019

strange. It's a shame it takes so long to reproduce!

@FrankRenp
Copy link

I've made a similar observervation. Today in the morning (after approx. 2 days, 9 hours), the number of disconnects increases, and in the afternoon (after approx. 2day, 22 hours) as well.
However, MQTT is still connected.

Wifi connection is between 38% (connection to the fritz.box) and 100% (connection to the repeater). I do not see any change, to identify, if a connection change is related to the to disconnects, but I cannot exclude that.

grafik

@proddy
Copy link
Collaborator

proddy commented Nov 7, 2019

I have a test system thats running and tracking mqtt disconnects, wifi strength and memory via Gafana. It was running for 2 days (with no interruptions) until my wife unplugged the USB charger that was powering the Wemos D1 to vacuum clean the floor! doh. It's restarted now and lets see what it reports back. The test code I'm using is in https://github.com/proddy/MyESP

@proddy
Copy link
Collaborator

proddy commented Nov 10, 2019

the changes I added to the latest 1.9.4 seem to have improved a few things. Still no disconnects on my side for the last few days.

Screen Shot 2019-11-10 at 18 36 56

There was also an upgrade of my Ubiquiti unifi network in between which is probably that glitch you see.

How are you getting on?

@FrankRenp
Copy link

Currently 27h uptime with 1.9.4b10.

@FrankRenp
Copy link

Current uptime approx 51 hours. Some Wifi-Reconnects but no mqtt disconnects - looks good.

[NTP] Internet time: 12:06:30 UTC on 11/11. Local time: 13:06:30 CET
[WIFI] Connecting to SSID: XXXXXXXXXX
[WIFI] Connected to SSID XXXXXXXXX (hostname: ems-esp, IP: 192.168.0.31)
[OTA] Listening to firmware updates on ems-esp.local:8266
[SYSTEM] Disabling serial port communication
[NTP] NTP internet time enabled via server pool.ntp.org with timezone 2
[NTP] Internet time: 12:59:25 UTC on 11/11. Local time: 13:59:25 CET
[NTP] Internet time: 13:59:25 UTC on 11/11. Local time: 14:59:25 CET
[NTP] Internet time: 14:59:25 UTC on 11/11. Local time: 15:59:25 CET
[NTP] Internet time: 15:59:25 UTC on 11/11. Local time: 16:59:25 CET
[NTP] Internet time: 16:59:25 UTC on 11/11. Local time: 17:59:25 CET
[NTP] Internet time: 17:59:25 UTC on 11/11. Local time: 18:59:25 CET
[NTP] Internet time: 18:59:25 UTC on 11/11. Local time: 19:59:25 CET
[WIFI] Connecting to SSID: XXXXXXXXX...
[WIFI] Connected to SSID XXXXXXXXXX (hostname: ems-esp, IP: 192.168.0.31)
[OTA] Listening to firmware updates on ems-esp.local:8266
[SYSTEM] Disabling serial port communication
[NTP] NTP internet time enabled via server pool.ntp.org with timezone 2

@proddy
Copy link
Collaborator

proddy commented Nov 13, 2019

I haven't seen any dropout for days. Closing. Please re-open if you still see signs of instability

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants