Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WiFi] Reduce wifi reset calls and call full wifi init at reset (#3147) #3148

Merged
merged 23 commits into from
Jul 20, 2020

Conversation

TD-er
Copy link
Member

@TD-er TD-er commented Jul 4, 2020

Fixes: #2931
Fixes: #1258
Fixes: #3155
Fixes: #3141
Fixes: #3111

Connected state is now based on the state reported by the internal WiFi object.
Frequent calls to reset() of WiFi are now rate limited to make sure we don't end up in a loop.

@TD-er
Copy link
Member Author

TD-er commented Jul 4, 2020

@TD-er TD-er mentioned this pull request Jul 4, 2020
@v-a-d-e-r
Copy link

v-a-d-e-r commented Jul 5, 2020

I get often this now with this changes (merged code, not the bin files above) and OTA update:
Update successful: 691952 bytes
Rebooting...
WIFI : WiFiConnected() out of sync
WIFI : Resetting...
Exception (28):
epc1=0x402773a1 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000039 depc=0x00000000

stack>>>

@TD-er
Copy link
Member Author

TD-er commented Jul 5, 2020

I'm also testing this and there is some logic error which I'm hunting down as we speak.
Thanks for letting me know it isn't just my node.

P.S. to get the node to enter some endless loop, try to disconnect the ESP from the access point side (Mikrotik allows you to kick of a connected client)

@v-a-d-e-r
Copy link

@TD-er: Something OT: Have you checked the pubSubClient v2.8? Sending works, receiving not... :-/

@TD-er
Copy link
Member Author

TD-er commented Jul 5, 2020

Nope, have not looked at the PubSubClient main repo for a while.

@v-a-d-e-r
Copy link

v-a-d-e-r commented Jul 5, 2020

I downloaded it here (right version?):
https://github.com/knolleary/pubsubclient/releases/tag/v2.8
Did you made some changes in the prev versions to work with ESPeasy?

@TD-er
Copy link
Member Author

TD-er commented Jul 5, 2020

Did you made some changes in the prev versions to work with ESPeasy?

Yep.
Made quite a few changes, that's why Tasmota is also using the "ESPeasy version" of PubSubClient :)

@v-a-d-e-r
Copy link

Ah, ok. So I can wait for a long time to get it to work in the orig condition... LOL

@uzi18
Copy link
Contributor

uzi18 commented Jul 6, 2020

Have a discussion with you some time ago in different issue, to prevent fast switching of wifi state - wifi logic should work more lazy (2nd step) and wifistate/wificonnected should be not a function but int/bool to check (first step), this will reduce wifi overhead.
But maybe it is only my point of view.

@TD-er
Copy link
Member Author

TD-er commented Jul 6, 2020

Not sure if I understand you correctly @uzi18

What I'm now trying to do (not yet committed) is to consider the wifiStatus a proper bitfield.

What I'm seeing though is that this wifiStatus can get out of sync with the true state.
Also it really does happen (quite often) the WiFi reports to have an IP address set (via DHCP for example) and claims not to be connected.
It looks like I'm now narrowing in on the true cause of the crashes we sometimes see and which is very well reproducible when enabling syslog.
So far, it seems sending UDP packets while the WiFi is in this unknown state is causing these crashes.
As if it is queuing the packets and still processing this queue after the queue itself has been destructed.

@TD-er
Copy link
Member Author

TD-er commented Jul 6, 2020

wifiStatus &= ~ESPEASY_WIFI_GOT_IP;
wifiStatus &= ~ESPEASY_WIFI_SERVICES_INITIALIZED;
Copy link
Contributor

@uzi18 uzi18 Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TD-er use bitRead/bitWrite macros, do you use here inversed logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perhaps indeed a bit more readable :)

@TD-er
Copy link
Member Author

TD-er commented Jul 7, 2020

OK, there is really something wrong in the core library.
The work-around I'm working on is in short something like this:

station_status_t status = wifi_station_get_connect_status();
if (status == STATION_CONNECTING && timeOutReached(last_wifi_connect_attempt_moment + 15000)) {
      resetWiFi();
}

The strange part is, this STATION_CONNECTING is not being checked for in the wl_status_t ESP8266WiFiSTAClass::status() function.

But it is actually happening on only some modules when trying to reconnect, or on the first connection.
Exactly as has been reported for ages but I was unable to reproduce it until a module I just yesterday opened from its seal.

@v-a-d-e-r
Copy link

After merging this:
"[WiFi] Add timeout to get out of incorrect WiFi.connected() state"
the shown Vcc value is wrong (normal 3V, now 1.58V)? What has the WiFi stuff to do with Vcc? LC droped also a lot...
Can you check this please?

@TD-er
Copy link
Member Author

TD-er commented Jul 7, 2020

Is Vcc only dropped on the (first?) readings while it was still trying to connect?

If so, then it can be explained by the fact the Vcc is not read during the WiFi connect, as reading it may affect WiFi performance and vice verse.

I told you before the loop count is not a really useful measure for detecting the CPU load.
The loop count is determined by the scheduled jobs in the scheduler.
If nothing has to be done, the loop count can become quite low.
If you have the "eco" mode enabled, its count can even become lower.

@TD-er TD-er force-pushed the bugfix/wifi_reset branch from 2ae7cfb to b32c185 Compare July 7, 2020 20:46
@v-a-d-e-r
Copy link

v-a-d-e-r commented Jul 7, 2020

Now I merged this:
[WiFi] Convert bit-wise operations to use bitSet/bitClear/bitRead
and it looks a bit better. And the Vcc is also correct! Strange thing....
Next test is with core 2.7.2, that was not good at this morning. Many MQTT disconnects and slow WiFi.

@TD-er
Copy link
Member Author

TD-er commented Jul 7, 2020

Usually I wait a few days before trying the new core versions, especially if they include a new toolchain..

But if you build the 'stage' build, you would actually have the same code as now on the master branch of esp8266/Arduino.

@v-a-d-e-r
Copy link

I have flashed a new FW this minute with your changes and core 2.7.2 and let it run over night now. Will see what the log says tomorrow.... ;-)

@TD-er
Copy link
Member Author

TD-er commented Jul 7, 2020

Well I do have some test code pending with more logging.
But the more I look into it, it looks like the only way to know for sure what's happening is not to look at WiFi.status() or WiFi.isConnected() as they really sometimes make no sense.
We really should only look at our own internal state which is being handled via events.

@v-a-d-e-r
Copy link

Wow, after 9 hours no crash (reboot), no lost WiFi! My config: ESPeasy custom, core 2.7.2, controllers 2 + 13 + 16 and 1 plugin (P001). But this is with my test node. Will now see what my "real" nodes do after flashing this FW...

@TD-er
Copy link
Member Author

TD-er commented Jul 8, 2020

Yep a lot of stability related fixes lately
But this WiFi stuff keeps bogging me

@v-a-d-e-r
Copy link

Ok, this was the last of my 18 nodes I flashed (all are different). This is a high risk to do this on daily needed nodes, but the only real test with usable results. But for now it looks not so bad.... :-)

@TD-er
Copy link
Member Author

TD-er commented Jul 8, 2020

OK, I have to make a decision now, how to tackle the WiFi issues.

Option 1:
Try to detect when WiFi status is out of sync (using events, timeouts and failed connect attempts)
Reset WiFi + try to reconnect until WiFi status is correct.

Option 2:
Ignore WiFi status and solely base WiFi true state on events.

What I observe here in my tests, is that I do not seem to miss any events.
So option 2 does seem to be the preferred way as it will result in much quicker response to WiFi state changes and thus less likely to crash.
For example sending UDP messages with incorrect WiFi state is likely to cause crashes.
Also the WiFi status may return WL_DISCONNECTED, even though it is actually connected and thus ready to be used. => improve of response time.

But some internals of libraries or the core may still rely on the WiFi.status() result, so we may run into issues which may be hard to reproduce or locate.

So what's the better option to do here?

@v-a-d-e-r
Copy link

Is option 2 not a bit risky to ignore the WiFi status? BTW: At this time all my nodes seem to run stable with your last changes.

@TD-er
Copy link
Member Author

TD-er commented Jul 8, 2020

Well if the wifi status is incorrect, what should you do?
And the status can become incorrect during WiFi connect or when the node gets disconnected.

The tricky part here is that some nodes appear to have a higher chance of seeing this out of sync behavior compared to others.
I now have a node here on my desk which is very good at reproducing this behavior while my standard test nodes do connect and disconnect without issues.

@v-a-d-e-r
Copy link

Hard to say. I have here 16 Sonoff nodes and 2 ESP-12F in use. All running without issues now. But I had not really strange WiFi issues in the past. Maybe it depends on the good router (FB7490)!?

@TD-er
Copy link
Member Author

TD-er commented Jul 8, 2020

I used to think that too, but since I now have one node that has significant more issues connecting to WiFi, compared to the other nodes, I don't think it is solely the router that makes the difference.

After all, it has been reported by users for ages, but I simply was unable to reproduce it.

@v-a-d-e-r
Copy link

Or it depends on the stuff that is compiled within the firmware!?

@v-a-d-e-r
Copy link

Ok, the 6 nodes I flashed yesterday survived the night without any issues... :-)

@TD-er
Copy link
Member Author

TD-er commented Jul 18, 2020

Maybe you can also let the node force to reconnect to WiFi and/or kick a node from the access point (if possible, or restart the AP)

@v-a-d-e-r
Copy link

Ooops, CPU load is rising from 10 to 80% after I disabled WLAN at the router. After connection is done it goes back to 10

@TD-er
Copy link
Member Author

TD-er commented Jul 18, 2020

CPU load (as it is computed) will indeed seem to go up.
But there's a simple explanation for it.
During WiFi connecting phase, there are some calls to delay(10) to give the WiFi part of the code ample amount of time to process all rather time critical steps.
And since this limits the number of loops that can be run, it will appear the load spikes.

The only way to make this look less (remember the load computation is based on the time spent in loop calls that don't have a scheduled job to run) is to schedule the wifi connect calls too.
Right now, it is still a fixed delay time, but it could be made a bit more dynamic by actually schedule the wifi maintenance.

@v-a-d-e-r
Copy link

v-a-d-e-r commented Jul 18, 2020

When no sensor data are lost with that high load it doesn't care to me. ;-)

@TD-er
Copy link
Member Author

TD-er commented Jul 18, 2020

When no sensor data are lost with that high load it doesn't care to me. ;-)

It is only lost if the controller buffer is full and it cannot offload it to the other end of the controller because there is no network connection.
If the load gets to 100% (or very close), then there may be issues where the node cannot perform its scheduled jobs or react in due time.

@v-a-d-e-r
Copy link

Under normal circumstances WLAN should be up and the nodes are connected. Most important is, that they do not crash and reboot with a lost WiFi...

@TD-er
Copy link
Member Author

TD-er commented Jul 18, 2020

Just added some patches mainly for ESP32 and if you're logging to syslog.

But I also looked into the 'lockup' you mentioned @v-a-d-e-r
Maybe it is something related to the serial port chip on the unit.

On boards with the CP210x chips when I had VS code to read the serial log and close VS code, the node hangs.
There should be no interaction with this serial to USB chip, but apparently there is.
These boards always have had the strange behavior that when you connect to them to read the logs they would trigger a reset.
Maybe there's something related here which triggers the reset too short so it will hang?

@v-a-d-e-r
Copy link

v-a-d-e-r commented Jul 18, 2020

I have no ESP32 in use. It is related to all nodes. Only 2 of 18 use serial port.

@TD-er
Copy link
Member Author

TD-er commented Jul 18, 2020

I have no ESP32 in use. It is related to all nodes. Only 2 of 18 use serial port.

The USB to serial thingy I've only seen on the ESP8266 boards.
ESP32 doesn't seem to suffer from this
The latest fixes are mainly ESP32 related, so if you don't have one of those, I doubt you will see a lot of changes with the last commits.

@TD-er
Copy link
Member Author

TD-er commented Jul 19, 2020

The USB chip is present on almost all boards.
What I meant was the "device hangs when closing IDE while serial terminal present" issue.
That behavior I've only seen on ESP8266, not on ESP32.

By the way, what an exorbitant high price for such a module.

@v-a-d-e-r
Copy link

That's right. I took the first hit I found, because I needed only the link ;-)

TD-er added 9 commits July 19, 2020 14:35
No need to reboot after changing UDP port setting.
The max. UDP packet size could allow for a large memory allocation when a large packet was received.
No matter if this was a valid packet to process anyway.
This could lead to extra memory usage until the next (valid) NTP packet was processed.
Also the UDP port could be held.
Fixes: letscontrolit#3155

For the ESP8266, we keep things like the last boot cause in RTC memory.
Currently that's not supported on ESP32 (in our code) so the last boot cause is never read and stored.
@TD-er TD-er merged commit 6652399 into letscontrolit:mega Jul 20, 2020
@TD-er TD-er deleted the bugfix/wifi_reset branch July 20, 2020 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment