Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wifi issues -never ending story- go back to non event based wifi? #1302

Closed
TD-er opened this issue Apr 22, 2018 · 388 comments
Closed

Wifi issues -never ending story- go back to non event based wifi? #1302

TD-er opened this issue Apr 22, 2018 · 388 comments
Labels
Category: Core related Related to the (external) core libraries Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity Status: Fixed Commit has been made, ready for testing Type: Discussion Open ended discussion (compared to specific question)

Comments

@TD-er
Copy link
Member

TD-er commented Apr 22, 2018

As a lot of you have noticed the last few weeks, there have been lots of issues with the wifi.
This all started when I changed the way wifi operates to be event based.

  • Static IP not working
  • Boot loops (ESP32)
  • Connected but data transfer not possible (NTP cannot connect to 0.0.0.0)
  • No AP found errors
  • Loading setup page from AP mode not working (changed to core 2.4.0 for this)
  • Beacon timeout errors => no proper reconnect
  • Various other wifi related issues.

Some of these errors are core version related, and update to core 2.4.0 does introduce lots of other issues.
And then there is the problem with corrupted settings what was also in this period. That wasn't related to the wifi event based connect, but it made me look for a lot of other issues that were not really issues at all but just corrupted settings.

So at the moment the wifi state machine I wrote is overly complex due to the many fixes that were no fixes, because things were not broken.
And still there are other real issues, either caused by core 2.4.0 or still open wifi issues.

So now we have to choose:

  1. Go back to sloooowwww but stable wifi (still some issues then with MQTT when connection is lost)
  2. Invest some more time to get event based wifi just right + try to get core 2.4.1 working.
  3. Invest some more time to get event based wifi just right, but still go back to core 2.3.0
  4. Some intermediate solution to do async wifi with core 2.3.0

Core 2.3.0 does seem to give a lot less issues and leaves more free memory.
So I guess that's my preferred base.
This means that for event based wifi, there is still some issue with respect to loading the setup page when initial config is needed.

Anyway, this has to stop now and get stable again.
There are currently way too many issues at hand that are quite hard to see as separate issues.

Any other suggestion?

@TD-er TD-er added Category: Core related Related to the (external) core libraries Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity and removed Category: Wifi Related to the network connectivity labels Apr 22, 2018
@Budman1758
Copy link

I really can't speak to this from a programming level but it seems to me from what I have been seeing is that with the exception of the static ip address thing is that when setting up a "brand new" unit the wifi seems to work fine. I have not seen any connection issues with "fresh" installs with the latest firmwares. Web pages load fast and the entire things seems fast and responsive. Its when you try to upgrade is when most of these issues seem to be happening. Seems like there is a corruption issue when upgrading to a newer firmware.

I also notice that it seems to be a lot of user compiled firmwares are having wifi issues. Just from reading thru all these issue posts I get that impression. I could be completely wrong about that though. I am not trying to say that as a fact, but just a possibility.

I can't speak to MQTT because I don't use it.

Just my 2 cents worth.....

@Grovkillen
Copy link
Member

If you are leaning towards option 3 I support you fully. I'd hate to see us drop the improvements your event based WiFi has given us. Core 2_4_x might be easier to revert/go to up stream?

@DittelHome
Copy link

From the perspective of a user:
I would go ahead and use the new Core 2.4.1 as soon as possible.
The users can always use older versions.

@DittelHome
Copy link

Dont forget, core 2.4.x fixes some problems:
PWM flicker is history (#1156 is fixed with core 2.4.0)
Serial with large packet are also fixed...
At some point we have to make the transition to the new core. Return to 2.3.0 means only to postpone the problem. In the end we have to do the work anyway. My ESP's are definitely better with 2.4.0

@Grovkillen
Copy link
Member

As I see it, core 2_4_x will happen but maybe not necessary as of right now. We did a bad decision when we went ahead with core update and wifi event based approach at the same time. We should have made them one after another. When we then, at the same time, had an update in the global settings the problem got exceptionally hard to pinpoint. I strongly support the idea of going back to 2_3_0 during the fix of wifi stability + fix of settings corruption.

After that we can hopefully release the v2.1.0 and then focus on getting core 2_4_x stable for v2.2.0

@melwinek
Copy link

After clearing the settings and uploading the version from 22.04. So far everything is working. At least for now :) Only free memory is not enough, even in NORMAL. We'll see how it will go on.

@giig1967g
Copy link
Contributor

I have to agree with @Budman1758 and @melwinek : I also found that starting from a clean unit there are no problems at all with Wifi, static IP and settings.
The main issue is the fact that to upgrade I now need to manually clean all the units, reflash them and rebuild their configuration.

@Grovkillen
Copy link
Member

Grovkillen commented Apr 23, 2018

I guess we should not forget that officially we're still in the process of going from stable R120 to stable 2.1.0 and settings will not be converted between these two releases making you need to start from scratch anyway. What we did with the update of core 2_4_x was to make a "break point" yet again. If we can live with that then its not a problem. I agree that a clean install is really stable (at least on NORMAL, which I test most frequently). And NORMAL is the only part which will actually be in the release, test and dev is only in the development nightly release anyway.

@giig1967g
Copy link
Contributor

What I mean is: if the current developed firmware works and is stable on a clean setup, then it means that there is nothing wrong with it. I wouldn't go back to 2.3 or to old wifi.

@Grovkillen
Copy link
Member

Yes I hear you and I kinda agree. The only thing is that we create another break point which I guess is okay since it's still beta.

@ghtester
Copy link

ghtester commented Apr 23, 2018

Although it is a step back that I do not like, I'm afraid that it would be really better to go back to core 2_3_0 for now as I think some strange issues may happen due to lack of free memory on 2_4_0.

@Budman1758
Copy link

@giig1967g I agree with you there. I do believe there are some corruption issues going on though. Might be whats screwing up the wifi vs its having a lot of inherent problems.

@TD-er
Copy link
Member Author

TD-er commented Apr 23, 2018

There are still options to get memory usage to an acceptable point.
I think I can get about 3 - 4 kB more memory in like 1 evening of programming. (need to change all plugin files though)
And MQTT import is also something which is really a pain that should be resolved soon.
And the Switch plugin does have too much functionality it itself which should be split.

I will think about it today, what we should do, so please add more suggestions/arguments :)

@melwinek
Copy link

@TD-er You're right with SWITCH.
Most people use only ON / OFF for switch / relay.
And in this plugin there is a servo, dimmer and probably something.
That could be separate.

@TD-er
Copy link
Member Author

TD-er commented Apr 23, 2018

It is also handling stuff very specific to MQTT and/or Domoticz. That should not be part of the plugin.

@melwinek
Copy link

@TD-er In many cases, it would help me to compile myself, after removing unnecessary plugins, in many cases I need only SWITCH, FHEM Controller, DHT.
But after these adventures with settings I'm afraid to compile myself. Especially after your post: #1292

@M0ebiu5
Copy link
Contributor

M0ebiu5 commented Apr 23, 2018

Did you take a look, how wifi is realized in other projects (eg tasmota)?

About memory: i told you 😄
I think there are way too many rarely used features in the core - the decision, if a core feature request is implemented should be way more strict - now it's a little like Christmas for everyone...
Maybe a voting with a certain limit would help.

If possible, the core should first be cleaned from that rarely used features (or transformed to a plugin) and then optimized. Also, one could think about additional interfaces for plugins, to allow swapping more functionality outside of the core.

@TD-er
Copy link
Member Author

TD-er commented Apr 23, 2018

@M0ebiu5 Agree.
What should happen is that new features will be developed on a separate branch, then collect a few of them and merge those to a release candidate branch and test those.
Then release and merge the used features to the master branch (or dev branch, or whatever you name it).

And one thing I learned is to ask twice about what is observed, what should be observed and what version is used. That will make things a lot more clear and lead to less mistakes.
Part of that must be done in the code itself to make some kind of footprint to be able to see (and log) what software is used.

Also plugins should be just plugins to interface a sensor to some output values.
Maybe plugins that generate output (like displays) should not be used the same as the input ones.
So we get something like:

  • Sensor to read a device and generate measurements
  • Output (display?) to present values. This can also be something other than a display, for example JSON or an image.
  • Controller to interface to the outside world (input and output)
  • Rules to process data and events.
  • Notifications to convert events to something else. Actually it sounds like a more elaborate controller.
  • Commands to do some basic setup or temporary (non-persistent) changes/updates and perform some actions (e.g. reboot)
  • Web page to configure them all.

But such a redesign will take quite some effort.

@M0ebiu5
Copy link
Contributor

M0ebiu5 commented Apr 23, 2018

@TD-er you are right, but i would make the changes in small steps - cause most parts are working stable and big changes could put this stability at risk.

New interfaces to the core are one possible way. They will not influence the current behavior and only new or heavily changed plugins will use them. It will take more time to transform to a clean architecture, but with a lower risk and the effort will also be spread over time.

@TD-er
Copy link
Member Author

TD-er commented Apr 23, 2018

I agree that these changes should be done at ease.
It is more a view on the redesign for the future.

@melwinek
Copy link

However, node from 22.04 has lost the connection.
Resetting the router does not help.
ESP reset will help, but I'm far away.
So, the best version on my nodes is mega-20180410.
Maybe because it's on core 2.3?
Maybe, however, a good solution would be to go back to 2.3 for some time?

@TD-er
Copy link
Member Author

TD-er commented Apr 24, 2018

Nope, last night I saw the problem (in the code and happening at my own units).
My nodes did not reconnect when they got a 'beacon timeout' error, which is quite a common reason to disconnect. It is a logic error in the code, but it was already past 1:30 am and I didn't want to fix it at that moment. It would certainly have been past nightly build time to fix it, so that didn't matter anymore ;)

@s0170071
Copy link
Contributor

related: #1064

@TD-er TD-er added the Type: Discussion Open ended discussion (compared to specific question) label Apr 25, 2018
@micropet
Copy link

I just flashed 6 devices with the current version ESP_Easy_mega-20180425_test_ESP8266_4096.bin.

I think with this version we have reached an absolute low point.
All devices could not be reached in the network after a few hours.

@TD-er
Copy link
Member Author

TD-er commented May 31, 2018

Very strange indeed, since I wonder how you can see the web page at all with such IP config.
Or do you connect to the ESP via its accesspoint function?

@clumsy-stefan
Copy link
Contributor

no, connected via network directly. ping and http works without issues, speedy, (see the client IP is 10.0.0.10 which is my laptop, internal net is 10.0.0.0/16)... yes, quite strange though..
cuold be DHCP related or so.. after a reboot everything is fine again..

could it be related to the fact, that I only enabled one controller (FHEM) and no MQTT enabled controller? I saw a lot of mqtt code in the sources, outside of the controller plugin... just a guess... but I think it's not really related...

@uzi18
Copy link
Contributor

uzi18 commented Jun 3, 2018

Maybe dhcp expired and not renewed?

@clumsy-stefan
Copy link
Contributor

could very well be... probably the network stack still has an active IP but if renew fails the coresponding config gets zeroed... not sure how the DHCP code works though, but it could explain the state I0m seeing.

also I found that when the server (in my case fhem) is not responding fast enough a numebr of times, the units start to reboot after some time. could be a problem in the plugin code or the underlying tcp stack... I did some performance tweaking on the server, since then the units run much more stable (som ehave uptimes over 48h now)..

@TD-er
Copy link
Member Author

TD-er commented Jun 4, 2018

I've already seen 10+ days of connection-uptimes (uptime without any reconnect) with the latest versions.
It is possible to get the units to fill up their ram with lots of requests.
And I've seen the LWIP doing strange things when doing lots of requests. (reading from memory not containing data related to that request.)

@melwinek
Copy link

melwinek commented Jun 9, 2018

Today one node stop responding again. He disconnected from wifi. I could not connect to the "esp" network. He stopped sending data to the controller. I had to reboot him. Maybe a watchdog would be a good solution. If, for example, an hour is disconnected from the wifi, it reboots. Or maybe it can be done with rules, but I do not know how :)

@TD-er
Copy link
Member Author

TD-er commented Jun 9, 2018

Today I experienced a lot of Watchdog actions while debugging a plugin.
And I know now that sometimes when the watchdog intervenes, a node can remain halted.
So a watchdog is not the perfect solution.

Is it possible that hanging node of yours was never rebooted after flashing? (press reset or power cycle)

@melwinek
Copy link

It is possible that there was no reboot after flashing. But it was a flashing via www, not a serial.

@TD-er
Copy link
Member Author

TD-er commented Jun 10, 2018

OK, then it shouldn't matter, if you flashed OTA.
As long as there has been a proper reset/reboot after the serial flash.

@ghtester
Copy link

ghtester commented Jul 4, 2018

Well, after struggling many stability issues and strange wifi troubles with latest firmware releases I had to get back to earlier versions in the end. For instance, until power outage happened recently, one old ESP12E node with mega-20180311dev was working for 70 days, sending temperature data to ThingSpeak.
On another node after upgrade to mega-20180522dev I was experiencing a reboot due to exception about every 24 hours despite reset to defaults, just running without any device configured, no NTP configured, no controller... Never survived 48 hours. After downgrade to mega-20180324 two and half days ago, kept the config, just enabled NTP again and so far it's running. Although there are some bugs and missing features in these older versions, for me it's currently the best choice.

@s0170071
Copy link
Contributor

s0170071 commented Jul 4, 2018

There is not much anyone can do if the issues are not reproduceable reliably.
What helps a bit is scheduling a reboot every night. You can use the rules for that.

@ghtester
Copy link

ghtester commented Jul 4, 2018

I know but I prefer a stable node without scheduled reboots. I don't know if the stability was significantly decreased due switching to core 2.4.1 (maybe which is not mature enough yet) or if it's related to ESP Easy redesign but it happened despite the maximal effort of all ESP Easy contributors. I really appreciate the hard work all of you but currently I can't use the latest ESP Easy releases anymore.

@TD-er
Copy link
Member Author

TD-er commented Jul 4, 2018

I think it is also related to the used plugin or maybe combination of plugins.

Last week I worked on looking into the effects of timings and I am sure it will have significant effect on time-critical tasks.

I just looked at some of my nodes, all running official builds:

Binary filename ESP_Easy_mega-20180513_normal_ESP8266_4096.bin

Unit 3

  • Uptime 16 days 18 hours 26 minutes
  • Connected 5d23h07m
  • Last Disconnect Reason (201) No AP found
  • Number reconnects 2

Unit 5

  • Uptime 11 days 5 hours 22 minutes
  • Connected 5d22h57m
  • Last Disconnect Reason (201) No AP found
  • Number reconnects 4

Unit 6

  • Uptime 11 days 5 hours 23 minutes
  • Connected 45 m 1 s
  • Last Disconnect Reason (201) No AP found
  • Number reconnects 58

Binary filename ESP_Easy_mega-20180619_test_ESP8266_4096.bin

Unit 7

  • Uptime 13 days 20 hours 51 minutes
  • Connected 5d23h05m
  • Last Disconnect Reason (202) Auth fail
  • Number reconnects 2

About 6 days ago, I had some issues with one of my WiFi accesspoints, which I had to restart.

Unit 6 is connected to the same as unit 3 & 7, but it has a lot more reconnects.
Those 3 units are right next to eachother, within a meter from eachother to compare different CO2 sensors (MH-Z19 A, B and SenseAir S8) and all powered by the same power supply (IKEA 3-port USB charger).

The only difference between them is that the one with more reconnects has the Senseair sensor.
So it could be the implementation of that sensor does put more strain on the WiFi routine (less delay calls), which could lead to WiFi instability.

Could you give a list of plugins used?
Also I made a pull request yesterday which does log a lot timing statistics. Maybe you could make a build based on that one and run it for a few minutes to get some idea on plugins using way too much time.

@ghtester
Copy link

ghtester commented Jul 4, 2018

I am always using the official builds as I am not able to prepare and maintain the developping environment for these devices.
The mentioned release mega-20180522dev, as I described above, was completely empty configuration so absolutely no plugins used, no rules, I have even deleted the default Controller Nr1 in the end. Nothing could stop the node from rebooting due to exception at intervals about 24 - 40 hours.

@uzi18
Copy link
Contributor

uzi18 commented Jul 4, 2018

Dont know if it is wifi issue - it not look like it, I have managed to set static IP adresses for wifi, but espeasy still fetch it by dhcp and set different.

1104 : WD   : Uptime 0 ConnectFailures 0 FreeMem 21800
1105 : S
W   : State 1.00
1106 : EVENT: x#w=1.00

scandone

state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 2
c
nt 

connected with BJ3, channel 12
dhcp client start...
4350 : WIFI : Connected! AP: BJ3 (E8:DE:27:4F:66:86) Ch: 12 Duration: 3760 ms
4351 : EVENT: WiFi#ChangedAccesspoint
4355 : IP   : Static IP : 192.168.2.184 GW: 192.168.2.1 SN: 192.168.2.0 DNS: 8.8.8.8
4360 : WIFI : Static IP: 0.0.0.0 (ESP5-5) GW: 0.0.0.0 SN: 0.0.0.0   duration: 11 ms
4367 : EVENT: WiFi#Connected
4374 : Webserver: start
4374 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_SERVICES_INITIALIZED
ip:192.168.2.123,mask:255.255.255.0,gw:192.168.2.1
4400 : WIFI : Static IP: 192.168.2.123 (ESP5-5) GW: 192.168.2.1 SN: 255.255.255.0   duration: 50 ms
4401 : EVENT: WiFi#Connected
4406 : WIFI  : Arduino wifi status: WL_CONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_SERVICES_INITIALIZED
4500 : MQTT : Intentional reconnect
4501 : LoadFromFile: config.dat index: 28672 datasize: 724


@TD-er
Copy link
Member Author

TD-er commented Jul 4, 2018

@uzi18 Have you set all fields for static IP config?

If so, then I am afraid it is a known issue (to me), where there is some previous session stored in a region where we don't (yet) erase data at a factory reset.
This means at this moment there is no other way than to wipe all of the flash and start again with a recent version of ESPeasy.
The later versions do set a value to not make the wifi settings persistent.

@uzi18
Copy link
Contributor

uzi18 commented Jul 4, 2018

@TD-er Yes, all data filled - as you see in log.
I have flashed NEW module, with
INFO : Plugins: 71 [Normal] [Testing] (ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3)
and it work like that.
Module was only taken from original bag and flashed, espeasy never was before here.

@s0170071
Copy link
Contributor

s0170071 commented Jul 5, 2018

@TD-er: Two thoughts on that:

  1. My non ESPEasy heating to MQTT broadcaster is working flawlessly using the latest pubsub client. Does re-connect and so on. Maybe the ESPEasy connectivity surveillance is interfering with whats already been done by the core library. Should we have a way to disable all that additional ping-reconnect-wifistate stuff ? Just for testing ?
  2. Are you aware of the wifi auto power down ? https://blog.creations.de/?p=149

TD-er added a commit to TD-er/ESPEasy that referenced this issue Jul 11, 2018
As being described a few times and a screenshot shown here: letscontrolit#1302 (comment)
It looks like a DHCP request may fail resulting in a cleared IP setup. The web server then still replies to requests, but no new connections can be made then.

This patch should detect such a situation and then reset the wifi and make a new connection.
@TD-er
Copy link
Member Author

TD-er commented Jul 11, 2018

Would be nice if someone with rather unstable wifi could test this PR: #1562

@s0170071
Copy link
Contributor

@TD-er just stumbled upon esp8266/Arduino#4718
deals with lwip reconnect issues. Is fixed in the meantime. Maybe you want to skip through it...

@clumsy-stefan
Copy link
Contributor

I always use the latest GIT Version from the esp8266.. that's why I probably don't see the 0.0.0.0 issue anymore...

@melwinek
Copy link

melwinek commented Jul 18, 2018

Yesterday again one node after the restart of the router lost contact with the network.
Rules worked correctly.
This is my own wall switch, it is difficult to disassemble it for reset.

To facilitate this in the future, I modified the rules:

on S1#Switch do
timerSet,1,5 
if [R1#Relay]=1
gpio,12,0
else
gpio,12,1
endif
endon

on S2#Switch do
if [R2#Relay]=1
gpio,13,0
else
gpio,13,1
endif
endon

On Rules#Timer=1 do
if [S1#Switch]=1.00
 reboot
endif
endon

Now life will be simpler :))

@s0170071
Copy link
Contributor

s0170071 commented Jul 18, 2018

I do reboot every 24h. This revives one node with a firmware from 4 weeks ago twice a week.
Maybe this should be a permanent feature....

@Grovkillen
Copy link
Member

Is this still an issue? If so please reopen.

@TD-er
Copy link
Member Author

TD-er commented Oct 25, 2018

Our longest thread on the issue list....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Core related Related to the (external) core libraries Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity Status: Fixed Commit has been made, ready for testing Type: Discussion Open ended discussion (compared to specific question)
Projects
None yet
Development

No branches or pull requests