Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WiFi] read all lines from slow servers #1720

Closed
TD-er opened this issue Sep 8, 2018 · 18 comments
Closed

[WiFi] read all lines from slow servers #1720

TD-er opened this issue Sep 8, 2018 · 18 comments
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity Type: Bug Considered a bug

Comments

@TD-er
Copy link
Member

TD-er commented Sep 8, 2018

As suggested by @clumsy-stefan here: #1669 (comment)

The suggested fix is based on this PR in the ESP8266 Arduino examples. (PR 5113)

I would like to know if someone who is experiencing issues with slow servers, if they see this line in the logs:

"Timeout while reading input data!"

If so, please report it, so I can make the timeout setting configurable.
Later I will make the function safeReadStringUntil use a callback construction and let it operate async.
For now the function is blocking, so it may cause all kind of issues.

@TD-er TD-er added Type: Bug Considered a bug Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity labels Sep 8, 2018
TD-er added a commit to TD-er/ESPEasy that referenced this issue Sep 8, 2018
@sassod
Copy link

sassod commented Sep 8, 2018 via email

@TD-er
Copy link
Member Author

TD-er commented Sep 8, 2018

@sassod Was this with the controller queue included? (PR merged yesterday, but I made a lot of test builds.)

@sassod
Copy link

sassod commented Sep 8, 2018 via email

@TD-er
Copy link
Member Author

TD-er commented Sep 8, 2018

The change introduced with yesterdays merge is meant to be used for controllers like Thingspeak.
The controllers themselves manage their own queue and only send new values when the minimal interval has passed since last send.
So you don't have to use the Message Delay value anymore. (that one blocks the whole ESP)

Only thing I haven't checked is if the result of millis() increases when in deep sleep. (this just occurred to me)

@clumsy-stefan
Copy link
Contributor

I only have "HTTP : C009 connection failed" in the logs (on firsth sight) and no TImeouts as described... but probably the same root cause?!

@TD-er
Copy link
Member Author

TD-er commented Sep 8, 2018

@clumsy-stefan Looking at the code, that error will occur when

  • client.connect returns something other than 1
  • WiFi is not connected
  • Hostname cannot be resolved (when using DNS)

As far as I can see, the timeout cannot be set, but is 5000 msec for WiFiClient.

TD-er added a commit that referenced this issue Sep 8, 2018
@clumsy-stefan
Copy link
Contributor

I've upgraded my server to a newer and faster HW/SW.... interestingly since then the units run without issues and reboots.... it really seems, that when the server is really slow or has tcp/network problems the client can get into some strange state... eg. having to many parallel open sessions or waiting too long for a result...

@clumsy-stefan
Copy link
Contributor

quick update, all units runing with latest commit from yesterday (self compiled):

nb: now I also do get the timeouts in the logs ... I only see reboots, when the server is not running/answering... other than this they seem to work fine..

Name / date / UPTIME (over 30 hours, most of them)....
sonoff_basic_06_sysinfo 2018-09-12 08:20:08 2125
sonoff_basic_08_sysinfo 2018-09-12 08:20:21 2125
sonoff_dual_01_sysinfo 2018-09-09 16:37:51 1594
sonoff_pow_01_sysinfo 2018-09-12 08:20:17 1364
sonoff_pow_02_sysinfo 2018-09-12 08:20:23 2126
wemos_mini_01_sysinfo 2018-09-12 04:24:45 1896
wemos_mini_02_sysinfo 2018-09-12 08:19:59 2131
wemos_mini_03_sysinfo 2018-09-12 08:20:14 2131
wemos_mini_04_sysinfo 2018-09-12 08:20:07 1991
wemos_mini_05_sysinfo 2018-09-11 19:40:42 1371
wemos_mini_06_sysinfo 2018-09-12 08:20:07 2130
wemos_mini_07_sysinfo 2018-09-12 08:20:19 1585
wemos_mini_08_sysinfo 2018-09-12 08:20:16 2129
wemos_mini_09_sysinfo 2018-09-12 08:20:30 2129
wemos_mini_10_sysinfo 2018-09-12 08:19:55 2128
wemos_mini_11_sysinfo 2018-09-12 08:20:14 2128
wemos_mini_12_sysinfo 2018-09-12 08:20:31 2128
wemos_mini_13_sysinfo 2018-09-12 08:19:49 2127
wemos_mini_14_sysinfo 2018-09-12 08:20:03 2127
wemos_mini_15_sysinfo 2018-09-12 08:20:15 2127
wemos_mini_16_sysinfo 2018-09-12 08:20:26 2127
wemos_pro_01_sysinfo 2018-09-12 08:20:14 2126

@clumsy-stefan
Copy link
Contributor

Another quick update... As per list below lots of nodes have >5000min uptime!! great... As I had to shutdown and work on the server it seems some of the nodes (quite randomly) rebooted while others survived without issues...

However, for my part the current version(s) seem to be really stable!! appreciate it!! Good work!!

Name / date / UPTIME
sonoff_basic_02_sysinfo 2018-09-14 05:21:41 27
sonoff_basic_03_sysinfo 2018-09-14 10:27:28 1665
sonoff_basic_06_sysinfo 2018-09-14 10:27:06 5131
sonoff_basic_08_sysinfo 2018-09-14 10:27:19 5132
sonoff_pow_01_sysinfo 2018-09-14 10:27:15 4371
sonoff_pow_02_sysinfo 2018-09-14 10:27:20 5133
wemos_mini_01_sysinfo 2018-09-14 10:26:49 4528
wemos_mini_02_sysinfo 2018-09-14 10:27:01 5137
wemos_mini_03_sysinfo 2018-09-14 10:27:29 2745
wemos_mini_04_sysinfo 2018-09-14 10:27:08 4997
wemos_mini_05_sysinfo 2018-09-14 10:26:47 3933
wemos_mini_06_sysinfo 2018-09-14 10:27:05 5136
wemos_mini_07_sysinfo 2018-09-14 10:27:11 760
wemos_mini_08_sysinfo 2018-09-14 10:27:12 5136
wemos_mini_09_sysinfo 2018-09-14 10:27:27 5135
wemos_mini_10_sysinfo 2018-09-14 10:27:01 793
wemos_mini_11_sysinfo 2018-09-14 10:27:20 5135
wemos_mini_12_sysinfo 2018-09-14 10:26:37 5133
wemos_mini_13_sysinfo 2018-09-14 10:26:57 5134
wemos_mini_14_sysinfo 2018-09-14 10:27:11 5079
wemos_mini_15_sysinfo 2018-09-14 10:27:19 5134
wemos_mini_16_sysinfo 2018-09-14 10:27:23 5134
wemos_pro_01_sysinfo 2018-09-14 10:26:57 1994

@TD-er
Copy link
Member Author

TD-er commented Sep 14, 2018

OK, then for now, I will close it

@TD-er TD-er closed this as completed Sep 14, 2018
@giig1967g
Copy link
Contributor

ok to close it, but still the fact that some units rebooted just because the server was shutdown needs further investigation, I think.

@clumsy-stefan
Copy link
Contributor

problem is, since updating my server it seems to be fast enough so the "slow response" issue does not occur anymore ;) so I'm having troubles to test it..

@giig1967g
Copy link
Contributor

but if you shutdown the server some units do reboot?

@TD-er
Copy link
Member Author

TD-er commented Sep 14, 2018

And what is the reboot reason then?
Watchdog?

@clumsy-stefan
Copy link
Contributor

@giig1967g: yes, true, but can't be sure it's because of that
@TD-er: as an example units 7 & 10: both of them:
Boot Manual reboot (2)
Reset Reason Hardware Watchdog
Free Mem 16504 (13504 - addToLog) (Unit 7)
Free Mem 13368 (5800 - LoadTaskSettings) (Unit 10)

I think it's a difference if the server doesn't answer at all or if it answer's opens the socket and is too slow or even doesn't respond at all... I guess these are different situations from the unit's pov...

it's easier to deal with if the server is not reachable at all, than having open sockets to deal with..

@giig1967g
Copy link
Contributor

I am asking this because I also was having serious issues that stopped when I swapped my router (from Mikrotik to Lynksys).
With Mikrotik I was also getting some reboots (HW watchdog) if the router rebooted or even if just the wifi on the router was set to off. With Linksys this does not happen.
So I just wanted to highlight that the fact that for me and for @clumsy-stefan the problem disappeared, is not because the root cause has been identified, but because we changed hardware.
Many other could face this problem until the root cause is found.

What do you think?

@TD-er TD-er reopened this Sep 14, 2018
@clumsy-stefan
Copy link
Contributor

had a really strange issue today: after nearly all units ran for over 6'000Min. I updated all of them with the newst commit this morning. After that they all ran fine for about 4 hours until in the afternoon I had to fix some things on the server and shut it down for about an hour... after it restartet all units were "absent". Looking at the log (via UDP syslog or webpage) they all said "Timeout while reading input data!"... even after a (software) reboot, the problem persisted.
I finally solved it by going to the controller tab editing the controller and just pressed "save"... then the units rebooted once again and all ran fine after that....
Anyone any clue what happened here?

@TD-er
Copy link
Member Author

TD-er commented Oct 27, 2019

These timeout patches are already present in the code for a while now.
Core 2.6.0 also has a lot of fixes related to proper handling of timeouts.

So I will close this one for now.
Please create a new issue and refer to this one if it is still an issue.

@TD-er TD-er closed this as completed Oct 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Controller Related to interaction with other platforms Category: Stabiliy Things that work, but not as long as desired Category: Wifi Related to the network connectivity Type: Bug Considered a bug
Projects
None yet
Development

No branches or pull requests

4 participants