Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serial] Redirect all Serial.print/println to a buffer #2089

Merged
merged 5 commits into from
Nov 27, 2018

Conversation

TD-er
Copy link
Member

@TD-er TD-er commented Nov 25, 2018

Writing data to the serial port while it is not being read may cause all kinds of issues and maybe even Watchdog Resets.
All Serial.print and Serial.println calls in our own code is now sent to a dynamic buffer.
This buffer will try to only send data to the serial port when its buffer permits it.
After a while, unread data will be dropped.

@TD-er
Copy link
Member Author

TD-er commented Nov 25, 2018

This is meant to be used later on to make the rules be executed directly again, not delayed execution by the scheduler.

Would be nice if some already can test this part, to make sure I didn't break any plugins actually using the serial port.

@clumsy-stefan
Copy link
Contributor

is it possible, that with this PR some logs on the serial port get lost?

for instance I don't get any events andymore (eg. Clock#Time) only the first one after successfull NTP sync.
Afrer this PR:

13883 : WIFI : Connected! AP: clumsy_ap2 (E4:8D:8C:53:1D:23) Ch: 7 Duration: 3759 ms
13884 : EVENT: WiFi#ChangedAccesspoint
13934 : WIFI : DHCP IP: 10.0.10.117 (wemos-mini-17-17) GW: 10.0.0.2 SN: 255.255.0.0   duration: 24 ms
13948 : EVENT: WiFi#Connected
13992 : Webserver: start
14367 : NTP  : NTP replied: delay 10 mSec Accuracy increased by 0.887 seconds
14369 : Current Time Zone:  DST time start: 2018-03-25 02:00:00 offset: 120 minSTD time start: 2018-10-28 03:00:00 offset: 60 min
14372 : EVENT: Time#Initialized
14405 : EVENT: Clock#Time=Sun,19:14
41912 : WD   : Uptime 1 ConnectFailures 0 FreeMem 18632
71357 : EVENT: Rules#Timer=2
71912 : WD   : Uptime 1 ConnectFailures 0 FreeMem 18632
131912 : WD   : Uptime 2 ConnectFailures 0 FreeMem 18632
191912 : WD   : Uptime 3 ConnectFailures 0 FreeMem 18632
251912 : WD   : Uptime 4 ConnectFailures 0 FreeMem 18632
311912 : WD   : Uptime 5 ConnectFailures 0 FreeMem 18632
371913 : WD   : Uptime 6 ConnectFailures 0 FreeMem 18632

before this PR:

12113 : WD   : Uptime 0 ConnectFailures 0 FreeMem 22680
14005 : WIFI : Connected! AP: clumsy_ap2 (E4:8D:8C:49:4F:BA) Ch: 1 Duration: 3761 ms
14005 : EVENT: WiFi#ChangedAccesspoint
14831 : WIFI : DHCP IP: 10.0.10.117 (wemos-mini-17-17) GW: 10.0.0.2 SN: 255.255.0.0   duration: 762 ms
14844 : EVENT: WiFi#Connected
14891 : Webserver: start
15764 : NTP  : NTP replied: delay 261 mSec Accuracy increased by 0.478 seconds
15766 : Current Time Zone:  DST time start: 2018-03-25 02:00:00 offset: 120 minSTD time start: 2018-10-28 03:00:00 offset: 60 min
15768 : EVENT: Time#Initialized
15806 : EVENT: Clock#Time=Sun,08:43



35503 : EVENT: Clock#Time=Sun,08:44





70226 : SYS  : -79.00,18424.00,14.00,1.00
70229 : EVENT: sysinfo#rssi=-79.00
70262 : EVENT: sysinfo#mem=18424.00
70291 : EVENT: sysinfo#load=14.00
70322 : EVENT: sysinfo#uptime=1.00
71503 : EVENT: Rules#Timer=2
72058 : WD   : Uptime 1 ConnectFailures 0 FreeMem 18592









130226 : SYS  : -79.00,18424.00,6.00,2.00
130229 : EVENT: sysinfo#rssi=-79.00
130261 : EVENT: sysinfo#mem=18424.00
130291 : EVENT: sysinfo#load=6.00
130321 : EVENT: sysinfo#uptime=2.00
132058 : WD   : Uptime 2 ConnectFailures 0 FreeMem 18592









190226 : SYS  : -79.00,18424.00,6.00,3.00
190229 : EVENT: sysinfo#rssi=-79.00
190261 : EVENT: sysinfo#mem=18424.00
190291 : EVENT: sysinfo#load=6.00
190320 : EVENT: sysinfo#uptime=3.00
192058 : WD   : Uptime 3 ConnectFailures 0 FreeMem 18592

@clumsy-stefan
Copy link
Contributor

just saw, that when I open the web-log viewer I see the events in the web log and also then in the serial log, as long as hte web-log window is open...

@TD-er
Copy link
Member Author

TD-er commented Nov 25, 2018

It is possible some logs will not appear, but that will only happen if you're already almost out of memory.
If the free heap is < 5k (must check), it will not buffer the text to the log.
If you are reading the buffered log from the web log, it may be some memory is freed.
Later I want to merge both buffers, since the web log and serial log now do store data in 2-fold.
So if you disable the web log, you may see the serial log at boot I guess

@clumsy-stefan
Copy link
Contributor

I do see a serial log, but only the uptime (see above) and nothing else (except of course the boot up process unitl NTP is synced).

@TD-er
Copy link
Member Author

TD-er commented Nov 25, 2018

What serial log level do you have active? In the sysinfo page you should also see what level is actually active and not just what is set. (web log level is made more dynamic, if no log is being read)

@clumsy-stefan
Copy link
Contributor

clumsy-stefan commented Nov 25, 2018

serial and weblog are both on "info" and in he weblog windoow it tells me also it's on info. also I did not change before and after the update/recompile, so I would assume the serial log should look the same before and after...

@TD-er
Copy link
Member Author

TD-er commented Nov 25, 2018

Can you have a look at this piece of the code:

ESPEasy/src/Serial.ino

Lines 49 to 59 in 85be5b8

void addToSerialBuffer(const char *line) {
const size_t line_length = strlen(line);
int roomLeft = ESP.getFreeHeap() - 5000;
if (roomLeft > 0) {
size_t pos = 0;
while (pos < line_length && pos < static_cast<size_t>(roomLeft)) {
serialWriteBuffer.push_back(line[pos]);
++pos;
}
}
}

Maybe change this:

int roomLeft = ESP.getFreeHeap() - 5000; 
   if (roomLeft > 0) {

Into this:

int freeHeap = ESP.getFreeHeap();
int roomLeft = freeHeap - 5000; 
   if (roomLeft > 0) {

And then add some else to this if where you write the free heap to the buffer. (or just a '1' for starters)
Just to get some idea why it is not writing these to the log.

@clumsy-stefan
Copy link
Contributor

changed the function to:

void addToSerialBuffer(const char *line) {
  const size_t line_length = strlen(line);
//  int roomLeft = ESP.getFreeHeap() - 5000;
  int freeHeap = ESP.getFreeHeap();
  int roomLeft = freeHeap - 5000;
  if (roomLeft > 0) {
    size_t pos = 0;
    while (pos < line_length && pos < static_cast<size_t>(roomLeft)) {
      serialWriteBuffer.push_back(line[pos]);
      ++pos;
    }
  } else {
      serialWriteBuffer.push_back(freeHeap);
      serialWriteBuffer.push_back('/');
      serialWriteBuffer.push_back(roomLeft);
      addNewlineToSerialBuffer();
  }
}

now it seems to work again?! I get all log entries and it actually never runs into the else condition as it seems...

@TD-er
Copy link
Member Author

TD-er commented Nov 25, 2018

You cannot add the free heap like this.
The buffer has char as template argument, so you must add the freeheap values a character at a time.

@clumsy-stefan
Copy link
Contributor

oh, ok, didn't realize.... however it works again and if it would run into the else condition at least it would add a "/" and a newline... which it doesn't..

but I'll change the free heap addition...... tomorrw...... ;)

@clumsy-stefan
Copy link
Contributor

@TD-er I'm not sure what's going on here, I reverted the changes (at least the first line) and added (for simplicity) just a '1' as you suggested in the else condition. after recompiling everything still worked on my 2 test nodes.

then I removed the else condition again completetly, flashed the two nodes now on one node I see all events on the second one I only see uptime and clock events.

so I added the else condition again and now still one node show all events and the other one only uptime and clock (never running into the else condition thoug)...

one more recompile and flash (same code) now I get all events again on both nodes!!

I'm really confused now (compiler issues?)... sorry for this...

@clumsy-stefan
Copy link
Contributor

one more thing: it seems that sending commands on the serial does also not work anymore. sending a command shows something like (no matter which command):

265154 : Command: >>12085
265155 : Command unknown: ">>12085"

else if (error == OTA_RECEIVE_ERROR) Serial.println(F("Receive Failed"));
else if (error == OTA_END_ERROR) Serial.println(F("End Failed"));
serialPrint(F("\nOTA : Error (will reboot): "));
if (error == OTA_AUTH_ERROR) serialPrintln(F("Auth Failed"));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use braces here, it can prevent braces mismatch and therefore strange bugs.

Throughout the code I have seen this and it is not used consistent .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is not a consistent coding standard used.
For "oneliners" I tend to use the notation you see in your linked lines.
Whenever I am changing code which either has a longer statement in the if-scope, I replace it by adding {}.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I would suggest to always use it. It makes it also more readable in my opinion.

@TD-er
Copy link
Member Author

TD-er commented Nov 26, 2018

@clumsy-stefan
I have noticed similar issues with compiling lately.
It seems like it is getting worse. Something is wrong with the linker when using PlatformIO.
When strange things happen, I do a clean and recompile and it seems to work again. But that's not how it is meant to be.

And I tested running commands yesterday via the serial and those I tested just work.

How do you add the 1 to the buffer? as 1 or '1' ?
The 1 will be considered a special marker and not printed. The '1' is a char representing the character "1".
This special marker is needed to check if the serial port is being read by some device.

@clumsy-stefan
Copy link
Contributor

¡ user Arduino SDK on a Mac.... so It can't be plattformIO related I guess... but it could mean that the cross-compiler has issues...

I did it as a '1' so it should print the number "1"...

      serialWriteBuffer.push_back('1');
      addNewlineToSerialBuffer();

@TD-er
Copy link
Member Author

TD-er commented Nov 26, 2018

Hmm, so if I add a "clean" as step in the platformIO config, it would maybe help to regain stability?
I am really fed up with these random build stability issues, since they take a lot of time and energy and we have these since February or March, or at least since then we know about them.

@Barracuda09
Copy link

Hi @TD-er
Well an "clean" step is probably hiding the real problem somewhere?

And is it always correct after an clean rebuild?

Marc

@TD-er
Copy link
Member Author

TD-er commented Nov 26, 2018

@Barracuda09 I agree it may be hiding the real issue, but reality is we cannot do much about it ourselves and the issues have been reported a long time ago.
Also I am not really sure it will always work fine then.

As I heard from the Tasmota people, they recently moved from PlatformIO to Arduino builds for the "automated builds" and now have (more?) reproducible builds.
So I am already looking into that option for ESPeasy, since the building process now is simply unreliable.

@uzi18
Copy link
Contributor

uzi18 commented Nov 26, 2018

@TD-er consider to fill bug in platformio-core/esp8266 maybe this will reveal something what is wrong in build time

@Barracuda09
Copy link

@TD-er
What seems to be the problem then, as there seems to be a lot of uncertainties in this case?

@TD-er
Copy link
Member Author

TD-er commented Nov 26, 2018

@uzi18 I am sure I have reported several times on a topic about this, but apparently not on platformio-core issues (or I cannot find it now)
I don't know where those issues were reported, but I know @ivankravets or @igrr (both named Ivan, so I mix them always in my head, I am terrible with names) replied on the issue with some reply like "need help on way to reproduce it" or something like that.

@Barracuda09 The build problems are probably related to several issues, but I can describe the symptoms:

  • WiFi instability or completely unable to connect.
  • Strange behavior in working code as @clumsy-stefan just experienced on this PR.
  • At build sometimes (actually quite often) unable to find libraries
  • WiFi credentials stored on the ESP appear to be missing or incorrect
  • Strange behavior on stored settings

All those issues may be gone by simply starting another build on the same project.
The WiFi credentials (or other stored settings) may be loaded just fine on a new build + flash, just as long as you did not save anything with the "wrong" build on the ESP.

This last issue may be related to the missing -mtarget_align flag I added a few days ago, but still a struct should always have the same layout in memory. If that cannot be guaranteed, we should make it the highest priority to change the way how we store our settings.

The WiFi issues tend to get worse by the day. It used to be that building the project twice would help the WiFi stability issues. By just building it once, it was at random if the built image would be able to connect to WiFi. That's the reason our build script is building it twice.
Not sure if it is related to the core version, or the way how platformIO handles pre-compiled libs, or perhaps a linker issue. It started to show around March. We were then also in the transition from 2.3.0 to 2.4.0 and I was also working on event based WiFi. So combine that with unreliable builds and totally not reproducible issues reported by numerous users and you see why that has been a nightmare back then.

Recently we have had a lot of watchdog issues and the introduction of the scheduler I made. I still don't know why running commands from outside the scheduler (what used to work just fine before I added the scheduler) is causing resets. On my own nodes it seems to be working just fine when I tried it, but on builds made by others it is crashing. So either we hit some kind of limit and are the resets just a matter of some stack overflow or something else which makes perfect sense.
But it could also be it is a linker issue. By now it is just unclear where to look for these issues, since there are too many unknowns and the PlatformIO unreliable builds are one big unknown I want to get rid of.
You simply can't fix bugs if they are not reproducible and even worse if your fixes are unreliable.

Like I said, Tasmota has recently stopped building the automated builds using PlatformIO and I am looking into how to make Arduino-based builds and still use the PlatformIO.ini configs to make sure if someone is building it in PIO, it will at least use the same settings.
But this has taken way too much time and energy now and tends to get worse by the day as I see these issues now on my own test rig on about every other build I make.

@Barracuda09
Copy link

@TD-er

I am not familiar with how the settings are stored, but is it just storing the struct 'as is' and casting it back when it is red?

I expect you did not update PIO, as to say recently, that you have these issues your self? If not that should rule out an problem in PIO at the moment. So an "problem" in the code could maybe cause PIO to generate wrong binaries?

@clumsy-stefan
Copy link
Contributor

@TD-er good compilation of the issues currently around... .unfortunately I think that switching to Arduino SDK does not necessarily improve things. I use Arduiono SDK from the start when I did my own builds and, as seen today with this PR, it seems to have similar issues (eg. code running different when compiling multiple times)....

however I'm happy to provide any help, support, testing, whatever, just let me know...

@TD-er
Copy link
Member Author

TD-er commented Nov 26, 2018

@Barracuda09 Indeed the settings are stored "as is", meaning we simply dump the data starting at the address of the struct and continue for sizeof number of bytes.
But if for some reason the location of members of the struct is moving between builds, then we have a serious problem.

And the instability between builds is without update of platformIO inbetween. There simply aren't that many updates on PlatformIO ;)

@clumsy-stefan Good to know Arduino environment (not just the SDK, but the way of building like Arduino IDE does) may not be the holy grail either.

@Barracuda09
Copy link

@TD-er

Ok so the storing of settings is indeed an tricky think. And indeed an potential problem when something or someone is changing the struct at any point.

So the build problem is something that is feed into PIO that now makes it produce wrong builds. Maybe increase the warnings to see if it finds anything?

@TD-er
Copy link
Member Author

TD-er commented Nov 26, 2018

We have had some checksum in the settings a while ago, but that caused also issues, since its implementation demanded the checksum to be located at the end of the file.
And extending the struct should always be done at the end, without changing the order of the existing structs to be stored.
But I guess we should at least start enabling the checksum again. But not on the settings file itself, but on the struct. Not sure how, since the problem seems to be related to what member is where and not the consistency of the byte array contents.
Maybe a simple check can be to compare the sizeof of the struct.
If it is increased we should also set a new value of the struct itself, but to check the build at runtime we should have the intended size at least.
the sizeof value is now stored in the settings file, so we can already check, but then it may give false positives when switching between builds which use different versions of this struct.

@Barracuda09
Copy link

Well at least an 'plain' struct should not get reorder, only padding is allowed. But indeed an more robust system would be an good idea.

But that should be consistent between build as you did not change PIO, so that makes me believe that some code "complexity" (large files?) makes PIO now produce wrong builds.

@TD-er
Copy link
Member Author

TD-er commented Nov 26, 2018

But that should be consistent between build as you did not change PIO, so that makes me believe that some code "complexity" (large files?) makes PIO now produce wrong builds.

Indeed. And that's what is worrying me a lot.
If this differs per build based on the same source then we really have a problem.

I've been programming long enough to know that any time I was thinking "it must be some compiler issue", in the end I was (almost) always wrong. But this time I am really out of options on what may cause these issues so I tend to believe the tools simply contain a bug.

@Barracuda09
Copy link

Yes, that is an bug in PIO no question about that. If it seems to be able to produce correct working builds.

Now what triggers this bug at this point in PIO. Something in the code or libs or file sizes.

@TD-er
Copy link
Member Author

TD-er commented Nov 28, 2018

That's a lot of green lines with timings of over 2 seconds.
No wonder you're getting Watchdog resets.
What settings do you use for controller 1?
Also making the connection takes a long time.

@Domosapiens
Copy link

image

@Domosapiens
Copy link

image

@Domosapiens
Copy link

image

Tried so many settings ... that I lost overview..

@TD-er
Copy link
Member Author

TD-er commented Nov 28, 2018

It is local network (192.168.1.x), so please try to set the Client Timeout to something like 200 or 300 msec.
Just to try and test to see what is working best for your setup.
1000 is a bit too long, since it may result in a multiple of that value as can be seen in the timing stats page.

Edit: Also I got the feeling MQTT is a bit more lightweight to the ESP, compared to a HTTP connection.

@Domosapiens
Copy link

Will do Gijs.
See what tomorrow will bring !
Sleep well. Thanks

@TD-er
Copy link
Member Author

TD-er commented Nov 28, 2018

Welterusten :)

@Domosapiens
Copy link

Domosapiens commented Nov 29, 2018

I'm renting, so no need for early wake-up ;)
Strong connection:
image

@Domosapiens
Copy link

Domosapiens commented Nov 29, 2018

Weak connection, first access point on longer distance (but ~same uptime)
image

@Domosapiens
Copy link

Extreme close to access point (shorter uptime!)
image

@TD-er
Copy link
Member Author

TD-er commented Nov 29, 2018

What was the reboot reason for the last one? Exception or software/hardware watchdog?

@Domosapiens
Copy link

Domosapiens commented Nov 29, 2018

HW watchdog (snapshot from now)
image

@TD-er
Copy link
Member Author

TD-er commented Nov 29, 2018

And that one is still using 1000 msec timeout in the controller?

@Domosapiens
Copy link

Nope, changed all 3 last night to 300msec
image

@TD-er
Copy link
Member Author

TD-er commented Nov 29, 2018

And the timings on that one?
Problem is though, the timings are from before a crash, so it may only give some indication where timings may take more time than usual and indicate a trend. It can still be caused by something else we don't see yet, since it is crashing and not leaving any trace on where it crashed.

@Domosapiens
Copy link

image

@clumsy-stefan
Copy link
Contributor

what happens if sendData takes too long? I have this on a number of nodes with a PCF8574 connected. sometimes the TEN_PER_SECONDcall takes >500ms....

@TD-er
Copy link
Member Author

TD-er commented Nov 29, 2018

@clumsy-stefan Then no other call can be made.
Meaning if TEN_PER_SECOND takes 500 msec, it will block 5 calls to TEN_PER_SECOND, 25 calls to FIFTY_PER_SECOND, etc.

So GPIO events will be missed also, unless they have an interrupt attached to them.

@clumsy-stefan
Copy link
Contributor

so, also background/core tasks (like handling WiFi) will be blocked then?

@TD-er
Copy link
Member Author

TD-er commented Nov 29, 2018

so, also background/core tasks (like handling WiFi) will be blocked then?

The call to our own function could still be done, depending on what is causing the delay.
If there are several sub tasks "we" do (in our code), then there is an option to call for delay() or our own function to handle background tasks.
If it is a single call to some function (e.g. connect() ) then it may very well be there is no call to delay() or something similar and thus WiFi tasks are not attended.

@thomastech
Copy link
Contributor

Feedback:
I'm running this PR on two NodeMCU's. Both use the OpenHab MQTT controller, serial log is disabled. One has default 1000mS Client timeout, the other is 600mS (value randomly chosen after reading TD-er's comments above).

Both experience resets, reported as Manual Reboot, Hardware Watchdog. The resets do not occur at the same time.

From what I can see the reset occurrences seem about the same with this PR. But I've only been running it for about a day, so it's too soon to know for sure.

  • Thomas

@xztraz
Copy link

xztraz commented Nov 30, 2018

seem like serial server is borken in later builds. tried ESPEasy_mega-20181129. no working serial data via netcat. (no blinking on port). (serial logging turned off)
went back to version ESPEasy_mega-20181004 (no settings changed) and it works good again. and led blinks

blinking leds for tx and rx on rs232 shield

@TD-er
Copy link
Member Author

TD-er commented Nov 30, 2018

Is 20181004 the last one to work?
Or did my serial buffer of 2 days ago break it?
That one was in 20181128

@Domosapiens
Copy link

Running ESP_Easy_mega-20181128_normal_ESP8266_4096.bin and checked 2# Wemos D1 units:
Serial works good!
(but does not solve the HW WatchDog problem)

@TD-er
Copy link
Member Author

TD-er commented Nov 30, 2018

There are so many causes for HW watchdog, and the serial port writing is now one of them we can check of the list (except for plugins actually writing to that port like Nextion plugin)

Also @xztraz reported about the Serial2Net plugin. I really want to know the last version it was working.

@xztraz
Copy link

xztraz commented Nov 30, 2018

@TD-er Ser2net bug doesn't seem to have to do with this merge. its from earlier. weird @Domosapiens got it to work thou..

20181129-normal - no lights
20181128-normal - no lights
20181101-normal - no lights
20181031-normal - no lights

20181030-normal <- Here stuff seems to get broken

20181029-normal - ok
20181026-normal - ok
20181021-normal - ok
20181004-normal - ok


Changes in mega-20181030 (since mega-20181029):

ESPEasy release bot (1):
automatically updated release notes for mega-20181030

James Tutton (2):
Added Special Case to allow Servo to be disabled
Fixed missing ;

TD-er (1):
[Core 2.4.2] Set core 2.4.2 as default platform

svmac (1):
P022_PCA9685 Enhancements

@TD-er
Copy link
Member Author

TD-er commented Nov 30, 2018

These were the differences: mega-20181029...mega-20181030

@TD-er
Copy link
Member Author

TD-er commented Nov 30, 2018

Could you also try the build image with core 2.4.1 in its name? ("normal_core_241_ESP8266_4096")

@xztraz
Copy link

xztraz commented Nov 30, 2018

ESP_Easy_mega-20181129_normal_core_241_ESP8266_4096 works. so the new core seem to introduce something.

@TD-er
Copy link
Member Author

TD-er commented Nov 30, 2018

So at least you now have a work-around for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants