-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDK appears to randomly allocate >3K stack [solved, high TX pwr + cheap/slow flash chip = random crash] #6366
Comments
I see no issue on my side
Can you please try and update master,
|
Ran the commands requested:
It's plugged into a USB hub with a good 12v power supply (several amps available). I added disable_extra4k_at_link_time(); as the first line of setup() (and included coredecls.h)... it has definitely changed the output. Debug and stack decode follow.... Legacy Stack Debug output:
Legacy Stack Trace (yes it's complete - nothing decoded after the ethernet_input line) :
As the decode didn't look helpful with legacy stack, I also reverted to my original code above... Normal Stack debug output
Normal Stack Decoded:
Also any given build will crash consistently (ie at the same address with the same trace) although changing the code, building then changing it back building and flashing can give a different result. I'm beginning to wonder if my build environment has an issue... although that would not explain why I have never had an issue with the Arduino IDE developing for Arduinos, but I have this issue with all releases of the esp8266 tools since 2.4.0 (may go back further but that was as far as I tested) If I hadn't already tested with several WEMOS D1 boards which all crashed identically for a given build I would be sure it was dodgy hardware. |
oops, can't read the display on my bench supply :-) |
Please enable all debug options, so we may get more info. |
I can't see an "ALL" option, but the above were done with what appears to be the "everything enabled" option - the second-last in the list, with all the other items in it. |
Yes, this entry. |
Just to be clear, I had that option selected when I captured the output I posted above - so there's no much point in posting the same again.... If only we had access to the SDK source I could probably have a fix sorted in a couple of hours :-/ |
Can you post an archive with sketch source, selected options and the elf file ? |
Ok, that should be what you need - complete preferences.txt from ~/.arduino15, the sketch and the binary file that was generated, plus a matching set of debug and stack decodes. Interestingly this particular build is crashing very early in the run - much earlier than before, even though the sketch has not been modified since my last set of logs. Hopefully you can work something out from that.. |
oops, just realised the bin file in that archive isn't the elf file you wanted... correct file attached... Apologies for calling the sketch "blank" it kind of grew from the example :( |
@ChocolateFrogsNuts wrote: If only we had access to the SDK source I could probably have a fix sorted in a couple of hours :-/ : Read through this issue looking for sources of sdk 2.2.1; additional googling for sources led to: In Arduino ide version of "board esp8266 package" seems to be equal to esp8266 sdk version |
@spblinux at a quick glance it appears to offer no more than the tools/sdk directory in the Arduino library... but it is late here, so I will have a closer look in the morning. |
@ChocolateFrogsNuts sdk example code and sources of lib driver.a are stripped out in arduino ide version of sdk. But source code of functions referenced in ld/eagle.rom.addr.v6.ld is missing in both variants of sdk. (In my case I needed the source code of a uart controlled esp8285 "wifi shield" which is in examples/at.) |
@spblinux we don't use libdriver.a. It is a leftover that shouldn't be in this repository. |
I'm sorry I still can't reproduce with your binary
That could be a worse version of the #2330 issue (per your comment)
Can you try with your mobile phone as AP ? (that's what I did) |
Additional Info: When it crashes on my normal wifi network, I don't see any traffic from the ESP chip other than the usual DHCP after the stack dump has finished - my AP is not the gateway so I can get that with a 'tcpdump -s 1500 -X ether host xx:xx:xx:xx:xx:xx' on the gateway. For some reason it still crashes when I drop the internet connection from my gateway though - I can only assume the ESP code still thinks it has an internet connection, so still does whatever it is that is the problem. Yay! this is progress - I can now control when it crashes, which hopefully narrows down why a little. |
Hmm, looks like it could in fact be related to #2330 - Using an app called Network Utilities, I can ping the ESP without any issues, however if I use the "IP Discovery" tool in that app, I get an instant crash every time - not only that it seems to cause a whole series of crashes (probably from multiple ARP requests). I will go looking in the lwip arp code later today... I have something else I have to tend to this morning :-) |
And now this issue became even more interesting :) |
Ok, further testing and I am increasingly convinced the stack is being used up by the binary blob...
So I tried several of the LWIP options - all crashed in the same way, including the "v1.4 Compile from source" option. With that option selected (easiest way to re-compile lwip), I then added one line of code to tools/sdk/lwip/src/netif/etharp.c
That one line, which has no effect unless the stack is already full, seems to make it almost immune to crashes no matter how many ARP requests I send it! Unfortunately it is also almost completely deaf to IP traffic, but at least we know why ARP is behaving badly - it looks like every time network traffic is received, a "random" amount of stack is allocated! The app I was using on the tablet is |
If stack is a concern, please do your tests with this additional call:
(this will allocate the stack in user heap space - which is the SDK default, not within the system stack space, which is a hack not compatible for example when WPS is used - for the record, this case is automatically handled) |
I've had disable_extra4k_at_link_time() back in while I try to work out the details of the stack. I still don't like the look of the stacks I am seeing.... for example:
There is no way umm_malloc should call malloc_loc then back to pbuf_free. This could be anywhere... next job is to get more debugging output from the arduino libraries - there still seems to be stuff that is not enabled |
Note that I am leaning towards a block of heap being free'd but the pointer being left in play somewhere as the root cause... something has somehow caused check_poison_block to be called on a block that has already been free'd at some point... |
One more thing you can try is enabling gdb. A crash like this goes into agdb break and you can try the stack unwinder or look at system memory. I've got a pr that does a core dump that gdb plus a custom app can load, like loading a core file on Unix, but the stack unwinding is more foolproof on a live system. |
Hmm, today I completed my own stack decoder that does a better job than the ESP Exception Decoder tool, which simply prints out the line of code associated with every user code address found in the stack, if gdb can provide it. So what have I worked out about the stack on these crashes? At this point, trying to get gdb to do anything with a stack or core dump is going to be fairly useless I think :( I need a way to step through the live code and see what is happening, or I need to get more debug output from all the code in the form of text out the serial port. It's not going to be easy to tell exactly where the thing is when it initially goes wrong. |
Hmm, that's bug hunting for pro's :) I would love to see good stack decoding utils for this platform. |
@earlephilhower one of the things I found today was espressif's gdbstub which can be compiled in with your code and apparently allows gdb to step through the live code on chip. I haven't ruled out trying it out yet..... I have at least managed to get lwip to start dumping massive amounts of debug info to Serial, but wouldn't you know it, now that I might be able to see what happens, hitting it with ARP requests from my tablet no longer crashes it. Fortunately it still crashes on my normal network... and so far there has been a gap of around 1 second between the last network traffic and the exception, and one exception happened before any network traffic was logged..... Time to find more debugging options to enable :-/ Oh, and @TD-er don't get your hopes up - my stack decoder is a fairly rough perl script for the command line - nothing like the exception decoder tool for the IDE, although I may do a conversion/upgrade/shove the two together some day. |
@ChocolateFrogsNuts don't use the Espressif version, it doesn't support sharing the UART and has a custom format that's incompatible with the plain-GNU toolchain we've been using. How to use the included one is in here: https://github.com/esp8266/Arduino/blob/master/doc/gdb.rst It has full GDB support, including single stepping/breakpoint/Ctrl-C interrupt/etc. on the live system. We have no ELF or source for the blob or ROMs, though, so don't expect to be able to get anything other than assembly inside non-open source code. |
Well that's the nice thing with open source. |
@ChocolateFrogsNuts please continue pursuit, you may be on to a stability issue. Also, in case you haven't realized it already, @earlephilhower has done a lot of work related to what you're looking at, so please discuss with him. |
thanks @earlephilhower I'll keep it in mind if the current line of attack fails... Currently I'm doing my own build of lwip2. I've got debugging output from the glue layer, it tells me plenty, but it isn't doing anything for about a second before the crash, although there seems to be a packet repeated several times over the previous 10 seconds, so I will find out what that is. I'm working on getting more output by enabling more debugging from lwip2 itself to see if anything is happening in there. EDIT: got the lwip2 debugging on... I turned everything on... This will take some time to analyse :) |
I wonder why d1 (mini, mini pro) defaults to DIO and not QIO, maybe @wemos can tell ? |
Not sure it's wemos that selected that default.. it's set in boards.txt (from tools/boards.txt.py) |
@wemos is indeed the one user who defined these boards |
https://wiki.wemos.cc/_media/products:d1:sch_d1_mini_v3.0.0.pdf There's only DO and DI on the schematic, so maybe legit boards' flash chips don't have the add'l D2/D3 pins? |
It seems sdio_data_2/3 are gpio9/10 and are connected |
Yes, that's the actual IC pinout. But the flash chip needs them as well, and to support QIO. I imagine the Wemos selected flash chips did not have QIO mode, or they couldn't route it properly on the 2-layer PCB they had. @ChocolateFrogsNuts obviously has a really weird fake with different chips and evidently a different schematic. Seems they just stole the silkscreen layer from WeMos. :( |
I don't think any so called "Wemos" board out there that you can now buy is made by Wemos/Lolin. I'm no longer using the "Wemos D1 mini" form factor in my projects. They have become unreliable since you are not sure what you'll get when ordering them. |
That reads to me like WP and HP on the flash double as D2 and D3 for QIO. |
QIO works for both types of D1 board that I have - the other thing is if the early versions didn't have D2/D3 connected - that schematic is marked as rev 3.0.0. |
two a little off topic comments :-) what is faster? 40 MHz DIO or 26 MHz QIO? with all this variations of only one of all the esp8266 boards it is really hard to troubleshoot on own bench, here on GitHub or help on forums or Stack Exchange |
theoretically (without benchmarking) DIO is 2-bit, QIO is 4-bit, so 26/40*2=1.3 - 26MHz QIO is 1.3 times the speed of 40MHz DIO... adjust for overheads. In practice, I just ran the following tests on the genuine (winbond chip) D1 mini Pro
Conclusion: surprisingly little difference, with only a 30% improvement between 26MHz QIO vs 80MHz QIO despite 3x the clock. Test code pushed to DebugTools/SPIFFS.ino at https://github.com/ChocolateFrogsNuts/ESP-DebugTools |
for completeness...
and with a 160MHz CPU clock
so unless you're running 160MHz CPU with 80MHz QIO flash, don't stress about the flash mode and clock too much :-) |
Aaaannd after realising the main limiting factor was the write speed of the flash which has nothing to do with read speed, here are the read-only results....
Interesting that 26MHz QIO actually is faster than 40MHz DIO for read, but again 3% isn't anything to be excited/worried about. |
For the people about to say "you just benchmarked SPIFFS with all its overheads"....
Looking more like we might expect, but only increases the difference between 40MHz DIO and 26MHz QIO up to 6% on 80MHz CPU, or 11% on 160MHz CPU. Test code in DebugTools/flash.ino in the above repository. |
Hmm, XMC chips default to 75% drive on their outputs, but can be switched up to 100% by setting SR3:5,6 to 1,1. It's likely this will get them operating at full speed... I just need to find the correct way to send the SPI commands needed to the chip. |
Well that was easier than I thought.... To identify an XMC chip, run 'esptool.py flash_id' To enable 100% drive until power is removed: To enable 100% drive permanently: And there you have it, your XMC flash chip should be just as good as the Winbond ones. EDIT: the XMC chips are rated for 104MHz :-) |
Wow thanks ! |
hehe, I was kinda thinking of making it automatically set 100% output if an XMC chip at >26MHz is found at boot. By only setting it in the volatile register, it will automatically save power or not as required. Setting it permanently would still work too. |
Guys, this is super interesting, but how about opening a new issue and discussing there? This one was already addressed, is closed, and is super long. @ChocolateFrogsNuts a thought before you jump in: there may be other flash chips that could benefit. In @igrr's words, what you found could be described as a flash chip "quirk", similar to the puya quirk, which requires special code to get correct operation. It may make sense to structure this type of code in a way that minimizes invasiveness and allows easier additions in the future, whether to add other manufacturers to a quirk or add other quirks. |
erk! 2 problems :
It's not like the stub esptool is using is initialising much either. It does call rom functions SelectSpiFunction() and spi_flash_attach() but I would think that has already been done by the time sketch code is running. Calling them again results in a crash further down the code, which I expected... So close and yet so far :-( |
See also: - esphome/feature-requests#471 (comment) - esp8266/Arduino#6366 - esp8266/Arduino#6471 - https://github.com/xoseperez/espurna/blob/849f8cf920096fa4b804e70913dab0917ee18ad9/code/espurna/config/general.h#L593-L599 - https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/network/esp_wifi.html#_CPPv425esp_wifi_set_max_tx_power6int8_t
* Add wifi output_power setting See also: - esphome/feature-requests#471 (comment) - esp8266/Arduino#6366 - esp8266/Arduino#6471 - https://github.com/xoseperez/espurna/blob/849f8cf920096fa4b804e70913dab0917ee18ad9/code/espurna/config/general.h#L593-L599 - https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/network/esp_wifi.html#_CPPv425esp_wifi_set_max_tx_power6int8_t * Lint
* Add wifi output_power setting See also: - esphome/feature-requests#471 (comment) - esp8266/Arduino#6366 - esp8266/Arduino#6471 - https://github.com/xoseperez/espurna/blob/849f8cf920096fa4b804e70913dab0917ee18ad9/code/espurna/config/general.h#L593-L599 - https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/network/esp_wifi.html#_CPPv425esp_wifi_set_max_tx_power6int8_t * Lint
* Add wifi output_power setting See also: - esphome/feature-requests#471 (comment) - esp8266/Arduino#6366 - esp8266/Arduino#6471 - https://github.com/xoseperez/espurna/blob/849f8cf920096fa4b804e70913dab0917ee18ad9/code/espurna/config/general.h#L593-L599 - https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/network/esp_wifi.html#_CPPv425esp_wifi_set_max_tx_power6int8_t * Lint
I'm pretty sure this is an SDK problem, but posting here in case I'm wrong, or it's caused by something in the Arduino library - and I discovered/tested it using the Arduino IDE. I have also been looking on esp8266.com for similar issues.
Basic Infos
Platform
Settings in IDE
Problem Description
Exception 0 with a very long stack dump, sometimes more than 5500 bytes of stack used.
Dump contains a large area of stack (>3Kbytes) that has been allocated but never used (still set to feefeffe)
I tested with version 2.4.0 right up to git (1-Aug-2019) with the same results.
I also tried several D1 mini Pro boards with the same result.
It can be reliably reproduced by flashing the BareMinimum example once the wifi has been configured to connect to an available network with the default sleep mode (MODEM_SLEEP_T).
With debugging on it will crash within seconds of "pm open,type:2"
I decoded many stack dumps, and as far as I can tell the "randomly" allocated stack is happening in the SDK code, probably in a network interrupt handler as generally the trace leaves the Arduino library in esp_yield_within_cont() which calls run_scheduled_recurrent_functions(); - before the big chunk of stack is allocated - and re-enters the Arduino library at ethernet_input() - after the big chunk is allocated.
There are several stack frames that don't get decoded, presumably because they happen in the SDK.
The sketch below contains some additional code I have been using to try and diagnose the issue, but as mentioned above, the BareMinimum example will trigger the problem reliably once the wifi is configured. #includes are as per Sketch->Include->ESP8266WIFI.
It may be worth noting that only including ESP8266WiFi.h and no other headers in this sketch results in a crash much sooner for me.
MCVE Sketch
Debug Messages
The text was updated successfully, but these errors were encountered: