-
Notifications
You must be signed in to change notification settings - Fork 489
Crash Dumps
When the ESP8266 CPU encounters a fatal exception, esp-open-rtos outputs a crash dump like this:
Fatal exception (29):
epc1=0x40208560
epc2=0x00000000
epc3=0x40103117
excvaddr=0x00000000
depc=0x00000000
excsave1=0x40208558
Registers:
a0 40208558 a1 3fff5040 a2 00000000 a3 00000081
a4 00000001 a5 00000081 a6 00000000 a7 00000000
a8 00000000 a9 00000000 a10 00000000 a11 00000000
a12 00000000 a13 60000300 SAR 0000001e
Stack: SP=0x3fff5040
0x3fff5040: 00000000 00000000 00000000 00000000
0x3fff5050: 00000000 a5a5a5a5 a5a5a5a5 00000290
0x3fff5060: 3fff5408 00000000 00000000 00000000
0x3fff5070: 00000000 00000000 00000000 00000000
0x3fff5080: 00000000 00000000 00000000 00000000
0x3fff5090: 00000000 00000000 00000000 00000000
0x3fff50a0: 00000000 00000000 00000000 00000000
0x3fff50b0: 00000000 00000000 00000000 00000000
Free Heap: 42592
_heap_start 0x3fff1380 brk 0x3fff5480 supervisor sp 0x3ffffae0 sp-brk 42592 bytes
arena (total_size) 16640 fordblks (free_size) 0 uordblocks (used_size) 16640
The code which generates crash dumps can be found in core/debug_dumps.c
The number in parentheses (ie (29)
in the example above) is an Xtensa exception cause number:
Number | Cause | Reason |
---|---|---|
0 | IllegalInstruction | Illegal instruction was executed. |
1 | Syscall | SYSCALL instruction encountered (esp-open-rtos does not use syscalls or have a syscall handler, so this is an error) |
2 | InstructionFetchError | Attempt to execute instructions from an illegal address, or read error when fetching instruction |
3 | LoadStoreError | Attempt to read/write memory in a manner not supported by the hardware (for example, trying to read/write a byte to a memory area that only supports word accesses) |
4 | Level1Interrupt | (This is normal and should never generate a crash dump) |
5 | (reserved/unused) | (Not used by ESP8266 hardware. Should never occur.) |
6 | IntegerDivideByZero | CPU tried to divide by zero. |
7..8 | (reserved/unused) | (Not used by ESP8266 hardware. Should never occur.) |
9 | LoadStoreAlignment | Attempt to read/write memory with an unaligned address (for example, trying to read/write a 32-bit word at an address that is not a multiple of 4) |
10..19 | (reserved/unused) | (Not used by ESP8266 hardware. Should never occur.) |
20 | InstFetchProhibited | CPU tried to execute code from a memory region which is not valid for code |
21..27 | (reserved/unused) | (Not used by ESP8266 hardware. Should never occur.) |
28 | LoadProhibited | CPU tried to load memory from a region which is protected against reads |
29 | StoreProhibited | CPU tried to store memory from a region which is protected against writes |
Immediately following the exception number is a dump of the Xtensa registers at the time the exception occured.
The program counter (PC) is the register in a CPU that points to the currently executing instruction.
EPCx gives the PC register address at different exception levels, which is the program addresses that were in use when the CPU encountered the exception:
epc1=0x40221970
epc2=0x00000000
epc3=0x00000000
EPC1 is the saved PC address that was in use when the exception occured, assuming it caused an exception at the normal level (level 1). This is almost always the PC address to use when debugging a crash dump.
EPC2 is used to save the current PC address when a debugger halted execution. If you're not using a debugger then you can expect it to be 0.
EPC3 is the saved PC from the last time the CPU jumped to the NMI (Non Maskable Interrupt) vector. This address is unrelated to the crash 99.999% of the time, unless the crash occurred inside the NMI exception handler (very rare, unless you're writing code that runs in the NMI). This address can still be informative as it shows you a place where the CPU was running, the last time an NMI occured (ie recently).
EXCVADDR is the memory address the caused the exception, if there was one. If you get LoadStore, LoadProhibited or StoreProhibited exception cause then look at the excvaddr to see what load/store address caused the exception:
excvaddr=0x00000000
For example, reading from a NULL pointer causes a LoadProhibited exception (28) with EXCVADDR=0. Writing to a NULL pointer causes a StoreProhibited exception (29) with EXCVADDR=0 (writing to a NULL pointer is what caused the example fatal exception at the top of the page).
(Note that the entire region from 0x00000000 to 0x20000000 in the ESP8266 is protected against both reads and writes. This means that reading/writing to anywhere in that address range will generate an exception (a handy side-effect of this is that, as mentioned, NULL pointer accesses fall in this range, and thus generate exceptions)
excsave1=0x40208558
EXCSAVE1 is the PC address of normal (non-interrupt-context) execution from the last time an interrupt occured. Like EPC3, this often tells you a place where the CPU was running recently (but it may not be running there now). As interrupts are used for context switching between tasks, this program address is often related to the last context switch.
Registers:
a0 40208558 a1 3fff5040 a2 00000000 a3 00000081
a4 00000001 a5 00000081 a6 00000000 a7 00000000
a8 00000000 a9 00000000 a10 00000000 a11 00000000
a12 00000000 a13 60000300 SAR 0000001e
The general purpose registers are manipulated as your code runs and aren't guaranteed to mean anything in particular, but they include control flow values and program variable values. Some of these may be useful! Once you find the line of source where the crash happened (see below), you can optionally look at the function disassembly to find out which registers were being used for what.
A register that often holds a predictable value is a0
. When using the Xtensa "call0" ABI (as used by ESP8266), this register holds the return address from the current function call. (For more explanation of this, see "stack dump" section below.)
Once you have some PC addresses (like 0x40208560
), you want to convert them to line numbers in your source code. There are ways to do this:
The filteroutput.py
tool in the utils/
directory will automatically filter serial output and pick out program addresses, and inject the line numbers. If you use make test
to flash & run your programs then filteroutput.py
is already being used to monitor the serial port.
The crash dump from above, processed with filteroutput.py
:
epc1=0x40208560
0x40208560: blinkenTask at blink.c:51 (discriminator 1)
epc2=0x00000000
epc3=0x40106530
0x40106530: memcmp at /home/gus/dev/esp/rtos/newlib/build/xtensa-lx106-elf/newlib/libc/string/../../../../../newlib/libc/string/memcmp.c:64
excvaddr=0x00000000
depc=0x00000000
excsave1=0x40208558
0x40208558: blinkenTask at blink.c:51 (discriminator 1)
The line printed under epc1
shows the line number where the program crashed.
filteroutput.py
can be used to live monitor a serial port, like this:
filteroutput.py -p /dev/ttyACM0 -b 115200
Or you can pipe an existing log file through it, like this:
filteroutput.py < my_debug_log.txt
filteroutput.py
will automatically look for an ELF file (.out file in the build directory) under the current directory. This is the file which is used to look up line numbers based on PC addresses. If run from a different directory, use the -e
option to pass in the path to the ELF, ie
filteroutput.py -e ../build/my_program.out < my_debug_log.txt
Run filteroutput.py -h
for more usage details.
The addrsource
tool in the utils/
directory of esp-open-rtos will give you more details about a crashing PC address.
Similar to filteroutput.py
, addrsource
automatically looks for an ELF file in the build
directory. So if you are running addrsource
from your app directory (i.e. the directory that has build
and firmware
subdirectories), you can just give it the EPC1 address as the only argument:
addrsource 0x40221970
If the current directory is somewhere else, you will need to also tell it where to find the ELF binary (.out file in the build directory) to use:
addrsource 0x40221970 myprogram/build/myprogram.out
The addrsource
tool prints the line number, source function body, and disassembly for the exception address.
If all you need is the filename and line number, the xtensa-lx106-elf-addr2line
tool (provided as part of the cross-compiler suite from esp-open-sdk) is a simpler tool that just prints the source code line numbers based on provided addresses. filteroutput.py
is running addr2line behind the scenes.
Launch addr2line with the path to the ELF binary:
xtensa-lx106-elf-addr2line -e build/program.out
Then type/paste in the addresses:
$ xtensa-lx106-elf-addr2line -e build/program.out
0x40221970
/path/to/program.c:76
Stack: SP=0x3fff5040
0x3fff5040: 00000000 00000000 00000000 00000000
0x3fff5050: 00000000 a5a5a5a5 a5a5a5a5 00000290
0x3fff5060: 3fff5408 00000000 00000000 00000000
0x3fff5070: 00000000 00000000 00000000 00000000
0x3fff5080: 00000000 00000000 00000000 00000000
0x3fff5090: 00000000 00000000 00000000 00000000
0x3fff50a0: 00000000 00000000 00000000 00000000
0x3fff50b0: 00000000 00000000 00000000 00000000
The crash dump includes a "stack dump" - a chunk of RAM read from the current task's stack. The current address of the stack pointer (SP) is printed on the first line.
FreeRTOS allocates each task's stack from the heap at runtime, so the stack dump may go past the end of the task's stack and include other memory (the stacks of other tasks, or other heap memory). FreeRTOS uses the magic value 0xa5a5a5a5 as an end-of-stack marker, so you can usually assume that anything after 0xa5a5a5a5 is no longer the stack of your program.
You can use the stack dump to figure out the call stack at the time of the crash. It's not quite as simple as viewing a backtrace in gdb though (patches to make it more simple are welcome!)
Here's some contrived sample code that makes a bunch of nested function calls and then crashes:
void crash_now(void)
{
*((volatile uint32_t *)0) = 0; // null pointer deref, instant crash
}
void call_crash_now(void)
{
for(int i = 0; i < 3; i++) {
crash_now();
}
}
void should_we_crash(bool crash_now)
{
if(crash_now) {
call_crash_now();
}
}
void init_task(void *pvParameters)
{
should_we_crash(true);
while(1) { }
}
(NB: If you actually compile this, add __attribute__((noinline))
after each void
or gcc just inlines all these functions into one!)
Here's the crash dump:
Fatal exception (29):
epc1=0x40211a61
0x40211a61: crash_now at crash.c:16
epc2=0x00000000
epc3=0x00000000
excvaddr=0x00000000
depc=0x00000000
excsave1=0x40208551
0x40208551: call_crash_now at crash.c:22
Registers:
a0 40208551 a1 3fff5020 a2 00000000 a3 00000000
0x40208551: call_crash_now at crash.c:22
a4 00000000 a5 00000000 a6 00000000 a7 00000000
a8 00000000 a9 00000000 a10 00000000 a11 00000000
a12 00000000 a13 00000000 SAR 00000000
Stack: SP=0x3fff5020
0x3fff5020: 00000000 00000000 00000000 4020856d
0x4020856d: should_we_crash at crash.c:31
0x3fff5030: 00000000 00000000 00000000 4020857e
0x4020857e: init_task at crash.c:35
0x3fff5040: 00000000 00000000 00000000 00000000
0x3fff5050: 00000000 a5a5a5a5 a5a5a5a5 00000290
Reading from the top of the crash dump down:
EPC1 contains the actual crash address - caused by the null pointer dereference in the crash_now
function.
Reading down the register dump, a0
has value 0x40208551
. This is the return address for the crash_now
function, and points to the call_crash_now
calling function.
In the stack memory dump, the address 0x4020856d
appears. This is the return address from call_crash_now
, as it was pushed onto the stack. It points to the should_we_crash
function that's one higher on the call stack.
Four words further along in the stack dump is 0x4020857e
, which points to the init_task
function which calls into should_we_crash
. You'll usually find a task init function at the "end" of the stack, unless the stack is corrupt.
Further on in the stack dump, you can see the 0xa5a5a5a5 stack end marker - this confirms for us that anything after this is not part of our stack.
So we can quickly assemble an idea of the full call trace of the task at the time of the crash:
init_task
-> should_we_crash
-> call_crash_now
-> crash_now
.
Note that program addresses can appear in the stack dump for other reasons (function pointers, left over data in RAM that hasn't been overwritten yet, etc.) If an address in the stack dump doesn't seem like it could possibly have been part of the call trace, you can probably ignore it.
When you look at a gdb backtrace, you see the line numbers of each function call itself. When looking at a raw memory/register dump like this, you're looking at the return addresses - which is the instruction in the calling function that's to be executed after the called function returns.
For example, in the dump above one of the stack addresses is 0x4020857e
which is crash.c line 31:
void __attribute__((noinline)) should_we_crash(bool crash_now)
{
if(crash_now) {
call_crash_now(); // <--- line 29 is here
}
} // <--- line 31 is here
Line 31 is the address after call_crash_now() returns - it points to the next instruction executed. You have to look a few lines up in the C code to see where the call was made. For most source code, this is pretty obvious.
To get a precise line number, you can run addr2line on a value 3 less than the value in the dump. This works because each instruction is 3 bytes long.
$ xtensa-lx106-elf-addr2line -e build/blink.out
0x4020856d
crash.c:31
0x4020856a
crash.c:29
... bingo!
Free Heap: 42592
_heap_start 0x3fff1380 brk 0x3fff5480 supervisor sp 0x3ffffae0 sp-brk 42592 bytes
arena (total_size) 16640 fordblks (free_size) 0 uordblocks (used_size) 16640
The last part of the crash dump contains information about the heap. Usually the only value you need to worry about is "Free Heap" - if it's very low then this may explain why your program crashed. If it's a totally nonsense value then the heap may be corrupt.
You can get information about the free heap value at runtime by calling xPortGetFreeHeapSize()
.
The other lines printed are information about brk()
, and internal heap information from mallinfo
. Look in debug_dumps.c
to see how these values are all derived:
-
_heap_start
- Address in RAM where the heap starts (heap grows up from here.) This address is determined at link time, by the linker script. -
brk
- The "brk" system call takes never-used RAM and adds it to the data segment that is used by the heap. Before any heap is allocated,brk
is equal to_heap_start
. This address in RAM shows how high in RAM the segment used by the heap has ever gotten. -
supervisor sp
- Address in RAM of the "supervisor" (pre-FreeRTOS) stack pointer. This is how far the stack grew down from the very end of RAM, before the RTOS scheduler started. If the end of the heap (as determined bybrk
) grows close to this point, you may run out of heap and allocations will fail. -
sp-brk
- Literally,supervisor sp
minusbrk
. This shows how many bytes of memory were never ever allocated to the heap. It lets you see the "high water mark" of heap usage. If this value is reasonably high then it's very unlikely you ran out of heap (althoughmalloc()
can still fail if the value it's trying to allocate is higher than what's available here.) The overall "free heap" size is calulcated from this value, plus the free memory inside heap's segment.sp-brk
is guaranteed to be unfragmented, whereas free space inside the heap may be fragmented. -
arena (aka total_size)
- This is the total number of bytes inside the heap. It should be equal tobrk-_heap_start
. If that's not the case then the heap is probably corrupt. -
fordblks (aka free_size)
- This is the total number of bytes free in the heap. Usually this is memory that has been allocated and then freed, and not yet allocated again. -
uordblocks (used_size)
- This is the total number of bytes currently allocated in the heap.
The heap info dump doesn't tell you much about heap fragmentation - it's possible to have a lot of free heap, but allocating a chunk of contiguous memory will still fail because all the free memory is fragmented. A clue that this might happen is if the fordblks
value is high (ie a lot of the free heap was previously malloced() as something else, so the free space might be in fragments.) You will always be able to allocate at least sp-brk
number of bytes in one chunk, so if this value is high then fragmentation-related problems are unlikely.
Some regions (Instruction RAM & mapped SPI flash) in the ESP8266 address space require 4 byte aligned word reads and writes. Unaligned or short reads/writes will raise LoadStoreError exceptions.
esp-open-rtos includes a LoadStoreError handler (in core/exception_vectors.S) which will automatically "fix" unaligned or non-word reads by replacing them with the necessary aligned word reads. This lets you store string constants and other data in flash memory ("irom") and then read them character by character as if they were in RAM.
esp-open-rtos does not fix LoadStoreErrors caused by unaligned stores to these address spaces.