STM32H7: Bus fault when reading corrupt flash sectors #33140

ghost · 2021-03-08T14:09:14Z

Describe the bug
The stm32h7 flash driver crashes the system with a bus fault if the data it reads has a double ECC error (DBECCERR).

The stm32h7 flash has an integrated ECC capable of correcting single errors and detecting double errors. The bus fault is expected behavior from the hardware, but very unfortunate behavior for the flash API. It would be much better to return an error code from flash_read().

To Reproduce
You will need a flash word with a double ECC error.

Edit flash_stm32h7x.c and comment out the loop where it says "Check if 256 bits location is erased".
Build Zephyr for e.g. a Nucleo-144 board with the flash shell enabled
Use the flash shell to write the same flash word multiple times using different data. (Normally not allowed, but see step 1).
Use the flash shell to read that flash word. With high probability you should see a bus fault due to a double ECC error.

Expected behavior
The flash_read() API should return an error code when it finds a double ECC error, such as -EIO. It should probably also log the ECC error.

Impact
The bus fault stops mcuboot from working reliably when a firmware upgrade is interrupted in the middle, or when a flash word has developed an ECC error for some other reason.

Attempting to work around it on a different level than the flash driver is complicated, see: mcu-tools/mcuboot#713 (comment)

Logs and console output

uart:~$ flash erase 0x1e0000
Erase success.
uart:~$ flash write 0x1e0000 5555aaaa
Write OK.
Verified.
uart:~$ flash write 0x1e0000 aaaa5555
Write OK.
Verified.
uart:~$ flash read 0x1e0000 4
[00:00:25.082,000] <err> os: ***** BUS FAULT *****
[00:00:25.082,000] <err> os:   Precise data bus error
[00:00:25.082,000] <err> os:   BFAR Address: 0x81e0000
[00:00:25.082,000] <err> os: r0/a1:  0x24007af0  r1/a2:  0x081e0000  r2/a3:  0x00000004
[00:00:25.082,000] <err> os: r3/a4:  0x24007af0 r12/ip:  0x00000008 r14/lr:  0x08026501
[00:00:25.082,000] <err> os:  xpsr:  0x81000200
[00:00:25.082,000] <err> os: Faulting instruction address (r15/pc): 0x080357cc
[00:00:25.082,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:25.082,000] <err> os: Current thread: 0x24004e00 (shell_uart)
[00:00:25.105,000] <err> os: Halting system

Environment (please complete the following information):

OS: Debian GNU/Linux
Toolchain: cross-compile
Commit: dccfe76

Suggested fix
It is possible to stop the bus fault from happening by using BFHFNMIGN and FAULTMASK during the read from the flash. Without the bus fault it is then possible to read the status from the flash controller and translate that to a status code. I have a PoC available here: endiantechnologies@d541116

I'm a bit out of my depth when it comes to fiddling with such low-level ARM Cortex registers, so I think someone more knowledgeable should have a look at the code. Other places in the driver that read from the flash also need to be protected. But at least with the patch an error is returned instead of the system crashing:

uart:~$ flash read 0x1e0000 4
Read ERROR!
[00:00:05.709,000] <wrn> flash_stm32h7: Bank2 ECC error at 0x000e0000
[00:00:05.709,000] <err> flash_stm32h7: Status Bank2: 0x04000000

And I believe that mcuboot is better able to handle this kind of error code than it can handle a bus fault.

The text was updated successfully, but these errors were encountered:

jameswalmsley · 2021-03-19T23:43:35Z

This looks like a good way to handle this.
I'll will pull your patch and test it out.

FRASTM · 2021-04-14T07:58:58Z

I made a test on nucleo_h723zg : but I cannot reproduce the issue (when commenting the loop at " Check if 256 bits location is erased" )

uart:~$ flash write 0x1ff00 55 111 67 11 41 65                                                    
Reading back written bytes:                                                                       
37 6f 43 0b | 29 41                                                                               
uart:~$ flash read 0x1ff00 8                                                                      
37 6f 43 0b | 29 41 ff ff

erwango · 2021-04-16T07:56:24Z

@weinholtendian I'm mivong this issue to enhancement as I don't see the reported behavior as a bug, though I agree that behavior could (/should) be improved.
Can you push your proposal as a PR and we'll perform a review so it can be integrated.

ghost · 2021-04-16T13:56:02Z

@FRASTM This was tested with an STM32H745 Nucleo-144 (NUCLEO-H745ZI-Q). I believe that the H723 also will exhibit the same behavior as otherwise described in this issue. Both reference manuals (RM0399 and RM0468) contain this type of text: When DBECCERR1/2 flag is raised, a bus error is generated. In case of successive double error detections, a bus error is generated each time a new data is sent back to the requester through the AXI interface. However, the method for deliberately creating a double ECC error is just something I improvised and verified on the H745 and it might be that it doesn't work on the H723 (or something is off with my instructions). It will still be possible to encounter double ECC errors in the field due to normal causes for flash failures, or when interrupting an mcuboot upgrade (as mentioned in mcu-tools/mcuboot#713).

@erwango Ack, will do this next week.

ghost · 2021-04-19T08:14:13Z

@FRASTM I might have figured out why you couldn't reproduce the bus fault from my instructions. The flash read command probably never went out to the flash device in your case because the data written with flash write was still in the cache. (For various reasons I happened to test this with the MQTT sample and it must have been doing enough things with the system that the cache line got flushed). Could you try again and reset the system after the flash write command?

Updated instructions to get a double ECC error, which can be used to verify that the suggested patch works as intended:

Get an STM32H745 Nucleo-144 (or equivalent).
Modify samples/hello_world/prj.conf:

CONFIG_SHELL=y
CONFIG_LOG=y
CONFIG_FLASH=y
CONFIG_FLASH_SHELL=y
CONFIG_FLASH_PAGE_LAYOUT=y
CONFIG_FLASH_MAP=y
CONFIG_FLASH_MAP_SHELL=y
CONFIG_FLASH_LOG_LEVEL_DBG=y
CONFIG_REBOOT=y

Find "Check if 256 bits location is erased" in drivers/flash/flash_stm32h7x.c and comment out the loop.
Build with west build -b nucleo_h745zi_q_m7 samples/hello_world/.
Flash with west flash.
Run the commands as in the transcript below.

*** Booting Zephyr OS build zephyr-v2.5.0-2336-g8c4ae05131be  ***
Hello World! nucleo_h745zi_q_m7


[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Flash initialized. BS: 32
[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Block 0: bs: 131072 count: 16
[00:00:00.000,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
uart:~$ flash erase 0x1e0000
Erase success.
[00:00:04.937,000] <dbg> flash_stm32h7.flash_stm32h7_erase: Erase offset: 1966080, len: 131072
[00:00:04.937,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
[00:00:05.730,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Enable write protection
uart:~$ flash write 0x1e0000 5555aaaa
Write OK.
Verified.
[00:00:09.178,000] <dbg> flash_stm32h7.flash_stm32h7_write: Write offset: 1966080, len: 4
[00:00:09.178,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
[00:00:09.178,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Enable write protection
[00:00:09.179,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
uart:~$ flash write 0x1e0000 aaaa5555
Write OK.
Verified.
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_write: Write offset: 1966080, len: 4
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Enable write protection
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
uart:~$ flash read 0x1e0000 4
001E0000: 55 55 aa aa                                      |UU..             |

[00:00:18.578,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
uart:~$ kernel reboot cold�*** Booting Zephyr OS build zephyr-v2.5.0-2336-g8c4ae05131be  ***
Hello World! nucleo_h745zi_q_m7


[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Flash initialized. BS: 32
[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Block 0: bs: 131072 count: 16
[00:00:00.000,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
uart:~$ flash read 0x1e0000 4
[00:00:01.880,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
[00:00:01.880,000] <err> os: ***** BUS FAULT *****
[00:00:01.880,000] <err> os:   Precise data bus error
[00:00:01.880,000] <err> os:   BFAR Address: 0x81e0000
[00:00:01.880,000] <err> os: r0/a1:  0x24002550  r1/a2:  0x081e0000  r2/a3:  0x00000004
[00:00:01.880,000] <err> os: r3/a4:  0x24002550 r12/ip:  0x00000000 r14/lr:  0x08004329
[00:00:01.880,000] <err> os:  xpsr:  0x81000200
[00:00:01.880,000] <err> os: Faulting instruction address (r15/pc): 0x08009e08
[00:00:01.880,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:01.880,000] <err> os: Current thread: 0x240003b8 (shell_uart)
[00:00:01.924,000] <err> os: Halting system

The flash now has a fault that you can use to verify that the patch works. After applying the patch and flashing again (west flash does not erase the fault we created) the double ECC error is reported as -EIO and there is no bus fault:

uart:~$ *** Booting Zephyr OS build zephyr-v2.5.0-2336-g8c4ae05131be  ***
Hello World! nucleo_h745zi_q_m7


[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Flash initialized. BS: 32
[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Block 0: bs: 131072 count: 16
[00:00:00.000,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
uart:~$ flash read 0x1e0000 4
Read ERROR!
[00:00:03.582,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
[00:00:03.582,000] <wrn> flash_stm32h7: Bank2 ECC error at 0x000e0000
[00:00:03.582,000] <err> flash_stm32h7: Status Bank2: 0x04000000

The STM32H7x flash has an integrated ECC that can correct single errors and detect double errors. When a double ECC error is detected, the DBECCERR1/2 flag is raised and there is a bus fault. We now mask this bus fault and check the error flags. ECC errors are logged with the offset of the data. Single ECC errors cause a warning to be logged and double ECC errors return -EIO. Fixes zephyrproject-rtos#33140. Signed-off-by: Göran Weinholt <[email protected]>

The STM32H7x flash has an integrated ECC that can correct single errors and detect double errors. When a double ECC error is detected, the DBECCERR1/2 flag is raised and there is a bus fault. We now mask this bus fault and check the error flags. ECC errors are logged with the offset of the data. Single ECC errors cause a warning to be logged and double ECC errors return -EIO. Fixes #33140. Signed-off-by: Göran Weinholt <[email protected]>

ghost added the bug The issue is a bug, or the PR is fixing a bug label Mar 8, 2021

nashif assigned erwango Mar 8, 2021

nashif added platform: STM32 ST Micro STM32 priority: low Low impact/importance bug labels Mar 8, 2021

ghost mentioned this issue Mar 18, 2021

Multiple SoCs have flash configurations unsupported by MCUboot mcu-tools/mcuboot#713

Open

erwango assigned FRASTM and unassigned erwango Mar 31, 2021

erwango added Enhancement Changes/Updates/Additions to existing features and removed bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug labels Apr 16, 2021

ghost mentioned this issue Apr 19, 2021

drivers: flash: stm32h7x: handle ECC errors on read #34374

Merged

galak closed this as completed in #34374 May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STM32H7: Bus fault when reading corrupt flash sectors #33140

STM32H7: Bus fault when reading corrupt flash sectors #33140

ghost commented Mar 8, 2021

jameswalmsley commented Mar 19, 2021

FRASTM commented Apr 14, 2021 •

edited

Loading

erwango commented Apr 16, 2021 •

edited

Loading

ghost commented Apr 16, 2021

ghost commented Apr 19, 2021

STM32H7: Bus fault when reading corrupt flash sectors #33140

STM32H7: Bus fault when reading corrupt flash sectors #33140

Comments

ghost commented Mar 8, 2021

jameswalmsley commented Mar 19, 2021

FRASTM commented Apr 14, 2021 • edited Loading

erwango commented Apr 16, 2021 • edited Loading

ghost commented Apr 16, 2021

ghost commented Apr 19, 2021

FRASTM commented Apr 14, 2021 •

edited

Loading

erwango commented Apr 16, 2021 •

edited

Loading