Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STM32H7: Bus fault when reading corrupt flash sectors #33140

Closed
ghost opened this issue Mar 8, 2021 · 5 comments · Fixed by #34374
Closed

STM32H7: Bus fault when reading corrupt flash sectors #33140

ghost opened this issue Mar 8, 2021 · 5 comments · Fixed by #34374
Assignees
Labels
Enhancement Changes/Updates/Additions to existing features platform: STM32 ST Micro STM32

Comments

@ghost
Copy link

ghost commented Mar 8, 2021

Describe the bug
The stm32h7 flash driver crashes the system with a bus fault if the data it reads has a double ECC error (DBECCERR).

The stm32h7 flash has an integrated ECC capable of correcting single errors and detecting double errors. The bus fault is expected behavior from the hardware, but very unfortunate behavior for the flash API. It would be much better to return an error code from flash_read().

To Reproduce
You will need a flash word with a double ECC error.

  1. Edit flash_stm32h7x.c and comment out the loop where it says "Check if 256 bits location is erased".
  2. Build Zephyr for e.g. a Nucleo-144 board with the flash shell enabled
  3. Use the flash shell to write the same flash word multiple times using different data. (Normally not allowed, but see step 1).
  4. Use the flash shell to read that flash word. With high probability you should see a bus fault due to a double ECC error.

Expected behavior
The flash_read() API should return an error code when it finds a double ECC error, such as -EIO. It should probably also log the ECC error.

Impact
The bus fault stops mcuboot from working reliably when a firmware upgrade is interrupted in the middle, or when a flash word has developed an ECC error for some other reason.

Attempting to work around it on a different level than the flash driver is complicated, see: mcu-tools/mcuboot#713 (comment)

Logs and console output

uart:~$ flash erase 0x1e0000
Erase success.
uart:~$ flash write 0x1e0000 5555aaaa
Write OK.
Verified.
uart:~$ flash write 0x1e0000 aaaa5555
Write OK.
Verified.
uart:~$ flash read 0x1e0000 4
[00:00:25.082,000] <err> os: ***** BUS FAULT *****
[00:00:25.082,000] <err> os:   Precise data bus error
[00:00:25.082,000] <err> os:   BFAR Address: 0x81e0000
[00:00:25.082,000] <err> os: r0/a1:  0x24007af0  r1/a2:  0x081e0000  r2/a3:  0x00000004
[00:00:25.082,000] <err> os: r3/a4:  0x24007af0 r12/ip:  0x00000008 r14/lr:  0x08026501
[00:00:25.082,000] <err> os:  xpsr:  0x81000200
[00:00:25.082,000] <err> os: Faulting instruction address (r15/pc): 0x080357cc
[00:00:25.082,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:25.082,000] <err> os: Current thread: 0x24004e00 (shell_uart)
[00:00:25.105,000] <err> os: Halting system

Environment (please complete the following information):

  • OS: Debian GNU/Linux
  • Toolchain: cross-compile
  • Commit: dccfe76

Suggested fix
It is possible to stop the bus fault from happening by using BFHFNMIGN and FAULTMASK during the read from the flash. Without the bus fault it is then possible to read the status from the flash controller and translate that to a status code. I have a PoC available here: endiantechnologies@d541116

I'm a bit out of my depth when it comes to fiddling with such low-level ARM Cortex registers, so I think someone more knowledgeable should have a look at the code. Other places in the driver that read from the flash also need to be protected. But at least with the patch an error is returned instead of the system crashing:

uart:~$ flash read 0x1e0000 4
Read ERROR!
[00:00:05.709,000] <wrn> flash_stm32h7: Bank2 ECC error at 0x000e0000
[00:00:05.709,000] <err> flash_stm32h7: Status Bank2: 0x04000000

And I believe that mcuboot is better able to handle this kind of error code than it can handle a bus fault.

@ghost ghost added the bug The issue is a bug, or the PR is fixing a bug label Mar 8, 2021
@nashif nashif added platform: STM32 ST Micro STM32 priority: low Low impact/importance bug labels Mar 8, 2021
@jameswalmsley
Copy link

This looks like a good way to handle this.
I'll will pull your patch and test it out.

@erwango erwango assigned FRASTM and unassigned erwango Mar 31, 2021
@FRASTM
Copy link
Collaborator

FRASTM commented Apr 14, 2021

I made a test on nucleo_h723zg : but I cannot reproduce the issue (when commenting the loop at " Check if 256 bits location is erased" )

uart:~$ flash write 0x1ff00 55 111 67 11 41 65                                                    
Reading back written bytes:                                                                       
37 6f 43 0b | 29 41                                                                               
uart:~$ flash read 0x1ff00 8                                                                      
37 6f 43 0b | 29 41 ff ff                                                                         

@erwango erwango added Enhancement Changes/Updates/Additions to existing features and removed bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug labels Apr 16, 2021
@erwango
Copy link
Member

erwango commented Apr 16, 2021

@weinholtendian I'm mivong this issue to enhancement as I don't see the reported behavior as a bug, though I agree that behavior could (/should) be improved.
Can you push your proposal as a PR and we'll perform a review so it can be integrated.

@ghost
Copy link
Author

ghost commented Apr 16, 2021

@FRASTM This was tested with an STM32H745 Nucleo-144 (NUCLEO-H745ZI-Q). I believe that the H723 also will exhibit the same behavior as otherwise described in this issue. Both reference manuals (RM0399 and RM0468) contain this type of text: When DBECCERR1/2 flag is raised, a bus error is generated. In case of successive double error detections, a bus error is generated each time a new data is sent back to the requester through the AXI interface. However, the method for deliberately creating a double ECC error is just something I improvised and verified on the H745 and it might be that it doesn't work on the H723 (or something is off with my instructions). It will still be possible to encounter double ECC errors in the field due to normal causes for flash failures, or when interrupting an mcuboot upgrade (as mentioned in mcu-tools/mcuboot#713).

@erwango Ack, will do this next week.

@ghost
Copy link
Author

ghost commented Apr 19, 2021

@FRASTM I might have figured out why you couldn't reproduce the bus fault from my instructions. The flash read command probably never went out to the flash device in your case because the data written with flash write was still in the cache. (For various reasons I happened to test this with the MQTT sample and it must have been doing enough things with the system that the cache line got flushed). Could you try again and reset the system after the flash write command?

Updated instructions to get a double ECC error, which can be used to verify that the suggested patch works as intended:

  1. Get an STM32H745 Nucleo-144 (or equivalent).
  2. Modify samples/hello_world/prj.conf:
CONFIG_SHELL=y
CONFIG_LOG=y
CONFIG_FLASH=y
CONFIG_FLASH_SHELL=y
CONFIG_FLASH_PAGE_LAYOUT=y
CONFIG_FLASH_MAP=y
CONFIG_FLASH_MAP_SHELL=y
CONFIG_FLASH_LOG_LEVEL_DBG=y
CONFIG_REBOOT=y
  1. Find "Check if 256 bits location is erased" in drivers/flash/flash_stm32h7x.c and comment out the loop.
  2. Build with west build -b nucleo_h745zi_q_m7 samples/hello_world/.
  3. Flash with west flash.
  4. Run the commands as in the transcript below.
*** Booting Zephyr OS build zephyr-v2.5.0-2336-g8c4ae05131be  ***
Hello World! nucleo_h745zi_q_m7


[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Flash initialized. BS: 32
[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Block 0: bs: 131072 count: 16
[00:00:00.000,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
uart:~$ flash erase 0x1e0000
Erase success.
[00:00:04.937,000] <dbg> flash_stm32h7.flash_stm32h7_erase: Erase offset: 1966080, len: 131072
[00:00:04.937,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
[00:00:05.730,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Enable write protection
uart:~$ flash write 0x1e0000 5555aaaa
Write OK.
Verified.
[00:00:09.178,000] <dbg> flash_stm32h7.flash_stm32h7_write: Write offset: 1966080, len: 4
[00:00:09.178,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
[00:00:09.178,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Enable write protection
[00:00:09.179,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
uart:~$ flash write 0x1e0000 aaaa5555
Write OK.
Verified.
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_write: Write offset: 1966080, len: 4
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Enable write protection
[00:00:13.061,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
uart:~$ flash read 0x1e0000 4
001E0000: 55 55 aa aa                                      |UU..             |

[00:00:18.578,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
uart:~$ kernel reboot cold�*** Booting Zephyr OS build zephyr-v2.5.0-2336-g8c4ae05131be  ***
Hello World! nucleo_h745zi_q_m7


[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Flash initialized. BS: 32
[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Block 0: bs: 131072 count: 16
[00:00:00.000,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
uart:~$ flash read 0x1e0000 4
[00:00:01.880,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
[00:00:01.880,000] <err> os: ***** BUS FAULT *****
[00:00:01.880,000] <err> os:   Precise data bus error
[00:00:01.880,000] <err> os:   BFAR Address: 0x81e0000
[00:00:01.880,000] <err> os: r0/a1:  0x24002550  r1/a2:  0x081e0000  r2/a3:  0x00000004
[00:00:01.880,000] <err> os: r3/a4:  0x24002550 r12/ip:  0x00000000 r14/lr:  0x08004329
[00:00:01.880,000] <err> os:  xpsr:  0x81000200
[00:00:01.880,000] <err> os: Faulting instruction address (r15/pc): 0x08009e08
[00:00:01.880,000] <err> os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:01.880,000] <err> os: Current thread: 0x240003b8 (shell_uart)
[00:00:01.924,000] <err> os: Halting system

The flash now has a fault that you can use to verify that the patch works. After applying the patch and flashing again (west flash does not erase the fault we created) the double ECC error is reported as -EIO and there is no bus fault:

uart:~$ *** Booting Zephyr OS build zephyr-v2.5.0-2336-g8c4ae05131be  ***
Hello World! nucleo_h745zi_q_m7


[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Flash initialized. BS: 32
[00:00:00.000,000] <dbg> flash_stm32h7.stm32h7_flash_init: Block 0: bs: 131072 count: 16
[00:00:00.000,000] <dbg> flash_stm32h7.flash_stm32h7_write_protection: Disable write protection
uart:~$ flash read 0x1e0000 4
Read ERROR!
[00:00:03.582,000] <dbg> flash_stm32h7.flash_stm32h7_read: Read offset: 1966080, len: 4
[00:00:03.582,000] <wrn> flash_stm32h7: Bank2 ECC error at 0x000e0000
[00:00:03.582,000] <err> flash_stm32h7: Status Bank2: 0x04000000

ghost pushed a commit to endiantechnologies/zephyr that referenced this issue May 4, 2021
The STM32H7x flash has an integrated ECC that can correct single
errors and detect double errors. When a double ECC error is detected,
the DBECCERR1/2 flag is raised and there is a bus fault.

We now mask this bus fault and check the error flags. ECC errors are
logged with the offset of the data. Single ECC errors cause a warning
to be logged and double ECC errors return -EIO.

Fixes zephyrproject-rtos#33140.

Signed-off-by: Göran Weinholt <[email protected]>
galak pushed a commit that referenced this issue May 4, 2021
The STM32H7x flash has an integrated ECC that can correct single
errors and detect double errors. When a double ECC error is detected,
the DBECCERR1/2 flag is raised and there is a bus fault.

We now mask this bus fault and check the error flags. ECC errors are
logged with the offset of the data. Single ECC errors cause a warning
to be logged and double ECC errors return -EIO.

Fixes #33140.

Signed-off-by: Göran Weinholt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Changes/Updates/Additions to existing features platform: STM32 ST Micro STM32
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants