Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more locking for shared SPI devices #5595

Merged
merged 11 commits into from
Dec 31, 2024
Merged

add more locking for shared SPI devices #5595

merged 11 commits into from
Dec 31, 2024

Conversation

caveman99
Copy link
Member

No description provided.

@mverch67
Copy link
Collaborator

mverch67 commented Dec 17, 2024

Looks good. I like to try that code with the T-Deck to see if it fixes the easily reproducible crash at second 35 after startup.

@caveman99
Copy link
Member Author

It was not always easy to not lock twice. if we have a situation where the firmware freezes, i probably overlooked something.

@mverch67
Copy link
Collaborator

First try:

[18:19:14.500] Connected to /dev/ttyACM0
��@INFO  | ??:??:?? 1 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 1 Booted, wake cause 0 (boot count 1), reset_reason=reset

[18:19:14.889] Disconnected
[18:19:15.890] Warning: Could not open /dev/ttyACM0 (No such file or directory)
[18:19:15.890] Waiting for tty device..
[18:19:16.893] Connected to /dev/ttyACM0
��@INFO  | ??:??:?? 2 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 2 Booted, wake cause 0 (boot count 1), reset_reason=reset

[18:19:17.766] Disconnected
[18:19:18.768] Connected to /dev/ttyACM0
��@INFO  | ??:??:?? 1 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 1 Booted, wake cause 0 (boot count 1), reset_reason=reset

[18:19:19.155] Disconnected
[18:19:20.156] Warning: Could not open /dev/ttyACM0 (No such file or directory)
[18:19:20.157] Waiting for tty device..
[18:19:21.159] Connected to /dev/ttyACM0
��@INFO  | ??:??:?? 2 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 2 Booted, wake cause 0 (boot count 1), reset_reason=reset

[18:19:22.036] Disconnected
[18:19:23.038] Connected to /dev/ttyACM0
��@INFO  | ??:??:?? 1 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 1 Booted, wake cause 0 (boot count 1), reset_reason=reset

[18:19:23.425] Disconnected

@mverch67
Copy link
Collaborator

INFO  | ??:??:?? 0 Booted, wake cause 0 (boot count 1), reset_reason=reset
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x4209eba5  PS      : 0x00060630  A0      : 0x820a24f7  A1      : 0x3fcebee0  
A2      : 0x00000000  A3      : 0x00000002  A4      : 0x00000000  A5      : 0x3fced294  
A6      : 0x00000000  A7      : 0x00000000  A8      : 0x3fca8324  A9      : 0x00000001  
A10     : 0x3fcecdc0  A11     : 0xffffffff  A12     : 0x00000005  A13     : 0x0000ff00  
A14     : 0x00ff0000  A15     : 0xff000000  SAR     : 0x00000015  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x400556d5  LEND    : 0x400556e5  LCOUNT  : 0xfffffffe  


Backtrace: 0x4209eba2:0x3fcebee0 0x420a24f4:0x3fcebf00 0x42094bfa:0x3fcebf20 0x42007da2:0x3fcebff0

ELF file SHA256: b1bab3be7c1f3e96


0x4209eba2: concurrency::Lock::lock() at /home/manuel/Documents/PlatformIO/Projects/firmware/src/concurrency/Lock.cpp:16
0x420a24f4: fsInit() at /home/manuel/Documents/PlatformIO/Projects/firmware/src/FSCommon.cpp:345
0x42094bfa: setup() at /home/manuel/Documents/PlatformIO/Projects/firmware/src/main.cpp:368
0x42007da2: loopTask(void*) at /home/manuel/.platformio/packages/framework-arduinoespressif32/cores/esp32/main.cpp:42

@mverch67
Copy link
Collaborator

got worse :(

0x40377f5e: panic_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_system/panic.c:408
0x403819f5: esp_system_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_system/esp_system.c:137
0x40388f81: __assert_func at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/newlib/assert.c:85
0x4209ede6: initSPI() at /home/manuel/Documents/PlatformIO/Projects/firmware/src/SPILock.cpp:10
0x4209509a: setup() at /home/manuel/Documents/PlatformIO/Projects/firmware/src/main.cpp:650
0x42007da2: loopTask(void*) at /home/manuel/.platformio/packages/framework-arduinoespressif32/cores/esp32/main.cpp:42

@mverch67
Copy link
Collaborator

mverch67 commented Dec 17, 2024

uh... pressed CTRL-C instead of CTRL-V to insert, so after copy again initSPI was there twice... now fixed it and it hangs here:

DEBUG | ??:??:?? 1 No SD_MMC card detected
INFO  | ??:??:?? 1 S:B:50,2.5.17.a3bc0297
DEBUG | ??:??:?? 1 Total heap: 281992
DEBUG | ??:??:?? 1 Free heap: 246292
DEBUG | ??:??:?? 1 Total PSRAM: 8386295
DEBUG | ??:??:?? 1 Free PSRAM: 8386295
DEBUG | ??:??:?? 1 NVS: UsedEntries 203, FreeEntries 427, AllEntries 630, NameSpaces 8
DEBUG | ??:??:?? 1 Setup Preferences in Flash Storage
DEBUG | ??:??:?? 1 Number of Device Reboots: 19224
INFO  | ??:??:?? 1 No OTA firmware available
INFO  | ??:??:?? 1 Init NodeDB

classical double-lock checkmate situation

@GUVWAF
Copy link
Member

GUVWAF commented Dec 17, 2024

reproducible crash at second 35 after startup.

This sounds like it's caused by the initial NodeInfo transmit scheduled 30 seconds after booting.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 17, 2024

@mverch67 Just pushed a commit that might fix the shared SPI access issue when transmitting.

Nevermind, also results in double-lock, but there should also be locking around spiTransfer() as RadioLib calls that directly.

@caveman99
Copy link
Member Author

Nevermind, also results in double-lock, but there should also be locking around spiTransfer() as RadioLib calls that directly.

spiBeginTransaction() and spiEndTransaction() are locked/released in our own Locking HAL Wrapper underneath radiolib. Top of RadioLibInterface.cpp

@GUVWAF
Copy link
Member

GUVWAF commented Dec 17, 2024

Yes, indeed, and it looks like RadioLib is always calling spiTransfer() in between those calls, so that looks good already.

@caveman99
Copy link
Member Author

aha. found it. rmDIr is locking itself, no need to wrap it at the start of loadFromDisk

@mverch67
Copy link
Collaborator

no change, still hangs at Init NodeDB

@mverch67
Copy link
Collaborator

reproducible crash at second 35 after startup.

This sounds like it's caused by the initial NodeInfo transmit scheduled [30 seconds]

Yes, that is the one I mean't. It hangs when touching the screen at the same time in 50% of the cases. While touch is I2C based the display redraw uses SPI. I already replaced the touch driver (not using lovyanGFX) a while ago and it got a lot better. The UI thread is completely locking SPI during execution of its entire loop, but still something corrupts the SPI bus so that radiolib reads garbage and asserts.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 18, 2024

Just added various missing unlock()s for early returns.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 18, 2024

Picked these changes to master, but still no luck. It hangs at pb_decode() in loadProto().

@mverch67
Copy link
Collaborator

mverch67 commented Dec 18, 2024

Just added various missing unlock()s for early returns.

In cases were there's only one lock in a method just use the C++ version of the locking because then C++ does the unlock automatically.
Or can use C++ scopes { ... } if there are multiple locks required in one function.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 19, 2024

The reason was that in readcb(), it tried to acquire a lock, while in loadProto() it has acquired it already, so I removed them from readcb() (which was also used in NAU802Sensor where it acquired it also already). Also added locks to SafeFile, so they can be removed from saveProto(), which acquired a lock before calling SafeFile’s close() that called renameFile(). Also replaced some of the lock/unlocks with the LockGuard.

Now it boots successfully (on master with a T-Beam), but please check my changes.

@fifieldt
Copy link
Contributor

Tested on a T114 and it seems happy.

@mverch67
Copy link
Collaborator

T-Deck still hangs:

INFO  | ??:??:?? 2 3 I2C devices found
DEBUG | ??:??:?? 2 acc_info = 0
DEBUG | ??:??:?? 2 No SD_MMC card detected
INFO  | ??:??:?? 2 S:B:50,2.5.17.05898017
DEBUG | ??:??:?? 2 Total heap: 270112
DEBUG | ??:??:?? 2 Free heap: 227596
DEBUG | ??:??:?? 2 Total PSRAM: 8386295
DEBUG | ??:??:?? 2 Free PSRAM: 8386295
DEBUG | ??:??:?? 2 NVS: UsedEntries 203, FreeEntries 427, AllEntries 630, NameSpaces 8
DEBUG | ??:??:?? 2 Setup Preferences in Flash Storage
DEBUG | ??:??:?? 2 Number of Device Reboots: 19339
INFO  | ??:??:?? 2 No OTA firmware available
INFO  | ??:??:?? 2 Init NodeDB

@mverch67
Copy link
Collaborator

It now works on T-Deck but does not solve the sporadic radioLib crash at second 34:

DEBUG | ??:??:?? 33 [RadioIf] Started Tx (id=0xa9a49d2b fr=0x1895ed48 to=0xffffffff, WantAck=0, HopLim=3 Ch=0x8 encrypted hopStart=3 priority=10)
DEBUG | ??:??:?? 33 [RadioIf] Packet TX: 968ms
DEBUG | ??:??:?? 34 [RadioIf] Completed sending (id=0xa9a49d2b fr=0x1895ed48 to=0xffffffff, WantAck=0, HopLim=3 Ch=0x8 encrypted hopStart=3 priority=10)
DEBUG | ??:??:?? 34 [RadioIf] SX126x standby RadioLib err=-2

or later

DEBUG | ??:??:?? 51 [RadioIf] Started Tx (id=0x0207ae0d fr=0x1895ed48 to=0xffffffff, WantAck=0, HopLim=3 Ch=0x8 encrypted hopStart=3 priority=10)
DEBUG | ??:??:?? 51 [RadioIf] Packet TX: 575ms
DEBUG | ??:??:?? 52 [RadioIf] Completed sending (id=0x0207ae0d fr=0x1895ed48 to=0xffffffff, WantAck=0, HopLim=3 Ch=0x8 encrypted hopStart=3 priority=10)
ERROR | ??:??:?? 52 [RadioIf] SX126X startReceiveDutyCycleAuto RadioLib err=-707

@GUVWAF
Copy link
Member

GUVWAF commented Dec 26, 2024

The SPI transfers for RadioLib are within the lock, but it seems in standby() (which is also called in startReceive()) it also pulls the CS pin low before that:
https://github.com/jgromes/RadioLib/blob/c37015ef9467404d47261f8f4e737aadc8049009/src/modules/SX126x/SX126x.cpp#L471

I don’t have a T-Deck to test, but @mverch67 maybe you could try to add locking around this call also (think it would work by calling this->hal->spiBeginTransaction(); and this->hal->spiEndTransaction() around it)?

@mverch67
Copy link
Collaborator

mverch67 commented Dec 26, 2024

The SPI transfers for RadioLib are within the lock, but it seems in standby() (which is also called in startReceive()) it also pulls the CS pin low before that: https://github.com/jgromes/RadioLib/blob/c37015ef9467404d47261f8f4e737aadc8049009/src/modules/SX126x/SX126x.cpp#L471

Great find! I think this is the spot that I was looking for all the time and which causes the trouble.
Setting CS pin and doing SPI transfer must be within one transaction so your proposal is likely not working as desired.

Now I removed the spi locks from spiBeginTransaction() and spiEndTransaction() and replaced these with locks around all calls to iface->XXX() in RadioLibInterface.cpp and lora.YYY() in SX126x.cpp.

Not crashing anymore at second 34!

@GUVWAF
Copy link
Member

GUVWAF commented Dec 26, 2024

Ah, yes, it should be locking around both the setting of the CS pin and the transfer.

Great to hear it's fixed!
Maybe it can be better fixed e.g. by an SPI lock callback from within RadioLib, but your proposal also works. Edit: opened a discussion for this: jgromes/RadioLib#1364

@GUVWAF
Copy link
Member

GUVWAF commented Dec 28, 2024

When we update to the latest commit of RadioLib, it should solve the sporadic SPI errors when transmitting.
This PR is still needed when using an SD card, so probably good to merge in.

Copy link
Collaborator

@mverch67 mverch67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though for ContentHandler.cpp I'd rather use the C++ variant of the locking as it is much easier to use and to maintain (think of others adding features in this file).

@caveman99 caveman99 merged commit b9e6de2 into tft-gui-work Dec 31, 2024
4 checks passed
@caveman99 caveman99 deleted the tft-spilock branch December 31, 2024 12:31
fifieldt pushed a commit to fifieldt/meshtastic-firmware that referenced this pull request Jan 3, 2025
* add more locking for shared SPI devices
* call initSPI before the lock is used
* remove old one
* don't double lock
* Add missing unlock
* More missing unlocks
* Add locks to SafeFile, remove from `readcb`, introduce some LockGuards
* fix lock in setupSDCard()
* pull radiolib trunk with SPI-CS fixes
* change ContentHandler to Constructor type locks, where applicable

---------

Co-authored-by: mverch67 <[email protected]>
Co-authored-by: GUVWAF <[email protected]>
Co-authored-by: Manuel <[email protected]>
fifieldt pushed a commit to fifieldt/meshtastic-firmware that referenced this pull request Jan 3, 2025
* add more locking for shared SPI devices
* call initSPI before the lock is used
* remove old one
* don't double lock
* Add missing unlock
* More missing unlocks
* Add locks to SafeFile, remove from `readcb`, introduce some LockGuards
* fix lock in setupSDCard()
* pull radiolib trunk with SPI-CS fixes
* change ContentHandler to Constructor type locks, where applicable

---------

Co-authored-by: mverch67 <[email protected]>
Co-authored-by: GUVWAF <[email protected]>
Co-authored-by: Manuel <[email protected]>
fifieldt added a commit that referenced this pull request Jan 3, 2025
* add more locking for shared SPI devices (#5595)

* add more locking for shared SPI devices
* call initSPI before the lock is used
* remove old one
* don't double lock
* Add missing unlock
* More missing unlocks
* Add locks to SafeFile, remove from `readcb`, introduce some LockGuards
* fix lock in setupSDCard()
* pull radiolib trunk with SPI-CS fixes
* change ContentHandler to Constructor type locks, where applicable

---------

Co-authored-by: mverch67 <[email protected]>
Co-authored-by: GUVWAF <[email protected]>
Co-authored-by: Manuel <[email protected]>

* mesh-tab: lower I2C touch frequency

---------

Co-authored-by: Thomas Göttgens <[email protected]>
Co-authored-by: mverch67 <[email protected]>
Co-authored-by: GUVWAF <[email protected]>
Co-authored-by: Manuel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants