Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

pekkanikander · 2018-10-07T10:24:59Z

Description

With #9473, with the updated support for nRF52 SoftDevice, the way interrupts are re-enabled in _mutex_unlock seems to cause indeterministic hard faults.

Steps to reproduce the issue

The problem is hard to reproduce. We saw it a few weeks ago when working on our own nRF52 board with relatively heavy I/O and BlueTooth load. Applying the change proposed below increased the stability a lot, but not completely. (There seems to be a thread deadlock issue triggered in _xtimer_tsleep, also related to _mutex_lock, but that is not clear enough to me even to describe.)

I will try to reproduce this problem with nRF52 DK so that others can debug this too, but I think it is better to document this now and start some discussion already before that. Hence this issue.

Expected results

RIOT should not cause hard faults.

Actual results

Indeterministic crashes, with hard fault triggered on line 62 in mutex.c:
https://github.com/RIOT-OS/RIOT/blob/master/core/mutex.c#L62

Versions

arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]

Suggested fix

Change the order in which interrupts are enabled again and higher priority threads are allowed to run:

   --- a/core/mutex.c
   +++ b/core/mutex.c
   @@ -58,8 +58,8 @@ int _mutex_lock(mutex_t *mutex, int blocking)
            else {
                thread_add_to_list(&mutex->queue, me);
            }
   -        irq_restore(irqstate);
            thread_yield_higher();
   +        irq_restore(irqstate);
            /* We were woken up by scheduler. Waker removed us from queue.
             * We have the mutex now. */
            return 1;

Discussion

For Cortex-M series, thread_yield_higher() simply triggers the PendSV interrupt:
https://github.com/RIOT-OS/RIOT/blob/master/cpu/cortexm_common/thread_arch.c#L291

In RIOT, if a thread called _mutex_lock, the PendSV interrupt will take place immediately, since by default all interrupts are on the same priority level, see #3511

Unless I understand something completely wrong, with the current version, at the window between when the IRQs are enabled and when isr_pendsv is triggered, there is a time window where other interrupts may be served. This may cause an interrupt routing to change the state of this mutex. At that point, the scheduler has not switched away from the calling thread's context, but is status of the thread is already changed to STATUS_MUTEX_BLOCKED.

Now, if someone triggers SVC, e.g. calls cpu_switch_context_exit triggering isr_svc, the scheduler will be entered while the blocking thread's context hasn't been saved yet. I have no idea if that ever happens. But note that Nordic uses the SVC interrupt heavily, but they should eat them all at their trampoline, I think.

If the order of the two lines is changed, as suggested above, it has no functional effects under Cortex-M. However, the timing of when interrupts are run and when the scheduler is run will change. As a side effect, this will fix the hard faults on nRF52 when SoftDevice is used, for reasons that I don't understand.

I am not familiar with the other supported CPU architectures, and therefore I am opening this as an issue, as changing the order of the lines may have adverse effects on other CPU architectures.

The text was updated successfully, but these errors were encountered:

kaspar030 · 2018-10-08T09:35:39Z

Oh the joys of debugging system level blobs...

I'm a little bit reluctant to change this before it is fully understood why it makes something more stable.

The irq_disable/enable pair is used to protect some critical code path. After that, "thread_yield_higher()" is supposed to switch away from the calling thread, which has been marked as not being on a runqueue anymore. If that machinery doesn't work as expected, many other assumptions within RIOT's core will probably fail, too.

I don't really see why line 62 (thread_yield_higher()) should raise a hard fault. Maybe this is a stacksize problem? Can you try to increase the ISR stacksize to something huge?

at the window between when the IRQs are enabled and when isr_pendsv is triggered, there is a time window where other interrupts may be served. This may cause an interrupt routing to change the state of this mutex. At that point, the scheduler has not switched away from the calling thread's context, but is status of the thread is already changed to STATUS_MUTEX_BLOCKED.

That is not a problem at all. The most that could happen is that the scheduler is run multiple times, which doesn't hurt. The status "just" states that it should not be put on a runqueue anymore.

Now, if someone triggers SVC, e.g. calls cpu_switch_context_exit triggering isr_svc, the scheduler will be entered while the blocking thread's context hasn't been saved yet. I have no idea if that ever happens. But note that Nordic uses the SVC interrupt heavily, but they should eat them all at their trampoline, I think.

Does your code use SVC / switch_context_exit? Usually that's used once to start threading, otherwise only when threads actually end.

To me this smells like either something unrelated (e.g., stack size), or an interrupt prioritisation issue.
Maybe it would make sense to check whether all IRQ's actually have the correct one? All of them that are not hijacked by the softdevice need to be set to priority 6.
At some point, cortex_init() in combination with a default priority of "6" did that. Apparently in current master "cortexm_init" is skipped, so maybe the IRQ priorities are wrong?

Can you get the exact hard fault reason?

pekkanikander · 2018-10-08T11:11:16Z

[It will take a few days before I'll have better time to focus on this (and all my other RIOT issues, including one PR on interrupt priorities that I'm still preparing.)]

I completely agree that it may we wise not to change this. And if the change is made, IMHO it should be first applied to Cortex-M, then (after a few months without complaints) to other ARM, and then, maybe and only maybe, the rest. For Cortex-M, the order should not really matter. But apparently it does. And, arguably, my suggested order would be slightly more efficient on Cortex-M, avoiding some context switches.

I don't really see why line 62 (thread_yield_higher()) should raise a hard fault. Maybe this is a stacksize problem? Can you try to increase the ISR stacksize to something huge?

Once I get back to triggering the hard fault (and hopefully more often/deterministically), I'll have a closer look at it. Based on my memory, it looked like the scheduler tried to schedule a thread that didn't exist / was badly restored, causing a hard fault at the end of pendsv_isr / isr_srv, at the instruction following the final bx r0. Then, this looked to core_panic as if the hard fault would have been at line 62. But that impression may be completely wrong. And yes, this could also be a stacksize problem, as currently the SD callbacks are running on whatever thread's stack it happens to be triggered. However, this did happen with all stacks being the minimum of 1k.

(Changing the SD callbacks nRF side code to run on the ISR stack, and then on the RIOT side on a separate thread defined in ble-core.c, is on my todo list.)

Does your code use SVC / switch_context_exit? Usually that's used once to start threading, otherwise only when threads actually end.

My code does not use SVC or switch_context_exit. But in nRF SDK 15 all the SD calls are made through SVC. One potential cause for the crashes might be that some of those SVCs are "leaked" from the SD trampoline to the RIOT side. I'll try to test this hypothesis, too, once I'm there. However, I have hard time seeing how changing the order of those two lines would help for that.

Furthermore, I'm preparing a separate PR on the interrupt priorities, as changing the PendSV to a lower subpriority was also needed to completely stop the crashes we were experiencing. However, enabling subpriorities is a clear case that a) can be easily made a configuration option and b) would be also useful for other people than with nRF52 SD. Hence, I'm preparing a separate PR out of it.

After these two changes (this one, i.e. the order of lines, and PendSV subpriority), our system has been without crashes. However, we still experience thread lockups. That is, all of our IO-bound threads (three of them) lock up at some point, some of them more often and some more seldom. It usually takes tens of minutes before any of them locks up, sometimes hours, seldomly just minutes. Everything else continues smoothly. The lockup is apparently related to the same issues. I'll come to that also, later, once this one and the PendSV priorities are in better shape.

pekkanikander · 2018-10-26T10:44:51Z

I don't think there is any realistic chance of getting this fixed for this release. We don't even have a reliable easily available test case yet. Please drop from this release.

chrysn · 2021-05-10T08:19:03Z

Is there any indication that this stems from anything other than the SoftDevice integration / interaction?

Otherwise, I'd close the bug given that softdevice package support was removed.

pekkanikander · 2021-05-10T14:00:29Z

If nobody else has seen anything even remotely like this during the last two and half years, it indeed may be best to close this.

chrysn · 2021-05-10T14:05:50Z

Closing under the assumption that this was most likely happening due to SD doing things in ISRs that were incompatible with RIOT's ISR conventions.

miri64 assigned kaspar030 Oct 8, 2018

pekkanikander mentioned this issue Oct 9, 2018

cpu/cortexm_common: Add support for Cortex-M interrupt sub-priorities #10137

Merged

jia200x added this to the Release 2018.10 milestone Oct 25, 2018

jia200x modified the milestones: Release 2018.10, Release 2019.01 Oct 30, 2018

miri64 mentioned this issue Jan 24, 2019

esp8266: esp_wifi netdev driver #10792

Merged

aabadie modified the milestones: Release 2019.01, Release 2019.04 Feb 12, 2019

miri64 modified the milestones: Release 2019.04, Release 2019.07 May 7, 2019

MrKevinWeiss modified the milestones: Release 2019.07, Release 2019.10 Jul 25, 2019

miri64 added this to the Release 2020.07 milestone Jul 2, 2020

miri64 modified the milestones: Release 2020.07, Release 2020.10 Jul 24, 2020

chrysn closed this as completed May 10, 2021

MrKevinWeiss removed this from the Release 2020.10 milestone Jul 15, 2021

MrKevinWeiss added this to the Release 2021.07 milestone Jul 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

pekkanikander commented Oct 7, 2018 •

edited

Loading

kaspar030 commented Oct 8, 2018

pekkanikander commented Oct 8, 2018

pekkanikander commented Oct 26, 2018

chrysn commented May 10, 2021

pekkanikander commented May 10, 2021

chrysn commented May 10, 2021

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

Comments

pekkanikander commented Oct 7, 2018 • edited Loading

Description

Steps to reproduce the issue

Expected results

Actual results

Versions

Suggested fix

Discussion

kaspar030 commented Oct 8, 2018

pekkanikander commented Oct 8, 2018

pekkanikander commented Oct 26, 2018

chrysn commented May 10, 2021

pekkanikander commented May 10, 2021

chrysn commented May 10, 2021

pekkanikander commented Oct 7, 2018 •

edited

Loading