Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122

Closed
pekkanikander opened this issue Oct 7, 2018 · 6 comments
Closed
Assignees
Labels
Area: BLE Area: Bluetooth Low Energy support Area: core Area: RIOT kernel. Handle PRs marked with this with care! Area: pkg Area: External package ports Platform: ARM Platform: This PR/issue effects ARM-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@pekkanikander
Copy link
Contributor

pekkanikander commented Oct 7, 2018

Description

With #9473, with the updated support for nRF52 SoftDevice, the way interrupts are re-enabled in _mutex_unlock seems to cause indeterministic hard faults.

Steps to reproduce the issue

The problem is hard to reproduce. We saw it a few weeks ago when working on our own nRF52 board with relatively heavy I/O and BlueTooth load. Applying the change proposed below increased the stability a lot, but not completely. (There seems to be a thread deadlock issue triggered in _xtimer_tsleep, also related to _mutex_lock, but that is not clear enough to me even to describe.)

I will try to reproduce this problem with nRF52 DK so that others can debug this too, but I think it is better to document this now and start some discussion already before that. Hence this issue.

Expected results

RIOT should not cause hard faults.

Actual results

Indeterministic crashes, with hard fault triggered on line 62 in mutex.c:
https://github.com/RIOT-OS/RIOT/blob/master/core/mutex.c#L62

Versions

arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]

Suggested fix

Change the order in which interrupts are enabled again and higher priority threads are allowed to run:

   --- a/core/mutex.c
   +++ b/core/mutex.c
   @@ -58,8 +58,8 @@ int _mutex_lock(mutex_t *mutex, int blocking)
            else {
                thread_add_to_list(&mutex->queue, me);
            }
   -        irq_restore(irqstate);
            thread_yield_higher();
   +        irq_restore(irqstate);
            /* We were woken up by scheduler. Waker removed us from queue.
             * We have the mutex now. */
            return 1;

Discussion

For Cortex-M series, thread_yield_higher() simply triggers the PendSV interrupt:
https://github.com/RIOT-OS/RIOT/blob/master/cpu/cortexm_common/thread_arch.c#L291

In RIOT, if a thread called _mutex_lock, the PendSV interrupt will take place immediately, since by default all interrupts are on the same priority level, see #3511

Unless I understand something completely wrong, with the current version, at the window between when the IRQs are enabled and when isr_pendsv is triggered, there is a time window where other interrupts may be served. This may cause an interrupt routing to change the state of this mutex. At that point, the scheduler has not switched away from the calling thread's context, but is status of the thread is already changed to STATUS_MUTEX_BLOCKED.

Now, if someone triggers SVC, e.g. calls cpu_switch_context_exit triggering isr_svc, the scheduler will be entered while the blocking thread's context hasn't been saved yet. I have no idea if that ever happens. But note that Nordic uses the SVC interrupt heavily, but they should eat them all at their trampoline, I think.

If the order of the two lines is changed, as suggested above, it has no functional effects under Cortex-M. However, the timing of when interrupts are run and when the scheduler is run will change. As a side effect, this will fix the hard faults on nRF52 when SoftDevice is used, for reasons that I don't understand.

I am not familiar with the other supported CPU architectures, and therefore I am opening this as an issue, as changing the order of the lines may have adverse effects on other CPU architectures.

@miri64 miri64 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Platform: ARM Platform: This PR/issue effects ARM-based platforms Area: core Area: RIOT kernel. Handle PRs marked with this with care! Area: pkg Area: External package ports Area: BLE Area: Bluetooth Low Energy support labels Oct 8, 2018
@kaspar030
Copy link
Contributor

Oh the joys of debugging system level blobs...

I'm a little bit reluctant to change this before it is fully understood why it makes something more stable.

The irq_disable/enable pair is used to protect some critical code path. After that, "thread_yield_higher()" is supposed to switch away from the calling thread, which has been marked as not being on a runqueue anymore. If that machinery doesn't work as expected, many other assumptions within RIOT's core will probably fail, too.

I don't really see why line 62 (thread_yield_higher()) should raise a hard fault. Maybe this is a stacksize problem? Can you try to increase the ISR stacksize to something huge?

at the window between when the IRQs are enabled and when isr_pendsv is triggered, there is a time window where other interrupts may be served. This may cause an interrupt routing to change the state of this mutex. At that point, the scheduler has not switched away from the calling thread's context, but is status of the thread is already changed to STATUS_MUTEX_BLOCKED.

That is not a problem at all. The most that could happen is that the scheduler is run multiple times, which doesn't hurt. The status "just" states that it should not be put on a runqueue anymore.

Now, if someone triggers SVC, e.g. calls cpu_switch_context_exit triggering isr_svc, the scheduler will be entered while the blocking thread's context hasn't been saved yet. I have no idea if that ever happens. But note that Nordic uses the SVC interrupt heavily, but they should eat them all at their trampoline, I think.

Does your code use SVC / switch_context_exit? Usually that's used once to start threading, otherwise only when threads actually end.

To me this smells like either something unrelated (e.g., stack size), or an interrupt prioritisation issue.
Maybe it would make sense to check whether all IRQ's actually have the correct one? All of them that are not hijacked by the softdevice need to be set to priority 6.
At some point, cortex_init() in combination with a default priority of "6" did that. Apparently in current master "cortexm_init" is skipped, so maybe the IRQ priorities are wrong?

Can you get the exact hard fault reason?

@pekkanikander
Copy link
Contributor Author

[It will take a few days before I'll have better time to focus on this (and all my other RIOT issues, including one PR on interrupt priorities that I'm still preparing.)]

I completely agree that it may we wise not to change this. And if the change is made, IMHO it should be first applied to Cortex-M, then (after a few months without complaints) to other ARM, and then, maybe and only maybe, the rest. For Cortex-M, the order should not really matter. But apparently it does. And, arguably, my suggested order would be slightly more efficient on Cortex-M, avoiding some context switches.

I don't really see why line 62 (thread_yield_higher()) should raise a hard fault. Maybe this is a stacksize problem? Can you try to increase the ISR stacksize to something huge?

Once I get back to triggering the hard fault (and hopefully more often/deterministically), I'll have a closer look at it. Based on my memory, it looked like the scheduler tried to schedule a thread that didn't exist / was badly restored, causing a hard fault at the end of pendsv_isr / isr_srv, at the instruction following the final bx r0. Then, this looked to core_panic as if the hard fault would have been at line 62. But that impression may be completely wrong. And yes, this could also be a stacksize problem, as currently the SD callbacks are running on whatever thread's stack it happens to be triggered. However, this did happen with all stacks being the minimum of 1k.

(Changing the SD callbacks nRF side code to run on the ISR stack, and then on the RIOT side on a separate thread defined in ble-core.c, is on my todo list.)

Does your code use SVC / switch_context_exit? Usually that's used once to start threading, otherwise only when threads actually end.

My code does not use SVC or switch_context_exit. But in nRF SDK 15 all the SD calls are made through SVC. One potential cause for the crashes might be that some of those SVCs are "leaked" from the SD trampoline to the RIOT side. I'll try to test this hypothesis, too, once I'm there. However, I have hard time seeing how changing the order of those two lines would help for that.

Furthermore, I'm preparing a separate PR on the interrupt priorities, as changing the PendSV to a lower subpriority was also needed to completely stop the crashes we were experiencing. However, enabling subpriorities is a clear case that a) can be easily made a configuration option and b) would be also useful for other people than with nRF52 SD. Hence, I'm preparing a separate PR out of it.

After these two changes (this one, i.e. the order of lines, and PendSV subpriority), our system has been without crashes. However, we still experience thread lockups. That is, all of our IO-bound threads (three of them) lock up at some point, some of them more often and some more seldom. It usually takes tens of minutes before any of them locks up, sometimes hours, seldomly just minutes. Everything else continues smoothly. The lockup is apparently related to the same issues. I'll come to that also, later, once this one and the PendSV priorities are in better shape.

@pekkanikander
Copy link
Contributor Author

I don't think there is any realistic chance of getting this fixed for this release. We don't even have a reliable easily available test case yet. Please drop from this release.

@chrysn
Copy link
Member

chrysn commented May 10, 2021

Is there any indication that this stems from anything other than the SoftDevice integration / interaction?

Otherwise, I'd close the bug given that softdevice package support was removed.

@pekkanikander
Copy link
Contributor Author

If nobody else has seen anything even remotely like this during the last two and half years, it indeed may be best to close this.

@chrysn
Copy link
Member

chrysn commented May 10, 2021

Closing under the assumption that this was most likely happening due to SD doing things in ISRs that were incompatible with RIOT's ISR conventions.

@chrysn chrysn closed this as completed May 10, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2020.10 milestone Jul 15, 2021
@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: BLE Area: Bluetooth Low Energy support Area: core Area: RIOT kernel. Handle PRs marked with this with care! Area: pkg Area: External package ports Platform: ARM Platform: This PR/issue effects ARM-based platforms Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

7 participants