-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indeterministic hard fault in _mutex_lock(), with nRF52 SoftDevice #10122
Comments
Oh the joys of debugging system level blobs... I'm a little bit reluctant to change this before it is fully understood why it makes something more stable. The irq_disable/enable pair is used to protect some critical code path. After that, "thread_yield_higher()" is supposed to switch away from the calling thread, which has been marked as not being on a runqueue anymore. If that machinery doesn't work as expected, many other assumptions within RIOT's core will probably fail, too. I don't really see why line 62 (thread_yield_higher()) should raise a hard fault. Maybe this is a stacksize problem? Can you try to increase the ISR stacksize to something huge?
That is not a problem at all. The most that could happen is that the scheduler is run multiple times, which doesn't hurt. The status "just" states that it should not be put on a runqueue anymore.
Does your code use SVC / switch_context_exit? Usually that's used once to start threading, otherwise only when threads actually end. To me this smells like either something unrelated (e.g., stack size), or an interrupt prioritisation issue. Can you get the exact hard fault reason? |
[It will take a few days before I'll have better time to focus on this (and all my other RIOT issues, including one PR on interrupt priorities that I'm still preparing.)] I completely agree that it may we wise not to change this. And if the change is made, IMHO it should be first applied to Cortex-M, then (after a few months without complaints) to other ARM, and then, maybe and only maybe, the rest. For Cortex-M, the order should not really matter. But apparently it does. And, arguably, my suggested order would be slightly more efficient on Cortex-M, avoiding some context switches.
Once I get back to triggering the hard fault (and hopefully more often/deterministically), I'll have a closer look at it. Based on my memory, it looked like the scheduler tried to schedule a thread that didn't exist / was badly restored, causing a hard fault at the end of (Changing the SD callbacks nRF side code to run on the ISR stack, and then on the RIOT side on a separate thread defined in
My code does not use SVC or Furthermore, I'm preparing a separate PR on the interrupt priorities, as changing the PendSV to a lower subpriority was also needed to completely stop the crashes we were experiencing. However, enabling subpriorities is a clear case that a) can be easily made a configuration option and b) would be also useful for other people than with nRF52 SD. Hence, I'm preparing a separate PR out of it. After these two changes (this one, i.e. the order of lines, and PendSV subpriority), our system has been without crashes. However, we still experience thread lockups. That is, all of our IO-bound threads (three of them) lock up at some point, some of them more often and some more seldom. It usually takes tens of minutes before any of them locks up, sometimes hours, seldomly just minutes. Everything else continues smoothly. The lockup is apparently related to the same issues. I'll come to that also, later, once this one and the PendSV priorities are in better shape. |
I don't think there is any realistic chance of getting this fixed for this release. We don't even have a reliable easily available test case yet. Please drop from this release. |
Is there any indication that this stems from anything other than the SoftDevice integration / interaction? Otherwise, I'd close the bug given that softdevice package support was removed. |
If nobody else has seen anything even remotely like this during the last two and half years, it indeed may be best to close this. |
Closing under the assumption that this was most likely happening due to SD doing things in ISRs that were incompatible with RIOT's ISR conventions. |
Description
With #9473, with the updated support for nRF52 SoftDevice, the way interrupts are re-enabled in
_mutex_unlock
seems to cause indeterministic hard faults.Steps to reproduce the issue
The problem is hard to reproduce. We saw it a few weeks ago when working on our own nRF52 board with relatively heavy I/O and BlueTooth load. Applying the change proposed below increased the stability a lot, but not completely. (There seems to be a thread deadlock issue triggered in
_xtimer_tsleep
, also related to_mutex_lock
, but that is not clear enough to me even to describe.)I will try to reproduce this problem with nRF52 DK so that others can debug this too, but I think it is better to document this now and start some discussion already before that. Hence this issue.
Expected results
RIOT should not cause hard faults.
Actual results
Indeterministic crashes, with hard fault triggered on line 62 in
mutex.c
:https://github.com/RIOT-OS/RIOT/blob/master/core/mutex.c#L62
Versions
arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]
Suggested fix
Change the order in which interrupts are enabled again and higher priority threads are allowed to run:
Discussion
For Cortex-M series,
thread_yield_higher()
simply triggers thePendSV
interrupt:https://github.com/RIOT-OS/RIOT/blob/master/cpu/cortexm_common/thread_arch.c#L291
In RIOT, if a thread called
_mutex_lock
, the PendSV interrupt will take place immediately, since by default all interrupts are on the same priority level, see #3511Unless I understand something completely wrong, with the current version, at the window between when the IRQs are enabled and when
isr_pendsv
is triggered, there is a time window where other interrupts may be served. This may cause an interrupt routing to change the state of this mutex. At that point, the scheduler has not switched away from the calling thread's context, but is status of the thread is already changed toSTATUS_MUTEX_BLOCKED
.Now, if someone triggers SVC, e.g. calls
cpu_switch_context_exit
triggeringisr_svc
, the scheduler will be entered while the blocking thread's context hasn't been saved yet. I have no idea if that ever happens. But note that Nordic uses the SVC interrupt heavily, but they should eat them all at their trampoline, I think.If the order of the two lines is changed, as suggested above, it has no functional effects under Cortex-M. However, the timing of when interrupts are run and when the scheduler is run will change. As a side effect, this will fix the hard faults on nRF52 when SoftDevice is used, for reasons that I don't understand.
I am not familiar with the other supported CPU architectures, and therefore I am opening this as an issue, as changing the order of the lines may have adverse effects on other CPU architectures.
The text was updated successfully, but these errors were encountered: