-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpu/native: fix race in thread_yield_higher() #10891
cpu/native: fix race in thread_yield_higher() #10891
Conversation
Error case: 1. thread_yield_higher() stores the thread's ucontext 2. creates an "isr ucontext" for isr_thread_yield, switches to it Case 1: no signals are pending, continues in isr_thread_yield() 3a. sched_run is called 4a. return to sched_active_thread ucontext Case 2: signals pending (the crashing scenario), continues in native_irq_handler() 3b. handles signals 4b. if sched_context_switch_request is set, call sched_run 5b. return to sched_active_thread ucontext 4b misses the call to sched_run(), leading to a possible return into a non-ready thread.
This is pretty central in native, so I guess a couple of eyes need to take a look... |
@LudwigKnuepfer maybe you could take a look? |
This fixes #6123 for me! |
I'm rather wondering if diff --git a/cpu/native/native_cpu.c b/cpu/native/native_cpu.c
index 2629e55..79534e3 100644
--- a/cpu/native/native_cpu.c
+++ b/cpu/native/native_cpu.c
@@ -142,7 +142,7 @@ void isr_cpu_switch_context_exit(void)
ucontext_t *ctx;
DEBUG("isr_cpu_switch_context_exit\n");
- if ((sched_context_switch_request == 1) || (sched_active_thread == NULL)) {
+ if (sched_active_thread == NULL) {
sched_run();
} wouldn't be a more valid fix. Because this is basically also creating the situation you are creating:
|
Some historic research:
|
I don't think so, as it would not call |
The fix I proposed causes the scheduler only to be called when the thread is exited, so this obviously was bullshit. Also, since the title of #229 where this check was introduced mentions segfaults we probably shouldn't mess with that. |
stable for 30min now, This bug has haunted my benchmarking efforts on native for years! |
Penny for your thoughts on this though #6660 (comment) |
Next time use some elbow grease like I did ;D. |
You mean I think the naming is a little off.
|
Ok my final assessment for today for this PR is that |
I tested most thread-related tests on |
I was however not able find an isolated test case yet :-(. |
Maybe because I don't fully understand how and when |
I was also wondering why not all platforms need to set it in RIOT/cpu/atmega_common/thread_arch.c Lines 219 to 230 in 782b181
(edit: but maybe this was just mapped after All others just seem to be content to trigger a software interrupt except for Lines 21 to 34 in 782b181
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, tests pass and conceptually the change makes sense to me so and is hard to argue with. So it gets my soft ACK (which may be counted as 1 approve, given the "Impact: major" label, if others more knowledgeable in the deep dungeons of RIOT's scheduling approve)
|
TL;DR A context switch can be triggered either from the running task or from within an ISR. In ISRs, the context switch cannot be done right away. See e.g. this (pseudocode):
In ISR context, the ISR must finish first. It is not possible to "pause" ISR context, switch to the thread, then "continue" the ISR context. ISRs must finish. So the thread that would probably be unlocked by the mutec_unlock() call is merely put on the runqueue, but we need to somehow pass along the information that a context switch might be necessary. This is what "sched_context_switch_request" is used for: pass along the information that a context switch might be necessary after an ISR has finished. Thing is, different platforms have different concepts. E.g., on Cortex-M, there's PendSV, which will always be called last. Also, there are the banked ISR registers (a second set of registers), thus for serving an ISR, no manual context saving is necessary. Only when, after all ISRs have run, a different user thread should continue to run, the current context needs to be saved, the next selected and then restored, through fiddling with the user mode stack. Cortex-M checks the flag here: RIOT/cpu/cortexm_common/include/cpu.h Lines 185 to 190 in 782b181
Other platforms do it differently, but all are using the flag. |
Ok, I think most of that I was able to figure out myself, so thanks for the affirmation and putting it in words. However, I still don't really am able to track how the case before was wrong / let to invalid unblocking so that is why I don't feel 100% confident in this bug fix. |
I'll try to put it in different words. Usually on thread_yield_higher() on native, when called outside of an ISR, an "isr context" is created with "makecontext" which executes "isr_thread_yield". "swapcontext()" saves the thread context and jumps into the newly created "isr context", launching "isr_thread_yield". Now, if there are no signals pending, the scheduler is run, then a context switch to the then-current sched_active_thread is initiated. All is good. But it is possible that signals queue up (after irq_disable in thread_yield_higher()), making "isr_thread_yield()" call "native_irq_handler()". That function does not return, as it returns to thread context by itself, after handling ISRs. The jumping-back to thread context is almost identical to how "isr_thread_yield()" does it (if there'd been no signals pending). The main difference is that "native_irq_handler()" does not always call "sched_run()", but only if "sched_context_switch_request" was set. This lead to the crash at hand: msg_receive() removes the current thread from the runqueue, then calls thread_yield_higher() expecting to be scheduled away. But in the off chance that a signal pops up at the wrong time (and the ISR does not set sched_context_switch_request=1), the scheduler doesn't get called, and msg_receive() returns without receiving. The fix works by always setting "sched_context_switch_request" to 1, so should "isr_thread_yield" degrade to "native_irq_handler", the scheduler is still called. |
Ok, so the tests I started to write yesterday weren't all that wrong (trigger a timer in very short intervals [which causes a SIGALRM on native], while a message exchange between two threads is happening), they just were not able to trigger the race condition. |
Actually yes, if the timer does not trigger a context switch (e.g., no mutex unlock or msg send)! |
Aha, that was what missing then :-). |
Ok, I tried that, but still not really able to reproduce the issue. Probably the receiving thread needs to do a little more than just printing something. |
Can you try to make the timeouts of the timer random, and make sure it does not cause a context switch (remove the mutex_unlock())? Also I think thread_1 can go. |
Ah, sorry, I didn't get you. No context switch was right... |
Ok, I did that now and also replaced |
I got it! Let me write a quick test script and I provide a PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works with #10908 and I did not encounter any problems in the other thread tests with native
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would ACK this PR even though I have to admit that I haven't dug into all the consequences this change might possible have. However, we're carrying this bug with us for ore more than two years now and apparently this PR fixet it and does not break any other standard test. Plus, it is basically an one-liner.
@kaspar030 so we're green? |
Yes! Many many thanks to @miri64 for reducing #6123 to
Could someone? Isn't there a script? |
Backport provided in #10921 |
Contribution description
Fixes #10881.
Testing procedure
See #10881 and #10908
Issues/PRs references
Fixes #10881.
Fixes #6123.