-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: msg_receive()
on native sometimes returns without msg
being re-set
#10881
Comments
I'm trying to reproduce. I got the extra case you've added once within 10 minutes. Can you re-run your test with this: https://gist.github.com/1b447978ac89cfc77d6be51911e8b305 |
Sorry, please use this: https://gist.github.com/92fa81d1ae60d5f134806027829ea736 |
I only got three hits, they were all with res==1 so far. |
@miri64 all the messages are supposed to be received from another thread, right? (not from isr?) |
Hi @kaspar030 I was planning to work on an isolated test case today. Let me check your patches first.
At least in the scenario, most should be from the thread itself (containing ICMPv6 echo replies or sent NDP messages), from the |
I've added an to msg_receive() (at the top, write something to m->content.value, below in the failing case, read back and assert it has been changed). That triggers as @miri64's default case. I've then tried to enable debug in msg.c, and it still triggers. It looks like a thread sitting in "msg_receive()" is being woken up before anyone has copied a message, which should not be possible. This must be something timing sensitive, as enabling more debug output either postpones or "fixes" the problem. I'm investigating... |
Can you rephrase / fix this sentence (or provide a patch)? I don't know what you mean. |
Sorry, patch is here. What I mean is that I've tried your check (memset, then see if type=0) directly into msg_receive(), as seen in the patch. As expected, it also triggers. |
Shall I still do that btw? You appear to be able to reproduce this yourself. |
Ok for now. It did trigger, now it doesn't anymore. Maybe I was just lucky before. Anyhow I can investigate more now. |
I boiled it down to this:
|
Confirmed |
I'm not 100% sure this helps, but I have the following patch now: diff --git a/core/msg.c b/core/msg.c
index a46875f..51348f4 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -74,8 +74,10 @@ int msg_try_send(msg_t *m, kernel_pid_t target_pid)
return msg_send_int(m, target_pid);
}
if (sched_active_pid == target_pid) {
+ puts("s");
return msg_send_to_self(m);
}
+ puts("f");
return _msg_send(m, target_pid, false, irq_disable());
}
@@ -184,6 +186,7 @@ int msg_send_to_self(msg_t *m)
int msg_send_int(msg_t *m, kernel_pid_t target_pid)
{
+ puts("i");
#ifdef DEVELHELP
if (!pid_is_valid(target_pid)) {
DEBUG("msg_send(): target_pid is invalid, continuing anyways\n");
@@ -330,6 +333,7 @@ static int _msg_receive(msg_t *m, int block)
thread_yield_higher();
/* sender copied message */
+ assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
}
else {
irq_restore(state); (the Up until now I consistently have an |
(would have really surprised me if it were an |
Ok... Not really helpful :-/ diff --git a/core/msg.c b/core/msg.c
index a46875f..3d5d04e 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -76,6 +76,7 @@ int msg_try_send(msg_t *m, kernel_pid_t target_pid)
if (sched_active_pid == target_pid) {
return msg_send_to_self(m);
}
+ puts("f");
return _msg_send(m, target_pid, false, irq_disable());
}
@@ -99,6 +100,7 @@ static int _msg_send(msg_t *m, kernel_pid_t target_pid, bool block, unsigned sta
thread_t *me = (thread_t *) sched_active_thread;
+ puts("2");
DEBUG("msg_send() %s:%i: Sending from %" PRIkernel_pid " to %" PRIkernel_pid
". block=%i src->state=%i target->state=%i\n", RIOT_FILE_RELATIVE,
__LINE__, sched_active_pid, target_pid,
@@ -324,12 +326,14 @@ static int _msg_receive(msg_t *m, int block)
if (queue_index < 0) {
DEBUG("_msg_receive(): %" PRIkernel_pid ": No msg in queue. Going blocked.\n",
sched_active_thread->pid);
+ puts("1");
sched_set_status(me, STATUS_RECEIVE_BLOCKED);
irq_restore(state);
thread_yield_higher();
/* sender copied message */
+ assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
}
else {
irq_restore(state);
|
diff --git a/core/msg.c b/core/msg.c
index a46875f..55823ed 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -327,9 +327,12 @@ static int _msg_receive(msg_t *m, int block)
sched_set_status(me, STATUS_RECEIVE_BLOCKED);
irq_restore(state);
+ puts("1");
thread_yield_higher();
+ puts("2");
/* sender copied message */
+ assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
}
else {
irq_restore(state);
diff --git a/core/sched.c b/core/sched.c
index d78f1e0..4d63ddc 100644
--- a/core/sched.c
+++ b/core/sched.c
@@ -91,7 +91,7 @@ int __attribute__((used)) sched_run(void)
int nextrq = bitarithm_lsb(runqueue_bitcache);
thread_t *next_thread = container_of(sched_runqueues[nextrq].next->next, thread_t, rq_entry);
- DEBUG("sched_run: active thread: %" PRIkernel_pid ", next thread: %" PRIkernel_pid "\n",
+ printf("sched_run: active thread: %" PRIkernel_pid ", next thread: %" PRIkernel_pid "\n",
(kernel_pid_t)((active_thread == NULL) ? KERNEL_PID_UNDEF : active_thread->pid),
next_thread->pid);
|
Investigated deeper:
diff --git a/core/msg.c b/core/msg.c
index a46875f..797d8e5 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -328,8 +328,8 @@ static int _msg_receive(msg_t *m, int block)
irq_restore(state);
thread_yield_higher();
-
/* sender copied message */
+ assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
}
else {
irq_restore(state);
diff --git a/cpu/native/native_cpu.c b/cpu/native/native_cpu.c
index 2629e55..244228a 100644
--- a/cpu/native/native_cpu.c
+++ b/cpu/native/native_cpu.c
@@ -143,6 +143,7 @@ void isr_cpu_switch_context_exit(void)
DEBUG("isr_cpu_switch_context_exit\n");
if ((sched_context_switch_request == 1) || (sched_active_thread == NULL)) {
+ puts("SCHED");
sched_run();
}
@@ -168,6 +169,7 @@ void cpu_switch_context_exit(void)
#endif
if (_native_in_isr == 0) {
+ puts("NOISR");
irq_disable();
_native_in_isr = 1;
native_isr_context.uc_stack.ss_sp = __isr_stack;
@@ -180,6 +182,7 @@ void cpu_switch_context_exit(void)
errx(EXIT_FAILURE, "1 this should have never been reached!!");
}
else {
+ puts("ISR");
isr_cpu_switch_context_exit();
}
errx(EXIT_FAILURE, "3 this should have never been reached!!");
(this isn't the only time in the output though that |
(fixed a copy-pasta error in output above) |
Looks like, if an ISR (signal) occurs after Could you try this:
|
I guess this will stabilize things, but I don't think that is a valid fix. |
Agreed. Somehow the thread trying to yield away is cut in the middle. I do not yet get where it's PC is being saved. |
If I understand |
I think I got it now.
Case 1: no signals are pending, continues in Case 2: signals pending (the crashing scenario), continues in This explains perfectly why in case 2 It also makes me confident that moving |
msg_receive()
sometimes returns without msg
being re-setmsg_receive()
on native sometimes returns without msg
being re-set
(cherry picked from commit 82b2362)
Address Kaspar's comments
Address Kaspar's comments
Description
Coming from #6123 I was able to track down the issue to the fact that
msg_receive()
in some corner-cases seems to return without the out parameter being rewritten. I wasn't able to pin-point the exact issue yet and I'm unsure if it is only an issue onnative
.Steps to reproduce the issue
I applied the following patch to current master (bdd2d52, might not apply to later versions, please then try to change manually):
and ran
gnrc_networking
. I then tried to ping the node as described in #10875:Edit: Alternatively just run the tests in #10908.
Expected results
should never show up.
Actual results
shows up with some regularity
Versions
bdd2d52 on a somewhat recent Arch as of writing this issue.
The text was updated successfully, but these errors were encountered: