Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: msg_receive() on native sometimes returns without msg being re-set #10881

Closed
miri64 opened this issue Jan 27, 2019 · 23 comments
Closed

core: msg_receive() on native sometimes returns without msg being re-set #10881

miri64 opened this issue Jan 27, 2019 · 23 comments
Assignees
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Platform: native Platform: This PR/issue effects the native platform Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@miri64
Copy link
Member

miri64 commented Jan 27, 2019

Description

Coming from #6123 I was able to track down the issue to the fact that msg_receive() in some corner-cases seems to return without the out parameter being rewritten. I wasn't able to pin-point the exact issue yet and I'm unsure if it is only an issue on native.

Steps to reproduce the issue

I applied the following patch to current master (bdd2d52, might not apply to later versions, please then try to change manually):

diff --git a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
index feb2e8f10..0c3fa97c4 100644
--- a/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
+++ b/sys/net/gnrc/network_layer/ipv6/gnrc_ipv6.c
@@ -180,6 +180,7 @@ static void *_event_loop(void *args)
     /* start event loop */
     while (1) {
         DEBUG("ipv6: waiting for incoming message.\n");
+        memset(&msg, 0, sizeof(msg));
         msg_receive(&msg);
 
         switch (msg.type) {
@@ -220,6 +221,7 @@ static void *_event_loop(void *args)
                 gnrc_ipv6_nib_handle_timer_event(msg.content.ptr, msg.type);
                 break;
             default:
+                printf("ipv6: unknown message type 0x%04x\n", msg.type);
                 break;
         }
     }

and ran gnrc_networking. I then tried to ping the node as described in #10875:

sudo true; for _ in $(seq 10); do sudo ping -s0 -f "fe80::...%tapbr0" & done

Edit: Alternatively just run the tests in #10908.

Expected results

ipv6: unknown message type 0x0000

should never show up.

Actual results

ipv6: unknown message type 0x0000

shows up with some regularity

Versions

bdd2d52 on a somewhat recent Arch as of writing this issue.

Operating System Environment
-----------------------------
       Operating System: "Arch Linux" 
                 Kernel: Linux 4.20.3-arch1-1-ARCH x86_64 unknown

Installed compiler toolchains
-----------------------------
             native gcc: gcc (GCC) 8.2.1 20181127
      arm-none-eabi-gcc: arm-none-eabi-gcc (Arch Repository) 8.2.0
                avr-gcc: avr-gcc (GCC) 8.2.0
       mips-mti-elf-gcc: missing
             msp430-gcc: missing
   riscv-none-embed-gcc: missing
   xtensa-esp32-elf-gcc: missing
   xtensa-lx106-elf-gcc: missing
                  clang: clang version 7.0.1 (tags/RELEASE_701/final)

Installed compiler libs
-----------------------
   arm-none-eabi-newlib: "3.0.0"
    mips-mti-elf-newlib: missing
riscv-none-embed-newlib: missing
xtensa-esp32-elf-newlib: missing
xtensa-lx106-elf-newlib: missing
               avr-libc: "2.0.0" ("20150208")

Installed development tools
---------------------------
                  cmake: cmake version 3.13.3
               cppcheck: missing
                doxygen: 1.8.15
                 flake8: 3.6.0 (mccabe: 0.6.1, pycodestyle: 2.4.0, pyflakes: 2.0.0) CPython 3.7.2 on Linux
                    git: git version 2.20.1
                   make: GNU Make 4.2.1
                openocd: missing
                 python: Python 3.7.2
                python2: Python 2.7.15
                python3: Python 3.7.2
             coccinelle: missing
@miri64 miri64 added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Platform: native Platform: This PR/issue effects the native platform Area: core Area: RIOT kernel. Handle PRs marked with this with care! labels Jan 27, 2019
@kaspar030
Copy link
Contributor

I'm trying to reproduce. I got the extra case you've added once within 10 minutes.

Can you re-run your test with this: https://gist.github.com/1b447978ac89cfc77d6be51911e8b305
It adds printout of the msg_receive() return value, and changes msg_receive() to return different values depending on which case was hit.

@kaspar030
Copy link
Contributor

@kaspar030
Copy link
Contributor

It adds printout of the msg_receive() return value, and changes msg_receive() to return different values depending on which case was hit.

I only got three hits, they were all with res==1 so far.

@kaspar030
Copy link
Contributor

@miri64 all the messages are supposed to be received from another thread, right? (not from isr?)

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

Hi @kaspar030 I was planning to work on an isolated test case today. Let me check your patches first.

@miri64 all the messages are supposed to be received from another thread, right? (not from isr?)

At least in the scenario, most should be from the thread itself (containing ICMPv6 echo replies or sent NDP messages), from the netif thread (ICMPv6 echo requests and received NDP messages), but also from the ISR (evtimer timer messages for NDP events).

@kaspar030
Copy link
Contributor

I've added an to msg_receive() (at the top, write something to m->content.value, below in the failing case, read back and assert it has been changed). That triggers as @miri64's default case.

I've then tried to enable debug in msg.c, and it still triggers. It looks like a thread sitting in "msg_receive()" is being woken up before anyone has copied a message, which should not be possible.

This must be something timing sensitive, as enabling more debug output either postpones or "fixes" the problem.

I'm investigating...

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

I've added an to msg_receive() (at the top, write something to m->content.value, below in the failing case, read back and assert it has been changed). That triggers as @miri64's default case.

Can you rephrase / fix this sentence (or provide a patch)? I don't know what you mean.

@kaspar030
Copy link
Contributor

Can you rephrase / fix this sentence (or provide a patch)? I don't know what you mean.

Sorry, patch is here. What I mean is that I've tried your check (memset, then see if type=0) directly into msg_receive(), as seen in the patch. As expected, it also triggers.

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

Can you re-run your test with this: https://gist.github.com/1b447978ac89cfc77d6be51911e8b305
It adds printout of the msg_receive() return value, and changes msg_receive() to return different values depending on which case was hit.

Shall I still do that btw? You appear to be able to reproduce this yourself.

@kaspar030
Copy link
Contributor

Shall I still do that btw? You appear to be able to reproduce this yourself.

Ok for now. It did trigger, now it doesn't anymore. Maybe I was just lucky before. Anyhow I can investigate more now.

@kaspar030
Copy link
Contributor

I boiled it down to this:

[kaspar@ng riot (master)]$ git diff core/msg.c
diff --git a/core/msg.c b/core/msg.c
index a46875f16e..86ee0dd287 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -330,6 +330,7 @@ static int _msg_receive(msg_t *m, int block)
             thread_yield_higher();
 
             /* sender copied message */
+            assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
         }
         else {
             irq_restore(state);
[kaspar@ng riot (master)]$

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

Confirmed

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

I'm not 100% sure this helps, but I have the following patch now:

diff --git a/core/msg.c b/core/msg.c
index a46875f..51348f4 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -74,8 +74,10 @@ int msg_try_send(msg_t *m, kernel_pid_t target_pid)
         return msg_send_int(m, target_pid);
     }
     if (sched_active_pid == target_pid) {
+        puts("s");
         return msg_send_to_self(m);
     }
+    puts("f");
     return _msg_send(m, target_pid, false, irq_disable());
 }
 
@@ -184,6 +186,7 @@ int msg_send_to_self(msg_t *m)
 
 int msg_send_int(msg_t *m, kernel_pid_t target_pid)
 {
+    puts("i");
 #ifdef DEVELHELP
     if (!pid_is_valid(target_pid)) {
         DEBUG("msg_send(): target_pid is invalid, continuing anyways\n");
@@ -330,6 +333,7 @@ static int _msg_receive(msg_t *m, int block)
             thread_yield_higher();
 
             /* sender copied message */
+            assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
         }
         else {
             irq_restore(state);

(the puts in msg_try_send to catch netapi messages to determine the origin of netapi messages and the puts in msg_send_int to determine evtimer_msg firings

Up until now I consistently have an f before it hits the assertion.

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

(would have really surprised me if it were an s ;-))

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

Ok... Not really helpful :-/

diff --git a/core/msg.c b/core/msg.c
index a46875f..3d5d04e 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -76,6 +76,7 @@ int msg_try_send(msg_t *m, kernel_pid_t target_pid)
     if (sched_active_pid == target_pid) {
         return msg_send_to_self(m);
     }
+    puts("f");
     return _msg_send(m, target_pid, false, irq_disable());
 }
 
@@ -99,6 +100,7 @@ static int _msg_send(msg_t *m, kernel_pid_t target_pid, bool block, unsigned sta
 
     thread_t *me = (thread_t *) sched_active_thread;
 
+    puts("2");
     DEBUG("msg_send() %s:%i: Sending from %" PRIkernel_pid " to %" PRIkernel_pid
           ". block=%i src->state=%i target->state=%i\n", RIOT_FILE_RELATIVE,
           __LINE__, sched_active_pid, target_pid,
@@ -324,12 +326,14 @@ static int _msg_receive(msg_t *m, int block)
         if (queue_index < 0) {
             DEBUG("_msg_receive(): %" PRIkernel_pid ": No msg in queue. Going blocked.\n",
                   sched_active_thread->pid);
+            puts("1");
             sched_set_status(me, STATUS_RECEIVE_BLOCKED);
 
             irq_restore(state);
             thread_yield_higher();
 
             /* sender copied message */
+            assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
         }
         else {
             irq_restore(state);
f
2
1
1
f
2
1
f
2
1
core/msg.c:336 => 0x8049e12
*** RIOT kernel panic:
FAILED ASSERTION.

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

sched_run is not called for some run :-/

diff --git a/core/msg.c b/core/msg.c
index a46875f..55823ed 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -327,9 +327,12 @@ static int _msg_receive(msg_t *m, int block)
             sched_set_status(me, STATUS_RECEIVE_BLOCKED);
 
             irq_restore(state);
+            puts("1");
             thread_yield_higher();
+            puts("2");
 
             /* sender copied message */
+            assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
         }
         else {
             irq_restore(state);
diff --git a/core/sched.c b/core/sched.c
index d78f1e0..4d63ddc 100644
--- a/core/sched.c
+++ b/core/sched.c
@@ -91,7 +91,7 @@ int __attribute__((used)) sched_run(void)
     int nextrq = bitarithm_lsb(runqueue_bitcache);
     thread_t *next_thread = container_of(sched_runqueues[nextrq].next->next, thread_t, rq_entry);
 
-    DEBUG("sched_run: active thread: %" PRIkernel_pid ", next thread: %" PRIkernel_pid "\n",
+    printf("sched_run: active thread: %" PRIkernel_pid ", next thread: %" PRIkernel_pid "\n",
           (kernel_pid_t)((active_thread == NULL) ? KERNEL_PID_UNDEF : active_thread->pid),
           next_thread->pid);
1
sched_run: active thread: 6, next thread: 4
sched_run: active thread: 4, next thread: 6
2
sched_run: active thread: 6, next thread: 4
sched_run: active thread: 4, next thread: 6
1
2
core/msg.c:335 => 0x8049e12
*** RIOT kernel panic:
FAILED ASSERTION.

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

Investigated deeper:

  • before the hit assertion _native_sigpend is 2, so isr_thread_yield steps into native_irq_handle()
    if (_native_sigpend > 0) {
    DEBUG("isr_thread_yield(): handling signals\n\n");
    native_irq_handler();
    }
  • this doesn't seem to return, but rather ends up in cpu_switch_context_exit() which has the chance to run sched_run in isr_cpu_switch_context_exit, but for some reason it doesn't
    void isr_cpu_switch_context_exit(void)
    {
    ucontext_t *ctx;
    DEBUG("isr_cpu_switch_context_exit\n");
    if ((sched_context_switch_request == 1) || (sched_active_thread == NULL)) {
    sched_run();
    }
diff --git a/core/msg.c b/core/msg.c
index a46875f..797d8e5 100644
--- a/core/msg.c
+++ b/core/msg.c
@@ -328,8 +328,8 @@ static int _msg_receive(msg_t *m, int block)
 
             irq_restore(state);
             thread_yield_higher();
-
             /* sender copied message */
+            assert(sched_active_thread->status != STATUS_RECEIVE_BLOCKED);
         }
         else {
             irq_restore(state);
diff --git a/cpu/native/native_cpu.c b/cpu/native/native_cpu.c
index 2629e55..244228a 100644
--- a/cpu/native/native_cpu.c
+++ b/cpu/native/native_cpu.c
@@ -143,6 +143,7 @@ void isr_cpu_switch_context_exit(void)
 
     DEBUG("isr_cpu_switch_context_exit\n");
     if ((sched_context_switch_request == 1) || (sched_active_thread == NULL)) {
+        puts("SCHED");
         sched_run();
     }
 
@@ -168,6 +169,7 @@ void cpu_switch_context_exit(void)
 #endif
 
     if (_native_in_isr == 0) {
+        puts("NOISR");
         irq_disable();
         _native_in_isr = 1;
         native_isr_context.uc_stack.ss_sp = __isr_stack;
@@ -180,6 +182,7 @@ void cpu_switch_context_exit(void)
         errx(EXIT_FAILURE, "1 this should have never been reached!!");
     }
     else {
+        puts("ISR");
         isr_cpu_switch_context_exit();
     }
     errx(EXIT_FAILURE, "3 this should have never been reached!!");
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
SCHED
ISR
ISR
ISR
ISR
core/msg.c:332 => 0x8049e12
*** RIOT kernel panic:
FAILED ASSERTION.

(this isn't the only time in the output though that ISR follows onto each other)

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

(fixed a copy-pasta error in output above)

@kaspar030
Copy link
Contributor

* this doesn't seem to return, but rather ends up in `cpu_switch_context_exit()` which has the chance to run `sched_run` in `isr_cpu_switch_context_exit`, but for some reason it doesn't

Looks like, if an ISR (signal) occurs after thread_yield_higher has disabled them makes the called isr_thread_yield() jump into native_irq_handler(), which calls sched_run() conditionally (on sched_context_switch_request), then jumps directly back to the user context. Seems like this can skip the sched_run() that the user context wanted to run.

Could you try this:

diff --git a/cpu/native/native_cpu.c b/cpu/native/native_cpu.c
index 2629e55df0..655562565e 100644
--- a/cpu/native/native_cpu.c
+++ b/cpu/native/native_cpu.c
@@ -208,6 +208,8 @@ void isr_thread_yield(void)
 
 void thread_yield_higher(void)
 {
+    sched_context_switch_request = 1;
+
     if (_native_in_isr == 0) {
         ucontext_t *ctx = (ucontext_t *)(sched_active_thread->sp);
         _native_in_isr = 1;
@@ -224,9 +226,6 @@ void thread_yield_higher(void)
         }
         irq_enable();
     }
-    else {
-        sched_context_switch_request = 1;
-    }
 }
 
 void native_cpu_init(void)

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

I guess this will stabilize things, but I don't think that is a valid fix.

@kaspar030
Copy link
Contributor

I guess this will stabilize things, but I don't think that is a valid fix.

Agreed. Somehow the thread trying to yield away is cut in the middle. I do not yet get where it's PC is being saved.

@miri64
Copy link
Member Author

miri64 commented Jan 28, 2019

If I understand man 3 getcontext and man 3 makecontext correctly, it is written machine dependent into ucp rewording edit: into the machine dependent part of ucontext_t in both these functions (getcontext takes the current pc, makecontext sets its to the entry of the function provided).

@kaspar030
Copy link
Contributor

I think I got it now.

  1. thread_yield_higher() stores the thread's ucontext
  2. creates an "isr ucontext" for isr_thread_yield, switches to it

Case 1: no signals are pending, continues in isr_thread_yield()
3a. sched_run is called
4a. return to sched_active_thread ucontext

Case 2: signals pending (the crashing scenario), continues in native_irq_handler()
3b. handles signals
4b if sched_context_switch_request is set, call sched_run
5b. return to sched_active_thread ucontext

This explains perfectly why in case 2 sched_run is just plain skipped, but the system does not completely fall apart.

It also makes me confident that moving sched_active_thread==1 is a proper fix.

@kaspar030 kaspar030 changed the title core: msg_receive() sometimes returns without msg being re-set core: msg_receive() on native sometimes returns without msg being re-set Jan 28, 2019
kaspar030 added a commit to kaspar030/RIOT that referenced this issue Jan 28, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Jan 30, 2019
miri64 pushed a commit to miri64/RIOT that referenced this issue Feb 1, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Feb 1, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Feb 13, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 27, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 27, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 28, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 28, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 28, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 28, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 28, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 28, 2019
miri64 added a commit to miri64/RIOT that referenced this issue Mar 28, 2019
miri64 added a commit that referenced this issue Mar 28, 2019
geromueller pushed a commit to geromueller/RIOT that referenced this issue Jun 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Platform: native Platform: This PR/issue effects the native platform Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

4 participants