Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix detached pthread failures #779

Open
vtikoo opened this issue Aug 13, 2020 · 7 comments
Open

Fix detached pthread failures #779

vtikoo opened this issue Aug 13, 2020 · 7 comments
Labels
p0 Blocking priority

Comments

@vtikoo
Copy link
Contributor

vtikoo commented Aug 13, 2020

PR #403 is disabling these two tests:
/tests/basic/pthread_detach
/tests/languages/nodejs

These are disabled due to issues in detached pthreads support.

@github-actions github-actions bot added the needs-triage Bug does not yet have a priority assigned label Aug 13, 2020
@vtikoo vtikoo added p0 Blocking priority and removed needs-triage Bug does not yet have a priority assigned labels Aug 13, 2020
@vtikoo vtikoo changed the title Fix nodejs and pthread_detach test failures Fix detached pthread failures Aug 17, 2020
@SeanTAllen
Copy link
Contributor

mmap11 LTP test hang may be related to this.

@prp prp mentioned this issue Sep 16, 2020
@vtikoo
Copy link
Contributor Author

vtikoo commented Sep 22, 2020

/tests/languages/nodejs

The failing nodejs detached thread is https://github.com/nodejs/node/blob/1be7bbdc3b533759072cfc3d13766511197e3d01/src/inspector_agent.cc#L77-L92. This is created here - https://github.com/nodejs/node/blob/1be7bbdc3b533759072cfc3d13766511197e3d01/src/inspector_agent.cc#L121.

Pretty much the first thing the thread does is a futex wait syscall, and in doing so segfaults.

inline void* StartIoThreadMain(void* unused) {
  for (;;) {
    uv_sem_wait(&start_io_thread_semaphore);
  ...

Depending on whether LKL_DEBUG is turned on this fails differently.

With LKL_DEBUG:

This fails at the first LKL_TRACE in lkl_syscall for SYS_futex here.

Thread 9 "ENCLAVE" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffa0e9d700 (LWP 3277)]
0x00007fe000504461 in _vprintf (out=0x7fe019414358, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:437
437                 if (!(p = _parse_placeholder(p, &ph, ap)))
(gdb) bt
#0  0x00007fe000504461 in _vprintf (out=0x7fe019414358, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:437
#1  0x00007fe00050413f in oe_vsnprintf (str=0x7fe019414470 "", size=256, fmt=0x7fe0006f57d0 "%.*s", ap=0x7fe019414450)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/printf.c:634
#2  0x00007fe000503180 in oe_host_vfprintf (device=0, fmt=0x7fe0006f57d0 "%.*s", ap_=0x7fe019414750)
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/hostcalls.c:157
#3  0x00007fe00050363a in oe_host_printf (fmt=0x7fe0006f57d0 "%.*s")
    at /home/vitikoo/code/sgx-lkl/openenclave/enclave/core/hostcalls.c:184
#4  0x00007fe0005e7a4e in print (
    str=0x7fe000df8da0 "[[    LKL   ]] lkl_syscall(): enter (no=98 current=node host0->TIF host0->TIF_SIGPENDING=1)\n", 
    len=92) at lkl/posix-host.c:104
#5  0x00007fe000084ec7 in lkl_vprintf (
    fmt=0x7fe0006bb6b8 "[[    LKL   ]] %s(): enter (no=%li current=%s host0->TIF host0->TIF_SIGPENDING=%i)\n", 
    args=args@entry=0x7fe019414800) at lib/utils.c:181
#6  0x00007fe000084fc9 in lkl_printf (fmt=<optimized out>) at lib/utils.c:193
#7  0x00007fe00008bac4 in lkl_syscall (no=98, params=0x7fe019414970) at arch/lkl/kernel/syscalls.c:144

Failing instruction being -

=> 0x00007fe000504461 <+561>:   callq  0x7fe000505a20 <_parse_placeholder>

Without LKL_DEBUG:

The futex wait syscall is actually run, causing the thread to yield. While yielding back to the lthread scheduler it segfaults trying to write a rbp relative address.

Thread 6 "ENCLAVE" received signal SIGSEGV, Segmentation fault.
0x00007fe0005cf715 in __scheduler_self () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:81
81              __asm__ __volatile__ ("mov %%gs:48,%0" : "=r" (self) );
(gdb) bt
#0  0x00007fe0005cf715 in __scheduler_self () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:81
#1  0x00007fe0005cf731 in lthread_get_sched () at /home/vitikoo/code/sgx-lkl/src/include/enclave/lthread_int.h:89
#2  0x00007fe0005cff38 in _lthread_yield_cb (lt=0x7fe000df6c40, f=0x7fe0005c5932 <__do_futex_unlock>, arg=0x7fe000ab75b0 <futex_q_lock>) at sched/lthread.c:343
#3  0x00007fe0005c59cb in __do_futex_sleep (fq=0x7fe000df6d68, ts=0x0) at sched/futex.c:177
#4  0x00007fe0005c5a5b in futex_wait (uaddr=0x7fe000df6c20, val=0, ts=0x0) at sched/futex.c:215
#5  0x00007fe0005c5bdb in enclave_futex_timedwait (uaddr=0x7fe000df6c20, val=0, timeout=0x0) at sched/futex.c:276
#6  0x00007fe0005c5c25 in enclave_futex_wait (uaddr=0x7fe000df6c20, val=0) at sched/futex.c:300
#7  0x00007fe0005e7505 in sem_down (sem=0x7fe000df6c20) at lkl/posix-host.c:289
#8  0x00007fe00008a8c3 in __switch_to (prev=<optimized out>, next=<optimized out>) at arch/lkl/kernel/threads.c:194
#9  0x00007fe000424536 in context_switch (rf=<optimized out>, next=<optimized out>, prev=<optimized out>, rq=<optimized out>) at kernel/sched/core.c:3470
#10 __schedule (preempt=<optimized out>) at kernel/sched/core.c:4167
#11 0x00007fe0004247b2 in schedule () at kernel/sched/core.c:4234
#12 0x00007fe0000d87c3 in futex_wait_queue_me (hb=<optimized out>, q=0x7fe00093d000 <init_thread_info>, timeout=0x7fe03fd64000) at kernel/futex.c:2721
#13 0x00007fe0000d9343 in futex_wait (uaddr=0x0, flags=<optimized out>, val=<optimized out>, abs_time=0x0, bitset=<optimized out>) at kernel/futex.c:2827
#14 0xffffffff00080006 in ?? ()

The failing instruction is -

=> 0x00007fe0005cf715 <+13>:    mov    %rax,-0x8(%rbp)
(gdb) i r rbp
rbp            0x7fe019412000   0x7fe019412000

@vtikoo
Copy link
Contributor Author

vtikoo commented Sep 22, 2020

/tests/basic/pthread_detach

Here sgx-lkl goes into a context switch loop between ksoftirqd ==> swapper ==> idle_host_task [==> ksoftirqd], after the last detached pthread has exited.

[[    LKL   ]] free_thread_stack(): enter (task=host_clone99 task->TIF_HOST_THREAD=1 task->TIF_SIGPENDING=0 ti=0x7fa03fdc7000 current=ksoftirqd/0)
[[    LKL   ]] kill_thread(): enter (task=host_clone99 task->state=128 task->flags=32836ti=0x7fa03fdc7000 ti->flags=1282 ti->TIF_NO_TERMINATION=0 )
[[    LKL   ]] thread_join(): enter (tid=140325186146128)
[[    LKL   ]] __switch_to(): Cleaning up 0x7fa03fdc7000
[[    LKL   ]] tls_set(): enter (key=0x7fa000df7af0 data=0)
[[    LKL   ]] thread_exit(): enter
[[    LKL   ]] free_thread_stack(): Deallocating 0x7fa03fdc7000
[[    LKL   ]] __switch_to(): ksoftirqd/0=>swapper
[[    LKL   ]] __switch_to(): calling sem_down on (task=ksoftirqd/0 task->TIF_HOST_THREAD=0 task->TIF_SIGPENDING=0)
[[    LKL   ]] __switch_to(): swapper=>idle_host_task
[[    LKL   ]] __switch_to(): calling sem_down on (task=swapper task->TIF_HOST_THREAD=0 task->TIF_SIGPENDING=0)
[[    LKL   ]] __switch_to(): idle_host_task=>kworker/0:1
[[    LKL   ]] __switch_to(): kworker/0:1=>swapper

The app main thread in this case seems to be sleeping on a futex.

(gdb) call lthread_dump_all_threads(false)
[[  SGX-LKL ]] =============================================================
[[  SGX-LKL ]] Stack traces for all lthreads:
[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 1: tid=147 (7fe000dfb4f0) [kernel] ()
[[  SGX-LKL ]]     #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[  SGX-LKL ]]     #1: 7fe0005c608b in futex_wait(...)
[[  SGX-LKL ]]     #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[  SGX-LKL ]]     #3: 7fe0005c6255 in enclave_futex_wait(...)
[[  SGX-LKL ]]     #4: 7fe0005e7b57 in sem_down(...)
[[  SGX-LKL ]]     #5: 7fe00008abd2 in __switch_to(...)
[[  SGX-LKL ]]     #6: 7fe000424b66 in __schedule(...)
[[  SGX-LKL ]]     #7: 7fe000424de2 in schedule(...)
[[  SGX-LKL ]]     #8: 7fe0000a3ea3 in worker_thread(...)
[[  SGX-LKL ]]     #9: 7fe0000a9b54 in kthread(...)
[[  SGX-LKL ]]     #10: 7fe00008a88d in thread_bootstrap(...)
[[  SGX-LKL ]]     #11: 7fe0005cfe60 in _exec(...)
[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 2: tid=146 (7fe000dfb360) [cloned host task 99] (READY)
[[  SGX-LKL ]]     #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[  SGX-LKL ]]     #1: 7fe0005c608b in futex_wait(...)
[[  SGX-LKL ]]     #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[  SGX-LKL ]]     #3: 7fe0005c6255 in enclave_futex_wait(...)
[[  SGX-LKL ]]     #4: 7fe0005e7b57 in sem_down(...)
[[  SGX-LKL ]]     #5: 7fe00008abd2 in __switch_to(...)
[[  SGX-LKL ]]     #6: 7fe000424b66 in __schedule(...)
[[  SGX-LKL ]]     #7: 7fe000424de2 in schedule(...)
[[  SGX-LKL ]]     #8: 7fe0000d8df3 in futex_wait_queue_me(...)
[[  SGX-LKL ]]     #9: 7fe0000d9973 in futex_wait(...)
[[  SGX-LKL ]]     #10: 7fe0000db79b in do_futex(...)
[[  SGX-LKL ]]     #11: 7fe0000dc045 in sys_futex(...)
[[  SGX-LKL ]]     #12: 7fe00008bc54 in lkl_syscall(...)
[[  SGX-LKL ]]     #13: 7fe000b8fd76 in <unknown>(...)
[[  SGX-LKL ]]     #14: 7fe000b8fe90 in <unknown>(...)
[[  SGX-LKL ]]     #15: 7fe000b8ff87 in <unknown>(...)
[[  SGX-LKL ]]     #16: 7fe000b962ed in <unknown>(...)
[[  SGX-LKL ]]     #17: 7fe03ddd18bf in <unknown>(...)
[[  SGX-LKL ]]     #18: 7fe000ba70e4 in <unknown>(...)
[[  SGX-LKL ]]     #19: 7fe000bab511 in <unknown>(...)

// 27 other cloned host tasks have same stacktrace and state

[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 31: tid=44 (7fe000df8120) [app_name_thread] (APP_MAIN)
[[  SGX-LKL ]]     #0: 7fe0005c5ffb in __do_futex_sleep(...)
[[  SGX-LKL ]]     #1: 7fe0005c608b in futex_wait(...)
[[  SGX-LKL ]]     #2: 7fe0005c620b in enclave_futex_timedwait(...)
[[  SGX-LKL ]]     #3: 7fe0005c6255 in enclave_futex_wait(...)
[[  SGX-LKL ]]     #4: 7fe0005e7b57 in sem_down(...)
[[  SGX-LKL ]]     #5: 7fe00008abd2 in __switch_to(...)
[[  SGX-LKL ]]     #6: 7fe000424b66 in __schedule(...)
[[  SGX-LKL ]]     #7: 7fe000424de2 in schedule(...)
[[  SGX-LKL ]]     #8: 7fe0000d8df3 in futex_wait_queue_me(...)
[[  SGX-LKL ]]     #9: 7fe0000d9973 in futex_wait(...)
[[  SGX-LKL ]]     #10: 7fe0000db79b in do_futex(...)
[[  SGX-LKL ]]     #11: 7fe0000dc045 in sys_futex(...)
[[  SGX-LKL ]]     #12: 7fe00008bc54 in lkl_syscall(...)
[[  SGX-LKL ]]     #13: 7fe000b8fd76 in <unknown>(...)
[[  SGX-LKL ]]     #14: 7fe000b8fe90 in <unknown>(...)
[[  SGX-LKL ]]     #15: 7fe000b8ff87 in <unknown>(...)
[[  SGX-LKL ]]     #16: 7fe000b9a177 in <unknown>(...)
[[  SGX-LKL ]]     #17: 7fe000b95be3 in <unknown>(...)
[[  SGX-LKL ]]     #18: 7fe03ddd196d in <unknown>(...)
[[  SGX-LKL ]]     #19: 7fe000b3f180 in <unknown>(...)
[[  SGX-LKL ]]     #20: 7fe000b3f137 in <unknown>(...)
[[  SGX-LKL ]]     #21: 7fe03ddd1776 in <unknown>(...)
[[  SGX-LKL ]] -------------------------------------------------------------
...
[[  SGX-LKL ]] -------------------------------------------------------------
[[  SGX-LKL ]] 56: tid=1 (7fe000df33e0) [sgx-lkl-init] (EXITED)
[[  SGX-LKL ]]     #0: 7fe0005cfeb7 in _exec(...)
[[  SGX-LKL ]] =============================================================

@prp
Copy link
Member

prp commented Sep 22, 2020

@vtikoo This may mean that the LKL termination sequence hasn't been triggered despite the fact that all application threads have exited.

@vtikoo
Copy link
Contributor Author

vtikoo commented Sep 22, 2020

@prp is the LKL termination sequence triggered only by the app main thread exiting?

@prp
Copy link
Member

prp commented Sep 22, 2020

Hmm, termination should be triggered by any application thread exiting: https://github.com/lsds/lkl/blob/e041aa71e03a142ecef542c005b07d13e2a3b722/arch/lkl/kernel/threads.c#L89

@vtikoo
Copy link
Contributor Author

vtikoo commented Sep 22, 2020

Going by the stack traces above, it looks like the application main thread hasn't exited and is sleeping on a futex. This is causing LKL to have no runnable tasks after the last detached pthread exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p0 Blocking priority
Projects
None yet
Development

No branches or pull requests

3 participants