Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to print stacktraces of all tasks, including unscheduled ones #46177

Closed
tveldhui opened this issue Jul 26, 2022 · 11 comments · Fixed by #46845 or #47402
Closed

Ability to print stacktraces of all tasks, including unscheduled ones #46177

tveldhui opened this issue Jul 26, 2022 · 11 comments · Fixed by #46845 or #47402
Assignees
Labels
feature Indicates new feature / enhancement requests multithreading Base.Threads and related functionality

Comments

@tveldhui
Copy link

tveldhui commented Jul 26, 2022

Detecting deadlocks is currently a bit difficult because if the tasks involved are unscheduled, we have no way of printing their backtraces.

Feature request: ability to print backtraces of all tasks, including all currently unscheduled tasks.

For us, deadlocks are not just priority inversions on locks, they can be byzantine scenarios involving many tasks, semaphores, channels, mutexes, conditions, futures, etc.

CC @NHDaly

@NHDaly NHDaly added feature Indicates new feature / enhancement requests multithreading Base.Threads and related functionality labels Jul 26, 2022
@NHDaly
Copy link
Member

NHDaly commented Jul 26, 2022

From my perspective, I think we could achieve this by keeping an array of WeakRefs to all Tasks, adding every task when its first constructed. And then we can iterate that to dump the tasks when requested.

Does that sound reasonable to others? Could that integrate well with the existing spontaneous profiling tools?
CC: @IanButterworth

@IanButterworth
Copy link
Member

Note that the spontaneous profiling tool (good name!) does group by thread and task already. It'd be good to understand how that's falling short

@NHDaly
Copy link
Member

NHDaly commented Jul 26, 2022

Ian, i think that it only shows the stack traces for tasks that are scheduled on threads. I.e. it's a profile of execution.

We're interested in something different, which is understanding the state of execution of all Tasks in the system, even the ones that aren't scheduled, because they're currently blocking.

This can help with identifying the cause of a deadlocked runtime, by dumping all Task stack traces and identifying which ones should be making progress but aren't able to. This cannot really be automated in general, and requires a careful eye reading the code together with all the stacktraces.

@IanButterworth
Copy link
Member

Sounds like a good topic for the profiling call this Friday @ 2pm ET, if you can make it!

@maleadt
Copy link
Member

maleadt commented Jul 27, 2022

jl_live_tasks from #32283?

@NHDaly
Copy link
Member

NHDaly commented Jul 27, 2022

Thanks @maleadt. That PR is actually what started this discussion, yeah.

I think we think that jl_live_tasks only returns the tasks that are currently on the run-queue for a given thread, but not all tasks anywhere in the system, including tasks blocking on a lock.

julia/src/gc-stacks.c

Lines 256 to 257 in b2bf56e

jl_ptls_t ptls = ct->ptls;
arraylist_t *live_tasks = &ptls->heap.live_tasks;

julia/src/julia_threads.h

Lines 138 to 140 in 05eb153

// live tasks started on this thread
// that are holding onto a stack from the pool
arraylist_t live_tasks;

And note that you need to loop through all the threads to get all live tasks, like we did in jl_wassup here: #44990


So I think jl_live_tasks isn't quite giving us what we wanted. it's close! But it's only for a single thread, and only showing the scheduled tasks. We want:

  • all Tasks that haven't been GC'd, and
  • aren't completed

I think the best way to do this would be with a global array of Weak References, which we add new Tasks to in their constructor, or when they're first scheduled.

Does that sound reasonable to you?

@NHDaly
Copy link
Member

NHDaly commented Jul 27, 2022

CC: @kpamnany, @jakebolewski

@maleadt
Copy link
Member

maleadt commented Jul 27, 2022

Hmm, I'm able to see the @async sleep(...) task that you mentioned in #44990. After compiling with JL_HAVE_UNW_CONTEXT:

$ gdb --args ./usr/bin/julia -t1

julia> t = @async sleep(30)

(gdb) print jl_live_tasks()->length
$2 = 5

(gdb) call jlbacktracet(jl_arrayref(jl_live_tasks(), 4))
jl_unw_swapcontext at /home/tim/Julia/src/julia/src/task.c:1014 [inlined]
jl_swap_fiber at /home/tim/Julia/src/julia/src/task.c:1026
ctx_switch at /home/tim/Julia/src/julia/src/task.c:477
ijl_switch at /home/tim/Julia/src/julia/src/task.c:551
try_yieldto at ./task.jl:861
wait at ./task.jl:931
wait at ./condition.jl:124
_trywait at ./asyncevent.jl:138
wait at ./asyncevent.jl:155 [inlined]
sleep at ./asyncevent.jl:240
#2 at ./task.jl:484
unknown function (ip: 0x7fffe3d3fb4f)
_jl_invoke at /home/tim/Julia/src/julia/src/gf.c:2339 [inlined]
ijl_apply_generic at /home/tim/Julia/src/julia/src/gf.c:2540
jl_apply at /home/tim/Julia/src/julia/src/julia.h:1838 [inlined]
start_task at /home/tim/Julia/src/julia/src/task.c:931

I'm probably missing something though, as I'm not familiar with the scheduler internals. @vtjnash should be able to elaborate.

@JeffBezanson
Copy link
Member

Correct, live_tasks should have all tasks that have not exited (and thus still have an execution stack). We would just need a version of jl_live_tasks that collects this over all threads; after all something like this does need to be thread-local.

@NHDaly
Copy link
Member

NHDaly commented Jul 29, 2022

Okay! Thanks both, that really helps me understand what's going on!

So I think we were just misunderstanding what live_tasks does -- we were under the impression it only iterated the tasks on the run queue, but I now understand (after looking at the code together with @vchuravy at JuliaCon) that it's indeed already doing what I was asking for: it's an array of all live tasks that isn't marked by GC. 👍 👍


@maleadt i'm not sure why i wasn't seeing the same backtraces that you are seeing... 🤔

@vchuravy pointed out that maybe for some reason my tasks weren't meeting some of these criteria?:

julia/src/stackwalk.c

Lines 866 to 870 in e2a8a4e

if (t->copy_stack || !t->started || t->stkbuf == NULL)
return;
int16_t old = -1;
if (!jl_atomic_cmpswap(&t->tid, &old, ptls->tid) && old != ptls->tid)
return;

Thanks @JeffBezanson, so maybe actually #44990 is what we wanted, and we can re-open that PR! :) Thanks

@NHDaly
Copy link
Member

NHDaly commented Oct 18, 2022

Reopening until the revert is reapplied :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Indicates new feature / enhancement requests multithreading Base.Threads and related functionality
Projects
None yet
6 participants