-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add jl_print_task_backtraces()
#46845
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, this looks super helpful. Thank you @kpamnany!!
Can we also backport to 1.8? 🥺 I think this would be quite helpful for debugging issues on 1.8 too, if we can.
src/stackwalk.c
Outdated
size_t i, j, n; | ||
|
||
for (i = 0; i < jl_n_threads; i++) { | ||
ptls = jl_all_tls_states[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
traditionally, we call this ptls2
in GC, but you also aren't allowed to access that outside of GC, since it is thread-local state for a different thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be racy, but will it simply fail on some platforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If, on some platforms, it is not possible to access another thread's state through jl_all_tls_states
, the only way around I can see is to do a GC stop-the-world and have each thread print its own live_tasks
which will no longer be racy, but make this a lot more complicated and slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in jl_gc_wait_for_the_world()
, the calling thread looks at other threads' TLS, so the problem is not that a thread cannot see another thread's TLS, but that it is racy and dangerous, correct?
There is the risk that since we haven't stopped the other threads, that a task's stack will change during this inspection. I'm not sure how likely it is that we'll crash doing this, but more elaborate deadlock detection will be much harder to do -- this can be useful as is, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thread, in general, cannot print the stack of another thread's task either, so having it run behind the jl_gc_wait_for_the_world
logic (running on every thread) would be useful. Though deadlock reporting doesn't precisely need that, since that may be only essential to run if we have run out of all other useful work and halted the other threads anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The live_tasks
pointer may move from under you while you are looking at it, making this very dangerous. You are right that sometimes we examine values in the jl_all_tls_states
, but only if they are marked atomic or are protected by a lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, we only intended to call this from gdb
, in which case the world will already be artificially stopped.
I agree with you both, that I don't think this is safe to call from inside the julia runtime.
I think we should make sure that the function comments make it clear that this isn't safe to call while the runtime is running.
It might make sense to provide another version of it, which force-pauses all the threads before printing, reusing the GC machinery, but that seems like a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the Go version of this stops the world so it can be safely called from userspace: https://github.com/golang/go/blob/6d8ec893039a39f495c8139012e47754e4518b70/src/runtime/mprof.go#L1193-L1196
This is sometimes put in an HTTP handler for easy access; would be nice for us to eventually be able to do the same. GDB is a good start ofc :)
a3b8428
to
777eb54
Compare
For this error:
Do I have to |
Also, I don't understand this?
|
We're experimenting with this to see if it's useful. As Nathan said, this is basically for use in I find that I cannot get a backtrace (Linux glibc x86-64) with the default |
This should be useful eventually for figuring out why the Sockets test stales on CI, such as https://buildkite.com/julialang/julia-master/builds/16200#01838101-13bb-48fd-8aff-7705c619cd66 |
Oh, interesting! I wonder if that's why I originally was thinking this didn't work. Thanks for tracking all of this down, @kpamnany! @vtjnash / @kpamnany: is this resolved? Is there anything I/we can do to help move this along? Thanks! 😊 |
777eb54
to
9b6a846
Compare
I added a comment warning that this is only intended for use when all threads are stopped (i.e. in We've verified that this can be useful. So, apart from the GC checker errors, I think this is good to go. |
9b6a846
to
b13b963
Compare
Iterates through `jl_all_tls_states` and through all `live_tasks` in `ptls->heap`, printing backtraces.
b13b963
to
1f6ba00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM thanks @kpamnany!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I would like a finally comment by @vtjnash saying if he is okay with the current intended usage (only from GDB) and then a follow-up PR that would implement Go solution to briefly stop-the-world.
This is failing analyzegc on master. Not sure why it didn't fail on this PR:
|
This PR was reverted in #47182 because of the problem with analyzegc reported above. |
Iterates through
jl_all_tls_states
and through alllive_tasks
inptls->heap
, printing backtraces.The purpose is to help find deadlocks.
Replaces #44990.
Closes #46177.