Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jl_print_task_backtraces() #46845

Merged
merged 1 commit into from
Oct 15, 2022
Merged

Add jl_print_task_backtraces() #46845

merged 1 commit into from
Oct 15, 2022

Conversation

kpamnany
Copy link
Contributor

@kpamnany kpamnany commented Sep 20, 2022

Iterates through jl_all_tls_states and through all live_tasks in ptls->heap, printing backtraces.

The purpose is to help find deadlocks.

Replaces #44990.
Closes #46177.

src/stackwalk.c Outdated Show resolved Hide resolved
Copy link
Contributor

@vilterp vilterp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/stackwalk.c Show resolved Hide resolved
src/stackwalk.c Outdated Show resolved Hide resolved
src/stackwalk.c Outdated Show resolved Hide resolved
Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, this looks super helpful. Thank you @kpamnany!!

Can we also backport to 1.8? 🥺 I think this would be quite helpful for debugging issues on 1.8 too, if we can.

src/stackwalk.c Outdated
size_t i, j, n;

for (i = 0; i < jl_n_threads; i++) {
ptls = jl_all_tls_states[i];
Copy link
Member

@vtjnash vtjnash Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traditionally, we call this ptls2 in GC, but you also aren't allowed to access that outside of GC, since it is thread-local state for a different thread.

Copy link
Contributor Author

@kpamnany kpamnany Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be racy, but will it simply fail on some platforms?

Copy link
Contributor Author

@kpamnany kpamnany Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If, on some platforms, it is not possible to access another thread's state through jl_all_tls_states, the only way around I can see is to do a GC stop-the-world and have each thread print its own live_tasks which will no longer be racy, but make this a lot more complicated and slow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in jl_gc_wait_for_the_world(), the calling thread looks at other threads' TLS, so the problem is not that a thread cannot see another thread's TLS, but that it is racy and dangerous, correct?

There is the risk that since we haven't stopped the other threads, that a task's stack will change during this inspection. I'm not sure how likely it is that we'll crash doing this, but more elaborate deadlock detection will be much harder to do -- this can be useful as is, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thread, in general, cannot print the stack of another thread's task either, so having it run behind the jl_gc_wait_for_the_world logic (running on every thread) would be useful. Though deadlock reporting doesn't precisely need that, since that may be only essential to run if we have run out of all other useful work and halted the other threads anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The live_tasks pointer may move from under you while you are looking at it, making this very dangerous. You are right that sometimes we examine values in the jl_all_tls_states, but only if they are marked atomic or are protected by a lock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, we only intended to call this from gdb, in which case the world will already be artificially stopped.

I agree with you both, that I don't think this is safe to call from inside the julia runtime.

I think we should make sure that the function comments make it clear that this isn't safe to call while the runtime is running.
It might make sense to provide another version of it, which force-pauses all the threads before printing, reusing the GC machinery, but that seems like a follow up PR?

Copy link
Contributor

@vilterp vilterp Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the Go version of this stops the world so it can be safely called from userspace: https://github.com/golang/go/blob/6d8ec893039a39f495c8139012e47754e4518b70/src/runtime/mprof.go#L1193-L1196

This is sometimes put in an HTTP handler for easy access; would be nice for us to eventually be able to do the same. GDB is a good start ofc :)

@kpamnany kpamnany force-pushed the kp/task-stacktraces branch 2 times, most recently from a3b8428 to 777eb54 Compare September 20, 2022 20:21
@kpamnany
Copy link
Contributor Author

For this error:

/cache/build/default-amdci5-6/julialang/julia-master/src/stackwalk.c:1124:9: error: Passing non-rooted value as argument to function that may GC [julia.GCChecker]
--
  | jlbacktracet(ptls2->root_task);
  | ^            ~~~~~~~~~~~~~~~~

Do I have to JL_GC_PUSH1(&ptls2->root_task) and JL_GC_POP() after?

@kpamnany
Copy link
Contributor Author

Also, I don't understand this?

/cache/build/default-amdci5-6/julialang/julia-master/src/stackwalk.c:1119:54: note: Argument value was derived global with untracked type. You may want to update the checker's type list
--
  | jl_safe_printf("     ---- Root task (%p)\n", ptls2->root_task);
  | ^~~~~~~~~~~~~~~~

@kpamnany
Copy link
Contributor Author

We're experimenting with this to see if it's useful. As Nathan said, this is basically for use in gdb, i.e. all threads will be stopped, so poking into other threads' local storage is safe.

I find that I cannot get a backtrace (Linux glibc x86-64) with the default JL_HAVE_ASM and have to turn on JL_HAVE_UNW_CONTEXT. Does #45110 fix that @vtjnash?

@vtjnash
Copy link
Member

vtjnash commented Sep 28, 2022

This should be useful eventually for figuring out why the Sockets test stales on CI, such as https://buildkite.com/julialang/julia-master/builds/16200#01838101-13bb-48fd-8aff-7705c619cd66

@NHDaly
Copy link
Member

NHDaly commented Oct 3, 2022

I find that I cannot get a backtrace (Linux glibc x86-64) with the default JL_HAVE_ASM and have to turn on JL_HAVE_UNW_CONTEXT. Does #45110 fix that @vtjnash?

Oh, interesting! I wonder if that's why I originally was thinking this didn't work. Thanks for tracking all of this down, @kpamnany!

@vtjnash / @kpamnany: is this resolved? Is there anything I/we can do to help move this along? Thanks! 😊

@kpamnany kpamnany force-pushed the kp/task-stacktraces branch from 777eb54 to 9b6a846 Compare October 4, 2022 22:13
@kpamnany
Copy link
Contributor Author

kpamnany commented Oct 4, 2022

I added a comment warning that this is only intended for use when all threads are stopped (i.e. in gdb). I also removed it from exported functions for that reason.

We've verified that this can be useful. So, apart from the GC checker errors, I think this is good to go.

@kpamnany kpamnany force-pushed the kp/task-stacktraces branch from 9b6a846 to b13b963 Compare October 5, 2022 14:22
Iterates through `jl_all_tls_states` and through all `live_tasks`
in `ptls->heap`, printing backtraces.
@kpamnany kpamnany force-pushed the kp/task-stacktraces branch from b13b963 to 1f6ba00 Compare October 5, 2022 16:38
Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM thanks @kpamnany!

@vchuravy vchuravy requested a review from vtjnash October 11, 2022 14:43
Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I would like a finally comment by @vtjnash saying if he is okay with the current intended usage (only from GDB) and then a follow-up PR that would implement Go solution to briefly stop-the-world.

@vchuravy vchuravy merged commit d89b96b into master Oct 15, 2022
@vchuravy vchuravy deleted the kp/task-stacktraces branch October 15, 2022 20:43
@Keno
Copy link
Member

Keno commented Oct 15, 2022

This is failing analyzegc on master. Not sure why it didn't fail on this PR:

/cache/build/default-amdci5-5/julialang/julia-master/src/stackwalk.c:1131:28: error: Implicit Atomic seq_cst synchronization [concurrency-implicit-atomics,-warnings-as-errors]
--
  | for (size_t i = 0; i < jl_n_threads; i++) {
  | ^
  | /cache/build/default-amdci5-5/julialang/julia-master/src/stackwalk.c:1132:27: error: Implicit Atomic seq_cst synchronization [concurrency-implicit-atomics,-warnings-as-errors]
  | jl_ptls_t ptls2 = jl_all_tls_states[i];
  | ^

giordano added a commit that referenced this pull request Oct 16, 2022
@giordano
Copy link
Contributor

This PR was reverted in #47182 because of the problem with analyzegc reported above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to print stacktraces of all tasks, including unscheduled ones
7 participants