-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refetch PTLS via Task struct (preparation for task migration) #39220
Conversation
src/init.c
Outdated
// We need `gcstack` in `Task` to allocate Julia objects; *including* the `Task` type. | ||
// However, to allocate a `Task` via `jl_gc_alloc` as done in `jl_init_root_task`, | ||
// we need the `Task` type itself. We use stack-allocated "raw" `jl_task_t` struct to | ||
// workaround this chicken-and-egg problem. Note that this relies on GC to be turned | ||
// off just above as GC fails because we don't/can't allocate the type tag. | ||
jl_task_t bootstrap_task; | ||
bootstrap_task.gcstack = NULL; | ||
jl_current_task = &bootstrap_task; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I suggest for solving the bootstrapping issue I mentioned in the OP. It's a bit scary approach but I find it the most straightforward way to do it.
Also, do we want to zero out the struct? (done in e259cc0)
LoadInst *ptls_load = ctx.builder.CreateAlignedLoad( | ||
emit_bitcast(ctx, pptls, T_ppjlvalue), Align(sizeof(void *)), "ptls_load"); | ||
// Note: Corersponding store (`t->ptls = ptls`) happes in `ctx_switch` of tasks.c. | ||
ptls_load->setOrdering(AtomicOrdering::Monotonic); // TODO: what should we use? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the appropriate ordering here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Private-Thread-LocalS have no atomic ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should've asked how do we prevent a load of PTLS from the field of the task struct to be reordered against a function call? I used atomic ordering so that b
in this function f
would not be CSE'ed with a
:
function f()
a = Threads.threadid()
yield()
b = Threads.threadid()
(a, b)
end
Is there other ways to do it? Also, since task struct would be shared between threads once we start migration, I thought some kind of ordering might make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an aliasing question and has nothing to do with atomics
Taken from test/compiler/codgen.jl, this function jl_string_ptr(s::String) = ccall(:jl_string_ptr, Ptr{UInt8}, (Any,), s) lowers to ( define i64 @julia_jl_string_ptr_179({}* nonnull %0) {
top:
%thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #5
%current_task_field3 = getelementptr i8, i8* %thread_ptr, i64 -26168
%1 = bitcast i8* %current_task_field3 to {}***
%current_task4 = load {}**, {}*** %1, align 8
%ptls_field5 = getelementptr inbounds {}*, {}** %current_task4, i64 38
%2 = bitcast {}** %ptls_field5 to {}**
%ptls_load = load atomic {}*, {}** %2 monotonic, align 8
%ptls_load2 = load atomic {}*, {}** %2 monotonic, align 8
%3 = bitcast {}* %0 to {}**
%4 = getelementptr inbounds {}*, {}** %3, i64 1
%5 = ptrtoint {}** %4 to i64
ret i64 %5
} and compiles to ( .text
movq %fs:0, %rax
movq -26168(%rax), %rax
movq 304(%rax), %rcx
movq 304(%rax), %rax
leaq 8(%rdi), %rax
retq
nopw %cs:(%rax,%rax) with this PR. But, in 1.7.0-DEV.266, we have define i64 @julia_jl_string_ptr_197({}* nonnull %0) {
top:
%1 = bitcast {}* %0 to {}**
%2 = getelementptr inbounds {}*, {}** %1, i64 1
%3 = ptrtoint {}** %2 to i64
ret i64 %3
} and .text
leaq 8(%rdi), %rax
retq
nopw %cs:(%rax,%rax) I'm puzzled why LLVM can't (is not allowed to?) eliminate unused loads. I understand memory ordering also imposes the constraints on the reordering of other instructions but didn't expect this result. I need to learn more about this stuff... Anyway, given this constraint (either theoretically or technologically), maybe we need to switch to placeholder function + pass-based solution akin to #39168 and try to place/eliminate load ourselves? Or is there a more clever approach? |
This is a start. The next step is to replace |
Re my comment #39220 (comment) just above: I just found this in LLVM's manual https://llvm.org/docs/Atomics.html#monotonic
So maybe what I'm seeing is expected? |
This is an alternative approach to #39168. In this patch, PTLS is refetched through
jl_task_t
. We can "cache" the pointer to thejl_task_t
as SSA variable since it is guaranteed that the task does not change within a function [1].Functionally, I think the first commit is sufficient. In the second commit, I removed
pgcstack
from PTLS as it's redundant. It turned out there is a bit of bootstrapping issue since we need toTask
type to allocate a task but we needgcstack
inTask
field to createTask
type.I'm not sure what memory ordering of load we should use for refetching PTLS so I just put
monotonic
as the weakest placeholder (sinceunordered
gets reordered w.r.t the function calls).As I mentioned in the last comment #39168 (comment), a major disadvantage of this approach is that LICM of
threadid()
would not work. However, it might be better to stop usingarray[threadid()]
pattern for "thread-local storage" anyway.There are some test failures in codegen test. I'll fix them once this direction is OK.
[1] This won't be the case if we start doing continuation stealing like Cilk. But, I think we can mix the approach taken in #39168 to refetch the pointer to the task struct at the beginning of the continuation conditionally and then propagate the task pointer via phi nodes.
Demo
@code_llvm debuginfo=:none optimize=false f()
You can see
%ptls_load2 = load ...
after@j_yield_224()
.@code_llvm debuginfo=:none f()