Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refetch PTLS via Task struct (preparation for task migration) #39220

Closed
wants to merge 3 commits into from

Conversation

tkf
Copy link
Member

@tkf tkf commented Jan 13, 2021

This is an alternative approach to #39168. In this patch, PTLS is refetched through jl_task_t. We can "cache" the pointer to the jl_task_t as SSA variable since it is guaranteed that the task does not change within a function [1].

Functionally, I think the first commit is sufficient. In the second commit, I removed pgcstack from PTLS as it's redundant. It turned out there is a bit of bootstrapping issue since we need to Task type to allocate a task but we need gcstack in Task field to create Task type.

I'm not sure what memory ordering of load we should use for refetching PTLS so I just put monotonic as the weakest placeholder (since unordered gets reordered w.r.t the function calls).

As I mentioned in the last comment #39168 (comment), a major disadvantage of this approach is that LICM of threadid() would not work. However, it might be better to stop using array[threadid()] pattern for "thread-local storage" anyway.

There are some test failures in codegen test. I'll fix them once this direction is OK.

[1] This won't be the case if we start doing continuation stealing like Cilk. But, I think we can mix the approach taken in #39168 to refetch the pointer to the task struct at the beginning of the continuation conditionally and then propagate the task pointer via phi nodes.

Demo

function f()
    a = Threads.threadid()
    yield()
    b = Threads.threadid()
    (a, b)
end

@code_llvm debuginfo=:none optimize=false f()

define void @julia_f_222([2 x i64]* noalias nocapture sret %0) {
top:
  %1 = alloca [2 x i64], align 8
  %2 = call {}*** @julia.ptls_states()
  %3 = bitcast {}*** %2 to {}**
  %current_task_field = getelementptr inbounds {}*, {}** %3, i64 825
  %current_task = load {}*, {}** %current_task_field, align 8
  %4 = bitcast {}* %current_task to {}**
  %ptls_field = getelementptr inbounds {}*, {}** %4, i64 38
  %5 = bitcast {}** %ptls_field to {}**
  %ptls_load = load atomic {}*, {}** %5 monotonic, align 8
  %6 = bitcast {}* %ptls_load to {}**
  %7 = bitcast {}** %6 to i16*
  %8 = getelementptr inbounds i16, i16* %7, i64 4
  %9 = load i16, i16* %8, align 2
  %10 = sext i16 %9 to i64
  %11 = add i64 %10, 1
  %12 = call nonnull {}* @j_yield_224()
  %13 = bitcast {}* %current_task to {}**
  %ptls_field1 = getelementptr inbounds {}*, {}** %13, i64 38
  %14 = bitcast {}** %ptls_field1 to {}**
  %ptls_load2 = load atomic {}*, {}** %14 monotonic, align 8
  %15 = bitcast {}* %ptls_load2 to {}**
  %16 = bitcast {}** %15 to i16*
  %17 = getelementptr inbounds i16, i16* %16, i64 4
  %18 = load i16, i16* %17, align 2
  %19 = sext i16 %18 to i64
  %20 = add i64 %19, 1
  %21 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i32 0, i32 0
  store i64 %11, i64* %21, align 8
  %22 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i32 0, i32 1
  store i64 %20, i64* %22, align 8
  %23 = bitcast [2 x i64]* %0 to i8*
  %24 = bitcast [2 x i64]* %1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %23, i8* %24, i64 16, i1 false)
  ret void
}

You can see %ptls_load2 = load ... after @j_yield_224().

@code_llvm debuginfo=:none f()

define void @julia_f_183([2 x i64]* noalias nocapture sret %0) {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #4
  %current_task_field4 = getelementptr i8, i8* %thread_ptr, i64 -26168
  %1 = bitcast i8* %current_task_field4 to {}***
  %current_task5 = load {}**, {}*** %1, align 8
  %ptls_field6 = getelementptr inbounds {}*, {}** %current_task5, i64 38
  %2 = bitcast {}** %ptls_field6 to {}**
  %ptls_load = load atomic {}*, {}** %2 monotonic, align 8
  %3 = bitcast {}* %ptls_load to i16*
  %4 = getelementptr inbounds i16, i16* %3, i64 4
  %5 = load i16, i16* %4, align 2
  %6 = sext i16 %5 to i64
  %7 = add nsw i64 %6, 1
  %8 = call nonnull {}* @j_yield_185()
  %ptls_load2 = load atomic {}*, {}** %2 monotonic, align 8
  %9 = bitcast {}* %ptls_load2 to i16*
  %10 = getelementptr inbounds i16, i16* %9, i64 4
  %11 = load i16, i16* %10, align 2
  %12 = sext i16 %11 to i64
  %13 = add nsw i64 %12, 1
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
  store i64 %7, i64* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx3 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
  store i64 %13, i64* %.sroa.2.0..sroa_idx3, align 8
  ret void
}

@tkf tkf marked this pull request as draft January 13, 2021 01:22
src/init.c Outdated
Comment on lines 735 to 742
// We need `gcstack` in `Task` to allocate Julia objects; *including* the `Task` type.
// However, to allocate a `Task` via `jl_gc_alloc` as done in `jl_init_root_task`,
// we need the `Task` type itself. We use stack-allocated "raw" `jl_task_t` struct to
// workaround this chicken-and-egg problem. Note that this relies on GC to be turned
// off just above as GC fails because we don't/can't allocate the type tag.
jl_task_t bootstrap_task;
bootstrap_task.gcstack = NULL;
jl_current_task = &bootstrap_task;
Copy link
Member Author

@tkf tkf Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I suggest for solving the bootstrapping issue I mentioned in the OP. It's a bit scary approach but I find it the most straightforward way to do it.

Also, do we want to zero out the struct? (done in e259cc0)

Comment on lines +4633 to +4636
LoadInst *ptls_load = ctx.builder.CreateAlignedLoad(
emit_bitcast(ctx, pptls, T_ppjlvalue), Align(sizeof(void *)), "ptls_load");
// Note: Corersponding store (`t->ptls = ptls`) happes in `ctx_switch` of tasks.c.
ptls_load->setOrdering(AtomicOrdering::Monotonic); // TODO: what should we use?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the appropriate ordering here?

Copy link
Member

@vtjnash vtjnash Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private-Thread-LocalS have no atomic ordering

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should've asked how do we prevent a load of PTLS from the field of the task struct to be reordered against a function call? I used atomic ordering so that b in this function f would not be CSE'ed with a:

function f()
    a = Threads.threadid()
    yield()
    b = Threads.threadid()
    (a, b)
end

Is there other ways to do it? Also, since task struct would be shared between threads once we start migration, I thought some kind of ordering might make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an aliasing question and has nothing to do with atomics

@tkf
Copy link
Member Author

tkf commented Jan 13, 2021

Taken from test/compiler/codgen.jl, this function

jl_string_ptr(s::String) = ccall(:jl_string_ptr, Ptr{UInt8}, (Any,), s)

lowers to (@code_llvm debuginfo=:none jl_string_ptr(""))

define i64 @julia_jl_string_ptr_179({}* nonnull %0) {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #5
  %current_task_field3 = getelementptr i8, i8* %thread_ptr, i64 -26168
  %1 = bitcast i8* %current_task_field3 to {}***
  %current_task4 = load {}**, {}*** %1, align 8
  %ptls_field5 = getelementptr inbounds {}*, {}** %current_task4, i64 38
  %2 = bitcast {}** %ptls_field5 to {}**
  %ptls_load = load atomic {}*, {}** %2 monotonic, align 8
  %ptls_load2 = load atomic {}*, {}** %2 monotonic, align 8
  %3 = bitcast {}* %0 to {}**
  %4 = getelementptr inbounds {}*, {}** %3, i64 1
  %5 = ptrtoint {}** %4 to i64
  ret i64 %5
}

and compiles to (julia> @code_native debuginfo=:none jl_string_ptr(""))

        .text
        movq    %fs:0, %rax
        movq    -26168(%rax), %rax
        movq    304(%rax), %rcx
        movq    304(%rax), %rax
        leaq    8(%rdi), %rax
        retq
        nopw    %cs:(%rax,%rax)

with this PR. But, in 1.7.0-DEV.266, we have

define i64 @julia_jl_string_ptr_197({}* nonnull %0) {
top:
  %1 = bitcast {}* %0 to {}**
  %2 = getelementptr inbounds {}*, {}** %1, i64 1
  %3 = ptrtoint {}** %2 to i64
  ret i64 %3
}

and

        .text
        leaq    8(%rdi), %rax
        retq
        nopw    %cs:(%rax,%rax)

I'm puzzled why LLVM can't (is not allowed to?) eliminate unused loads. I understand memory ordering also imposes the constraints on the reordering of other instructions but didn't expect this result. I need to learn more about this stuff...

Anyway, given this constraint (either theoretically or technologically), maybe we need to switch to placeholder function + pass-based solution akin to #39168 and try to place/eliminate load ourselves? Or is there a more clever approach?

@tkf tkf added compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality labels Jan 13, 2021
@vtjnash
Copy link
Member

vtjnash commented Jan 13, 2021

This is a start. The next step is to replace julia.ptls_states with julia.pgcstack and rewrite jl_get_ptls_states to jl_current_task->ptls and rewrite jl_current_task to container_of(..., jl_task_t, pgcstack) where ... was the current getter for ptls but which now contains &current_task->pgcstack.

@tkf
Copy link
Member Author

tkf commented Jan 14, 2021

Re my comment #39220 (comment) just above: I just found this in LLVM's manual https://llvm.org/docs/Atomics.html#monotonic

In terms of the optimizer, this can be treated as a read+write on the relevant memory location (and alias analysis will take advantage of that). In addition, it is legal to reorder non-atomic and Unordered loads around Monotonic loads. CSE/DSE and a few other optimizations are allowed, but Monotonic operations are unlikely to be used in ways which would make those optimizations useful.

So maybe what I'm seeing is expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants