Refetch PTLS via Task struct (preparation for task migration) #39220

tkf · 2021-01-13T01:22:20Z

This is an alternative approach to #39168. In this patch, PTLS is refetched through jl_task_t. We can "cache" the pointer to the jl_task_t as SSA variable since it is guaranteed that the task does not change within a function [1].

Functionally, I think the first commit is sufficient. In the second commit, I removed pgcstack from PTLS as it's redundant. It turned out there is a bit of bootstrapping issue since we need to Task type to allocate a task but we need gcstack in Task field to create Task type.

I'm not sure what memory ordering of load we should use for refetching PTLS so I just put monotonic as the weakest placeholder (since unordered gets reordered w.r.t the function calls).

As I mentioned in the last comment #39168 (comment), a major disadvantage of this approach is that LICM of threadid() would not work. However, it might be better to stop using array[threadid()] pattern for "thread-local storage" anyway.

There are some test failures in codegen test. I'll fix them once this direction is OK.

[1] This won't be the case if we start doing continuation stealing like Cilk. But, I think we can mix the approach taken in #39168 to refetch the pointer to the task struct at the beginning of the continuation conditionally and then propagate the task pointer via phi nodes.

Demo

function f()
    a = Threads.threadid()
    yield()
    b = Threads.threadid()
    (a, b)
end

`@code_llvm debuginfo=:none optimize=false f()`

define void @julia_f_222([2 x i64]* noalias nocapture sret %0) {
top:
  %1 = alloca [2 x i64], align 8
  %2 = call {}*** @julia.ptls_states()
  %3 = bitcast {}*** %2 to {}**
  %current_task_field = getelementptr inbounds {}*, {}** %3, i64 825
  %current_task = load {}*, {}** %current_task_field, align 8
  %4 = bitcast {}* %current_task to {}**
  %ptls_field = getelementptr inbounds {}*, {}** %4, i64 38
  %5 = bitcast {}** %ptls_field to {}**
  %ptls_load = load atomic {}*, {}** %5 monotonic, align 8
  %6 = bitcast {}* %ptls_load to {}**
  %7 = bitcast {}** %6 to i16*
  %8 = getelementptr inbounds i16, i16* %7, i64 4
  %9 = load i16, i16* %8, align 2
  %10 = sext i16 %9 to i64
  %11 = add i64 %10, 1
  %12 = call nonnull {}* @j_yield_224()
  %13 = bitcast {}* %current_task to {}**
  %ptls_field1 = getelementptr inbounds {}*, {}** %13, i64 38
  %14 = bitcast {}** %ptls_field1 to {}**
  %ptls_load2 = load atomic {}*, {}** %14 monotonic, align 8
  %15 = bitcast {}* %ptls_load2 to {}**
  %16 = bitcast {}** %15 to i16*
  %17 = getelementptr inbounds i16, i16* %16, i64 4
  %18 = load i16, i16* %17, align 2
  %19 = sext i16 %18 to i64
  %20 = add i64 %19, 1
  %21 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i32 0, i32 0
  store i64 %11, i64* %21, align 8
  %22 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i32 0, i32 1
  store i64 %20, i64* %22, align 8
  %23 = bitcast [2 x i64]* %0 to i8*
  %24 = bitcast [2 x i64]* %1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %23, i8* %24, i64 16, i1 false)
  ret void
}

You can see %ptls_load2 = load ... after @j_yield_224().

`@code_llvm debuginfo=:none f()`

define void @julia_f_183([2 x i64]* noalias nocapture sret %0) {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #4
  %current_task_field4 = getelementptr i8, i8* %thread_ptr, i64 -26168
  %1 = bitcast i8* %current_task_field4 to {}***
  %current_task5 = load {}**, {}*** %1, align 8
  %ptls_field6 = getelementptr inbounds {}*, {}** %current_task5, i64 38
  %2 = bitcast {}** %ptls_field6 to {}**
  %ptls_load = load atomic {}*, {}** %2 monotonic, align 8
  %3 = bitcast {}* %ptls_load to i16*
  %4 = getelementptr inbounds i16, i16* %3, i64 4
  %5 = load i16, i16* %4, align 2
  %6 = sext i16 %5 to i64
  %7 = add nsw i64 %6, 1
  %8 = call nonnull {}* @j_yield_185()
  %ptls_load2 = load atomic {}*, {}** %2 monotonic, align 8
  %9 = bitcast {}* %ptls_load2 to i16*
  %10 = getelementptr inbounds i16, i16* %9, i64 4
  %11 = load i16, i16* %10, align 2
  %12 = sext i16 %11 to i64
  %13 = add nsw i64 %12, 1
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
  store i64 %7, i64* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx3 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
  store i64 %13, i64* %.sroa.2.0..sroa_idx3, align 8
  ret void
}

tkf · 2021-01-13T01:26:20Z

src/init.c

+    // We need `gcstack` in `Task` to allocate Julia objects; *including* the `Task` type.
+    // However, to allocate a `Task` via `jl_gc_alloc` as done in `jl_init_root_task`,
+    // we need the `Task` type itself. We use stack-allocated "raw" `jl_task_t` struct to
+    // workaround this chicken-and-egg problem. Note that this relies on GC to be turned
+    // off just above as GC fails because we don't/can't allocate the type tag.
+    jl_task_t bootstrap_task;
+    bootstrap_task.gcstack = NULL;
+    jl_current_task = &bootstrap_task;


This is what I suggest for solving the bootstrapping issue I mentioned in the OP. It's a bit scary approach but I find it the most straightforward way to do it.

~~Also, do we want to zero out the struct?~~ (done in e259cc0)

tkf · 2021-01-13T01:27:24Z

src/codegen.cpp

+    LoadInst *ptls_load = ctx.builder.CreateAlignedLoad(
+        emit_bitcast(ctx, pptls, T_ppjlvalue), Align(sizeof(void *)), "ptls_load");
+    // Note: Corersponding store (`t->ptls = ptls`) happes in `ctx_switch` of tasks.c.
+    ptls_load->setOrdering(AtomicOrdering::Monotonic);  // TODO: what should we use?


What's the appropriate ordering here?

Private-Thread-LocalS have no atomic ordering

Maybe I should've asked how do we prevent a load of PTLS from the field of the task struct to be reordered against a function call? I used atomic ordering so that b in this function f would not be CSE'ed with a:

function f() a = Threads.threadid() yield() b = Threads.threadid() (a, b) end

Is there other ways to do it? Also, since task struct would be shared between threads once we start migration, I thought some kind of ordering might make sense.

That's an aliasing question and has nothing to do with atomics

tkf · 2021-01-13T02:40:48Z

Taken from test/compiler/codgen.jl, this function

jl_string_ptr(s::String) = ccall(:jl_string_ptr, Ptr{UInt8}, (Any,), s)

lowers to (@code_llvm debuginfo=:none jl_string_ptr(""))

define i64 @julia_jl_string_ptr_179({}* nonnull %0) {
top:
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #5
  %current_task_field3 = getelementptr i8, i8* %thread_ptr, i64 -26168
  %1 = bitcast i8* %current_task_field3 to {}***
  %current_task4 = load {}**, {}*** %1, align 8
  %ptls_field5 = getelementptr inbounds {}*, {}** %current_task4, i64 38
  %2 = bitcast {}** %ptls_field5 to {}**
  %ptls_load = load atomic {}*, {}** %2 monotonic, align 8
  %ptls_load2 = load atomic {}*, {}** %2 monotonic, align 8
  %3 = bitcast {}* %0 to {}**
  %4 = getelementptr inbounds {}*, {}** %3, i64 1
  %5 = ptrtoint {}** %4 to i64
  ret i64 %5
}

and compiles to (julia> @code_native debuginfo=:none jl_string_ptr(""))

        .text
        movq    %fs:0, %rax
        movq    -26168(%rax), %rax
        movq    304(%rax), %rcx
        movq    304(%rax), %rax
        leaq    8(%rdi), %rax
        retq
        nopw    %cs:(%rax,%rax)

with this PR. But, in 1.7.0-DEV.266, we have

define i64 @julia_jl_string_ptr_197({}* nonnull %0) {
top:
  %1 = bitcast {}* %0 to {}**
  %2 = getelementptr inbounds {}*, {}** %1, i64 1
  %3 = ptrtoint {}** %2 to i64
  ret i64 %3
}

and

        .text
        leaq    8(%rdi), %rax
        retq
        nopw    %cs:(%rax,%rax)

I'm puzzled why LLVM can't (is not allowed to?) eliminate unused loads. I understand memory ordering also imposes the constraints on the reordering of other instructions but didn't expect this result. I need to learn more about this stuff...

Anyway, given this constraint (either theoretically or technologically), maybe we need to switch to placeholder function + pass-based solution akin to #39168 and try to place/eliminate load ourselves? Or is there a more clever approach?

vtjnash · 2021-01-13T21:32:38Z

This is a start. The next step is to replace julia.ptls_states with julia.pgcstack and rewrite jl_get_ptls_states to jl_current_task->ptls and rewrite jl_current_task to container_of(..., jl_task_t, pgcstack) where ... was the current getter for ptls but which now contains &current_task->pgcstack.

tkf · 2021-01-14T04:08:45Z

Re my comment #39220 (comment) just above: I just found this in LLVM's manual https://llvm.org/docs/Atomics.html#monotonic

In terms of the optimizer, this can be treated as a read+write on the relevant memory location (and alias analysis will take advantage of that). In addition, it is legal to reorder non-atomic and Unordered loads around Monotonic loads. CSE/DSE and a few other optimizations are allowed, but Monotonic operations are unlikely to be used in ways which would make those optimizations useful.

So maybe what I'm seeing is expected?

tkf added 2 commits January 12, 2021 16:50

Fetch thread-local information through task

6879efe

Remove pgcstack from jl_tls_states_t

6c4c259

tkf requested review from vchuravy, vtjnash, yuyichao and JeffBezanson January 13, 2021 01:22

tkf marked this pull request as draft January 13, 2021 01:22

tkf commented Jan 13, 2021

View reviewed changes

Zero-out bootstrap_task

e259cc0

tkf added compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality labels Jan 13, 2021

vtjnash mentioned this pull request Apr 3, 2021

Allow PTLS to be changed in a function (preparation for task migration) #39168

Closed

vtjnash mentioned this pull request May 4, 2021

Refetch PTLS via Task struct (enable task migration) #40715

Merged

vchuravy closed this May 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refetch PTLS via Task struct (preparation for task migration) #39220

Refetch PTLS via Task struct (preparation for task migration) #39220

tkf commented Jan 13, 2021

tkf Jan 13, 2021 •

edited

Loading

tkf Jan 13, 2021

vtjnash Jan 13, 2021 •

edited

Loading

tkf Jan 13, 2021

vtjnash Jan 14, 2021

tkf commented Jan 13, 2021

vtjnash commented Jan 13, 2021

tkf commented Jan 14, 2021

Refetch PTLS via Task struct (preparation for task migration) #39220

Refetch PTLS via Task struct (preparation for task migration) #39220

Conversation

tkf commented Jan 13, 2021

Demo

@code_llvm debuginfo=:none optimize=false f()

@code_llvm debuginfo=:none f()

tkf Jan 13, 2021 • edited Loading

Choose a reason for hiding this comment

tkf Jan 13, 2021

Choose a reason for hiding this comment

vtjnash Jan 13, 2021 • edited Loading

Choose a reason for hiding this comment

tkf Jan 13, 2021

Choose a reason for hiding this comment

vtjnash Jan 14, 2021

Choose a reason for hiding this comment

tkf commented Jan 13, 2021

vtjnash commented Jan 13, 2021

tkf commented Jan 14, 2021

`@code_llvm debuginfo=:none optimize=false f()`

`@code_llvm debuginfo=:none f()`

tkf Jan 13, 2021 •

edited

Loading

vtjnash Jan 13, 2021 •

edited

Loading