jl_create_native generates bad code for *(Float16,Float16) #34993

maleadt · 2020-03-04T09:29:11Z

The codegen restructuring has regressed some GPU code, where we used to get static code we now get invokes and calls to jfptr functions:

julia> CUDAnative.code_llvm(*, Tuple{Float16, Float16}; strict=true)
ERROR: InvalidIRError: compiling *(Float16, Float16) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jfptr_Float32_2275)

;  @ float.jl:398 within `*'
define dso_local i16 @_Z10julia_MUL_7Float167Float16(i16, i16) local_unnamed_addr {
top:
  %2 = call fastcc float @j_Float32_4187(i16 %0)
  %3 = call fastcc float @j_Float32_4188(i16 %1)


define internal fastcc float @j_Float32_4127(i16) unnamed_addr #0 {
top:
  %1 = alloca %jl_value_t addrspace(10)*, align 8
  %2 = call fastcc %jl_value_t addrspace(10)* @ptx_gc_pool_alloc(i64 2)
  %3 = bitcast %jl_value_t addrspace(10)* %2 to i16 addrspace(10)*
  store i16 %0, i16 addrspace(10)* %3, align 8
  store %jl_value_t addrspace(10)* %2, %jl_value_t addrspace(10)** %1, align 8
  %4 = call fastcc nonnull %jl_value_t addrspace(10)* @tojlinvoke4128(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139691536025776 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** nonnull %1, i32 1)


define internal fastcc %jl_value_t addrspace(10)* @tojlinvoke4128(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) unnamed_addr {
top:
  %3 = call %jl_value_t addrspace(10)* @jfptr_Float32_2275(%jl_value_t addrspace(10)* %0, %jl_value_t addrspace(10)** %1, i32 %2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139690687678928 to %jl_value_t*) to %jl_value_t addrspace(10)*))
  ret %jl_value_t addrspace(10)* %3
}

I'm using CUDAnative.code_llvm here, which calls jl_create_native, as InteractiveUtils.code_llvm only shows the IR of the outer function even when dumping the entire module. I guess that may be a red herring though, since generating code for the constructor itself yields the expected IR (see below).

Before the refactor:

;  @ float.jl:398 within `*'
define i16 @julia___29(i16, i16) {
top:
  %2 = call float @julia_Float32_17988(i16 %0)
  %3 = call float @julia_Float32_17988(i16 %1)

;  @ float.jl:209 within `Float32'
define internal float @julia_Float32_17988(i16) {
top:
; direct implementation

Generating code for the constructor directly:

julia> CUDAnative.code_llvm(Float32, Tuple{Float16}; strict=true)

;  @ float.jl:209 within `Float32'
define dso_local float @_Z13julia_Float327Float16(%jl_value_t addrspace(10)*, i16) {
top:
; direct implementation

Ref #25984, JuliaGPU/CUDAnative.jl#162 (comment)

vtjnash · 2020-03-04T19:33:39Z

This is generally what it means to be non-recursive in codegen. Now jl_create_native should give you back exactly the code it was asked for, and not recursively emit a lot of extra, unwarranted functions. It's still probably doing too much extra, so not quite right yet, but it's getting closer to that goal.

maleadt · 2020-03-04T21:20:23Z

OK, I thought jl_create_native would collect the IR of all needed functions similarly to how codegen worked before (but with a local cache and not a global one). Thoughts on how to do so, since GPU codegen generally requires that kind of generated code (i.e. without the invoke/jfptr)?

Keno · 2020-03-04T21:58:33Z

I've used #33955 for that kind of thing in experiments, which I suspect may be the correct solution eventually, but obviously it would be nice to have a workaround for now.

maleadt · 2020-03-05T08:46:53Z

This is generally what it means to be non-recursive in codegen.

It's one thing to emit placeholder functions that later will be replaced with (a call to) the final function, it's another to have these placeholders alloc GC frames and perform non-specsig calls.

There also seem to be some caching happening:

julia> println(codegen(*, Tuple{Float16, Float16}))

define internal i16 @julia_MUL._79(i16, i16) {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t addrspace(10)**
  %4 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i64 4
  %5 = bitcast %jl_value_t addrspace(10)** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = call float @julia_Float32_82(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %0)
  %8 = call float @julia_Float32_82(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %1)

define internal float @julia_Float32_82(%jl_value_t addrspace(10)*, i16) {
top:
; actual impl

Why is there an additional pointer argument here?

Anyway, after calling this function on the CPU, jl_create_native returns vastly different code:

julia> kernel(a,b) = a*b
kernel (generic function with 1 method)

julia> kernel(Float16(1),Float16(2))
Float16(2.0)

julia> println(codegen(*, Tuple{Float16, Float16}))

define internal i16 @julia_MUL._131(i16, i16) {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t addrspace(10)**
  %4 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i64 4
  %5 = bitcast %jl_value_t addrspace(10)** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = call float @j_Float32_132(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %0)
  %8 = call float @j_Float32_133(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), i16 %1)

define internal float @j_Float32_132(%jl_value_t addrspace(10)*, i16) #7 {
top:
  %2 = call %jl_value_t*** @julia.ptls_states()
  %3 = bitcast %jl_value_t*** %2 to %jl_value_t addrspace(10)**
  %4 = getelementptr inbounds %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i64 4
  %5 = bitcast %jl_value_t addrspace(10)** %4 to i64**
  %6 = load i64*, i64** %5
  %7 = bitcast %jl_value_t*** %2 to i8*
  %8 = call noalias nonnull %jl_value_t addrspace(10)* @julia.gc_alloc_obj(i8* %7, i64 2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592403568 to %jl_value_t*) to %jl_value_t addrspace(10)*)) #6
  %9 = bitcast %jl_value_t addrspace(10)* %8 to i16 addrspace(10)*
  store i16 %1, i16 addrspace(10)* %9, align 8
  %10 = call cc37 nonnull %jl_value_t addrspace(10)* bitcast (%jl_value_t addrspace(10)* (%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32)* @tojlinvoke135 to %jl_value_t addrspace(10)* (%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)*)*)(%jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* %8)
  %11 = call %jl_value_t addrspace(10)* @julia.typeof(%jl_value_t addrspace(10)* %10)
  %12 = icmp eq %jl_value_t addrspace(10)* %11, addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*)
  br i1 %12, label %pass, label %fail

fail:                                             ; preds = %top
  %13 = addrspacecast %jl_value_t addrspace(10)* %10 to %jl_value_t addrspace(12)*
  call void @jl_type_error(i8* inttoptr (i64 94849326539024 to i8*), %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516592404080 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(12)* %13)
  unreachable

pass:                                             ; preds = %top
  %14 = bitcast %jl_value_t addrspace(10)* %10 to float addrspace(10)*
  %15 = load float, float addrspace(10)* %14, align 4
  ret float %15
}

define internal %jl_value_t addrspace(10)* @tojlinvoke135(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) {
top:
  %3 = call %jl_value_t addrspace(10)* @jfptr_Float32_107(%jl_value_t addrspace(10)* %0, %jl_value_t addrspace(10)** %1, i32 %2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 140516541103984 to %jl_value_t*) to %jl_value_t addrspace(10)*))
  ret %jl_value_t addrspace(10)* %3
}

declare %jl_value_t addrspace(10)* @jfptr_Float32_107(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32, %jl_value_t addrspace(10)*)

Calls to julia_Float32_82 got replaced by two different calls to j_Float32_132 and j_Float32_133, whose implementation now contains those tojlinvoke placeholders requiring a GC frame, which is supposed to call jfptr_Float32_107 but that function isn't even part of the module.

MWE without CUDAnative:

using Core.Compiler: MethodInstance
using Base: _methods_by_ftype

function codegen(f,tt)
    # get the method instance
    world = typemax(UInt)
    sig = Base.signature_type(f, tt)
    mthds = _methods_by_ftype(sig, -1, world)
    Base.isdispatchtuple(tt) || return(:(error("$tt is not a dispatch tuple")))
    length(mthds) == 1 || return (:(throw(MethodError(f,tt))))
    mtypes, msp, m = mthds[1]
    method_instance = ccall(:jl_specializations_get_linfo, Ref{MethodInstance}, (Any, Any, Any), m, mtypes, msp)

    # generate ir
    params = Base.CodegenParams()
    native_code = ccall(:jl_create_native, Ptr{Cvoid},
                        (Vector{Core.MethodInstance}, Base.CodegenParams),
                        [method_instance], params)
    @assert native_code != C_NULL
    llvm_mod = ccall(:jl_get_llvm_module, Ptr{Cvoid},
                            (Ptr{Cvoid},), native_code)
    @assert llvm_mod != C_NULL

    # get the top-level code
    code = Core.Compiler.inf_for_methodinstance(method_instance, world, world)

    # get the top-level function index
    llvm_func_idx = Ref{Int32}(-1)
    llvm_specfunc_idx = Ref{Int32}(-1)
    ccall(:jl_breakpoint, Nothing, ())
    ccall(:jl_get_function_id, Nothing,
            (Ptr{Cvoid}, Any, Ptr{Int32}, Ptr{Int32}),
            native_code, code, llvm_func_idx, llvm_specfunc_idx)
    @assert llvm_func_idx[] != -1
    @assert llvm_specfunc_idx[] != -1

    # get the top-level function)
    llvm_func = ccall(:jl_get_llvm_function, Ptr{Cvoid},
                    (Ptr{Cvoid}, UInt32), native_code, llvm_func_idx[]-1)
    llvm_specfunc = ccall(:jl_get_llvm_function, Ptr{Cvoid},
                        (Ptr{Cvoid}, UInt32), native_code, llvm_specfunc_idx[]-1)
    @assert llvm_specfunc != C_NULL

    # dump ir
    ccall(:jl_dump_function_ir, Ref{String},
          (Ptr{Cvoid}, Bool, Bool, Ptr{UInt8}),
          llvm_specfunc, true, true, :none)
end

maleadt added regression Regression in behavior compared to a previous version compiler:codegen Generation of LLVM IR and native code gpu Affects running Julia on a GPU labels Mar 4, 2020

maleadt assigned vtjnash Mar 4, 2020

vtjnash added invalid Indicates that an issue or pull request is no longer relevant and removed gpu Affects running Julia on a GPU regression Regression in behavior compared to a previous version labels Mar 4, 2020

JeffBezanson mentioned this issue Apr 2, 2020

avoid passing pointers for known Type arguments #35347

Merged

maleadt mentioned this issue Apr 29, 2020

Hitting the Julia cache with GPUCompiler on 1.5-dev JuliaGPU/GPUCompiler.jl#20

Closed

vchuravy mentioned this issue Apr 30, 2020

Fix GPUCodegen by re-inferring deleted methods #35655

Merged

vtjnash closed this as completed in #35655 May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jl_create_native generates bad code for *(Float16,Float16) #34993

jl_create_native generates bad code for *(Float16,Float16) #34993

maleadt commented Mar 4, 2020

vtjnash commented Mar 4, 2020

maleadt commented Mar 4, 2020

Keno commented Mar 4, 2020

maleadt commented Mar 5, 2020

jl_create_native generates bad code for *(Float16,Float16) #34993

jl_create_native generates bad code for *(Float16,Float16) #34993

Comments

maleadt commented Mar 4, 2020

vtjnash commented Mar 4, 2020

maleadt commented Mar 4, 2020

Keno commented Mar 4, 2020

maleadt commented Mar 5, 2020