fix test_utf8.j #9

StefanKarpinski · 2011-05-02T21:06:51Z

This non-default test has been failing for quite some time, I think.

update to 05c323d [ci skip]

- A_mul_B - Ac_mul_b_RFP - qrp - cholpfact, cholpfact! - qrfact, qrfact! - LinAlg.LUTridiagonal - LinAlg.solve Previously, callling any of these resulted in an `ERROR: ____ not defined`. HT astrieanna/TypeCheck.jl/#9

Add ＠compat as discussed in #8

Increase coverage

Display exceptions

The bug here is a bit subtle, but perhaps best illustrated with the included test case: ``` function f32579(x::Int64, b::Bool) if b x = nothing end if isa(x, Int64) y = x else y = x end if isa(y, Nothing) z = y else z = y end return z === nothing end ``` The code just after SSA conversion looks like: ``` 2 1 ─ goto #3 if not _3 3 2 ─ %2 = Main.nothing::Core.Compiler.Const(nothing, false) 5 3 ┄ %3 = φ (#2 => %2, #1 => _2)::Union{Nothing, Int64} │ %4 = (%3 isa Main.Int64)::Bool └── goto #5 if not %4 6 4 ─ %6 = π (%3, Int64) └── goto #6 8 5 ─ %8 = π (%3, Nothing) 10 6 ┄ %9 = φ (#4 => %6, #5 => %8)::Union{Nothing, Int64} │ %10 = (%9 isa Main.Nothing)::Bool └── goto #8 if not %10 11 7 ─ %12 = π (%9, Nothing) └── goto #9 13 8 ─ %14 = π (%9, Int64) 15 9 ┄ %15 = φ (#7 => %12, #8 => %14)::Union{Nothing, Int64} │ %16 = (%15 === Main.nothing)::Bool └── return %16 ``` Now, we have special code in SROA (despite it not really being an SROA transform) that looks at `===` and replaces it by a nest of phis of booleans. The reasoning for this transform is that it eliminates a use of a value where we only care about the type and not the content, thus making it more likely that the value will subsequently be eligible for SROA. In addition, while it goes along resolving which values feed into any particular phi, it accumulates and type conditions it encounters along the way. Thus in the example above, something like the following happens: - We look at %14, which πs to %9 with an Int64 constraint, so we only consider the #4 predecessor for %9 (due to the constraint), until we get to %3, where we again only consider the #1 predecessor, where we find the argument (of type Int64) and conclude the result is always false - Now we pop the next item of the stack from our original phi, look at %12, which πs to %9 with a Nothing constraint. At this point we used to terminate the search because we already looked at %9. However, crucially, we looked at %9 only with an Int64 constraint, so we missed the fact that `nothing` was in fact a possible value for this phi. The result was a missing entry in the generated phi node: ``` 1 ─ goto #3 if not b 2 ─ %2 = Main.nothing::Core.Compiler.Const(nothing, false) 3 ┄ %3 = φ (#1 => false)::Bool │ %4 = φ (#2 => %2, #1 => _2)::Union{Nothing, Int64} │ %5 = (%4 isa Main.Int64)::Bool └── goto #5 if not %5 4 ─ %7 = π (%4, Int64) └── goto #6 5 ─ %9 = π (%4, Nothing) 6 ┄ %10 = φ (#4 => %3, #5 => %3)::Bool │ %11 = φ (#4 => %7, #5 => %9)::Union{Nothing, Int64} │ %12 = (%11 isa Main.Nothing)::Bool └── goto #8 if not %12 7 ─ goto #9 8 ─ nothing::Nothing 9 ┄ %16 = φ (#7 => %10, #8 => %10)::Bool └── return %16 ``` (note the missing #2 predecessor in phi node %3), which would result in an undefined value at runtime, though in this case LLVM would have taken advantage of that to just return 0: ``` define i8 @julia_f32579_16051(i64, i8) { top: ; @ REPL[1]:15 within `f32579' ret i8 0 } ``` Compare this now to the optimized IR with this patch: ``` 1 ─ goto #3 if not b 2 ─ %2 = Main.nothing::Core.Compiler.Const(nothing, false) 3 ┄ %3 = φ (#2 => true, #1 => false)::Bool │ %4 = φ (#2 => %2, #1 => _2)::Union{Nothing, Int64} │ %5 = (%4 isa Main.Int64)::Bool └── goto #5 if not %5 4 ─ %7 = π (%4, Int64) └── goto #6 5 ─ %9 = π (%4, Nothing) 6 ┄ %10 = φ (#4 => %3, #5 => %3)::Bool │ %11 = φ (#4 => %7, #5 => %9)::Union{Nothing, Int64} │ %12 = (%11 isa Main.Nothing)::Bool └── goto #8 if not %12 7 ─ goto #9 8 ─ nothing::Nothing 9 ┄ %16 = φ (#7 => %10, #8 => %10)::Bool └── return %16 ``` The %3 phi node has its missing entry and the generated LLVM code correctly returns `b`: ``` define i8 @julia_f32579_16112(i64, i8) { top: %2 = and i8 %1, 1 ; @ REPL[1]:15 within `f32579' ret i8 %2 } ```

The bug here is a bit subtle, but perhaps best illustrated with the included test case: ``` function f32579(x::Int64, b::Bool) if b x = nothing end if isa(x, Int64) y = x else y = x end if isa(y, Nothing) z = y else z = y end return z === nothing end ``` The code just after SSA conversion looks like: ``` 2 1 ─ goto #3 if not _3 3 2 ─ %2 = Main.nothing::Core.Compiler.Const(nothing, false) 5 3 ┄ %3 = φ (#2 => %2, #1 => _2)::Union{Nothing, Int64} │ %4 = (%3 isa Main.Int64)::Bool └── goto #5 if not %4 6 4 ─ %6 = π (%3, Int64) └── goto #6 8 5 ─ %8 = π (%3, Nothing) 10 6 ┄ %9 = φ (#4 => %6, #5 => %8)::Union{Nothing, Int64} │ %10 = (%9 isa Main.Nothing)::Bool └── goto #8 if not %10 11 7 ─ %12 = π (%9, Nothing) └── goto #9 13 8 ─ %14 = π (%9, Int64) 15 9 ┄ %15 = φ (#7 => %12, #8 => %14)::Union{Nothing, Int64} │ %16 = (%15 === Main.nothing)::Bool └── return %16 ``` Now, we have special code in SROA (despite it not really being an SROA transform) that looks at `===` and replaces it by a nest of phis of booleans. The reasoning for this transform is that it eliminates a use of a value where we only care about the type and not the content, thus making it more likely that the value will subsequently be eligible for SROA. In addition, while it goes along resolving which values feed into any particular phi, it accumulates and type conditions it encounters along the way. Thus in the example above, something like the following happens: - We look at %14, which πs to %9 with an Int64 constraint, so we only consider the #4 predecessor for %9 (due to the constraint), until we get to %3, where we again only consider the #1 predecessor, where we find the argument (of type Int64) and conclude the result is always false - Now we pop the next item of the stack from our original phi, look at %12, which πs to %9 with a Nothing constraint. At this point we used to terminate the search because we already looked at %9. However, crucially, we looked at %9 only with an Int64 constraint, so we missed the fact that `nothing` was in fact a possible value for this phi. The result was a missing entry in the generated phi node: ``` 1 ─ goto #3 if not b 2 ─ %2 = Main.nothing::Core.Compiler.Const(nothing, false) 3 ┄ %3 = φ (#1 => false)::Bool │ %4 = φ (#2 => %2, #1 => _2)::Union{Nothing, Int64} │ %5 = (%4 isa Main.Int64)::Bool └── goto #5 if not %5 4 ─ %7 = π (%4, Int64) └── goto #6 5 ─ %9 = π (%4, Nothing) 6 ┄ %10 = φ (#4 => %3, #5 => %3)::Bool │ %11 = φ (#4 => %7, #5 => %9)::Union{Nothing, Int64} │ %12 = (%11 isa Main.Nothing)::Bool └── goto #8 if not %12 7 ─ goto #9 8 ─ nothing::Nothing 9 ┄ %16 = φ (#7 => %10, #8 => %10)::Bool └── return %16 ``` (note the missing #2 predecessor in phi node %3), which would result in an undefined value at runtime, though in this case LLVM would have taken advantage of that to just return 0: ``` define i8 @julia_f32579_16051(i64, i8) { top: ; @ REPL[1]:15 within `f32579' ret i8 0 } ``` Compare this now to the optimized IR with this patch: ``` 1 ─ goto #3 if not b 2 ─ %2 = Main.nothing::Core.Compiler.Const(nothing, false) 3 ┄ %3 = φ (#2 => true, #1 => false)::Bool │ %4 = φ (#2 => %2, #1 => _2)::Union{Nothing, Int64} │ %5 = (%4 isa Main.Int64)::Bool └── goto #5 if not %5 4 ─ %7 = π (%4, Int64) └── goto #6 5 ─ %9 = π (%4, Nothing) 6 ┄ %10 = φ (#4 => %3, #5 => %3)::Bool │ %11 = φ (#4 => %7, #5 => %9)::Union{Nothing, Int64} │ %12 = (%11 isa Main.Nothing)::Bool └── goto #8 if not %12 7 ─ goto #9 8 ─ nothing::Nothing 9 ┄ %16 = φ (#7 => %10, #8 => %10)::Bool └── return %16 ``` The %3 phi node has its missing entry and the generated LLVM code correctly returns `b`: ``` define i8 @julia_f32579_16112(i64, i8) { top: %2 = and i8 %1, 1 ; @ REPL[1]:15 within `f32579' ret i8 %2 } ``` (cherry picked from commit 0a12944)

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - add special SROA handling for `NamedTuple` generated by keyword sorter: With the change on `structdiff`, IR for a call with type-unstable keyword arguments after inlining would look like: ``` %1 = tuple(a, b, c)::Tuple{Any, Any, Any} %2 = NamedTuple{(:a, :b, :c)(%1)::NamedTuple{(:a, :b, :c), _A} where _A<:Tuple{Any, Any, Any} %3 = Core.getfield(%2, :a)::Any %4 = Core.getfield(%2, :b)::Any %5 = Core.getfield(%2, :c)::Any [... other body of the keyword func ...] ``` We can implement a bit hacky special handling within our SROA pass that checks if this definition (%2) is partly well-known `NamedTuple` construction, where its names are fully known, and also checks if its call argument (%1) is fully-known `tuple` call. In a case when the length of the `NamedTuple` names and the length of the arguments for the `tuple` call, we can safely replace those `getfield` calls with the corresponding `tuple` call argument, while letting the later DCE pass to delete the constructions of tuple and named-tuple altogether. With these changes, the IR for the example `NewInstruction` constructor is fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

@benchmark

This commit tries to fix and improve performance for calling keyword funcs whose arguments types are not fully known but `@nospecialize`-d. The final result would look like (this particular example is taken from our Julia-level compiler implementation): ```julia abstract type CallInfo end struct NoCallInfo <: CallInfo end struct NewInstruction stmt::Any type::Any info::CallInfo line::Union{Int32,Nothing} # if nothing, copy the line from previous statement in the insertion location flag::Union{UInt8,Nothing} # if nothing, IR flags will be recomputed on insertion function NewInstruction(@nospecialize(stmt), @nospecialize(type), @nospecialize(info::CallInfo), line::Union{Int32,Nothing}, flag::Union{UInt8,Nothing}) return new(stmt, type, info, line, flag) end end @nospecialize function NewInstruction(newinst::NewInstruction; stmt=newinst.stmt, type=newinst.type, info::CallInfo=newinst.info, line::Union{Int32,Nothing}=newinst.line, flag::Union{UInt8,Nothing}=newinst.flag) return NewInstruction(stmt, type, info, line, flag) end @Specialize using BenchmarkTools struct VirtualKwargs stmt::Any type::Any info::CallInfo end vkws = VirtualKwargs(nothing, Any, NoCallInfo()) newinst = NewInstruction(nothing, Any, NoCallInfo(), nothing, nothing) runner(newinst, vkws) = NewInstruction(newinst; vkws.stmt, vkws.type, vkws.info) @benchmark runner($newinst, $vkws) ``` > on master ``` BenchmarkTools.Trial: 10000 samples with 186 evaluations. Range (min … max): 559.898 ns … 4.173 μs ┊ GC (min … max): 0.00% … 85.29% Time (median): 605.608 ns ┊ GC (median): 0.00% Time (mean ± σ): 638.170 ns ± 125.080 ns ┊ GC (mean ± σ): 0.06% ± 0.85% █▇▂▆▄ ▁█▇▄▂ ▂ ██████▅██████▇▇▇██████▇▇▇▆▆▅▄▅▄▂▄▄▅▇▆▆▆▆▆▅▆▆▄▄▅▅▄▃▄▄▄▅▃▅▅▆▅▆▆ █ 560 ns Histogram: log(frequency) by time 1.23 μs < Memory estimate: 32 bytes, allocs estimate: 2. ``` > on this commit ```julia BenchmarkTools.Trial: 10000 samples with 1000 evaluations. Range (min … max): 3.080 ns … 83.177 ns ┊ GC (min … max): 0.00% … 0.00% Time (median): 3.098 ns ┊ GC (median): 0.00% Time (mean ± σ): 3.118 ns ± 0.885 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇█▆▅▄▂ ▂▄▆▆▇████████▆▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂ ▃ 3.08 ns Histogram: frequency by time 3.19 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` So for this particular case it achieves roughly 200x speed up. This is because this commit allows inlining of a call to keyword sorter as well as removal of `NamedTuple` call. Especially this commit is composed of the following improvements: - Add early return case for `structdiff`: This change improves the return type inference for a case when compared `NamedTuple`s are type unstable but there is no difference in their names, e.g. given two `NamedTuple{(:a,:b),T} where T<:Tuple{Any,Any}`s. And in such case the optimizer will remove `structdiff` and succeeding `pairs` calls, letting the keyword sorter to be inlined. - Tweak the core `NamedTuple{names}(args::Tuple)` constructor so that it directly forms `:splatnew` allocation rather than redirects to the general `NamedTuple` constructor, that could be confused for abstract input tuple type. - Improve `nfields_tfunc` accuracy as for abstract `NamedTuple` types. This improvement lets `inline_splatnew` to handle more abstract `NamedTuple`s, especially whose names are fully known but its fields tuple type is abstract. Those improvements are combined to allow our SROA pass to optimize away `NamedTuple` and `tuple` calls generated for keyword argument handling. E.g. the IR for the example `NewInstruction` constructor is now fairly optimized, like: ```julia julia> Base.code_ircode((NewInstruction,Any,Any,CallInfo)) do newinst, stmt, type, info NewInstruction(newinst; stmt, type, info) end |> only 2 1 ── %1 = Base.getfield(_2, :line)::Union{Nothing, Int32} │╻╷ Type##kw │ %2 = Base.getfield(_2, :flag)::Union{Nothing, UInt8} ││┃ getproperty │ %3 = (isa)(%1, Nothing)::Bool ││ │ %4 = (isa)(%2, Nothing)::Bool ││ │ %5 = (Core.Intrinsics.and_int)(%3, %4)::Bool ││ └─── goto #3 if not %5 ││ 2 ── %7 = %new(Main.NewInstruction, _3, _4, _5, nothing, nothing)::NewInstruction NewInstruction └─── goto #10 ││ 3 ── %9 = (isa)(%1, Int32)::Bool ││ │ %10 = (isa)(%2, Nothing)::Bool ││ │ %11 = (Core.Intrinsics.and_int)(%9, %10)::Bool ││ └─── goto #5 if not %11 ││ 4 ── %13 = π (%1, Int32) ││ │ %14 = %new(Main.NewInstruction, _3, _4, _5, %13, nothing)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 5 ── %16 = (isa)(%1, Nothing)::Bool ││ │ %17 = (isa)(%2, UInt8)::Bool ││ │ %18 = (Core.Intrinsics.and_int)(%16, %17)::Bool ││ └─── goto #7 if not %18 ││ 6 ── %20 = π (%2, UInt8) ││ │ %21 = %new(Main.NewInstruction, _3, _4, _5, nothing, %20)::NewInstruction│││╻ NewInstruction └─── goto #10 ││ 7 ── %23 = (isa)(%1, Int32)::Bool ││ │ %24 = (isa)(%2, UInt8)::Bool ││ │ %25 = (Core.Intrinsics.and_int)(%23, %24)::Bool ││ └─── goto #9 if not %25 ││ 8 ── %27 = π (%1, Int32) ││ │ %28 = π (%2, UInt8) ││ │ %29 = %new(Main.NewInstruction, _3, _4, _5, %27, %28)::NewInstruction │││╻ NewInstruction └─── goto #10 ││ 9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └─── unreachable ││ 10 ┄ %33 = φ (#2 => %7, #4 => %14, #6 => %21, #8 => %29)::NewInstruction ││ └─── goto #11 ││ 11 ─ return %33 │ => NewInstruction ```

…d and inlined) (#43322) A follow up attemp to fix #27988. (close #47493 close #50554) Examples: ```julia julia> using LazyArrays julia> bc = @~ @. 1*(1 + 1) + 1*1; julia> bc2 = @~ 1 .* 1 .- 1 .* 1 .^2 .+ 1 .* 1 .+ 1 .^ 3; ``` On master: <details><summary> click for details </summary> <p> ```julia julia> @code_typed Broadcast.flatten(bc).f(1,1,1,1,1) CodeInfo( 1 ─ %1 = Core.getfield(args, 1)::Int64 │ %2 = Core.getfield(args, 2)::Int64 │ %3 = Core.getfield(args, 3)::Int64 │ %4 = Core.getfield(args, 4)::Int64 │ %5 = Core.getfield(args, 5)::Int64 │ %6 = invoke Base.Broadcast.var"#13#14"{Base.Broadcast.var"#16#18"{Base.Broadcast.var"#15#17", Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}, typeof(+)}}(Base.Broadcast.var"#16#18"{Base.Broadcast.var"#15#17", Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}, typeof(+)}(Base.Broadcast.var"#15#17"(), Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}(Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}(Base.Broadcast.var"#15#17"())), Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"())), Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"())), +))(%1::Int64, %2::Int64, %3::Vararg{Int64}, %4, %5)::Tuple{Int64, Int64, Vararg{Int64}} │ %7 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"())), %6)::Tuple{Int64, Int64} │ %8 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"())), %6)::Tuple{Vararg{Int64}} │ %9 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#16#18"{Base.Broadcast.var"#9#11", Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}, typeof(*)}(Base.Broadcast.var"#9#11"(), Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}(Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}(Base.Broadcast.var"#15#17"())), Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"())), Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"())), *), %8)::Tuple{Int64} │ %10 = Core.getfield(%7, 1)::Int64 │ %11 = Core.getfield(%7, 2)::Int64 │ %12 = Base.mul_int(%10, %11)::Int64 │ %13 = Core.getfield(%9, 1)::Int64 │ %14 = Base.add_int(%12, %13)::Int64 └── return %14 ) => Int64 julia> @code_typed Broadcast.flatten(bc2).f(1,1,1,^,1,Val(2),1,1,^,1,Val(3)) CodeInfo( 1 ─ %1 = Core.getfield(args, 1)::Int64 │ %2 = Core.getfield(args, 2)::Int64 │ %3 = Core.getfield(args, 3)::Int64 │ %4 = Core.getfield(args, 5)::Int64 │ %5 = Core.getfield(args, 7)::Int64 │ %6 = Core.getfield(args, 8)::Int64 │ %7 = Core.getfield(args, 10)::Int64 │ %8 = invoke Base.Broadcast.var"#13#14"{Base.Broadcast.var"#16#18"{Base.Broadcast.var"#15#17", Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}}, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}}, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}}, typeof(Base.literal_pow)}}(Base.Broadcast.var"#16#18"{Base.Broadcast.var"#15#17", Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}}, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}}, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}}, typeof(Base.literal_pow)}(Base.Broadcast.var"#15#17"(), Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}}(Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}(Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}(Base.Broadcast.var"#15#17"()))), Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"()))), Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"()))), Base.literal_pow))(%3::Int64, ^::Function, %4::Vararg{Any}, $(QuoteNode(Val{2}())), %5, %6, ^, %7, $(QuoteNode(Val{3}())))::Tuple{Int64, Any, Vararg{Any}} │ %9 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"())), %8)::Tuple{Int64, Any} │ %10 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"())), %8)::Tuple │ %11 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#15#17"(), %10)::Tuple │ %12 = Core.getfield(%9, 1)::Int64 │ %13 = Core.getfield(%9, 2)::Any │ %14 = (*)(%12, %13)::Any │ %15 = Core.tuple(%14)::Tuple{Any} │ %16 = Core._apply_iterate(Base.iterate, Core.tuple, %15, %11)::Tuple{Any, Vararg{Any}} │ %17 = Base.mul_int(%1, %2)::Int64 │ %18 = Core.tuple(%17)::Tuple{Int64} │ %19 = Core._apply_iterate(Base.iterate, Core.tuple, %18, %16)::Tuple{Int64, Any, Vararg{Any}} │ %20 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"())), %19)::Tuple{Int64, Any} │ %21 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"())), %19)::Tuple │ %22 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#16#18"{Base.Broadcast.var"#15#17", Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}, typeof(*)}(Base.Broadcast.var"#15#17"(), Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}(Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}(Base.Broadcast.var"#15#17"())), Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"())), Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"())), *), %21)::Tuple{Any, Vararg{Any}} │ %23 = Core.getfield(%20, 1)::Int64 │ %24 = Core.getfield(%20, 2)::Any │ %25 = (-)(%23, %24)::Any │ %26 = Core.tuple(%25)::Tuple{Any} │ %27 = Core._apply_iterate(Base.iterate, Core.tuple, %26, %22)::Tuple{Any, Any, Vararg{Any}} │ %28 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"())), %27)::Tuple{Any, Any} │ %29 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"())), %27)::Tuple │ %30 = Core._apply_iterate(Base.iterate, Base.Broadcast.var"#16#18"{Base.Broadcast.var"#9#11", Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}}, Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}}, Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}}, typeof(Base.literal_pow)}(Base.Broadcast.var"#9#11"(), Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}}(Base.Broadcast.var"#13#14"{Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}}(Base.Broadcast.var"#13#14"{Base.Broadcast.var"#15#17"}(Base.Broadcast.var"#15#17"()))), Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}}(Base.Broadcast.var"#23#24"{Base.Broadcast.var"#25#26"}(Base.Broadcast.var"#25#26"()))), Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}}(Base.Broadcast.var"#19#20"{Base.Broadcast.var"#21#22"}(Base.Broadcast.var"#21#22"()))), Base.literal_pow), %29)::Tuple{Any} │ %31 = Core.getfield(%28, 1)::Any │ %32 = Core.getfield(%28, 2)::Any │ %33 = (+)(%31, %32)::Any │ %34 = Core.getfield(%30, 1)::Any │ %35 = (+)(%33, %34)::Any └── return %35 ) => Any ``` </p> </details> On this PR ```julia julia> @code_typed Broadcast.flatten(bc).f(1,1,1,1,1) CodeInfo( 1 ─ %1 = Core.getfield(args, 1)::Int64 │ %2 = Core.getfield(args, 2)::Int64 │ %3 = Core.getfield(args, 3)::Int64 │ %4 = Core.getfield(args, 4)::Int64 │ %5 = Core.getfield(args, 5)::Int64 │ %6 = Base.add_int(%2, %3)::Int64 │ %7 = Base.mul_int(%1, %6)::Int64 │ %8 = Base.mul_int(%4, %5)::Int64 │ %9 = Base.add_int(%7, %8)::Int64 └── return %9 ) => Int64 julia> @code_typed Broadcast.flatten(bc2).f(1,1,1,^,1,Val(2),1,1,^,1,Val(3)) CodeInfo( 1 ─ %1 = Core.getfield(args, 1)::Int64 │ %2 = Core.getfield(args, 2)::Int64 │ %3 = Core.getfield(args, 3)::Int64 │ %4 = Core.getfield(args, 5)::Int64 │ %5 = Core.getfield(args, 7)::Int64 │ %6 = Core.getfield(args, 8)::Int64 │ %7 = Core.getfield(args, 10)::Int64 │ %8 = Base.mul_int(%1, %2)::Int64 │ %9 = Base.mul_int(%4, %4)::Int64 │ %10 = Base.mul_int(%3, %9)::Int64 │ %11 = Base.sub_int(%8, %10)::Int64 │ %12 = Base.mul_int(%5, %6)::Int64 │ %13 = Base.add_int(%11, %12)::Int64 │ %14 = Base.mul_int(%7, %7)::Int64 │ %15 = Base.mul_int(%14, %7)::Int64 │ %16 = Base.add_int(%13, %15)::Int64 └── return %16 ) => Int64 ```

Handle :splatnew expressions on nightly

@something

`@something` eagerly unwraps any `Some` given to it, while keeping the variable between its arguments the same. This can be an issue if a previously unpacked value is used as input to `@something`, leading to a type instability on more than two arguments (e.g. because of a fallback to `Some(nothing)`). By using different variables for each argument, type inference has an easier time handling these cases that are isolated to single branches anyway. This also adds some comments to the macro, since it's non-obvious what it does. Benchmarking the specific case I encountered this in led to a ~2x performance improvement on multiple machines. 1.10-beta3/master: ``` [sukera@tower 01]$ jl1100 -q --project=. -L 01.jl -e 'bench()' v"1.10.0-beta3" BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 38.670 μs … 70.350 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 43.340 μs ┊ GC (median): 0.00% Time (mean ± σ): 43.395 μs ± 1.518 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▆█▂ ▁▁ ▂▂▂▂▂▂▂▂▂▁▂▂▂▃▃▃▂▂▃▃▃▂▂▂▂▂▄▇███▆██▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃ 38.7 μs Histogram: frequency by time 48 μs < Memory estimate: 0 bytes, allocs estimate: 0. ``` This PR: ``` [sukera@tower 01]$ julia -q --project=. -L 01.jl -e 'bench()' v"1.11.0-DEV.970" BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 22.820 μs … 44.980 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 24.300 μs ┊ GC (median): 0.00% Time (mean ± σ): 24.370 μs ± 832.239 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇██▇▆▅▁ ▂▂▂▂▂▂▂▂▃▃▄▅▇███████████▅▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂ ▃ 22.8 μs Histogram: frequency by time 27.7 μs < Memory estimate: 0 bytes, allocs estimate: 0. ``` <details> <summary>Benchmarking code (spoilers for Advent Of Code 2023 Day 01, Part 01). Running this requires the input of that Advent Of Code day.</summary> ```julia using BenchmarkTools using InteractiveUtils isdigit(d::UInt8) = UInt8('0') <= d <= UInt8('9') someDigit(c::UInt8) = isdigit(c) ? Some(c - UInt8('0')) : nothing function part1(data) total = 0 may_a = nothing may_b = nothing for c in data digitRes = someDigit(c) may_a = @something may_a digitRes Some(nothing) may_b = @something digitRes may_b Some(nothing) if c == UInt8('\n') digit_a = may_a::UInt8 digit_b = may_b::UInt8 total += digit_a*0xa + digit_b may_a = nothing may_b = nothing end end return total end function bench() data = read("input.txt") display(VERSION) println() display(@benchmark part1($data)) nothing end ``` </details> <details> <summary>`@code_warntype` before</summary> ```julia julia> @code_warntype part1(data) MethodInstance for part1(::Vector{UInt8}) from part1(data) @ Main ~/Documents/projects/AOC/2023/01/01.jl:7 Arguments #self#::Core.Const(part1) data::Vector{UInt8} Locals @_3::Union{Nothing, Tuple{UInt8, Int64}} may_b::Union{Nothing, UInt8} may_a::Union{Nothing, UInt8} total::Int64 c::UInt8 digit_b::UInt8 digit_a::UInt8 val@_10::Any val@_11::Any digitRes::Union{Nothing, Some{UInt8}} @_13::Union{Some{Nothing}, Some{UInt8}, UInt8} @_14::Union{Some{Nothing}, Some{UInt8}} @_15::Some{Nothing} @_16::Union{Some{Nothing}, Some{UInt8}, UInt8} @_17::Union{Some{Nothing}, UInt8} @_18::Some{Nothing} Body::Int64 1 ── (total = 0) │ (may_a = Main.nothing) │ (may_b = Main.nothing) │ %4 = data::Vector{UInt8} │ (@_3 = Base.iterate(%4)) │ %6 = (@_3 === nothing)::Bool │ %7 = Base.not_int(%6)::Bool └─── goto #24 if not %7 2 ┄─ Core.NewvarNode(:(digit_b)) │ Core.NewvarNode(:(digit_a)) │ Core.NewvarNode(:(val@_10)) │ %12 = @_3::Tuple{UInt8, Int64} │ (c = Core.getfield(%12, 1)) │ %14 = Core.getfield(%12, 2)::Int64 │ (digitRes = Main.someDigit(c)) │ (val@_11 = may_a) │ %17 = (val@_11::Union{Nothing, UInt8} !== Base.nothing)::Bool └─── goto #4 if not %17 3 ── (@_13 = val@_11::UInt8) └─── goto #11 4 ── (val@_11 = digitRes) │ %22 = (val@_11::Union{Nothing, Some{UInt8}} !== Base.nothing)::Bool └─── goto #6 if not %22 5 ── (@_14 = val@_11::Some{UInt8}) └─── goto #10 6 ── (val@_11 = Main.Some(Main.nothing)) │ %27 = (val@_11::Core.Const(Some(nothing)) !== Base.nothing)::Core.Const(true) └─── goto #8 if not %27 7 ── (@_15 = val@_11::Core.Const(Some(nothing))) └─── goto #9 8 ── Core.Const(:(@_15 = Base.nothing)) 9 ┄─ (@_14 = @_15) 10 ┄ (@_13 = @_14) 11 ┄ %34 = @_13::Union{Some{Nothing}, Some{UInt8}, UInt8} │ (may_a = Base.something(%34)) │ (val@_10 = digitRes) │ %37 = (val@_10::Union{Nothing, Some{UInt8}} !== Base.nothing)::Bool └─── goto #13 if not %37 12 ─ (@_16 = val@_10::Some{UInt8}) └─── goto #20 13 ─ (val@_10 = may_b) │ %42 = (val@_10::Union{Nothing, UInt8} !== Base.nothing)::Bool └─── goto #15 if not %42 14 ─ (@_17 = val@_10::UInt8) └─── goto #19 15 ─ (val@_10 = Main.Some(Main.nothing)) │ %47 = (val@_10::Core.Const(Some(nothing)) !== Base.nothing)::Core.Const(true) └─── goto #17 if not %47 16 ─ (@_18 = val@_10::Core.Const(Some(nothing))) └─── goto #18 17 ─ Core.Const(:(@_18 = Base.nothing)) 18 ┄ (@_17 = @_18) 19 ┄ (@_16 = @_17) 20 ┄ %54 = @_16::Union{Some{Nothing}, Some{UInt8}, UInt8} │ (may_b = Base.something(%54)) │ %56 = c::UInt8 │ %57 = Main.UInt8('\n')::Core.Const(0x0a) │ %58 = (%56 == %57)::Bool └─── goto #22 if not %58 21 ─ (digit_a = Core.typeassert(may_a, Main.UInt8)) │ (digit_b = Core.typeassert(may_b, Main.UInt8)) │ %62 = total::Int64 │ %63 = (digit_a * 0x0a)::UInt8 │ %64 = (%63 + digit_b)::UInt8 │ (total = %62 + %64) │ (may_a = Main.nothing) └─── (may_b = Main.nothing) 22 ┄ (@_3 = Base.iterate(%4, %14)) │ %69 = (@_3 === nothing)::Bool │ %70 = Base.not_int(%69)::Bool └─── goto #24 if not %70 23 ─ goto #2 24 ┄ return total ``` </details> <details> <summary>`@code_native debuginfo=:none` Before </summary> ```julia julia> @code_native debuginfo=:none part1(data) .text .file "part1" .globl julia_part1_418 # -- Begin function julia_part1_418 .p2align 4, 0x90 .type julia_part1_418,@function julia_part1_418: # @julia_part1_418 # %bb.0: # %top push rbp mov rbp, rsp push r15 push r14 push r13 push r12 push rbx sub rsp, 40 mov rax, qword ptr [rdi + 8] test rax, rax je .LBB0_1 # %bb.2: # %L17 mov rcx, qword ptr [rdi] dec rax mov r10b, 1 xor r14d, r14d # implicit-def: $r12b # implicit-def: $r13b # implicit-def: $r9b # implicit-def: $sil mov qword ptr [rbp - 64], rax # 8-byte Spill mov al, 1 mov dword ptr [rbp - 48], eax # 4-byte Spill # implicit-def: $al # kill: killed $al xor eax, eax mov qword ptr [rbp - 56], rax # 8-byte Spill mov qword ptr [rbp - 72], rcx # 8-byte Spill # implicit-def: $cl jmp .LBB0_3 .p2align 4, 0x90 .LBB0_8: # in Loop: Header=BB0_3 Depth=1 mov dword ptr [rbp - 48], 0 # 4-byte Folded Spill .LBB0_24: # %post_union_move # in Loop: Header=BB0_3 Depth=1 movzx r13d, byte ptr [rbp - 41] # 1-byte Folded Reload mov r12d, r8d cmp qword ptr [rbp - 64], r14 # 8-byte Folded Reload je .LBB0_13 .LBB0_25: # %guard_exit113 # in Loop: Header=BB0_3 Depth=1 inc r14 mov r10d, ebx .LBB0_3: # %L19 # =>This Inner Loop Header: Depth=1 mov rax, qword ptr [rbp - 72] # 8-byte Reload xor ebx, ebx xor edi, edi movzx r15d, r9b movzx ecx, cl movzx esi, sil mov r11b, 1 # implicit-def: $r9b movzx edx, byte ptr [rax + r14] lea eax, [rdx - 58] lea r8d, [rdx - 48] cmp al, -10 setae bl setb dil test r10b, 1 cmovne r15d, edi mov edi, 0 cmovne ecx, ebx mov bl, 1 cmovne esi, edi test r15b, 1 jne .LBB0_7 # %bb.4: # %L76 # in Loop: Header=BB0_3 Depth=1 mov r11b, 2 test cl, 1 jne .LBB0_5 # %bb.6: # %L78 # in Loop: Header=BB0_3 Depth=1 mov ebx, r10d mov r9d, r15d mov byte ptr [rbp - 41], r13b # 1-byte Spill test sil, 1 je .LBB0_26 .LBB0_7: # %L82 # in Loop: Header=BB0_3 Depth=1 cmp al, -11 jbe .LBB0_9 jmp .LBB0_8 .p2align 4, 0x90 .LBB0_5: # in Loop: Header=BB0_3 Depth=1 mov ecx, r8d mov sil, 1 xor ebx, ebx mov byte ptr [rbp - 41], r8b # 1-byte Spill xor r9d, r9d xor ecx, ecx cmp al, -11 ja .LBB0_8 .LBB0_9: # %L90 # in Loop: Header=BB0_3 Depth=1 test byte ptr [rbp - 48], 1 # 1-byte Folded Reload jne .LBB0_23 # %bb.10: # %L115 # in Loop: Header=BB0_3 Depth=1 cmp dl, 10 jne .LBB0_11 # %bb.14: # %L122 # in Loop: Header=BB0_3 Depth=1 test r15b, 1 jne .LBB0_15 # %bb.12: # %L130.thread # in Loop: Header=BB0_3 Depth=1 movzx eax, byte ptr [rbp - 41] # 1-byte Folded Reload mov bl, 1 add eax, eax lea eax, [rax + 4*rax] add al, r12b movzx eax, al add qword ptr [rbp - 56], rax # 8-byte Folded Spill mov al, 1 mov dword ptr [rbp - 48], eax # 4-byte Spill cmp qword ptr [rbp - 64], r14 # 8-byte Folded Reload jne .LBB0_25 jmp .LBB0_13 .p2align 4, 0x90 .LBB0_23: # %L115.thread # in Loop: Header=BB0_3 Depth=1 mov al, 1 # implicit-def: $r8b mov dword ptr [rbp - 48], eax # 4-byte Spill cmp dl, 10 jne .LBB0_24 jmp .LBB0_21 .LBB0_11: # in Loop: Header=BB0_3 Depth=1 mov r8d, r12d jmp .LBB0_24 .LBB0_1: xor eax, eax mov qword ptr [rbp - 56], rax # 8-byte Spill .LBB0_13: # %L159 mov rax, qword ptr [rbp - 56] # 8-byte Reload add rsp, 40 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret .LBB0_21: # %L122.thread test r15b, 1 jne .LBB0_15 # %bb.22: # %post_box_union58 movabs rdi, offset .L_j_str1 movabs rax, offset ijl_type_error movabs rsi, 140008511215408 movabs rdx, 140008667209736 call rax .LBB0_15: # %fail cmp r11b, 1 je .LBB0_19 # %bb.16: # %fail movzx eax, r11b cmp eax, 2 jne .LBB0_17 # %bb.20: # %box_union54 movzx eax, byte ptr [rbp - 41] # 1-byte Folded Reload movabs rcx, offset jl_boxed_uint8_cache mov rdx, qword ptr [rcx + 8*rax] jmp .LBB0_18 .LBB0_26: # %L80 movabs rax, offset ijl_throw movabs rdi, 140008495049392 call rax .LBB0_19: # %box_union movabs rdx, 140008667209736 jmp .LBB0_18 .LBB0_17: xor edx, edx .LBB0_18: # %post_box_union movabs rdi, offset .L_j_str1 movabs rax, offset ijl_type_error movabs rsi, 140008511215408 call rax .Lfunc_end0: .size julia_part1_418, .Lfunc_end0-julia_part1_418 # -- End function .type .L_j_str1,@object # @_j_str1 .section .rodata.str1.1,"aMS",@progbits,1 .L_j_str1: .asciz "typeassert" .size .L_j_str1, 11 .section ".note.GNU-stack","",@progbits ``` </details> <details> <summary>`@code_warntype` After</summary> ```julia [sukera@tower 01]$ julia -q --project=. -L 01.jl julia> data = read("input.txt"); julia> @code_warntype part1(data) MethodInstance for part1(::Vector{UInt8}) from part1(data) @ Main ~/Documents/projects/AOC/2023/01/01.jl:7 Arguments #self#::Core.Const(part1) data::Vector{UInt8} Locals @_3::Union{Nothing, Tuple{UInt8, Int64}} may_b::Union{Nothing, UInt8} may_a::Union{Nothing, UInt8} total::Int64 val@_7::Union{} val@_8::Union{} c::UInt8 digit_b::UInt8 digit_a::UInt8 ##215::Some{Nothing} ##216::Union{Nothing, UInt8} ##217::Union{Nothing, Some{UInt8}} ##212::Some{Nothing} ##213::Union{Nothing, Some{UInt8}} ##214::Union{Nothing, UInt8} digitRes::Union{Nothing, Some{UInt8}} @_19::Union{Nothing, UInt8} @_20::Union{Nothing, UInt8} @_21::Nothing @_22::Union{Nothing, UInt8} @_23::Union{Nothing, UInt8} @_24::Nothing Body::Int64 1 ── (total = 0) │ (may_a = Main.nothing) │ (may_b = Main.nothing) │ %4 = data::Vector{UInt8} │ (@_3 = Base.iterate(%4)) │ %6 = @_3::Union{Nothing, Tuple{UInt8, Int64}} │ %7 = (%6 === nothing)::Bool │ %8 = Base.not_int(%7)::Bool └─── goto #24 if not %8 2 ┄─ Core.NewvarNode(:(val@_7)) │ Core.NewvarNode(:(val@_8)) │ Core.NewvarNode(:(digit_b)) │ Core.NewvarNode(:(digit_a)) │ Core.NewvarNode(:(##215)) │ Core.NewvarNode(:(##216)) │ Core.NewvarNode(:(##217)) │ Core.NewvarNode(:(##212)) │ Core.NewvarNode(:(##213)) │ %19 = @_3::Tuple{UInt8, Int64} │ (c = Core.getfield(%19, 1)) │ %21 = Core.getfield(%19, 2)::Int64 │ %22 = c::UInt8 │ (digitRes = Main.someDigit(%22)) │ %24 = may_a::Union{Nothing, UInt8} │ (##214 = %24) │ %26 = Base.:!::Core.Const(!) │ %27 = ##214::Union{Nothing, UInt8} │ %28 = Base.isnothing(%27)::Bool │ %29 = (%26)(%28)::Bool └─── goto #4 if not %29 3 ── %31 = ##214::UInt8 │ (@_19 = Base.something(%31)) └─── goto #11 4 ── %34 = digitRes::Union{Nothing, Some{UInt8}} │ (##213 = %34) │ %36 = Base.:!::Core.Const(!) │ %37 = ##213::Union{Nothing, Some{UInt8}} │ %38 = Base.isnothing(%37)::Bool │ %39 = (%36)(%38)::Bool └─── goto #6 if not %39 5 ── %41 = ##213::Some{UInt8} │ (@_20 = Base.something(%41)) └─── goto #10 6 ── %44 = Main.Some::Core.Const(Some) │ %45 = Main.nothing::Core.Const(nothing) │ (##212 = (%44)(%45)) │ %47 = Base.:!::Core.Const(!) │ %48 = ##212::Core.Const(Some(nothing)) │ %49 = Base.isnothing(%48)::Core.Const(false) │ %50 = (%47)(%49)::Core.Const(true) └─── goto #8 if not %50 7 ── %52 = ##212::Core.Const(Some(nothing)) │ (@_21 = Base.something(%52)) └─── goto #9 8 ── Core.Const(nothing) │ Core.Const(:(val@_8 = Base.something(Base.nothing))) │ Core.Const(nothing) │ Core.Const(:(val@_8)) └─── Core.Const(:(@_21 = %58)) 9 ┄─ %60 = @_21::Core.Const(nothing) └─── (@_20 = %60) 10 ┄ %62 = @_20::Union{Nothing, UInt8} └─── (@_19 = %62) 11 ┄ %64 = @_19::Union{Nothing, UInt8} │ (may_a = %64) │ %66 = digitRes::Union{Nothing, Some{UInt8}} │ (##217 = %66) │ %68 = Base.:!::Core.Const(!) │ %69 = ##217::Union{Nothing, Some{UInt8}} │ %70 = Base.isnothing(%69)::Bool │ %71 = (%68)(%70)::Bool └─── goto #13 if not %71 12 ─ %73 = ##217::Some{UInt8} │ (@_22 = Base.something(%73)) └─── goto #20 13 ─ %76 = may_b::Union{Nothing, UInt8} │ (##216 = %76) │ %78 = Base.:!::Core.Const(!) │ %79 = ##216::Union{Nothing, UInt8} │ %80 = Base.isnothing(%79)::Bool │ %81 = (%78)(%80)::Bool └─── goto #15 if not %81 14 ─ %83 = ##216::UInt8 │ (@_23 = Base.something(%83)) └─── goto #19 15 ─ %86 = Main.Some::Core.Const(Some) │ %87 = Main.nothing::Core.Const(nothing) │ (##215 = (%86)(%87)) │ %89 = Base.:!::Core.Const(!) │ %90 = ##215::Core.Const(Some(nothing)) │ %91 = Base.isnothing(%90)::Core.Const(false) │ %92 = (%89)(%91)::Core.Const(true) └─── goto #17 if not %92 16 ─ %94 = ##215::Core.Const(Some(nothing)) │ (@_24 = Base.something(%94)) └─── goto #18 17 ─ Core.Const(nothing) │ Core.Const(:(val@_7 = Base.something(Base.nothing))) │ Core.Const(nothing) │ Core.Const(:(val@_7)) └─── Core.Const(:(@_24 = %100)) 18 ┄ %102 = @_24::Core.Const(nothing) └─── (@_23 = %102) 19 ┄ %104 = @_23::Union{Nothing, UInt8} └─── (@_22 = %104) 20 ┄ %106 = @_22::Union{Nothing, UInt8} │ (may_b = %106) │ %108 = Main.:(==)::Core.Const(==) │ %109 = c::UInt8 │ %110 = Main.UInt8('\n')::Core.Const(0x0a) │ %111 = (%108)(%109, %110)::Bool └─── goto #22 if not %111 21 ─ %113 = may_a::Union{Nothing, UInt8} │ (digit_a = Core.typeassert(%113, Main.UInt8)) │ %115 = may_b::Union{Nothing, UInt8} │ (digit_b = Core.typeassert(%115, Main.UInt8)) │ %117 = Main.:+::Core.Const(+) │ %118 = total::Int64 │ %119 = Main.:+::Core.Const(+) │ %120 = Main.:*::Core.Const(*) │ %121 = digit_a::UInt8 │ %122 = (%120)(%121, 0x0a)::UInt8 │ %123 = digit_b::UInt8 │ %124 = (%119)(%122, %123)::UInt8 │ (total = (%117)(%118, %124)) │ (may_a = Main.nothing) └─── (may_b = Main.nothing) 22 ┄ (@_3 = Base.iterate(%4, %21)) │ %129 = @_3::Union{Nothing, Tuple{UInt8, Int64}} │ %130 = (%129 === nothing)::Bool │ %131 = Base.not_int(%130)::Bool └─── goto #24 if not %131 23 ─ goto #2 24 ┄ %134 = total::Int64 └─── return %134 ``` </details> <details> <summary>`@code_native debuginfo=:none` After </summary> ```julia julia> @code_native debuginfo=:none part1(data) .text .file "part1" .globl julia_part1_1203 # -- Begin function julia_part1_1203 .p2align 4, 0x90 .type julia_part1_1203,@function julia_part1_1203: # @julia_part1_1203 ; Function Signature: part1(Array{UInt8, 1}) # %bb.0: # %top #DEBUG_VALUE: part1:data <- [DW_OP_deref] $rdi push rbp mov rbp, rsp push r15 push r14 push r13 push r12 push rbx sub rsp, 40 vxorps xmm0, xmm0, xmm0 #APP mov rax, qword ptr fs:[0] #NO_APP lea rdx, [rbp - 64] vmovaps xmmword ptr [rbp - 64], xmm0 mov qword ptr [rbp - 48], 0 mov rcx, qword ptr [rax - 8] mov qword ptr [rbp - 64], 4 mov rax, qword ptr [rcx] mov qword ptr [rbp - 72], rcx # 8-byte Spill mov qword ptr [rbp - 56], rax mov qword ptr [rcx], rdx #DEBUG_VALUE: part1:data <- [DW_OP_deref] 0 mov r15, qword ptr [rdi + 16] test r15, r15 je .LBB0_1 # %bb.2: # %L34 mov r14, qword ptr [rdi] dec r15 mov r11b, 1 mov r13b, 1 # implicit-def: $r12b # implicit-def: $r10b xor eax, eax jmp .LBB0_3 .p2align 4, 0x90 .LBB0_4: # in Loop: Header=BB0_3 Depth=1 xor r11d, r11d mov ebx, edi mov r10d, r8d .LBB0_9: # %L114 # in Loop: Header=BB0_3 Depth=1 mov r12d, esi test r15, r15 je .LBB0_12 .LBB0_10: # %guard_exit126 # in Loop: Header=BB0_3 Depth=1 inc r14 dec r15 mov r13d, ebx .LBB0_3: # %L36 # =>This Inner Loop Header: Depth=1 movzx edx, byte ptr [r14] test r13b, 1 movzx edi, r13b mov ebx, 1 mov ecx, 0 cmove ebx, edi cmovne edi, ecx movzx ecx, r10b lea esi, [rdx - 48] lea r9d, [rdx - 58] movzx r8d, sil cmove r8d, ecx cmp r9b, -11 ja .LBB0_4 # %bb.5: # %L89 # in Loop: Header=BB0_3 Depth=1 test r11b, 1 jne .LBB0_8 # %bb.6: # %L102 # in Loop: Header=BB0_3 Depth=1 cmp dl, 10 jne .LBB0_7 # %bb.13: # %L106 # in Loop: Header=BB0_3 Depth=1 test r13b, 1 jne .LBB0_14 # %bb.11: # %L114.thread # in Loop: Header=BB0_3 Depth=1 add ecx, ecx mov bl, 1 mov r11b, 1 lea ecx, [rcx + 4*rcx] add cl, r12b movzx ecx, cl add rax, rcx test r15, r15 jne .LBB0_10 jmp .LBB0_12 .p2align 4, 0x90 .LBB0_8: # %L102.thread # in Loop: Header=BB0_3 Depth=1 mov r11b, 1 # implicit-def: $sil cmp dl, 10 jne .LBB0_9 jmp .LBB0_15 .LBB0_7: # in Loop: Header=BB0_3 Depth=1 mov esi, r12d jmp .LBB0_9 .LBB0_1: xor eax, eax .LBB0_12: # %L154 mov rcx, qword ptr [rbp - 56] mov rdx, qword ptr [rbp - 72] # 8-byte Reload mov qword ptr [rdx], rcx add rsp, 40 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret .LBB0_15: # %L106.thread test r13b, 1 jne .LBB0_14 # %bb.16: # %post_box_union47 movabs rax, offset jl_nothing movabs rcx, offset jl_small_typeof movabs rdi, offset ".L_j_str_typeassert#1" mov rdx, qword ptr [rax] mov rsi, qword ptr [rcx + 336] movabs rax, offset ijl_type_error mov qword ptr [rbp - 48], rsi call rax .LBB0_14: # %post_box_union movabs rax, offset jl_nothing movabs rcx, offset jl_small_typeof movabs rdi, offset ".L_j_str_typeassert#1" mov rdx, qword ptr [rax] mov rsi, qword ptr [rcx + 336] movabs rax, offset ijl_type_error mov qword ptr [rbp - 48], rsi call rax .Lfunc_end0: .size julia_part1_1203, .Lfunc_end0-julia_part1_1203 # -- End function .type ".L_j_str_typeassert#1",@object # @"_j_str_typeassert#1" .section .rodata.str1.1,"aMS",@progbits,1 ".L_j_str_typeassert#1": .asciz "typeassert" .size ".L_j_str_typeassert#1", 11 .section ".note.GNU-stack","",@progbits ``` </details> Co-authored-by: Sukera <[email protected]>

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@inbounds

`memoryref(mem, i)` will otherwise emit a boundscheck. ``` ; │ @ /home/vchuravy/WorkstealingQueues/src/CLL.jl:53 within `setindex_atomic!` @ genericmemory.jl:329 ; │┌ @ boot.jl:545 within `memoryref` %ptls_field = getelementptr inbounds i8, ptr %tls_pgcstack, i64 16 %ptls_load = load ptr, ptr %ptls_field, align 8 %"box::GenericMemoryRef" = call noalias nonnull align 8 dereferenceable(32) ptr @ijl_gc_small_alloc(ptr %ptls_load, i32 552, i32 32, i64 23456076646928) #9 %"box::GenericMemoryRef.tag_addr" = getelementptr inbounds i64, ptr %"box::GenericMemoryRef", i64 -1 store atomic i64 23456076646928, ptr %"box::GenericMemoryRef.tag_addr" unordered, align 8 store ptr %memoryref_data, ptr %"box::GenericMemoryRef", align 8 %.repack8 = getelementptr inbounds { ptr, ptr }, ptr %"box::GenericMemoryRef", i64 0, i32 1 store ptr %memoryref_mem, ptr %.repack8, align 8 call void @ijl_bounds_error_int(ptr nonnull %"box::GenericMemoryRef", i64 %7) unreachable ``` For the Julia code: ```julia function Base.setindex_atomic!(buf::WSBuffer{T}, order::Symbol, val::T, idx::Int64) where T @inbounds Base.setindex_atomic!(buf.buffer, order, val,((idx - 1) & buf.mask) + 1) end ``` from https://github.com/gbaraldi/WorkstealingQueues.jl/blob/0ebc57237cf0c90feedf99e4338577d04b67805b/src/CLL.jl#L41

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible.

@noinline

E.g. this allows `finalizer` inlining in the following case: ```julia mutable struct ForeignBuffer{T} const ptr::Ptr{T} end const foreign_buffer_finalized = Ref(false) function foreign_alloc(::Type{T}, length) where T ptr = Libc.malloc(sizeof(T) * length) ptr = Base.unsafe_convert(Ptr{T}, ptr) obj = ForeignBuffer{T}(ptr) return finalizer(obj) do obj Base.@assume_effects :notaskstate :nothrow foreign_buffer_finalized[] = true Libc.free(obj.ptr) end end function f_EA_finalizer(N::Int) workspace = foreign_alloc(Float64, N) GC.@preserve workspace begin (;ptr) = workspace Base.@assume_effects :nothrow @noinline println(devnull, "ptr = ", ptr) end end ``` ```julia julia> @code_typed f_EA_finalizer(42) CodeInfo( 1 ── %1 = Base.mul_int(8, N)::Int64 │ %2 = Core.lshr_int(%1, 63)::Int64 │ %3 = Core.trunc_int(Core.UInt8, %2)::UInt8 │ %4 = Core.eq_int(%3, 0x01)::Bool └─── goto #3 if not %4 2 ── invoke Core.throw_inexacterror(:convert::Symbol, UInt64::Type, %1::Int64)::Union{} └─── unreachable 3 ── goto #4 4 ── %9 = Core.bitcast(Core.UInt64, %1)::UInt64 └─── goto #5 5 ── goto #6 6 ── goto #7 7 ── goto #8 8 ── %14 = $(Expr(:foreigncall, :(:malloc), Ptr{Nothing}, svec(UInt64), 0, :(:ccall), :(%9), :(%9)))::Ptr{Nothing} └─── goto #9 9 ── %16 = Base.bitcast(Ptr{Float64}, %14)::Ptr{Float64} │ %17 = %new(ForeignBuffer{Float64}, %16)::ForeignBuffer{Float64} └─── goto #10 10 ─ %19 = $(Expr(:gc_preserve_begin, :(%17))) │ %20 = Base.getfield(%17, :ptr)::Ptr{Float64} │ invoke Main.println(Main.devnull::Base.DevNull, "ptr = "::String, %20::Ptr{Float64})::Nothing │ $(Expr(:gc_preserve_end, :(%19))) │ %23 = Main.foreign_buffer_finalized::Base.RefValue{Bool} │ Base.setfield!(%23, :x, true)::Bool │ %25 = Base.getfield(%17, :ptr)::Ptr{Float64} │ %26 = Base.bitcast(Ptr{Nothing}, %25)::Ptr{Nothing} │ $(Expr(:foreigncall, :(:free), Nothing, svec(Ptr{Nothing}), 0, :(:ccall), :(%26), :(%25)))::Nothing └─── return nothing ) => Nothing ``` However, this is still a WIP. Before merging, I want to improve EA's precision a bit and at least fix the test case that is currently marked as `broken`. I also need to check its impact on compiler performance. Additionally, I believe this feature is not yet practical. In particular, there is still significant room for improvement in the following areas: - EA's interprocedural capabilities: currently EA is performed ad-hoc for limited frames because of latency reasons, which significantly reduces its precision in the presence of interprocedural calls. - Relaxing the `:nothrow` check for finalizer inlining: the current algorithm requires `:nothrow`-ness on all paths from the allocation of the mutable struct to its last use, which is not practical for real-world cases. Even when `:nothrow` cannot be guaranteed, auxiliary optimizations such as inserting a `finalize` call after the last use might still be possible (#55990).

ghost assigned StefanKarpinski May 2, 2011

StefanKarpinski closed this as completed in 83bd7b1 May 2, 2011

ViralBShah mentioned this issue Mar 16, 2012

WIP: libuv rewrite and Windows port #521

Merged

vtjnash mentioned this issue Mar 17, 2012

build fails at sys.ji #600

Closed

ViralBShah mentioned this issue Mar 18, 2012

Building julia with clang: tests fail #602

Closed

forkandwait mentioned this issue May 12, 2012

Failed build on FreeBSD #829

Closed

6e441f9c mentioned this issue May 30, 2012

Julia segfaults while loading stage0.jl #901

Closed

dioptre mentioned this issue Jul 30, 2012

error during bootstrap on OS X #1089

Closed

fundamental mentioned this issue Apr 11, 2013

Regression in reload() Results in REPL Hanging #2829

Closed

railskipl mentioned this issue May 3, 2013

Tab-Expansion of Base in REPL causes SIGABRT #2064

Closed

staticfloat mentioned this issue Jan 18, 2014

Build failure on Fedora #5422

Closed

cmundi mentioned this issue Mar 18, 2014

RFC: Windows CI #6028

Merged

burrowsa pushed a commit to burrowsa/julia that referenced this issue Mar 24, 2014

Merge pull request JuliaLang#9 from JuliaLang/master

a320bad

update to 05c323d [ci skip]

svaksha mentioned this issue May 3, 2014

[openblas] make testall - error in spawn.jl #6728

Closed

yuyichao mentioned this issue May 26, 2015

RFC: Try to generate reusable jlcall wrapper #11439

Closed

MaheshWaidande mentioned this issue Jul 27, 2015

Julia binary goes in infinite wait on Ubuntu 14.10 (ppc64le) #12326

Closed

chipkent mentioned this issue Jul 19, 2016

Possible LLVM segfault #17495

Closed

vchuravy mentioned this issue Jan 1, 2017

Basic support for builtins from compiler-rt #18734

Closed

5 tasks

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018

Merge pull request #9 from JuliaLang/sjk/at-compat

1ed1f76

Add ＠compat as discussed in #8

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018

Merge pull request #9 from staticfloat/sf/coveragebaby

98f4e99

Increase coverage

StefanKarpinski added a commit that referenced this issue Feb 8, 2018

actually fix loading issues (#9)

4596b54

vchuravy mentioned this issue Feb 8, 2018

use git subtree to graft Pkg3, etc. into stdlib #25942

Closed

KristofferC pushed a commit that referenced this issue Feb 8, 2018

Merge pull request #9 from christopher-dG/cdg/exceptions

b773c55

Display exceptions

Keno pushed a commit that referenced this issue Oct 9, 2023

Merge pull request #9 from JuliaDebug/teh/splatnew

4f68437

Handle :splatnew expressions on nightly

gbaraldi mentioned this issue Jul 2, 2024

Bump LLVM to v18 #54848

Merged

dlfivefifty mentioned this issue Nov 26, 2018

Lazy addition? JuliaArrays/LazyArrays.jl#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix test_utf8.j #9

fix test_utf8.j #9

StefanKarpinski commented May 2, 2011

fix test_utf8.j #9

fix test_utf8.j #9

Comments

StefanKarpinski commented May 2, 2011