Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sort_int_range! for rangelen::UInt64 #2

Merged
merged 1 commit into from
Oct 20, 2021
Merged

Conversation

LilithHafner
Copy link
Owner

No description provided.

@LilithHafner LilithHafner merged commit 8b9c60c into patch-1 Oct 20, 2021
@LilithHafner LilithHafner deleted the patch-2 branch October 20, 2021 14:42
LilithHafner pushed a commit that referenced this pull request Feb 22, 2022
…43226)

In order to allow `Argument`s to be printed nicely.

> before
```julia
julia> code_typed((Float64,)) do x
           sin(x)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.sin(_2::Float64)::Float64
└──      return %1
) => Float64

julia> code_typed((Bool,Any,Any)) do c, x, y
           z = c ? x : y
           z
       end
1-element Vector{Any}:
 CodeInfo(
1 ─      goto #3 if not c
2 ─      goto #4
3 ─      nothing::Nothing
4 ┄ %4 = φ (#2 => _3, #3 => _4)::Any
└──      return %4
) => Any
```

> after
```julia
julia> code_typed((Float64,)) do x
           sin(x)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.sin(x::Float64)::Float64
└──      return %1
) => Float64

julia> code_typed((Bool,Any,Any)) do c, x, y
           z = c ? x : y
           z
       end
1-element Vector{Any}:
 CodeInfo(
1 ─      goto #3 if not c
2 ─      goto #4
3 ─      nothing::Nothing
4 ┄ %4 = φ (#2 => x, #3 => y)::Any
└──      return %4
) => Any
```
LilithHafner pushed a commit that referenced this pull request Mar 8, 2022
…43226)

In order to allow `Argument`s to be printed nicely.

> before
```julia
julia> code_typed((Float64,)) do x
           sin(x)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.sin(_2::Float64)::Float64
└──      return %1
) => Float64

julia> code_typed((Bool,Any,Any)) do c, x, y
           z = c ? x : y
           z
       end
1-element Vector{Any}:
 CodeInfo(
1 ─      goto #3 if not c
2 ─      goto #4
3 ─      nothing::Nothing
4 ┄ %4 = φ (#2 => _3, #3 => _4)::Any
└──      return %4
) => Any
```

> after
```julia
julia> code_typed((Float64,)) do x
           sin(x)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.sin(x::Float64)::Float64
└──      return %1
) => Float64

julia> code_typed((Bool,Any,Any)) do c, x, y
           z = c ? x : y
           z
       end
1-element Vector{Any}:
 CodeInfo(
1 ─      goto #3 if not c
2 ─      goto #4
3 ─      nothing::Nothing
4 ┄ %4 = φ (#2 => x, #3 => y)::Any
└──      return %4
) => Any
```
LilithHafner pushed a commit that referenced this pull request Mar 8, 2022
…43226)

In order to allow `Argument`s to be printed nicely.

> before
```julia
julia> code_typed((Float64,)) do x
           sin(x)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.sin(_2::Float64)::Float64
└──      return %1
) => Float64

julia> code_typed((Bool,Any,Any)) do c, x, y
           z = c ? x : y
           z
       end
1-element Vector{Any}:
 CodeInfo(
1 ─      goto #3 if not c
2 ─      goto #4
3 ─      nothing::Nothing
4 ┄ %4 = φ (#2 => _3, #3 => _4)::Any
└──      return %4
) => Any
```

> after
```julia
julia> code_typed((Float64,)) do x
           sin(x)
       end
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = invoke Main.sin(x::Float64)::Float64
└──      return %1
) => Float64

julia> code_typed((Bool,Any,Any)) do c, x, y
           z = c ? x : y
           z
       end
1-element Vector{Any}:
 CodeInfo(
1 ─      goto #3 if not c
2 ─      goto #4
3 ─      nothing::Nothing
4 ┄ %4 = φ (#2 => x, #3 => y)::Any
└──      return %4
) => Any
```
LilithHafner pushed a commit that referenced this pull request Mar 31, 2022
Currently the optimizer handles abstract callsite only when there is a
single dispatch candidate (in most cases), and so inlining and static-dispatch
are prohibited when the callsite is union-split (in other word, union-split
happens only when all the dispatch candidates are concrete).

However, there are certain patterns of code (most notably our Julia-level compiler code)
that inherently need to deal with abstract callsite.
The following example is taken from `Core.Compiler` utility:
```julia
julia> @inline isType(@nospecialize t) = isa(t, DataType) && t.name === Type.body.name
isType (generic function with 1 method)

julia> code_typed((Any,)) do x # abstract, but no union-split, successful inlining
           isType(x)
       end |> only
CodeInfo(
1 ─ %1 = (x isa Main.DataType)::Bool
└──      goto #3 if not %1
2 ─ %3 = π (x, DataType)
│   %4 = Base.getfield(%3, :name)::Core.TypeName
│   %5 = Base.getfield(Type{T}, :name)::Core.TypeName
│   %6 = (%4 === %5)::Bool
└──      goto #4
3 ─      goto #4
4 ┄ %9 = φ (#2 => %6, #3 => false)::Bool
└──      return %9
) => Bool

julia> code_typed((Union{Type,Nothing},)) do x # abstract, union-split, unsuccessful inlining
           isType(x)
       end |> only
CodeInfo(
1 ─ %1 = (isa)(x, Nothing)::Bool
└──      goto #3 if not %1
2 ─      goto #4
3 ─ %4 = Main.isType(x)::Bool
└──      goto #4
4 ┄ %6 = φ (#2 => false, #3 => %4)::Bool
└──      return %6
) => Bool
```
(note that this is a limitation of the inlining algorithm, and so any
user-provided hints like callsite inlining annotation doesn't help here)

This commit enables inlining and static dispatch for abstract union-split callsite.
The core idea here is that we can simulate our dispatch semantics by
generating `isa` checks in order of the specialities of dispatch candidates:
```julia
julia> code_typed((Union{Type,Nothing},)) do x # union-split, unsuccessful inlining
                  isType(x)
              end |> only
CodeInfo(
1 ─ %1  = (isa)(x, Nothing)::Bool
└──       goto #3 if not %1
2 ─       goto #9
3 ─ %4  = (isa)(x, Type)::Bool
└──       goto #8 if not %4
4 ─ %6  = π (x, Type)
│   %7  = (%6 isa Main.DataType)::Bool
└──       goto #6 if not %7
5 ─ %9  = π (%6, DataType)
│   %10 = Base.getfield(%9, :name)::Core.TypeName
│   %11 = Base.getfield(Type{T}, :name)::Core.TypeName
│   %12 = (%10 === %11)::Bool
└──       goto #7
6 ─       goto #7
7 ┄ %15 = φ (#5 => %12, #6 => false)::Bool
└──       goto #9
8 ─       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
└──       unreachable
9 ┄ %19 = φ (#2 => false, #7 => %15)::Bool
└──       return %19
) => Bool
```

Inlining/static-dispatch of abstract union-split callsite will improve
the performance in such situations (and so this commit will improve the
latency of our JIT compilation). Especially, this commit helps us avoid
excessive specializations of `Core.Compiler` code by statically-resolving
`@nospecialize`d callsites, and as the result, the # of precompiled
statements is now reduced from  `2005` ([`master`](f782430)) to `1912` (this commit).

And also, as a side effect, the implementation of our inlining algorithm
gets much simplified now since we no longer need the previous special
handlings for abstract callsites.

One possible drawback would be increased code size.
This change seems to certainly increase the size of sysimage,
but I think these numbers are in an acceptable range:
> [`master`](f782430)
```
❯ du -shk usr/lib/julia/*
17604	usr/lib/julia/corecompiler.ji
194072	usr/lib/julia/sys-o.a
169424	usr/lib/julia/sys.dylib
23784	usr/lib/julia/sys.dylib.dSYM
103772	usr/lib/julia/sys.ji
```

> this commit
```
❯ du -shk usr/lib/julia/*
17512	usr/lib/julia/corecompiler.ji
195588	usr/lib/julia/sys-o.a
170908	usr/lib/julia/sys.dylib
23776	usr/lib/julia/sys.dylib.dSYM
105360	usr/lib/julia/sys.ji
```
LilithHafner pushed a commit that referenced this pull request Jun 30, 2022
…Lang#45790)

Currently the `@nospecialize`-d `push!(::Vector{Any}, ...)` can only
take a single item and we will end up with runtime dispatch when we try
to call it with multiple items:
```julia
julia> code_typed(push!, (Vector{Any}, Any))
1-element Vector{Any}:
 CodeInfo(
1 ─      $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing
│   %2 = Base.arraylen(a)::Int64
│        Base.arrayset(true, a, item, %2)::Vector{Any}
└──      return a
) => Vector{Any}

julia> code_typed(push!, (Vector{Any}, Any, Any))
1-element Vector{Any}:
 CodeInfo(
1 ─ %1 = Base.append!(a, iter)::Vector{Any}
└──      return %1
) => Vector{Any}
```

This commit adds a new specialization that it can take arbitrary-length
items. Our compiler should still be able to optimize the single-input 
case as before via the dispatch mechanism.
```julia
julia> code_typed(push!, (Vector{Any}, Any))
1-element Vector{Any}:
 CodeInfo(
1 ─      $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing
│   %2 = Base.arraylen(a)::Int64
│        Base.arrayset(true, a, item, %2)::Vector{Any}
└──      return a
) => Vector{Any}

julia> code_typed(push!, (Vector{Any}, Any, Any))
1-element Vector{Any}:
 CodeInfo(
1 ─ %1  = Base.arraylen(a)::Int64
│         $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000002, 0x0000000000000002))::Nothing
└──       goto #7 if not true
2 ┄ %4  = φ (#1 => 1, #6 => %14)::Int64
│   %5  = φ (#1 => 1, #6 => %15)::Int64
│   %6  = Base.getfield(x, %4, true)::Any
│   %7  = Base.add_int(%1, %4)::Int64
│         Base.arrayset(true, a, %6, %7)::Vector{Any}
│   %9  = (%5 === 2)::Bool
└──       goto #4 if not %9
3 ─       goto #5
4 ─ %12 = Base.add_int(%5, 1)::Int64
└──       goto #5
5 ┄ %14 = φ (#4 => %12)::Int64
│   %15 = φ (#4 => %12)::Int64
│   %16 = φ (#3 => true, #4 => false)::Bool
│   %17 = Base.not_int(%16)::Bool
└──       goto #7 if not %17
6 ─       goto #2
7 ┄       return a
) => Vector{Any}
```

This commit also adds the equivalent implementations for `pushfirst!`.
LilithHafner pushed a commit that referenced this pull request Jun 30, 2022
When calling `jl_error()` or `jl_errorf()`, we must check to see if we
are so early in the bringup process that it is dangerous to attempt to
construct a backtrace because the data structures used to provide line
information are not properly setup.

This can be easily triggered by running:

```
julia -C invalid
```

On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name"
branch in `jitlayers.cpp`, which calls `jl_errorf()`.  This in turn
calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part
of the backtrace printing process, which fails as the object maps are
not fully initialized.  See the below `gdb` stacktrace for details:

```
$ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid
...
fatal: error thrown and no exception handler available.
ErrorException("Invalid CPU name "invalid".")

Thread 1 "julia" received signal SIGSEGV, Segmentation fault.
0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277
1277    /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory.
 #0  0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277
 #1  std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258
 #2  jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181
 #3  0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210
 #4  0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636
 #5  0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657
 #6  jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090
 #7  0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605
 #8  0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638
 #9  0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654
 #10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77
 #11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823
 #12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044
 #13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585
 #14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648
 #15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131
 #16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184
 #17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424
 #18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157
 #19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741
 JuliaLang#20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728
 JuliaLang#21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705
 JuliaLang#22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59
```

To prevent this, we simply avoid calling `jl_errorf` this early in the
process, punting the problem to a later PR that can update guard
conditions within `jl_error*`.
LilithHafner pushed a commit that referenced this pull request Aug 12, 2023
This commit improves SROA pass by extending the `unswitchtupleunion`
optimization to handle the general parametric types, e.g.:
```julia
julia> struct A{T}
           x::T
       end;

julia> function foo(a1, a2, c)
           t = c ? A(a1) : A(a2)
           return getfield(t, :x)
       end;

julia> only(Base.code_ircode(foo, (Int,Float64,Bool); optimize_until="SROA"))
```

> Before
```
2 1 ─      goto #3 if not _4                                          │
  2 ─ %2 = %new(A{Int64}, _2)::A{Int64}                               │╻ A
  └──      goto #4                                                    │
  3 ─ %4 = %new(A{Float64}, _3)::A{Float64}                           │╻ A
  4 ┄ %5 = φ (#2 => %2, #3 => %4)::Union{A{Float64}, A{Int64}}        │
3 │   %6 = Main.getfield(%5, :x)::Union{Float64, Int64}               │
  └──      return %6                                                  │
   => Union{Float64, Int64}
```

> After
```
julia> only(Base.code_ircode(foo, (Int,Float64,Bool); optimize_until="SROA"))
2 1 ─      goto #3 if not _4                                           │
  2 ─      nothing::A{Int64}                                           │╻ A
  └──      goto #4                                                     │
  3 ─      nothing::A{Float64}                                         │╻ A
  4 ┄ %8 = φ (#2 => _2, #3 => _3)::Union{Float64, Int64}               │
  │        nothing::Union{A{Float64}, A{Int64}}
3 │   %6 = %8::Union{Float64, Int64}                                   │
  └──      return %6                                                   │
   => Union{Float64, Int64}
```
LilithHafner pushed a commit that referenced this pull request Dec 7, 2023
…#51489)

This exposes the GC "stop the world" API to the user, for causing a
thread to quickly stop executing Julia code. This adds two APIs (that
will need to be exported and documented later):
```
julia> @CCall jl_safepoint_suspend_thread(#=tid=#1::Cint, #=magicnumber=#2::Cint)::Cint # roughly tkill(1, SIGSTOP)

julia> @CCall jl_safepoint_resume_thread(#=tid=#1::Cint)::Cint # roughly tkill(1, SIGCONT)
```

You can even suspend yourself, if there is another task to resume you 10
seconds later:
```
julia> ccall(:jl_enter_threaded_region, Cvoid, ())

julia> t = @task let; Libc.systemsleep(10); print("\nhello from $(Threads.threadid())\n"); @CCall jl_safepoint_resume_thread(0::Cint)::Cint; end; ccall(:jl_set_task_tid, Cint, (Any, Cint), t, 1); schedule(t);

julia> @time @CCall jl_safepoint_suspend_thread(0::Cint, 2::Cint)::Cint

hello from 2
  10 seconds (6 allocations: 264 bytes)
1
```

The meaning of the magic number is actually the kind of stop that you
want:
```
// n.b. suspended threads may still run in the GC or GC safe regions
// but shouldn't be observable, depending on which enum the user picks (only 1 and 2 are typically recommended here)
// waitstate = 0 : do not wait for suspend to finish
// waitstate = 1 : wait for gc_state != 0 (JL_GC_STATE_WAITING or JL_GC_STATE_SAFE)
// waitstate = 2 : wait for gc_state != 0 (JL_GC_STATE_WAITING or JL_GC_STATE_SAFE) and that GC is not running on that thread
// waitstate = 3 : wait for full suspend (gc_state == JL_GC_STATE_WAITING) -- this may never happen if thread is sleeping currently
// if another thread comes along and calls jl_safepoint_resume, we also return early
// return new suspend count on success, 0 on failure
```
Only magic number 2 is currently meaningful to the user though. The
difference between waitstate 1 and 2 is only relevant in C code which is
calling this from JL_GC_STATE_SAFE, since otherwise it is a priori known
that GC isn't running, else we too would be running the GC. But the
distinction of those states might be useful if we have a concurrent
collector.

Very important warning: if the stopped thread is holding any locks
(e.g. for codegen or types) that you then attempt to acquire, your
thread will deadlock. This is very likely, unless you are very careful.
A future update to this API may try to change the waitstate to give the
option to wait for the thread to release internal or known locks.
LilithHafner pushed a commit that referenced this pull request Dec 21, 2023
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:

1. `cd contrib/pgo-lto`
2. `make -j$(nproc) stage1`
3. `make clean-profiles`
4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'`
5. `make -j$(nproc) stage2`

This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.

Example 1:

```julia
using LoopVectorization
function f!(a, b)
    @turbo for i in eachindex(a)
        a[i] *= b[i]
    end
    return a
end
f!(rand(1), rand(1))
```

```console
$ time ./julia -O3 lv.jl
```

Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)

Example 2:

```console
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
```

Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)

Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM):

```console
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
```

Without PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 101.0130 seconds (98.6253 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  53.6961 ( 54.7%)   0.1050 (  3.8%)  53.8012 ( 53.3%)  53.8045 ( 54.6%)  Unroll loops
  25.5423 ( 26.0%)   0.0072 (  0.3%)  25.5495 ( 25.3%)  25.5444 ( 25.9%)  Global Value Numbering
   7.1995 (  7.3%)   0.0526 (  1.9%)   7.2521 (  7.2%)   7.2517 (  7.4%)  Induction Variable Simplification
   5.0541 (  5.1%)   0.0098 (  0.3%)   5.0639 (  5.0%)   5.0561 (  5.1%)
   Combine redundant instructions #2
```

Wit PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 72.6507 seconds (70.1337 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.0894 ( 51.7%)   0.0825 (  2.9%)  36.1719 ( 49.8%)  36.1738 ( 51.6%)  Unroll loops
  16.5713 ( 23.7%)   0.0129 (  0.5%)  16.5843 ( 22.8%)  16.5794 ( 23.6%)  Global Value Numbering
   5.9047 (  8.5%)   0.0395 (  1.4%)   5.9442 (  8.2%)   5.9438 (  8.5%)  Induction Variable Simplification
   4.7566 (  6.8%)   0.0078 (  0.3%)   4.7645 (  6.6%)   4.7575 (  6.8%)  Combine redundant instructions #2
```

Or -28% time spent in LLVM.

---

Finally there's a significant reduction in binary sizes. For libLLVM.so:

```
79M	usr/lib/libLLVM-13jl.so (before)
67M	usr/lib/libLLVM-13jl.so (after)
```

And it can be reduced by another 2MB with `--icf=safe` when using LLD as
a linker anways.

Turn into makefile

Newline

Use two out of source builds

Ignore profiles + build dirs

Add --icf=safe

stage0 setup prebuilt clang with [cd]tors->init/fini patch
LilithHafner pushed a commit that referenced this pull request Apr 18, 2024
`@something` eagerly unwraps any `Some` given to it, while keeping the
variable between its arguments the same. This can be an issue if a
previously unpacked value is used as input to `@something`, leading to a
type instability on more than two arguments (e.g. because of a fallback
to `Some(nothing)`). By using different variables for each argument,
type inference has an easier time handling these cases that are isolated
to single branches anyway.

This also adds some comments to the macro, since it's non-obvious what
it does.

Benchmarking the specific case I encountered this in led to a ~2x
performance improvement on multiple machines.

1.10-beta3/master:

```
[sukera@tower 01]$ jl1100 -q --project=. -L 01.jl -e 'bench()'
v"1.10.0-beta3"

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  38.670 μs … 70.350 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     43.340 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.395 μs ±  1.518 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▆█▂ ▁▁                           
  ▂▂▂▂▂▂▂▂▂▁▂▂▂▃▃▃▂▂▃▃▃▂▂▂▂▂▄▇███▆██▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  38.7 μs         Histogram: frequency by time          48 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

This PR:

```
[sukera@tower 01]$ julia -q --project=. -L 01.jl -e 'bench()'
v"1.11.0-DEV.970"

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  22.820 μs …  44.980 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.370 μs ± 832.239 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▂▅▇██▇▆▅▁                                       
  ▂▂▂▂▂▂▂▂▃▃▄▅▇███████████▅▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂ ▃
  22.8 μs         Histogram: frequency by time         27.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
``` 


<details>
<summary>Benchmarking code (spoilers for Advent Of Code 2023 Day 01,
Part 01). Running this requires the input of that Advent Of Code
day.</summary>

```julia
using BenchmarkTools
using InteractiveUtils

isdigit(d::UInt8) = UInt8('0') <= d <= UInt8('9')
someDigit(c::UInt8) = isdigit(c) ? Some(c - UInt8('0')) : nothing

function part1(data)
    total = 0
    may_a = nothing
    may_b = nothing

    for c in data
        digitRes = someDigit(c)
        may_a = @something may_a digitRes Some(nothing)
        may_b = @something digitRes may_b Some(nothing)
        if c == UInt8('\n')
            digit_a = may_a::UInt8
            digit_b = may_b::UInt8
            total += digit_a*0xa + digit_b
            may_a = nothing
            may_b = nothing
        end
    end

    return total
end

function bench()
    data = read("input.txt")
    display(VERSION)
    println()
    display(@benchmark part1($data))
    nothing
end
```
</details>

<details>
<summary>`@code_warntype` before</summary>

```julia
julia> @code_warntype part1(data)
MethodInstance for part1(::Vector{UInt8})
  from part1(data) @ Main ~/Documents/projects/AOC/2023/01/01.jl:7
Arguments
  #self#::Core.Const(part1)
  data::Vector{UInt8}
Locals
  @_3::Union{Nothing, Tuple{UInt8, Int64}}
  may_b::Union{Nothing, UInt8}
  may_a::Union{Nothing, UInt8}
  total::Int64
  c::UInt8
  digit_b::UInt8
  digit_a::UInt8
  val@_10::Any
  val@_11::Any
  digitRes::Union{Nothing, Some{UInt8}}
  @_13::Union{Some{Nothing}, Some{UInt8}, UInt8}
  @_14::Union{Some{Nothing}, Some{UInt8}}
  @_15::Some{Nothing}
  @_16::Union{Some{Nothing}, Some{UInt8}, UInt8}
  @_17::Union{Some{Nothing}, UInt8}
  @_18::Some{Nothing}
Body::Int64
1 ──       (total = 0)
│          (may_a = Main.nothing)
│          (may_b = Main.nothing)
│    %4  = data::Vector{UInt8}
│          (@_3 = Base.iterate(%4))
│    %6  = (@_3 === nothing)::Bool
│    %7  = Base.not_int(%6)::Bool
└───       goto JuliaLang#24 if not %7
2 ┄─       Core.NewvarNode(:(digit_b))
│          Core.NewvarNode(:(digit_a))
│          Core.NewvarNode(:(val@_10))
│    %12 = @_3::Tuple{UInt8, Int64}
│          (c = Core.getfield(%12, 1))
│    %14 = Core.getfield(%12, 2)::Int64
│          (digitRes = Main.someDigit(c))
│          (val@_11 = may_a)
│    %17 = (val@_11::Union{Nothing, UInt8} !== Base.nothing)::Bool
└───       goto #4 if not %17
3 ──       (@_13 = val@_11::UInt8)
└───       goto #11
4 ──       (val@_11 = digitRes)
│    %22 = (val@_11::Union{Nothing, Some{UInt8}} !== Base.nothing)::Bool
└───       goto #6 if not %22
5 ──       (@_14 = val@_11::Some{UInt8})
└───       goto #10
6 ──       (val@_11 = Main.Some(Main.nothing))
│    %27 = (val@_11::Core.Const(Some(nothing)) !== Base.nothing)::Core.Const(true)
└───       goto #8 if not %27
7 ──       (@_15 = val@_11::Core.Const(Some(nothing)))
└───       goto #9
8 ──       Core.Const(:(@_15 = Base.nothing))
9 ┄─       (@_14 = @_15)
10 ┄       (@_13 = @_14)
11 ┄ %34 = @_13::Union{Some{Nothing}, Some{UInt8}, UInt8}
│          (may_a = Base.something(%34))
│          (val@_10 = digitRes)
│    %37 = (val@_10::Union{Nothing, Some{UInt8}} !== Base.nothing)::Bool
└───       goto #13 if not %37
12 ─       (@_16 = val@_10::Some{UInt8})
└───       goto JuliaLang#20
13 ─       (val@_10 = may_b)
│    %42 = (val@_10::Union{Nothing, UInt8} !== Base.nothing)::Bool
└───       goto #15 if not %42
14 ─       (@_17 = val@_10::UInt8)
└───       goto #19
15 ─       (val@_10 = Main.Some(Main.nothing))
│    %47 = (val@_10::Core.Const(Some(nothing)) !== Base.nothing)::Core.Const(true)
└───       goto #17 if not %47
16 ─       (@_18 = val@_10::Core.Const(Some(nothing)))
└───       goto #18
17 ─       Core.Const(:(@_18 = Base.nothing))
18 ┄       (@_17 = @_18)
19 ┄       (@_16 = @_17)
20 ┄ %54 = @_16::Union{Some{Nothing}, Some{UInt8}, UInt8}
│          (may_b = Base.something(%54))
│    %56 = c::UInt8
│    %57 = Main.UInt8('\n')::Core.Const(0x0a)
│    %58 = (%56 == %57)::Bool
└───       goto JuliaLang#22 if not %58
21 ─       (digit_a = Core.typeassert(may_a, Main.UInt8))
│          (digit_b = Core.typeassert(may_b, Main.UInt8))
│    %62 = total::Int64
│    %63 = (digit_a * 0x0a)::UInt8
│    %64 = (%63 + digit_b)::UInt8
│          (total = %62 + %64)
│          (may_a = Main.nothing)
└───       (may_b = Main.nothing)
22 ┄       (@_3 = Base.iterate(%4, %14))
│    %69 = (@_3 === nothing)::Bool
│    %70 = Base.not_int(%69)::Bool
└───       goto JuliaLang#24 if not %70
23 ─       goto #2
24 ┄       return total
```
</details>

<details>
<summary>`@code_native debuginfo=:none` Before </summary>

```julia
julia> @code_native debuginfo=:none part1(data)
	.text
	.file	"part1"
	.globl	julia_part1_418                 # -- Begin function julia_part1_418
	.p2align	4, 0x90
	.type	julia_part1_418,@function
julia_part1_418:                        # @julia_part1_418
# %bb.0:                                # %top
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	sub	rsp, 40
	mov	rax, qword ptr [rdi + 8]
	test	rax, rax
	je	.LBB0_1
# %bb.2:                                # %L17
	mov	rcx, qword ptr [rdi]
	dec	rax
	mov	r10b, 1
	xor	r14d, r14d
                                        # implicit-def: $r12b
                                        # implicit-def: $r13b
                                        # implicit-def: $r9b
                                        # implicit-def: $sil
	mov	qword ptr [rbp - 64], rax       # 8-byte Spill
	mov	al, 1
	mov	dword ptr [rbp - 48], eax       # 4-byte Spill
                                        # implicit-def: $al
                                        # kill: killed $al
	xor	eax, eax
	mov	qword ptr [rbp - 56], rax       # 8-byte Spill
	mov	qword ptr [rbp - 72], rcx       # 8-byte Spill
                                        # implicit-def: $cl
	jmp	.LBB0_3
	.p2align	4, 0x90
.LBB0_8:                                #   in Loop: Header=BB0_3 Depth=1
	mov	dword ptr [rbp - 48], 0         # 4-byte Folded Spill
.LBB0_24:                               # %post_union_move
                                        #   in Loop: Header=BB0_3 Depth=1
	movzx	r13d, byte ptr [rbp - 41]       # 1-byte Folded Reload
	mov	r12d, r8d
	cmp	qword ptr [rbp - 64], r14       # 8-byte Folded Reload
	je	.LBB0_13
.LBB0_25:                               # %guard_exit113
                                        #   in Loop: Header=BB0_3 Depth=1
	inc	r14
	mov	r10d, ebx
.LBB0_3:                                # %L19
                                        # =>This Inner Loop Header: Depth=1
	mov	rax, qword ptr [rbp - 72]       # 8-byte Reload
	xor	ebx, ebx
	xor	edi, edi
	movzx	r15d, r9b
	movzx	ecx, cl
	movzx	esi, sil
	mov	r11b, 1
                                        # implicit-def: $r9b
	movzx	edx, byte ptr [rax + r14]
	lea	eax, [rdx - 58]
	lea	r8d, [rdx - 48]
	cmp	al, -10
	setae	bl
	setb	dil
	test	r10b, 1
	cmovne	r15d, edi
	mov	edi, 0
	cmovne	ecx, ebx
	mov	bl, 1
	cmovne	esi, edi
	test	r15b, 1
	jne	.LBB0_7
# %bb.4:                                # %L76
                                        #   in Loop: Header=BB0_3 Depth=1
	mov	r11b, 2
	test	cl, 1
	jne	.LBB0_5
# %bb.6:                                # %L78
                                        #   in Loop: Header=BB0_3 Depth=1
	mov	ebx, r10d
	mov	r9d, r15d
	mov	byte ptr [rbp - 41], r13b       # 1-byte Spill
	test	sil, 1
	je	.LBB0_26
.LBB0_7:                                # %L82
                                        #   in Loop: Header=BB0_3 Depth=1
	cmp	al, -11
	jbe	.LBB0_9
	jmp	.LBB0_8
	.p2align	4, 0x90
.LBB0_5:                                #   in Loop: Header=BB0_3 Depth=1
	mov	ecx, r8d
	mov	sil, 1
	xor	ebx, ebx
	mov	byte ptr [rbp - 41], r8b        # 1-byte Spill
	xor	r9d, r9d
	xor	ecx, ecx
	cmp	al, -11
	ja	.LBB0_8
.LBB0_9:                                # %L90
                                        #   in Loop: Header=BB0_3 Depth=1
	test	byte ptr [rbp - 48], 1          # 1-byte Folded Reload
	jne	.LBB0_23
# %bb.10:                               # %L115
                                        #   in Loop: Header=BB0_3 Depth=1
	cmp	dl, 10
	jne	.LBB0_11
# %bb.14:                               # %L122
                                        #   in Loop: Header=BB0_3 Depth=1
	test	r15b, 1
	jne	.LBB0_15
# %bb.12:                               # %L130.thread
                                        #   in Loop: Header=BB0_3 Depth=1
	movzx	eax, byte ptr [rbp - 41]        # 1-byte Folded Reload
	mov	bl, 1
	add	eax, eax
	lea	eax, [rax + 4*rax]
	add	al, r12b
	movzx	eax, al
	add	qword ptr [rbp - 56], rax       # 8-byte Folded Spill
	mov	al, 1
	mov	dword ptr [rbp - 48], eax       # 4-byte Spill
	cmp	qword ptr [rbp - 64], r14       # 8-byte Folded Reload
	jne	.LBB0_25
	jmp	.LBB0_13
	.p2align	4, 0x90
.LBB0_23:                               # %L115.thread
                                        #   in Loop: Header=BB0_3 Depth=1
	mov	al, 1
                                        # implicit-def: $r8b
	mov	dword ptr [rbp - 48], eax       # 4-byte Spill
	cmp	dl, 10
	jne	.LBB0_24
	jmp	.LBB0_21
.LBB0_11:                               #   in Loop: Header=BB0_3 Depth=1
	mov	r8d, r12d
	jmp	.LBB0_24
.LBB0_1:
	xor	eax, eax
	mov	qword ptr [rbp - 56], rax       # 8-byte Spill
.LBB0_13:                               # %L159
	mov	rax, qword ptr [rbp - 56]       # 8-byte Reload
	add	rsp, 40
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret
.LBB0_21:                               # %L122.thread
	test	r15b, 1
	jne	.LBB0_15
# %bb.22:                               # %post_box_union58
	movabs	rdi, offset .L_j_str1
	movabs	rax, offset ijl_type_error
	movabs	rsi, 140008511215408
	movabs	rdx, 140008667209736
	call	rax
.LBB0_15:                               # %fail
	cmp	r11b, 1
	je	.LBB0_19
# %bb.16:                               # %fail
	movzx	eax, r11b
	cmp	eax, 2
	jne	.LBB0_17
# %bb.20:                               # %box_union54
	movzx	eax, byte ptr [rbp - 41]        # 1-byte Folded Reload
	movabs	rcx, offset jl_boxed_uint8_cache
	mov	rdx, qword ptr [rcx + 8*rax]
	jmp	.LBB0_18
.LBB0_26:                               # %L80
	movabs	rax, offset ijl_throw
	movabs	rdi, 140008495049392
	call	rax
.LBB0_19:                               # %box_union
	movabs	rdx, 140008667209736
	jmp	.LBB0_18
.LBB0_17:
	xor	edx, edx
.LBB0_18:                               # %post_box_union
	movabs	rdi, offset .L_j_str1
	movabs	rax, offset ijl_type_error
	movabs	rsi, 140008511215408
	call	rax
.Lfunc_end0:
	.size	julia_part1_418, .Lfunc_end0-julia_part1_418
                                        # -- End function
	.type	.L_j_str1,@object               # @_j_str1
	.section	.rodata.str1.1,"aMS",@progbits,1
.L_j_str1:
	.asciz	"typeassert"
	.size	.L_j_str1, 11

	.section	".note.GNU-stack","",@progbits
```
</details>

<details>
<summary>`@code_warntype` After</summary>

```julia

[sukera@tower 01]$ julia -q --project=. -L 01.jl
julia> data = read("input.txt");

julia> @code_warntype part1(data)
MethodInstance for part1(::Vector{UInt8})
  from part1(data) @ Main ~/Documents/projects/AOC/2023/01/01.jl:7
Arguments
  #self#::Core.Const(part1)
  data::Vector{UInt8}
Locals
  @_3::Union{Nothing, Tuple{UInt8, Int64}}
  may_b::Union{Nothing, UInt8}
  may_a::Union{Nothing, UInt8}
  total::Int64
  val@_7::Union{}
  val@_8::Union{}
  c::UInt8
  digit_b::UInt8
  digit_a::UInt8
  #JuliaLang#215::Some{Nothing}
  #JuliaLang#216::Union{Nothing, UInt8}
  #JuliaLang#217::Union{Nothing, Some{UInt8}}
  #JuliaLang#212::Some{Nothing}
  #JuliaLang#213::Union{Nothing, Some{UInt8}}
  #JuliaLang#214::Union{Nothing, UInt8}
  digitRes::Union{Nothing, Some{UInt8}}
  @_19::Union{Nothing, UInt8}
  @_20::Union{Nothing, UInt8}
  @_21::Nothing
  @_22::Union{Nothing, UInt8}
  @_23::Union{Nothing, UInt8}
  @_24::Nothing
Body::Int64
1 ──        (total = 0)
│           (may_a = Main.nothing)
│           (may_b = Main.nothing)
│    %4   = data::Vector{UInt8}
│           (@_3 = Base.iterate(%4))
│    %6   = @_3::Union{Nothing, Tuple{UInt8, Int64}}
│    %7   = (%6 === nothing)::Bool
│    %8   = Base.not_int(%7)::Bool
└───        goto JuliaLang#24 if not %8
2 ┄─        Core.NewvarNode(:(val@_7))
│           Core.NewvarNode(:(val@_8))
│           Core.NewvarNode(:(digit_b))
│           Core.NewvarNode(:(digit_a))
│           Core.NewvarNode(:(#JuliaLang#215))
│           Core.NewvarNode(:(#JuliaLang#216))
│           Core.NewvarNode(:(#JuliaLang#217))
│           Core.NewvarNode(:(#JuliaLang#212))
│           Core.NewvarNode(:(#JuliaLang#213))
│    %19  = @_3::Tuple{UInt8, Int64}
│           (c = Core.getfield(%19, 1))
│    %21  = Core.getfield(%19, 2)::Int64
│    %22  = c::UInt8
│           (digitRes = Main.someDigit(%22))
│    %24  = may_a::Union{Nothing, UInt8}
│           (#JuliaLang#214 = %24)
│    %26  = Base.:!::Core.Const(!)
│    %27  = #JuliaLang#214::Union{Nothing, UInt8}
│    %28  = Base.isnothing(%27)::Bool
│    %29  = (%26)(%28)::Bool
└───        goto #4 if not %29
3 ── %31  = #JuliaLang#214::UInt8
│           (@_19 = Base.something(%31))
└───        goto #11
4 ── %34  = digitRes::Union{Nothing, Some{UInt8}}
│           (#JuliaLang#213 = %34)
│    %36  = Base.:!::Core.Const(!)
│    %37  = #JuliaLang#213::Union{Nothing, Some{UInt8}}
│    %38  = Base.isnothing(%37)::Bool
│    %39  = (%36)(%38)::Bool
└───        goto #6 if not %39
5 ── %41  = #JuliaLang#213::Some{UInt8}
│           (@_20 = Base.something(%41))
└───        goto #10
6 ── %44  = Main.Some::Core.Const(Some)
│    %45  = Main.nothing::Core.Const(nothing)
│           (#JuliaLang#212 = (%44)(%45))
│    %47  = Base.:!::Core.Const(!)
│    %48  = #JuliaLang#212::Core.Const(Some(nothing))
│    %49  = Base.isnothing(%48)::Core.Const(false)
│    %50  = (%47)(%49)::Core.Const(true)
└───        goto #8 if not %50
7 ── %52  = #JuliaLang#212::Core.Const(Some(nothing))
│           (@_21 = Base.something(%52))
└───        goto #9
8 ──        Core.Const(nothing)
│           Core.Const(:(val@_8 = Base.something(Base.nothing)))
│           Core.Const(nothing)
│           Core.Const(:(val@_8))
└───        Core.Const(:(@_21 = %58))
9 ┄─ %60  = @_21::Core.Const(nothing)
└───        (@_20 = %60)
10 ┄ %62  = @_20::Union{Nothing, UInt8}
└───        (@_19 = %62)
11 ┄ %64  = @_19::Union{Nothing, UInt8}
│           (may_a = %64)
│    %66  = digitRes::Union{Nothing, Some{UInt8}}
│           (#JuliaLang#217 = %66)
│    %68  = Base.:!::Core.Const(!)
│    %69  = #JuliaLang#217::Union{Nothing, Some{UInt8}}
│    %70  = Base.isnothing(%69)::Bool
│    %71  = (%68)(%70)::Bool
└───        goto #13 if not %71
12 ─ %73  = #JuliaLang#217::Some{UInt8}
│           (@_22 = Base.something(%73))
└───        goto JuliaLang#20
13 ─ %76  = may_b::Union{Nothing, UInt8}
│           (#JuliaLang#216 = %76)
│    %78  = Base.:!::Core.Const(!)
│    %79  = #JuliaLang#216::Union{Nothing, UInt8}
│    %80  = Base.isnothing(%79)::Bool
│    %81  = (%78)(%80)::Bool
└───        goto #15 if not %81
14 ─ %83  = #JuliaLang#216::UInt8
│           (@_23 = Base.something(%83))
└───        goto #19
15 ─ %86  = Main.Some::Core.Const(Some)
│    %87  = Main.nothing::Core.Const(nothing)
│           (#JuliaLang#215 = (%86)(%87))
│    %89  = Base.:!::Core.Const(!)
│    %90  = #JuliaLang#215::Core.Const(Some(nothing))
│    %91  = Base.isnothing(%90)::Core.Const(false)
│    %92  = (%89)(%91)::Core.Const(true)
└───        goto #17 if not %92
16 ─ %94  = #JuliaLang#215::Core.Const(Some(nothing))
│           (@_24 = Base.something(%94))
└───        goto #18
17 ─        Core.Const(nothing)
│           Core.Const(:(val@_7 = Base.something(Base.nothing)))
│           Core.Const(nothing)
│           Core.Const(:(val@_7))
└───        Core.Const(:(@_24 = %100))
18 ┄ %102 = @_24::Core.Const(nothing)
└───        (@_23 = %102)
19 ┄ %104 = @_23::Union{Nothing, UInt8}
└───        (@_22 = %104)
20 ┄ %106 = @_22::Union{Nothing, UInt8}
│           (may_b = %106)
│    %108 = Main.:(==)::Core.Const(==)
│    %109 = c::UInt8
│    %110 = Main.UInt8('\n')::Core.Const(0x0a)
│    %111 = (%108)(%109, %110)::Bool
└───        goto JuliaLang#22 if not %111
21 ─ %113 = may_a::Union{Nothing, UInt8}
│           (digit_a = Core.typeassert(%113, Main.UInt8))
│    %115 = may_b::Union{Nothing, UInt8}
│           (digit_b = Core.typeassert(%115, Main.UInt8))
│    %117 = Main.:+::Core.Const(+)
│    %118 = total::Int64
│    %119 = Main.:+::Core.Const(+)
│    %120 = Main.:*::Core.Const(*)
│    %121 = digit_a::UInt8
│    %122 = (%120)(%121, 0x0a)::UInt8
│    %123 = digit_b::UInt8
│    %124 = (%119)(%122, %123)::UInt8
│           (total = (%117)(%118, %124))
│           (may_a = Main.nothing)
└───        (may_b = Main.nothing)
22 ┄        (@_3 = Base.iterate(%4, %21))
│    %129 = @_3::Union{Nothing, Tuple{UInt8, Int64}}
│    %130 = (%129 === nothing)::Bool
│    %131 = Base.not_int(%130)::Bool
└───        goto JuliaLang#24 if not %131
23 ─        goto #2
24 ┄ %134 = total::Int64
└───        return %134
```
</details>


<details>
<summary>`@code_native debuginfo=:none` After </summary>

```julia

julia> @code_native debuginfo=:none part1(data)
	.text
	.file	"part1"
	.globl	julia_part1_1203                # -- Begin function julia_part1_1203
	.p2align	4, 0x90
	.type	julia_part1_1203,@function
julia_part1_1203:                       # @julia_part1_1203
; Function Signature: part1(Array{UInt8, 1})
# %bb.0:                                # %top
	#DEBUG_VALUE: part1:data <- [DW_OP_deref] $rdi
	push	rbp
	mov	rbp, rsp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	sub	rsp, 40
	vxorps	xmm0, xmm0, xmm0
	#APP
	mov	rax, qword ptr fs:[0]
	#NO_APP
	lea	rdx, [rbp - 64]
	vmovaps	xmmword ptr [rbp - 64], xmm0
	mov	qword ptr [rbp - 48], 0
	mov	rcx, qword ptr [rax - 8]
	mov	qword ptr [rbp - 64], 4
	mov	rax, qword ptr [rcx]
	mov	qword ptr [rbp - 72], rcx       # 8-byte Spill
	mov	qword ptr [rbp - 56], rax
	mov	qword ptr [rcx], rdx
	#DEBUG_VALUE: part1:data <- [DW_OP_deref] 0
	mov	r15, qword ptr [rdi + 16]
	test	r15, r15
	je	.LBB0_1
# %bb.2:                                # %L34
	mov	r14, qword ptr [rdi]
	dec	r15
	mov	r11b, 1
	mov	r13b, 1
                                        # implicit-def: $r12b
                                        # implicit-def: $r10b
	xor	eax, eax
	jmp	.LBB0_3
	.p2align	4, 0x90
.LBB0_4:                                #   in Loop: Header=BB0_3 Depth=1
	xor	r11d, r11d
	mov	ebx, edi
	mov	r10d, r8d
.LBB0_9:                                # %L114
                                        #   in Loop: Header=BB0_3 Depth=1
	mov	r12d, esi
	test	r15, r15
	je	.LBB0_12
.LBB0_10:                               # %guard_exit126
                                        #   in Loop: Header=BB0_3 Depth=1
	inc	r14
	dec	r15
	mov	r13d, ebx
.LBB0_3:                                # %L36
                                        # =>This Inner Loop Header: Depth=1
	movzx	edx, byte ptr [r14]
	test	r13b, 1
	movzx	edi, r13b
	mov	ebx, 1
	mov	ecx, 0
	cmove	ebx, edi
	cmovne	edi, ecx
	movzx	ecx, r10b
	lea	esi, [rdx - 48]
	lea	r9d, [rdx - 58]
	movzx	r8d, sil
	cmove	r8d, ecx
	cmp	r9b, -11
	ja	.LBB0_4
# %bb.5:                                # %L89
                                        #   in Loop: Header=BB0_3 Depth=1
	test	r11b, 1
	jne	.LBB0_8
# %bb.6:                                # %L102
                                        #   in Loop: Header=BB0_3 Depth=1
	cmp	dl, 10
	jne	.LBB0_7
# %bb.13:                               # %L106
                                        #   in Loop: Header=BB0_3 Depth=1
	test	r13b, 1
	jne	.LBB0_14
# %bb.11:                               # %L114.thread
                                        #   in Loop: Header=BB0_3 Depth=1
	add	ecx, ecx
	mov	bl, 1
	mov	r11b, 1
	lea	ecx, [rcx + 4*rcx]
	add	cl, r12b
	movzx	ecx, cl
	add	rax, rcx
	test	r15, r15
	jne	.LBB0_10
	jmp	.LBB0_12
	.p2align	4, 0x90
.LBB0_8:                                # %L102.thread
                                        #   in Loop: Header=BB0_3 Depth=1
	mov	r11b, 1
                                        # implicit-def: $sil
	cmp	dl, 10
	jne	.LBB0_9
	jmp	.LBB0_15
.LBB0_7:                                #   in Loop: Header=BB0_3 Depth=1
	mov	esi, r12d
	jmp	.LBB0_9
.LBB0_1:
	xor	eax, eax
.LBB0_12:                               # %L154
	mov	rcx, qword ptr [rbp - 56]
	mov	rdx, qword ptr [rbp - 72]       # 8-byte Reload
	mov	qword ptr [rdx], rcx
	add	rsp, 40
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret
.LBB0_15:                               # %L106.thread
	test	r13b, 1
	jne	.LBB0_14
# %bb.16:                               # %post_box_union47
	movabs	rax, offset jl_nothing
	movabs	rcx, offset jl_small_typeof
	movabs	rdi, offset ".L_j_str_typeassert#1"
	mov	rdx, qword ptr [rax]
	mov	rsi, qword ptr [rcx + 336]
	movabs	rax, offset ijl_type_error
	mov	qword ptr [rbp - 48], rsi
	call	rax
.LBB0_14:                               # %post_box_union
	movabs	rax, offset jl_nothing
	movabs	rcx, offset jl_small_typeof
	movabs	rdi, offset ".L_j_str_typeassert#1"
	mov	rdx, qword ptr [rax]
	mov	rsi, qword ptr [rcx + 336]
	movabs	rax, offset ijl_type_error
	mov	qword ptr [rbp - 48], rsi
	call	rax
.Lfunc_end0:
	.size	julia_part1_1203, .Lfunc_end0-julia_part1_1203
                                        # -- End function
	.type	".L_j_str_typeassert#1",@object # @"_j_str_typeassert#1"
	.section	.rodata.str1.1,"aMS",@progbits,1
".L_j_str_typeassert#1":
	.asciz	"typeassert"
	.size	".L_j_str_typeassert#1", 11

	.section	".note.GNU-stack","",@progbits
```
</details>

Co-authored-by: Sukera <[email protected]>
LilithHafner added a commit that referenced this pull request Apr 18, 2024
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:

1. `cd contrib/pgo-lto`
2. `make -j$(nproc) stage1`
3. `make clean-profiles`
4. `./stage1.build/julia -O3 -e 'using Pkg;
Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'`
5. `make -j$(nproc) stage2`

<details>
<summary>* Output looks roughly like as follows</summary>

```c++
$ make -C contrib/pgo-lto top 
make: Entering directory '/dev/shm/julia/contrib/pgo-lto'
llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt
Instrumentation level: IR  entry_first = 0
Total functions: 85943
Maximum function count: 7867557260
Maximum internal block count: 3468437590
Top 50 functions with the largest internal block counts: 
  llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260
  LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590
  llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834
  llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575
  llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762
  llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177
  std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728
  llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040
  llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953
  ijl_method_instance_add_backedge, max count = 349608221
  llvm::SUnit::ComputeHeight(), max count = 336604330
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109
  llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545
  llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540
  LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274
  /dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464
  ijl_get_pgcstack, max count = 216953592
  LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152
  /dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813
  /dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603
```
</details>


This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.

Example 1:

```julia
using LoopVectorization
function f!(a, b)
    @turbo for i in eachindex(a)
        a[i] *= b[i]
    end
    return a
end
f!(rand(1), rand(1))
```

```console
$ time ./julia -O3 lv.jl
```

Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)

Example 2:

```console
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
```

Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)

Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM):

```console
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
```

Without PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 101.0130 seconds (98.6253 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  53.6961 ( 54.7%)   0.1050 (  3.8%)  53.8012 ( 53.3%)  53.8045 ( 54.6%)  Unroll loops
  25.5423 ( 26.0%)   0.0072 (  0.3%)  25.5495 ( 25.3%)  25.5444 ( 25.9%)  Global Value Numbering
   7.1995 (  7.3%)   0.0526 (  1.9%)   7.2521 (  7.2%)   7.2517 (  7.4%)  Induction Variable Simplification
   6.0541 (  5.1%)   0.0098 (  0.3%)   5.0639 (  5.0%)   5.0561 (  5.1%)  Combine redundant instructions #2
```

With PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 72.6507 seconds (70.1337 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.0894 ( 51.7%)   0.0825 (  2.9%)  36.1719 ( 49.8%)  36.1738 ( 51.6%)  Unroll loops
  16.5713 ( 23.7%)   0.0129 (  0.5%)  16.5843 ( 22.8%)  16.5794 ( 23.6%)  Global Value Numbering
   5.9047 (  8.5%)   0.0395 (  1.4%)   5.9442 (  8.2%)   5.9438 (  8.5%)  Induction Variable Simplification
   4.7566 (  6.8%)   0.0078 (  0.3%)   4.7645 (  6.6%)   4.7575 (  6.8%)  Combine redundant instructions #2
```

Or -28% time spent in LLVM.

`perf` reports show this is mostly fewer instructions and reduction in
icache misses.

---

Finally there's a significant reduction in binary sizes. For libLLVM.so:

```
79M	usr/lib/libLLVM-13jl.so (before)
67M	usr/lib/libLLVM-13jl.so (after)
```

And it can be reduced by another 2MB with `--icf=safe` when using LLD as
a linker anyways.

- [x] Two out-of-source builds would be better than a single in-source
build, so that it's easier to find good profile data

---------

Co-authored-by: Oscar Smith <[email protected]>
Co-authored-by: Lilith Orion Hafner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant