forked from JuliaLang/julia
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sort_int_range! for rangelen::UInt64 #2
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
LilithHafner
pushed a commit
that referenced
this pull request
Feb 22, 2022
…43226) In order to allow `Argument`s to be printed nicely. > before ```julia julia> code_typed((Float64,)) do x sin(x) end 1-element Vector{Any}: CodeInfo( 1 ─ %1 = invoke Main.sin(_2::Float64)::Float64 └── return %1 ) => Float64 julia> code_typed((Bool,Any,Any)) do c, x, y z = c ? x : y z end 1-element Vector{Any}: CodeInfo( 1 ─ goto #3 if not c 2 ─ goto #4 3 ─ nothing::Nothing 4 ┄ %4 = φ (#2 => _3, #3 => _4)::Any └── return %4 ) => Any ``` > after ```julia julia> code_typed((Float64,)) do x sin(x) end 1-element Vector{Any}: CodeInfo( 1 ─ %1 = invoke Main.sin(x::Float64)::Float64 └── return %1 ) => Float64 julia> code_typed((Bool,Any,Any)) do c, x, y z = c ? x : y z end 1-element Vector{Any}: CodeInfo( 1 ─ goto #3 if not c 2 ─ goto #4 3 ─ nothing::Nothing 4 ┄ %4 = φ (#2 => x, #3 => y)::Any └── return %4 ) => Any ```
LilithHafner
pushed a commit
that referenced
this pull request
Mar 8, 2022
…43226) In order to allow `Argument`s to be printed nicely. > before ```julia julia> code_typed((Float64,)) do x sin(x) end 1-element Vector{Any}: CodeInfo( 1 ─ %1 = invoke Main.sin(_2::Float64)::Float64 └── return %1 ) => Float64 julia> code_typed((Bool,Any,Any)) do c, x, y z = c ? x : y z end 1-element Vector{Any}: CodeInfo( 1 ─ goto #3 if not c 2 ─ goto #4 3 ─ nothing::Nothing 4 ┄ %4 = φ (#2 => _3, #3 => _4)::Any └── return %4 ) => Any ``` > after ```julia julia> code_typed((Float64,)) do x sin(x) end 1-element Vector{Any}: CodeInfo( 1 ─ %1 = invoke Main.sin(x::Float64)::Float64 └── return %1 ) => Float64 julia> code_typed((Bool,Any,Any)) do c, x, y z = c ? x : y z end 1-element Vector{Any}: CodeInfo( 1 ─ goto #3 if not c 2 ─ goto #4 3 ─ nothing::Nothing 4 ┄ %4 = φ (#2 => x, #3 => y)::Any └── return %4 ) => Any ```
LilithHafner
pushed a commit
that referenced
this pull request
Mar 8, 2022
…43226) In order to allow `Argument`s to be printed nicely. > before ```julia julia> code_typed((Float64,)) do x sin(x) end 1-element Vector{Any}: CodeInfo( 1 ─ %1 = invoke Main.sin(_2::Float64)::Float64 └── return %1 ) => Float64 julia> code_typed((Bool,Any,Any)) do c, x, y z = c ? x : y z end 1-element Vector{Any}: CodeInfo( 1 ─ goto #3 if not c 2 ─ goto #4 3 ─ nothing::Nothing 4 ┄ %4 = φ (#2 => _3, #3 => _4)::Any └── return %4 ) => Any ``` > after ```julia julia> code_typed((Float64,)) do x sin(x) end 1-element Vector{Any}: CodeInfo( 1 ─ %1 = invoke Main.sin(x::Float64)::Float64 └── return %1 ) => Float64 julia> code_typed((Bool,Any,Any)) do c, x, y z = c ? x : y z end 1-element Vector{Any}: CodeInfo( 1 ─ goto #3 if not c 2 ─ goto #4 3 ─ nothing::Nothing 4 ┄ %4 = φ (#2 => x, #3 => y)::Any └── return %4 ) => Any ```
LilithHafner
pushed a commit
that referenced
this pull request
Mar 31, 2022
Currently the optimizer handles abstract callsite only when there is a single dispatch candidate (in most cases), and so inlining and static-dispatch are prohibited when the callsite is union-split (in other word, union-split happens only when all the dispatch candidates are concrete). However, there are certain patterns of code (most notably our Julia-level compiler code) that inherently need to deal with abstract callsite. The following example is taken from `Core.Compiler` utility: ```julia julia> @inline isType(@nospecialize t) = isa(t, DataType) && t.name === Type.body.name isType (generic function with 1 method) julia> code_typed((Any,)) do x # abstract, but no union-split, successful inlining isType(x) end |> only CodeInfo( 1 ─ %1 = (x isa Main.DataType)::Bool └── goto #3 if not %1 2 ─ %3 = π (x, DataType) │ %4 = Base.getfield(%3, :name)::Core.TypeName │ %5 = Base.getfield(Type{T}, :name)::Core.TypeName │ %6 = (%4 === %5)::Bool └── goto #4 3 ─ goto #4 4 ┄ %9 = φ (#2 => %6, #3 => false)::Bool └── return %9 ) => Bool julia> code_typed((Union{Type,Nothing},)) do x # abstract, union-split, unsuccessful inlining isType(x) end |> only CodeInfo( 1 ─ %1 = (isa)(x, Nothing)::Bool └── goto #3 if not %1 2 ─ goto #4 3 ─ %4 = Main.isType(x)::Bool └── goto #4 4 ┄ %6 = φ (#2 => false, #3 => %4)::Bool └── return %6 ) => Bool ``` (note that this is a limitation of the inlining algorithm, and so any user-provided hints like callsite inlining annotation doesn't help here) This commit enables inlining and static dispatch for abstract union-split callsite. The core idea here is that we can simulate our dispatch semantics by generating `isa` checks in order of the specialities of dispatch candidates: ```julia julia> code_typed((Union{Type,Nothing},)) do x # union-split, unsuccessful inlining isType(x) end |> only CodeInfo( 1 ─ %1 = (isa)(x, Nothing)::Bool └── goto #3 if not %1 2 ─ goto #9 3 ─ %4 = (isa)(x, Type)::Bool └── goto #8 if not %4 4 ─ %6 = π (x, Type) │ %7 = (%6 isa Main.DataType)::Bool └── goto #6 if not %7 5 ─ %9 = π (%6, DataType) │ %10 = Base.getfield(%9, :name)::Core.TypeName │ %11 = Base.getfield(Type{T}, :name)::Core.TypeName │ %12 = (%10 === %11)::Bool └── goto #7 6 ─ goto #7 7 ┄ %15 = φ (#5 => %12, #6 => false)::Bool └── goto #9 8 ─ Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{} └── unreachable 9 ┄ %19 = φ (#2 => false, #7 => %15)::Bool └── return %19 ) => Bool ``` Inlining/static-dispatch of abstract union-split callsite will improve the performance in such situations (and so this commit will improve the latency of our JIT compilation). Especially, this commit helps us avoid excessive specializations of `Core.Compiler` code by statically-resolving `@nospecialize`d callsites, and as the result, the # of precompiled statements is now reduced from `2005` ([`master`](f782430)) to `1912` (this commit). And also, as a side effect, the implementation of our inlining algorithm gets much simplified now since we no longer need the previous special handlings for abstract callsites. One possible drawback would be increased code size. This change seems to certainly increase the size of sysimage, but I think these numbers are in an acceptable range: > [`master`](f782430) ``` ❯ du -shk usr/lib/julia/* 17604 usr/lib/julia/corecompiler.ji 194072 usr/lib/julia/sys-o.a 169424 usr/lib/julia/sys.dylib 23784 usr/lib/julia/sys.dylib.dSYM 103772 usr/lib/julia/sys.ji ``` > this commit ``` ❯ du -shk usr/lib/julia/* 17512 usr/lib/julia/corecompiler.ji 195588 usr/lib/julia/sys-o.a 170908 usr/lib/julia/sys.dylib 23776 usr/lib/julia/sys.dylib.dSYM 105360 usr/lib/julia/sys.ji ```
LilithHafner
pushed a commit
that referenced
this pull request
Jun 30, 2022
…Lang#45790) Currently the `@nospecialize`-d `push!(::Vector{Any}, ...)` can only take a single item and we will end up with runtime dispatch when we try to call it with multiple items: ```julia julia> code_typed(push!, (Vector{Any}, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing │ %2 = Base.arraylen(a)::Int64 │ Base.arrayset(true, a, item, %2)::Vector{Any} └── return a ) => Vector{Any} julia> code_typed(push!, (Vector{Any}, Any, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ %1 = Base.append!(a, iter)::Vector{Any} └── return %1 ) => Vector{Any} ``` This commit adds a new specialization that it can take arbitrary-length items. Our compiler should still be able to optimize the single-input case as before via the dispatch mechanism. ```julia julia> code_typed(push!, (Vector{Any}, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000001, 0x0000000000000001))::Nothing │ %2 = Base.arraylen(a)::Int64 │ Base.arrayset(true, a, item, %2)::Vector{Any} └── return a ) => Vector{Any} julia> code_typed(push!, (Vector{Any}, Any, Any)) 1-element Vector{Any}: CodeInfo( 1 ─ %1 = Base.arraylen(a)::Int64 │ $(Expr(:foreigncall, :(:jl_array_grow_end), Nothing, svec(Any, UInt64), 0, :(:ccall), Core.Argument(2), 0x0000000000000002, 0x0000000000000002))::Nothing └── goto #7 if not true 2 ┄ %4 = φ (#1 => 1, #6 => %14)::Int64 │ %5 = φ (#1 => 1, #6 => %15)::Int64 │ %6 = Base.getfield(x, %4, true)::Any │ %7 = Base.add_int(%1, %4)::Int64 │ Base.arrayset(true, a, %6, %7)::Vector{Any} │ %9 = (%5 === 2)::Bool └── goto #4 if not %9 3 ─ goto #5 4 ─ %12 = Base.add_int(%5, 1)::Int64 └── goto #5 5 ┄ %14 = φ (#4 => %12)::Int64 │ %15 = φ (#4 => %12)::Int64 │ %16 = φ (#3 => true, #4 => false)::Bool │ %17 = Base.not_int(%16)::Bool └── goto #7 if not %17 6 ─ goto #2 7 ┄ return a ) => Vector{Any} ``` This commit also adds the equivalent implementations for `pushfirst!`.
LilithHafner
pushed a commit
that referenced
this pull request
Jun 30, 2022
When calling `jl_error()` or `jl_errorf()`, we must check to see if we are so early in the bringup process that it is dangerous to attempt to construct a backtrace because the data structures used to provide line information are not properly setup. This can be easily triggered by running: ``` julia -C invalid ``` On an `i686-linux-gnu` build, this will hit the "Invalid CPU Name" branch in `jitlayers.cpp`, which calls `jl_errorf()`. This in turn calls `jl_throw()`, which will eventually call `jl_DI_for_fptr` as part of the backtrace printing process, which fails as the object maps are not fully initialized. See the below `gdb` stacktrace for details: ``` $ gdb -batch -ex 'r' -ex 'bt' --args ./julia -C invalid ... fatal: error thrown and no exception handler available. ErrorException("Invalid CPU name "invalid".") Thread 1 "julia" received signal SIGSEGV, Segmentation fault. 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 1277 /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h: No such file or directory. #0 0xf75bd665 in std::_Rb_tree<unsigned int, std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo>, std::_Select1st<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> >, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__k=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_tree.h:1277 #1 std::map<unsigned int, JITDebugInfoRegistry::ObjectInfo, std::greater<unsigned int>, std::allocator<std::pair<unsigned int const, JITDebugInfoRegistry::ObjectInfo> > >::lower_bound (__x=<optimized out>, this=0x248) at /usr/local/i686-linux-gnu/include/c++/9.1.0/bits/stl_map.h:1258 #2 jl_DI_for_fptr (fptr=4155049385, symsize=symsize@entry=0xffffcfa8, slide=slide@entry=0xffffcfa0, Section=Section@entry=0xffffcfb8, context=context@entry=0xffffcf94) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1181 #3 0xf75c056a in jl_getFunctionInfo_impl (frames_out=0xffffd03c, pointer=4155049385, skipC=0, noInline=0) at /cache/build/default-amdci5-4/julialang/julia-master/src/debuginfo.cpp:1210 #4 0xf7a6ca98 in jl_print_native_codeloc (ip=4155049385) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:636 #5 0xf7a6cd54 in jl_print_bt_entry_codeloc (bt_entry=0xf0798018) at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:657 #6 jlbacktrace () at /cache/build/default-amdci5-4/julialang/julia-master/src/stackwalk.c:1090 #7 0xf7a3cd2b in ijl_no_exc_handler (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:605 #8 0xf7a3d10a in throw_internal (ct=ct@entry=0xf070c010, exception=<optimized out>, exception@entry=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:638 #9 0xf7a3d330 in ijl_throw (e=0xf0794010) at /cache/build/default-amdci5-4/julialang/julia-master/src/task.c:654 #10 0xf7a905aa in ijl_errorf (fmt=fmt@entry=0xf7647cd4 "Invalid CPU name \"%s\".") at /cache/build/default-amdci5-4/julialang/julia-master/src/rtutils.c:77 #11 0xf75a4b22 in (anonymous namespace)::createTargetMachine () at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:823 #12 JuliaOJIT::JuliaOJIT (this=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/jitlayers.cpp:1044 #13 0xf7531793 in jl_init_llvm () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8585 #14 0xf75318a8 in jl_init_codegen_impl () at /cache/build/default-amdci5-4/julialang/julia-master/src/codegen.cpp:8648 #15 0xf7a51a52 in jl_restore_system_image_from_stream (f=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2131 #16 0xf7a55c03 in ijl_restore_system_image_data (buf=0xe859c1c0 <jl_system_image_data> "8'\031\003", len=125161105) at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2184 #17 0xf7a55cf9 in jl_load_sysimg_so () at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:424 #18 ijl_restore_system_image (fname=0x80a0900 "/build/bk_download/julia-d78fdad601/lib/julia/sys.so") at /cache/build/default-amdci5-4/julialang/julia-master/src/staticdata.c:2157 #19 0xf7a3bdfc in _finish_julia_init (rel=rel@entry=JL_IMAGE_JULIA_HOME, ct=<optimized out>, ptls=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:741 JuliaLang#20 0xf7a3c8ac in julia_init (rel=<optimized out>) at /cache/build/default-amdci5-4/julialang/julia-master/src/init.c:728 JuliaLang#21 0xf7a7f61d in jl_repl_entrypoint (argc=<optimized out>, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/src/jlapi.c:705 JuliaLang#22 0x080490a7 in main (argc=3, argv=0xffffddf4) at /cache/build/default-amdci5-4/julialang/julia-master/cli/loader_exe.c:59 ``` To prevent this, we simply avoid calling `jl_errorf` this early in the process, punting the problem to a later PR that can update guard conditions within `jl_error*`.
LilithHafner
pushed a commit
that referenced
this pull request
Aug 12, 2023
This commit improves SROA pass by extending the `unswitchtupleunion` optimization to handle the general parametric types, e.g.: ```julia julia> struct A{T} x::T end; julia> function foo(a1, a2, c) t = c ? A(a1) : A(a2) return getfield(t, :x) end; julia> only(Base.code_ircode(foo, (Int,Float64,Bool); optimize_until="SROA")) ``` > Before ``` 2 1 ─ goto #3 if not _4 │ 2 ─ %2 = %new(A{Int64}, _2)::A{Int64} │╻ A └── goto #4 │ 3 ─ %4 = %new(A{Float64}, _3)::A{Float64} │╻ A 4 ┄ %5 = φ (#2 => %2, #3 => %4)::Union{A{Float64}, A{Int64}} │ 3 │ %6 = Main.getfield(%5, :x)::Union{Float64, Int64} │ └── return %6 │ => Union{Float64, Int64} ``` > After ``` julia> only(Base.code_ircode(foo, (Int,Float64,Bool); optimize_until="SROA")) 2 1 ─ goto #3 if not _4 │ 2 ─ nothing::A{Int64} │╻ A └── goto #4 │ 3 ─ nothing::A{Float64} │╻ A 4 ┄ %8 = φ (#2 => _2, #3 => _3)::Union{Float64, Int64} │ │ nothing::Union{A{Float64}, A{Int64}} 3 │ %6 = %8::Union{Float64, Int64} │ └── return %6 │ => Union{Float64, Int64} ```
LilithHafner
pushed a commit
that referenced
this pull request
Dec 7, 2023
…#51489) This exposes the GC "stop the world" API to the user, for causing a thread to quickly stop executing Julia code. This adds two APIs (that will need to be exported and documented later): ``` julia> @CCall jl_safepoint_suspend_thread(#=tid=#1::Cint, #=magicnumber=#2::Cint)::Cint # roughly tkill(1, SIGSTOP) julia> @CCall jl_safepoint_resume_thread(#=tid=#1::Cint)::Cint # roughly tkill(1, SIGCONT) ``` You can even suspend yourself, if there is another task to resume you 10 seconds later: ``` julia> ccall(:jl_enter_threaded_region, Cvoid, ()) julia> t = @task let; Libc.systemsleep(10); print("\nhello from $(Threads.threadid())\n"); @CCall jl_safepoint_resume_thread(0::Cint)::Cint; end; ccall(:jl_set_task_tid, Cint, (Any, Cint), t, 1); schedule(t); julia> @time @CCall jl_safepoint_suspend_thread(0::Cint, 2::Cint)::Cint hello from 2 10 seconds (6 allocations: 264 bytes) 1 ``` The meaning of the magic number is actually the kind of stop that you want: ``` // n.b. suspended threads may still run in the GC or GC safe regions // but shouldn't be observable, depending on which enum the user picks (only 1 and 2 are typically recommended here) // waitstate = 0 : do not wait for suspend to finish // waitstate = 1 : wait for gc_state != 0 (JL_GC_STATE_WAITING or JL_GC_STATE_SAFE) // waitstate = 2 : wait for gc_state != 0 (JL_GC_STATE_WAITING or JL_GC_STATE_SAFE) and that GC is not running on that thread // waitstate = 3 : wait for full suspend (gc_state == JL_GC_STATE_WAITING) -- this may never happen if thread is sleeping currently // if another thread comes along and calls jl_safepoint_resume, we also return early // return new suspend count on success, 0 on failure ``` Only magic number 2 is currently meaningful to the user though. The difference between waitstate 1 and 2 is only relevant in C code which is calling this from JL_GC_STATE_SAFE, since otherwise it is a priori known that GC isn't running, else we too would be running the GC. But the distinction of those states might be useful if we have a concurrent collector. Very important warning: if the stopped thread is holding any locks (e.g. for codegen or types) that you then attempt to acquire, your thread will deadlock. This is very likely, unless you are very careful. A future update to this API may try to change the waitstate to give the option to wait for the thread to release internal or known locks.
LilithHafner
pushed a commit
that referenced
this pull request
Dec 21, 2023
Adds a convenient way to enable PGO+LTO on Julia and LLVM together: 1. `cd contrib/pgo-lto` 2. `make -j$(nproc) stage1` 3. `make clean-profiles` 4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'` 5. `make -j$(nproc) stage2` This results quite often in spectacular speedups for time to first X as it reduces the time spent in LLVM optimization passes by 25 or even 30%. Example 1: ```julia using LoopVectorization function f!(a, b) @turbo for i in eachindex(a) a[i] *= b[i] end return a end f!(rand(1), rand(1)) ``` ```console $ time ./julia -O3 lv.jl ``` Without PGO+LTO: 14.801s With PGO+LTO: 11.978s (-19%) Example 2: ```console $ time ./julia -e 'using Pkg; Pkg.test("Unitful");' ``` Without PGO+LTO: 1m47.688s With PGO+LTO: 1m35.704s (-11%) Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM): ```console $ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl ``` Without PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 101.0130 seconds (98.6253 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops 25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering 7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification 5.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions #2 ``` Wit PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 72.6507 seconds (70.1337 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops 16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering 5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification 4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions #2 ``` Or -28% time spent in LLVM. --- Finally there's a significant reduction in binary sizes. For libLLVM.so: ``` 79M usr/lib/libLLVM-13jl.so (before) 67M usr/lib/libLLVM-13jl.so (after) ``` And it can be reduced by another 2MB with `--icf=safe` when using LLD as a linker anways. Turn into makefile Newline Use two out of source builds Ignore profiles + build dirs Add --icf=safe stage0 setup prebuilt clang with [cd]tors->init/fini patch
LilithHafner
pushed a commit
that referenced
this pull request
Apr 18, 2024
`@something` eagerly unwraps any `Some` given to it, while keeping the variable between its arguments the same. This can be an issue if a previously unpacked value is used as input to `@something`, leading to a type instability on more than two arguments (e.g. because of a fallback to `Some(nothing)`). By using different variables for each argument, type inference has an easier time handling these cases that are isolated to single branches anyway. This also adds some comments to the macro, since it's non-obvious what it does. Benchmarking the specific case I encountered this in led to a ~2x performance improvement on multiple machines. 1.10-beta3/master: ``` [sukera@tower 01]$ jl1100 -q --project=. -L 01.jl -e 'bench()' v"1.10.0-beta3" BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 38.670 μs … 70.350 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 43.340 μs ┊ GC (median): 0.00% Time (mean ± σ): 43.395 μs ± 1.518 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▆█▂ ▁▁ ▂▂▂▂▂▂▂▂▂▁▂▂▂▃▃▃▂▂▃▃▃▂▂▂▂▂▄▇███▆██▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃ 38.7 μs Histogram: frequency by time 48 μs < Memory estimate: 0 bytes, allocs estimate: 0. ``` This PR: ``` [sukera@tower 01]$ julia -q --project=. -L 01.jl -e 'bench()' v"1.11.0-DEV.970" BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 22.820 μs … 44.980 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 24.300 μs ┊ GC (median): 0.00% Time (mean ± σ): 24.370 μs ± 832.239 ns ┊ GC (mean ± σ): 0.00% ± 0.00% ▂▅▇██▇▆▅▁ ▂▂▂▂▂▂▂▂▃▃▄▅▇███████████▅▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂ ▃ 22.8 μs Histogram: frequency by time 27.7 μs < Memory estimate: 0 bytes, allocs estimate: 0. ``` <details> <summary>Benchmarking code (spoilers for Advent Of Code 2023 Day 01, Part 01). Running this requires the input of that Advent Of Code day.</summary> ```julia using BenchmarkTools using InteractiveUtils isdigit(d::UInt8) = UInt8('0') <= d <= UInt8('9') someDigit(c::UInt8) = isdigit(c) ? Some(c - UInt8('0')) : nothing function part1(data) total = 0 may_a = nothing may_b = nothing for c in data digitRes = someDigit(c) may_a = @something may_a digitRes Some(nothing) may_b = @something digitRes may_b Some(nothing) if c == UInt8('\n') digit_a = may_a::UInt8 digit_b = may_b::UInt8 total += digit_a*0xa + digit_b may_a = nothing may_b = nothing end end return total end function bench() data = read("input.txt") display(VERSION) println() display(@benchmark part1($data)) nothing end ``` </details> <details> <summary>`@code_warntype` before</summary> ```julia julia> @code_warntype part1(data) MethodInstance for part1(::Vector{UInt8}) from part1(data) @ Main ~/Documents/projects/AOC/2023/01/01.jl:7 Arguments #self#::Core.Const(part1) data::Vector{UInt8} Locals @_3::Union{Nothing, Tuple{UInt8, Int64}} may_b::Union{Nothing, UInt8} may_a::Union{Nothing, UInt8} total::Int64 c::UInt8 digit_b::UInt8 digit_a::UInt8 val@_10::Any val@_11::Any digitRes::Union{Nothing, Some{UInt8}} @_13::Union{Some{Nothing}, Some{UInt8}, UInt8} @_14::Union{Some{Nothing}, Some{UInt8}} @_15::Some{Nothing} @_16::Union{Some{Nothing}, Some{UInt8}, UInt8} @_17::Union{Some{Nothing}, UInt8} @_18::Some{Nothing} Body::Int64 1 ── (total = 0) │ (may_a = Main.nothing) │ (may_b = Main.nothing) │ %4 = data::Vector{UInt8} │ (@_3 = Base.iterate(%4)) │ %6 = (@_3 === nothing)::Bool │ %7 = Base.not_int(%6)::Bool └─── goto JuliaLang#24 if not %7 2 ┄─ Core.NewvarNode(:(digit_b)) │ Core.NewvarNode(:(digit_a)) │ Core.NewvarNode(:(val@_10)) │ %12 = @_3::Tuple{UInt8, Int64} │ (c = Core.getfield(%12, 1)) │ %14 = Core.getfield(%12, 2)::Int64 │ (digitRes = Main.someDigit(c)) │ (val@_11 = may_a) │ %17 = (val@_11::Union{Nothing, UInt8} !== Base.nothing)::Bool └─── goto #4 if not %17 3 ── (@_13 = val@_11::UInt8) └─── goto #11 4 ── (val@_11 = digitRes) │ %22 = (val@_11::Union{Nothing, Some{UInt8}} !== Base.nothing)::Bool └─── goto #6 if not %22 5 ── (@_14 = val@_11::Some{UInt8}) └─── goto #10 6 ── (val@_11 = Main.Some(Main.nothing)) │ %27 = (val@_11::Core.Const(Some(nothing)) !== Base.nothing)::Core.Const(true) └─── goto #8 if not %27 7 ── (@_15 = val@_11::Core.Const(Some(nothing))) └─── goto #9 8 ── Core.Const(:(@_15 = Base.nothing)) 9 ┄─ (@_14 = @_15) 10 ┄ (@_13 = @_14) 11 ┄ %34 = @_13::Union{Some{Nothing}, Some{UInt8}, UInt8} │ (may_a = Base.something(%34)) │ (val@_10 = digitRes) │ %37 = (val@_10::Union{Nothing, Some{UInt8}} !== Base.nothing)::Bool └─── goto #13 if not %37 12 ─ (@_16 = val@_10::Some{UInt8}) └─── goto JuliaLang#20 13 ─ (val@_10 = may_b) │ %42 = (val@_10::Union{Nothing, UInt8} !== Base.nothing)::Bool └─── goto #15 if not %42 14 ─ (@_17 = val@_10::UInt8) └─── goto #19 15 ─ (val@_10 = Main.Some(Main.nothing)) │ %47 = (val@_10::Core.Const(Some(nothing)) !== Base.nothing)::Core.Const(true) └─── goto #17 if not %47 16 ─ (@_18 = val@_10::Core.Const(Some(nothing))) └─── goto #18 17 ─ Core.Const(:(@_18 = Base.nothing)) 18 ┄ (@_17 = @_18) 19 ┄ (@_16 = @_17) 20 ┄ %54 = @_16::Union{Some{Nothing}, Some{UInt8}, UInt8} │ (may_b = Base.something(%54)) │ %56 = c::UInt8 │ %57 = Main.UInt8('\n')::Core.Const(0x0a) │ %58 = (%56 == %57)::Bool └─── goto JuliaLang#22 if not %58 21 ─ (digit_a = Core.typeassert(may_a, Main.UInt8)) │ (digit_b = Core.typeassert(may_b, Main.UInt8)) │ %62 = total::Int64 │ %63 = (digit_a * 0x0a)::UInt8 │ %64 = (%63 + digit_b)::UInt8 │ (total = %62 + %64) │ (may_a = Main.nothing) └─── (may_b = Main.nothing) 22 ┄ (@_3 = Base.iterate(%4, %14)) │ %69 = (@_3 === nothing)::Bool │ %70 = Base.not_int(%69)::Bool └─── goto JuliaLang#24 if not %70 23 ─ goto #2 24 ┄ return total ``` </details> <details> <summary>`@code_native debuginfo=:none` Before </summary> ```julia julia> @code_native debuginfo=:none part1(data) .text .file "part1" .globl julia_part1_418 # -- Begin function julia_part1_418 .p2align 4, 0x90 .type julia_part1_418,@function julia_part1_418: # @julia_part1_418 # %bb.0: # %top push rbp mov rbp, rsp push r15 push r14 push r13 push r12 push rbx sub rsp, 40 mov rax, qword ptr [rdi + 8] test rax, rax je .LBB0_1 # %bb.2: # %L17 mov rcx, qword ptr [rdi] dec rax mov r10b, 1 xor r14d, r14d # implicit-def: $r12b # implicit-def: $r13b # implicit-def: $r9b # implicit-def: $sil mov qword ptr [rbp - 64], rax # 8-byte Spill mov al, 1 mov dword ptr [rbp - 48], eax # 4-byte Spill # implicit-def: $al # kill: killed $al xor eax, eax mov qword ptr [rbp - 56], rax # 8-byte Spill mov qword ptr [rbp - 72], rcx # 8-byte Spill # implicit-def: $cl jmp .LBB0_3 .p2align 4, 0x90 .LBB0_8: # in Loop: Header=BB0_3 Depth=1 mov dword ptr [rbp - 48], 0 # 4-byte Folded Spill .LBB0_24: # %post_union_move # in Loop: Header=BB0_3 Depth=1 movzx r13d, byte ptr [rbp - 41] # 1-byte Folded Reload mov r12d, r8d cmp qword ptr [rbp - 64], r14 # 8-byte Folded Reload je .LBB0_13 .LBB0_25: # %guard_exit113 # in Loop: Header=BB0_3 Depth=1 inc r14 mov r10d, ebx .LBB0_3: # %L19 # =>This Inner Loop Header: Depth=1 mov rax, qword ptr [rbp - 72] # 8-byte Reload xor ebx, ebx xor edi, edi movzx r15d, r9b movzx ecx, cl movzx esi, sil mov r11b, 1 # implicit-def: $r9b movzx edx, byte ptr [rax + r14] lea eax, [rdx - 58] lea r8d, [rdx - 48] cmp al, -10 setae bl setb dil test r10b, 1 cmovne r15d, edi mov edi, 0 cmovne ecx, ebx mov bl, 1 cmovne esi, edi test r15b, 1 jne .LBB0_7 # %bb.4: # %L76 # in Loop: Header=BB0_3 Depth=1 mov r11b, 2 test cl, 1 jne .LBB0_5 # %bb.6: # %L78 # in Loop: Header=BB0_3 Depth=1 mov ebx, r10d mov r9d, r15d mov byte ptr [rbp - 41], r13b # 1-byte Spill test sil, 1 je .LBB0_26 .LBB0_7: # %L82 # in Loop: Header=BB0_3 Depth=1 cmp al, -11 jbe .LBB0_9 jmp .LBB0_8 .p2align 4, 0x90 .LBB0_5: # in Loop: Header=BB0_3 Depth=1 mov ecx, r8d mov sil, 1 xor ebx, ebx mov byte ptr [rbp - 41], r8b # 1-byte Spill xor r9d, r9d xor ecx, ecx cmp al, -11 ja .LBB0_8 .LBB0_9: # %L90 # in Loop: Header=BB0_3 Depth=1 test byte ptr [rbp - 48], 1 # 1-byte Folded Reload jne .LBB0_23 # %bb.10: # %L115 # in Loop: Header=BB0_3 Depth=1 cmp dl, 10 jne .LBB0_11 # %bb.14: # %L122 # in Loop: Header=BB0_3 Depth=1 test r15b, 1 jne .LBB0_15 # %bb.12: # %L130.thread # in Loop: Header=BB0_3 Depth=1 movzx eax, byte ptr [rbp - 41] # 1-byte Folded Reload mov bl, 1 add eax, eax lea eax, [rax + 4*rax] add al, r12b movzx eax, al add qword ptr [rbp - 56], rax # 8-byte Folded Spill mov al, 1 mov dword ptr [rbp - 48], eax # 4-byte Spill cmp qword ptr [rbp - 64], r14 # 8-byte Folded Reload jne .LBB0_25 jmp .LBB0_13 .p2align 4, 0x90 .LBB0_23: # %L115.thread # in Loop: Header=BB0_3 Depth=1 mov al, 1 # implicit-def: $r8b mov dword ptr [rbp - 48], eax # 4-byte Spill cmp dl, 10 jne .LBB0_24 jmp .LBB0_21 .LBB0_11: # in Loop: Header=BB0_3 Depth=1 mov r8d, r12d jmp .LBB0_24 .LBB0_1: xor eax, eax mov qword ptr [rbp - 56], rax # 8-byte Spill .LBB0_13: # %L159 mov rax, qword ptr [rbp - 56] # 8-byte Reload add rsp, 40 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret .LBB0_21: # %L122.thread test r15b, 1 jne .LBB0_15 # %bb.22: # %post_box_union58 movabs rdi, offset .L_j_str1 movabs rax, offset ijl_type_error movabs rsi, 140008511215408 movabs rdx, 140008667209736 call rax .LBB0_15: # %fail cmp r11b, 1 je .LBB0_19 # %bb.16: # %fail movzx eax, r11b cmp eax, 2 jne .LBB0_17 # %bb.20: # %box_union54 movzx eax, byte ptr [rbp - 41] # 1-byte Folded Reload movabs rcx, offset jl_boxed_uint8_cache mov rdx, qword ptr [rcx + 8*rax] jmp .LBB0_18 .LBB0_26: # %L80 movabs rax, offset ijl_throw movabs rdi, 140008495049392 call rax .LBB0_19: # %box_union movabs rdx, 140008667209736 jmp .LBB0_18 .LBB0_17: xor edx, edx .LBB0_18: # %post_box_union movabs rdi, offset .L_j_str1 movabs rax, offset ijl_type_error movabs rsi, 140008511215408 call rax .Lfunc_end0: .size julia_part1_418, .Lfunc_end0-julia_part1_418 # -- End function .type .L_j_str1,@object # @_j_str1 .section .rodata.str1.1,"aMS",@progbits,1 .L_j_str1: .asciz "typeassert" .size .L_j_str1, 11 .section ".note.GNU-stack","",@progbits ``` </details> <details> <summary>`@code_warntype` After</summary> ```julia [sukera@tower 01]$ julia -q --project=. -L 01.jl julia> data = read("input.txt"); julia> @code_warntype part1(data) MethodInstance for part1(::Vector{UInt8}) from part1(data) @ Main ~/Documents/projects/AOC/2023/01/01.jl:7 Arguments #self#::Core.Const(part1) data::Vector{UInt8} Locals @_3::Union{Nothing, Tuple{UInt8, Int64}} may_b::Union{Nothing, UInt8} may_a::Union{Nothing, UInt8} total::Int64 val@_7::Union{} val@_8::Union{} c::UInt8 digit_b::UInt8 digit_a::UInt8 #JuliaLang#215::Some{Nothing} #JuliaLang#216::Union{Nothing, UInt8} #JuliaLang#217::Union{Nothing, Some{UInt8}} #JuliaLang#212::Some{Nothing} #JuliaLang#213::Union{Nothing, Some{UInt8}} #JuliaLang#214::Union{Nothing, UInt8} digitRes::Union{Nothing, Some{UInt8}} @_19::Union{Nothing, UInt8} @_20::Union{Nothing, UInt8} @_21::Nothing @_22::Union{Nothing, UInt8} @_23::Union{Nothing, UInt8} @_24::Nothing Body::Int64 1 ── (total = 0) │ (may_a = Main.nothing) │ (may_b = Main.nothing) │ %4 = data::Vector{UInt8} │ (@_3 = Base.iterate(%4)) │ %6 = @_3::Union{Nothing, Tuple{UInt8, Int64}} │ %7 = (%6 === nothing)::Bool │ %8 = Base.not_int(%7)::Bool └─── goto JuliaLang#24 if not %8 2 ┄─ Core.NewvarNode(:(val@_7)) │ Core.NewvarNode(:(val@_8)) │ Core.NewvarNode(:(digit_b)) │ Core.NewvarNode(:(digit_a)) │ Core.NewvarNode(:(#JuliaLang#215)) │ Core.NewvarNode(:(#JuliaLang#216)) │ Core.NewvarNode(:(#JuliaLang#217)) │ Core.NewvarNode(:(#JuliaLang#212)) │ Core.NewvarNode(:(#JuliaLang#213)) │ %19 = @_3::Tuple{UInt8, Int64} │ (c = Core.getfield(%19, 1)) │ %21 = Core.getfield(%19, 2)::Int64 │ %22 = c::UInt8 │ (digitRes = Main.someDigit(%22)) │ %24 = may_a::Union{Nothing, UInt8} │ (#JuliaLang#214 = %24) │ %26 = Base.:!::Core.Const(!) │ %27 = #JuliaLang#214::Union{Nothing, UInt8} │ %28 = Base.isnothing(%27)::Bool │ %29 = (%26)(%28)::Bool └─── goto #4 if not %29 3 ── %31 = #JuliaLang#214::UInt8 │ (@_19 = Base.something(%31)) └─── goto #11 4 ── %34 = digitRes::Union{Nothing, Some{UInt8}} │ (#JuliaLang#213 = %34) │ %36 = Base.:!::Core.Const(!) │ %37 = #JuliaLang#213::Union{Nothing, Some{UInt8}} │ %38 = Base.isnothing(%37)::Bool │ %39 = (%36)(%38)::Bool └─── goto #6 if not %39 5 ── %41 = #JuliaLang#213::Some{UInt8} │ (@_20 = Base.something(%41)) └─── goto #10 6 ── %44 = Main.Some::Core.Const(Some) │ %45 = Main.nothing::Core.Const(nothing) │ (#JuliaLang#212 = (%44)(%45)) │ %47 = Base.:!::Core.Const(!) │ %48 = #JuliaLang#212::Core.Const(Some(nothing)) │ %49 = Base.isnothing(%48)::Core.Const(false) │ %50 = (%47)(%49)::Core.Const(true) └─── goto #8 if not %50 7 ── %52 = #JuliaLang#212::Core.Const(Some(nothing)) │ (@_21 = Base.something(%52)) └─── goto #9 8 ── Core.Const(nothing) │ Core.Const(:(val@_8 = Base.something(Base.nothing))) │ Core.Const(nothing) │ Core.Const(:(val@_8)) └─── Core.Const(:(@_21 = %58)) 9 ┄─ %60 = @_21::Core.Const(nothing) └─── (@_20 = %60) 10 ┄ %62 = @_20::Union{Nothing, UInt8} └─── (@_19 = %62) 11 ┄ %64 = @_19::Union{Nothing, UInt8} │ (may_a = %64) │ %66 = digitRes::Union{Nothing, Some{UInt8}} │ (#JuliaLang#217 = %66) │ %68 = Base.:!::Core.Const(!) │ %69 = #JuliaLang#217::Union{Nothing, Some{UInt8}} │ %70 = Base.isnothing(%69)::Bool │ %71 = (%68)(%70)::Bool └─── goto #13 if not %71 12 ─ %73 = #JuliaLang#217::Some{UInt8} │ (@_22 = Base.something(%73)) └─── goto JuliaLang#20 13 ─ %76 = may_b::Union{Nothing, UInt8} │ (#JuliaLang#216 = %76) │ %78 = Base.:!::Core.Const(!) │ %79 = #JuliaLang#216::Union{Nothing, UInt8} │ %80 = Base.isnothing(%79)::Bool │ %81 = (%78)(%80)::Bool └─── goto #15 if not %81 14 ─ %83 = #JuliaLang#216::UInt8 │ (@_23 = Base.something(%83)) └─── goto #19 15 ─ %86 = Main.Some::Core.Const(Some) │ %87 = Main.nothing::Core.Const(nothing) │ (#JuliaLang#215 = (%86)(%87)) │ %89 = Base.:!::Core.Const(!) │ %90 = #JuliaLang#215::Core.Const(Some(nothing)) │ %91 = Base.isnothing(%90)::Core.Const(false) │ %92 = (%89)(%91)::Core.Const(true) └─── goto #17 if not %92 16 ─ %94 = #JuliaLang#215::Core.Const(Some(nothing)) │ (@_24 = Base.something(%94)) └─── goto #18 17 ─ Core.Const(nothing) │ Core.Const(:(val@_7 = Base.something(Base.nothing))) │ Core.Const(nothing) │ Core.Const(:(val@_7)) └─── Core.Const(:(@_24 = %100)) 18 ┄ %102 = @_24::Core.Const(nothing) └─── (@_23 = %102) 19 ┄ %104 = @_23::Union{Nothing, UInt8} └─── (@_22 = %104) 20 ┄ %106 = @_22::Union{Nothing, UInt8} │ (may_b = %106) │ %108 = Main.:(==)::Core.Const(==) │ %109 = c::UInt8 │ %110 = Main.UInt8('\n')::Core.Const(0x0a) │ %111 = (%108)(%109, %110)::Bool └─── goto JuliaLang#22 if not %111 21 ─ %113 = may_a::Union{Nothing, UInt8} │ (digit_a = Core.typeassert(%113, Main.UInt8)) │ %115 = may_b::Union{Nothing, UInt8} │ (digit_b = Core.typeassert(%115, Main.UInt8)) │ %117 = Main.:+::Core.Const(+) │ %118 = total::Int64 │ %119 = Main.:+::Core.Const(+) │ %120 = Main.:*::Core.Const(*) │ %121 = digit_a::UInt8 │ %122 = (%120)(%121, 0x0a)::UInt8 │ %123 = digit_b::UInt8 │ %124 = (%119)(%122, %123)::UInt8 │ (total = (%117)(%118, %124)) │ (may_a = Main.nothing) └─── (may_b = Main.nothing) 22 ┄ (@_3 = Base.iterate(%4, %21)) │ %129 = @_3::Union{Nothing, Tuple{UInt8, Int64}} │ %130 = (%129 === nothing)::Bool │ %131 = Base.not_int(%130)::Bool └─── goto JuliaLang#24 if not %131 23 ─ goto #2 24 ┄ %134 = total::Int64 └─── return %134 ``` </details> <details> <summary>`@code_native debuginfo=:none` After </summary> ```julia julia> @code_native debuginfo=:none part1(data) .text .file "part1" .globl julia_part1_1203 # -- Begin function julia_part1_1203 .p2align 4, 0x90 .type julia_part1_1203,@function julia_part1_1203: # @julia_part1_1203 ; Function Signature: part1(Array{UInt8, 1}) # %bb.0: # %top #DEBUG_VALUE: part1:data <- [DW_OP_deref] $rdi push rbp mov rbp, rsp push r15 push r14 push r13 push r12 push rbx sub rsp, 40 vxorps xmm0, xmm0, xmm0 #APP mov rax, qword ptr fs:[0] #NO_APP lea rdx, [rbp - 64] vmovaps xmmword ptr [rbp - 64], xmm0 mov qword ptr [rbp - 48], 0 mov rcx, qword ptr [rax - 8] mov qword ptr [rbp - 64], 4 mov rax, qword ptr [rcx] mov qword ptr [rbp - 72], rcx # 8-byte Spill mov qword ptr [rbp - 56], rax mov qword ptr [rcx], rdx #DEBUG_VALUE: part1:data <- [DW_OP_deref] 0 mov r15, qword ptr [rdi + 16] test r15, r15 je .LBB0_1 # %bb.2: # %L34 mov r14, qword ptr [rdi] dec r15 mov r11b, 1 mov r13b, 1 # implicit-def: $r12b # implicit-def: $r10b xor eax, eax jmp .LBB0_3 .p2align 4, 0x90 .LBB0_4: # in Loop: Header=BB0_3 Depth=1 xor r11d, r11d mov ebx, edi mov r10d, r8d .LBB0_9: # %L114 # in Loop: Header=BB0_3 Depth=1 mov r12d, esi test r15, r15 je .LBB0_12 .LBB0_10: # %guard_exit126 # in Loop: Header=BB0_3 Depth=1 inc r14 dec r15 mov r13d, ebx .LBB0_3: # %L36 # =>This Inner Loop Header: Depth=1 movzx edx, byte ptr [r14] test r13b, 1 movzx edi, r13b mov ebx, 1 mov ecx, 0 cmove ebx, edi cmovne edi, ecx movzx ecx, r10b lea esi, [rdx - 48] lea r9d, [rdx - 58] movzx r8d, sil cmove r8d, ecx cmp r9b, -11 ja .LBB0_4 # %bb.5: # %L89 # in Loop: Header=BB0_3 Depth=1 test r11b, 1 jne .LBB0_8 # %bb.6: # %L102 # in Loop: Header=BB0_3 Depth=1 cmp dl, 10 jne .LBB0_7 # %bb.13: # %L106 # in Loop: Header=BB0_3 Depth=1 test r13b, 1 jne .LBB0_14 # %bb.11: # %L114.thread # in Loop: Header=BB0_3 Depth=1 add ecx, ecx mov bl, 1 mov r11b, 1 lea ecx, [rcx + 4*rcx] add cl, r12b movzx ecx, cl add rax, rcx test r15, r15 jne .LBB0_10 jmp .LBB0_12 .p2align 4, 0x90 .LBB0_8: # %L102.thread # in Loop: Header=BB0_3 Depth=1 mov r11b, 1 # implicit-def: $sil cmp dl, 10 jne .LBB0_9 jmp .LBB0_15 .LBB0_7: # in Loop: Header=BB0_3 Depth=1 mov esi, r12d jmp .LBB0_9 .LBB0_1: xor eax, eax .LBB0_12: # %L154 mov rcx, qword ptr [rbp - 56] mov rdx, qword ptr [rbp - 72] # 8-byte Reload mov qword ptr [rdx], rcx add rsp, 40 pop rbx pop r12 pop r13 pop r14 pop r15 pop rbp ret .LBB0_15: # %L106.thread test r13b, 1 jne .LBB0_14 # %bb.16: # %post_box_union47 movabs rax, offset jl_nothing movabs rcx, offset jl_small_typeof movabs rdi, offset ".L_j_str_typeassert#1" mov rdx, qword ptr [rax] mov rsi, qword ptr [rcx + 336] movabs rax, offset ijl_type_error mov qword ptr [rbp - 48], rsi call rax .LBB0_14: # %post_box_union movabs rax, offset jl_nothing movabs rcx, offset jl_small_typeof movabs rdi, offset ".L_j_str_typeassert#1" mov rdx, qword ptr [rax] mov rsi, qword ptr [rcx + 336] movabs rax, offset ijl_type_error mov qword ptr [rbp - 48], rsi call rax .Lfunc_end0: .size julia_part1_1203, .Lfunc_end0-julia_part1_1203 # -- End function .type ".L_j_str_typeassert#1",@object # @"_j_str_typeassert#1" .section .rodata.str1.1,"aMS",@progbits,1 ".L_j_str_typeassert#1": .asciz "typeassert" .size ".L_j_str_typeassert#1", 11 .section ".note.GNU-stack","",@progbits ``` </details> Co-authored-by: Sukera <[email protected]>
LilithHafner
added a commit
that referenced
this pull request
Apr 18, 2024
Adds a convenient way to enable PGO+LTO on Julia and LLVM together: 1. `cd contrib/pgo-lto` 2. `make -j$(nproc) stage1` 3. `make clean-profiles` 4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'` 5. `make -j$(nproc) stage2` <details> <summary>* Output looks roughly like as follows</summary> ```c++ $ make -C contrib/pgo-lto top make: Entering directory '/dev/shm/julia/contrib/pgo-lto' llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt Instrumentation level: IR entry_first = 0 Total functions: 85943 Maximum function count: 7867557260 Maximum internal block count: 3468437590 Top 50 functions with the largest internal block counts: llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260 LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590 llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834 llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575 llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762 llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177 std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888 llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728 llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040 llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953 ijl_method_instance_add_backedge, max count = 349608221 llvm::SUnit::ComputeHeight(), max count = 336604330 llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109 llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545 llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540 LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274 /dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464 ijl_get_pgcstack, max count = 216953592 LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152 /dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813 /dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603 ``` </details> This results quite often in spectacular speedups for time to first X as it reduces the time spent in LLVM optimization passes by 25 or even 30%. Example 1: ```julia using LoopVectorization function f!(a, b) @turbo for i in eachindex(a) a[i] *= b[i] end return a end f!(rand(1), rand(1)) ``` ```console $ time ./julia -O3 lv.jl ``` Without PGO+LTO: 14.801s With PGO+LTO: 11.978s (-19%) Example 2: ```console $ time ./julia -e 'using Pkg; Pkg.test("Unitful");' ``` Without PGO+LTO: 1m47.688s With PGO+LTO: 1m35.704s (-11%) Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM): ```console $ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl ``` Without PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 101.0130 seconds (98.6253 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops 25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering 7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification 6.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions #2 ``` With PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 72.6507 seconds (70.1337 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops 16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering 5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification 4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions #2 ``` Or -28% time spent in LLVM. `perf` reports show this is mostly fewer instructions and reduction in icache misses. --- Finally there's a significant reduction in binary sizes. For libLLVM.so: ``` 79M usr/lib/libLLVM-13jl.so (before) 67M usr/lib/libLLVM-13jl.so (after) ``` And it can be reduced by another 2MB with `--icf=safe` when using LLD as a linker anyways. - [x] Two out-of-source builds would be better than a single in-source build, so that it's easier to find good profile data --------- Co-authored-by: Oscar Smith <[email protected]> Co-authored-by: Lilith Orion Hafner <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.