-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: support external compiler passes #35015
Conversation
If I recall correctly when Jarrett originally started working on Cassette untyped IR was chosen because operating on typed IR requires maintaining more invariances and CodeInfo/IRCode are subject/allowed to be changed at an time and building things upon them outside the compiler was intentionally unwanted. Getting at the type information seems to be the primary motivation here. If you want to operate on untyped IR you could use a generated function/Cassette pass/IRTools dynamo today, but operating on typed IR seems to be the motivation here. #33955 is definitely related here, but it follows a more Cassette style approach where the impact of arbitrary code transformation is limited to code that is compiled for a context. To want to do this through a compiler hook, it seem to me, is actually a sign that this probably needs to be done within the Compiler, but that places the onerous contract of being the compiler upon you. |
I take your point about being "onerous," though with some extra contributions to TypedCodeUtils it might get less so. My proximal motivation for thinking about this was thinking about supporting generic element types in LoopVectorization (JuliaSIMD/LoopVectorization.jl#65 (comment)) together with the observation that some bits of code seem like they might simplify if you knew the types. (Of course handling typed code is, as you say, more complicated.) It does look like #33955 is related. I will have to check it out. Maybe a good strategy is the following: I'm tired of #9080 (it seems to be down to a 10% penalty for big arrays, but for small arrays it's as high as 40%). What if I play with this and/or #33955 to see if I can write a compiler pass that fixes it, move that into the actual compiler, but meanwhile gain experience with what this would be like in practice. |
I (or someone else) could file an issue or PR at LoopVectorization if we want to discuss or plan this further. Originally, julia> @macroexpand @_avx for i ∈ eachindex(x), j ∈ eachindex(y)
s += x[i] * A[i,j] * y[j]
end
quote
$(Expr(:meta, :inline))
begin
var"##loopeachindexi#270" = LoopVectorization.maybestaticrange(eachindex(x))
var"##i_loop_lower_bound#271" = LoopVectorization.staticm1(first(var"##loopeachindexi#270"))
var"##i_loop_upper_bound#272" = last(var"##loopeachindexi#270")
var"##loopeachindexj#273" = LoopVectorization.maybestaticrange(eachindex(y))
var"##j_loop_lower_bound#274" = LoopVectorization.staticm1(first(var"##loopeachindexj#273"))
var"##j_loop_upper_bound#275" = last(var"##loopeachindexj#273")
var"##vptr##_x" = LoopVectorization.stridedpointer(x)
var"##vptr##_A" = LoopVectorization.stridedpointer(A)
var"##vptr##_y" = LoopVectorization.stridedpointer(y)
var"##T#269" = promote_type(eltype(x), eltype(A), eltype(y))
var"##W#268" = LoopVectorization.pick_vector_width_val(eltype(x), eltype(A), eltype(y))
var"##s_" = s
var"##mask##" = LoopVectorization.masktable(var"##W#268", LoopVectorization.valrem(var"##W#268", var"##i_loop_upper_bound#272" - var"##i_loop_lower_bound#271"))
var"##s_0" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
var"##s_1" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
var"##s_2" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
var"##s_3" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
end
begin
$(Expr(:gc_preserve, :(begin
var"##outer##j##outer##" = LoopVectorization.unwrap(var"##j_loop_lower_bound#274")
while var"##outer##j##outer##" < var"##j_loop_upper_bound#275" - 3
i = LoopVectorization._MM(var"##W#268", var"##i_loop_lower_bound#271")
j = var"##outer##j##outer##"
var"####tempload#279_0_" = LoopVectorization.vload(var"##vptr##_y", (j,))
j += 1
var"####tempload#279_1_" = LoopVectorization.vload(var"##vptr##_y", (j,))
j += 1
var"####tempload#279_2_" = LoopVectorization.vload(var"##vptr##_y", (j,))
j += 1
var"####tempload#279_3_" = LoopVectorization.vload(var"##vptr##_y", (j,))
begin
while i < var"##i_loop_upper_bound#272" - LoopVectorization.valmuladd(var"##W#268", 2, -1)
var"####tempload#276_0" = LoopVectorization.vload(var"##vptr##_x", (i,))
var"####tempload#276_1" = LoopVectorization.vload(var"##vptr##_x", (i + LoopVectorization.valmul(var"##W#268", 1),))
j = var"##outer##j##outer##"
var"####tempload#278_0_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
var"####tempload#278_0_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
var"####temporary#277_0_0" = LoopVectorization.vmul(var"####tempload#278_0_0", var"####tempload#279_0_")
var"####temporary#277_0_1" = LoopVectorization.vmul(var"####tempload#278_0_1", var"####tempload#279_0_")
var"##s_0" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_0_0", var"##s_0")
var"##s_1" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_0_1", var"##s_1")
j += 1
var"####tempload#278_1_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
var"####tempload#278_1_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
var"####temporary#277_1_0" = LoopVectorization.vmul(var"####tempload#278_1_0", var"####tempload#279_1_")
var"####temporary#277_1_1" = LoopVectorization.vmul(var"####tempload#278_1_1", var"####tempload#279_1_")
var"##s_2" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_1_0", var"##s_2")
var"##s_3" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_1_1", var"##s_3")
j += 1
var"####tempload#278_2_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
var"####tempload#278_2_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
var"####temporary#277_2_0" = LoopVectorization.vmul(var"####tempload#278_2_0", var"####tempload#279_2_")
var"####temporary#277_2_1" = LoopVectorization.vmul(var"####tempload#278_2_1", var"####tempload#279_2_")
var"##s_0" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_2_0", var"##s_0")
var"##s_1" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_2_1", var"##s_1")
j += 1
var"####tempload#278_3_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
var"####tempload#278_3_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
var"####temporary#277_3_0" = LoopVectorization.vmul(var"####tempload#278_3_0", var"####tempload#279_3_")
var"####temporary#277_3_1" = LoopVectorization.vmul(var"####tempload#278_3_1", var"####tempload#279_3_")
var"##s_2" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_3_0", var"##s_2")
var"##s_3" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_3_1", var"##s_3")
i += LoopVectorization.valmul(var"##W#268", 2)
end
... Later, a new macro was added and given the old name, as it is recommended over the old one). This one also creates a julia> @macroexpand @avx for i ∈ eachindex(x), j ∈ eachindex(y)
s += x[i] * A[i,j] * y[j]
end
quote
var"##loopeachindexi#282" = LoopVectorization.maybestaticrange(eachindex(x))
var"##i_loop_lower_bound#283" = LoopVectorization.staticm1(first(var"##loopeachindexi#282"))
var"##i_loop_upper_bound#284" = last(var"##loopeachindexi#282")
var"##loopeachindexj#285" = LoopVectorization.maybestaticrange(eachindex(y))
var"##j_loop_lower_bound#286" = LoopVectorization.staticm1(first(var"##loopeachindexj#285"))
var"##j_loop_upper_bound#287" = last(var"##loopeachindexj#285")
var"##vptr##_x" = LoopVectorization.stridedpointer(x)
var"##vptr##_A" = LoopVectorization.stridedpointer(A)
var"##vptr##_y" = LoopVectorization.stridedpointer(y)
local var"##s_0"
begin
$(Expr(:gc_preserve, :(var"##s_0" = LoopVectorization._avx_!(Val{(0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03), :LoopVectorization, :vmul, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04), :LoopVectorization, Symbol("##254"), LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000012, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000012, 0x0000000000000000, 0x0000000000010405, LoopVectorization.compute, 0x00, 0x05)}, Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000030), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xffffffffffffb068), LoopVectorization.ArrayRefStruct(0x0000000000000001, 0x0000000000000002, 0xffffffffffffffb0)}, Tuple{0, Tuple{6}, Tuple{5}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, (var"##i_loop_lower_bound#283":var"##i_loop_upper_bound#284", var"##j_loop_lower_bound#286":var"##j_loop_upper_bound#287"), var"##vptr##_x", var"##vptr##_A", var"##vptr##_y", s)), :x, :A, :y))
end
s = LoopVectorization.reduced_add(var"##s_0", s)
end The type parameter provides all the information needed to reconstruct the Perhaps we could deprecate While I think there's a lot of room for further taking advantage of that, it would be great to also be able to do something like this: julia> x = rand(ComplexF64); y = rand(ComplexF64);
julia> @code_typed x * y
CodeInfo(
1 ─ %1 = Base.getfield(z, :re)::Float64
│ %2 = Base.getfield(w, :re)::Float64
│ %3 = Base.mul_float(%1, %2)::Float64
│ %4 = Base.getfield(z, :im)::Float64
│ %5 = Base.getfield(w, :im)::Float64
│ %6 = Base.mul_float(%4, %5)::Float64
│ %7 = Base.sub_float(%3, %6)::Float64
│ %8 = Base.getfield(z, :re)::Float64
│ %9 = Base.getfield(w, :im)::Float64
│ %10 = Base.mul_float(%8, %9)::Float64
│ %11 = Base.getfield(z, :im)::Float64
│ %12 = Base.getfield(w, :re)::Float64
│ %13 = Base.mul_float(%11, %12)::Float64
│ %14 = Base.add_float(%10, %13)::Float64
│ %15 = %new(Complex{Float64}, %7, %14)::Complex{Float64}
└── return %15
) => Complex{Float64} which would allow supporting a much broader arrange of number types. |
This issue is way beyond my understanding but, regarding #9080, I think it should be pretty easy to make iteration over |
I just realized my comment is rather off-topic here. I re-posted a similar comment in #9080 (comment) with an MWE. |
@chriselrod, thanks for the detailed explanation! Very informative. I see the system you've built is more flexible than I credited. I think one of my motivations here is trying to provide a smooth path for this kind of transformation happening automatically without requiring an Relatedly, to me it seems that one possible advantage of writing this as a compiler pass rather than a generated function is reduced compile times. I may not be thinking about this properly, but the idea is that if you write it as a pass, then you only have to compile the pass itself once, and it can make the transformations on an infinite number of functions. Conversely, if you do this via a generated function, then each instance of If you try an experiment you'll see there's some (but not fully convincing) evidence for this viewpoint: if you redefine the functions in the tests here and then time the first execution (i.e., including the compile time), on my machine the ones with Bottom line, now that I understand more about your approach I'm more sanguine about continuing to help advance it, though I still suspect that moving more of this into the compiler is the better long-term solution. |
598eeea
to
0435e02
Compare
So, having used this for #35074, my feeling is: we want this or something like it (maybe #33955). BUT, if folks are concerned about it opening a back door to a bunch of poorly-written, crashy passes, I'd be just as happy deleting the tests and leaving this code in place, but commented out (and with a link to a gist or demo that shows how to use it). That way people who try new compiler passes can just build julia from source and uncomment the code to make it easier (much, much easier) to develop their pass. Then they can contribute their pass to Core.Compiler. |
# External passes | ||
const _opt = Ref{Any}(nothing) | ||
avx_pass(opt) = (_opt[] = opt; opt) | ||
macro avx(ex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because this doesn't actually do anything to do with avx
it would be clearer maybe to call this
demo_pass
and @with_demo_pass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go with the "comment out" option this will be deleted anyway. But yes, for the demo we need better names.
Yes! Pass-ordering is challenging and I rather not make it more complex ;) |
So the function of the pass in this PR gets and optimization state, (which includes basically everything from a One of the things i would like in general when writting Cassette passes, However, making them available to external passes opens up the possibility that a pass will make them incorrect, via updating the optimization state, without also updating the control flow graph / dom-tree |
We could put more such |
There are some compiler features that I'm interested in playing with. I wonder if it's implementable with a pass mechanism like this (or something similar). Auto type stabilizationConsider I have a function like this @autostabilize function f()
var1 = expr1
var2 = expr2 # cannot be inferred
var3 = expr3
var4 = expr4
end Can I implement the compiler hint function f()
var1 = expr1
var2 = expr2
g(var1, var2)
end
function g(var1, var2)
var3 = expr3
var4 = expr4
end ? Tail-calls to finite-state machines (TC-to-FSM)As I discussed in Tail-call optimization and function-barrier -based accumulation in loops - Internals & Design - JuliaLang, I think optimizing tail-calls has a grounded motivation as a natural extension to the commonly-used mutate-or-widen technique. Is it possible to implement a compiler pass that transforms tail-calls (which may be dispatched to different methods) to a finite-state machine, using this mechanism? It'd be also nice to inject a limit to the number of times function-barrier is used in a loop, as mentioned in Discourse, to avoid stack overflow. (If I can make both Edit: Actually, I guess what I want to do doesn't require a typed IR. I guess it's already doable with something like IRTools.jl (or directly with the hacks it's using)? Edit 2: Yep, there is already https://github.com/MikeInnes/IRTools.jl/blob/6d227c0edb828b7c761c97fe899dd33c03e69b56/examples/continuations.jl :) |
Now have AbstractInterpreter |
The idea here is to allow packages to define custom optimization passes starting from type-inferred code. The proximal motivation was to enable LoopVectorization to get more information than a macro allows about what types of objects it's working with. The overall design is that you can set
:meta
statements that optionally bracket the region of code that you want to apply the optimization to, but that the callback function receives the entire method (theCore.Compiler.OptimizationState
).Here, for example, would be the
@avx
macro:In the CodeInfo, this just brackets
ex
withThe compiler looks for such meta expressions and then will hand the code to
avx_pass
, which is obviously where all the magic needs to happen.After the other passes, this does leave stray
:external_pass
meta expressions at the end of the CodeInfo. Not sure if I should remove those, but they seem likely to be harmless.CC @chriselrod.
EDIT: Argh, just realized that I need to modify the iteration here: the pass is likely to change the number of lines, so this needs to start from scratch again after each pass. And the callback should be responsible for removing its
:meta
expression. But let's see what people think about the general idea before I fix this.