Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement a separate TracedRNumber #161

Merged
merged 35 commits into from
Oct 6, 2024
Merged

feat: implement a separate TracedRNumber #161

merged 35 commits into from
Oct 6, 2024

Conversation

avik-pal
Copy link
Collaborator

@avik-pal avik-pal commented Oct 4, 2024

currently very WIP

src/Reactant.jl Outdated Show resolved Hide resolved
src/TracedRArray.jl Outdated Show resolved Hide resolved
@avik-pal avik-pal changed the title feat: implement a separate TracedRScalar feat: implement a separate TracedRNumber Oct 5, 2024
@avik-pal avik-pal force-pushed the ap/scalar branch 2 times, most recently from 246cd00 to 40af781 Compare October 5, 2024 01:31
src/TracedRNumber.jl Outdated Show resolved Hide resolved
src/TracedRNumber.jl Outdated Show resolved Hide resolved
@avik-pal avik-pal force-pushed the ap/scalar branch 3 times, most recently from 113c2b5 to 10495fc Compare October 5, 2024 02:56
src/TracedRArray.jl Outdated Show resolved Hide resolved
@avik-pal avik-pal marked this pull request as ready for review October 5, 2024 03:30
@avik-pal avik-pal requested review from mofeing and wsmoses October 5, 2024 03:37
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reactant.jl Benchmarks

Benchmark suite Current: 8a9f06c Previous: f2c0e8a Ratio
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Reactant 1318556698 ns 1315729546 ns 1.00
ViT base (256 x 256 x 3 x 32)/forward/CUDA/Lux 213965204 ns 212083499 ns 1.01
ViT base (256 x 256 x 3 x 32)/forward/CPU/Reactant 6804934419 ns 5286469750 ns 1.29
ViT base (256 x 256 x 3 x 32)/forward/CPU/Lux 18511487331 ns 23583347555 ns 0.78
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Reactant 1255068856 ns 1254858296 ns 1.00
ViT small (256 x 256 x 3 x 4)/forward/CUDA/Lux 8751128 ns 8478570 ns 1.03
ViT small (256 x 256 x 3 x 4)/forward/CPU/Reactant 1643973338 ns 1636237670 ns 1.00
ViT small (256 x 256 x 3 x 4)/forward/CPU/Lux 2060290806 ns 2376437823 ns 0.87
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Reactant 1304838533 ns 1266018905 ns 1.03
ViT tiny (256 x 256 x 3 x 32)/forward/CUDA/Lux 93074730.5 ns 84820407 ns 1.10
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Reactant 2253189844 ns 2170879105 ns 1.04
ViT tiny (256 x 256 x 3 x 32)/forward/CPU/Lux 6064963189 ns 4675094299 ns 1.30
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Reactant 1297977475 ns 1263496480 ns 1.03
ViT tiny (256 x 256 x 3 x 4)/forward/CUDA/Lux 7525710.5 ns 7782824 ns 0.97
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Reactant 1466869310 ns 1467043032.5 ns 1.00
ViT tiny (256 x 256 x 3 x 4)/forward/CPU/Lux 1361212717 ns 1685775445 ns 0.81
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Reactant 1314689632 ns 1306815930 ns 1.01
ViT tiny (256 x 256 x 3 x 16)/forward/CUDA/Lux 11418734.5 ns 11611908 ns 0.98
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Reactant 1752787751.5 ns 1752808523 ns 1.00
ViT tiny (256 x 256 x 3 x 16)/forward/CPU/Lux 2629684857 ns 2463987825.5 ns 1.07
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Reactant 1276847603.5 ns 1325877558.5 ns 0.96
ViT small (256 x 256 x 3 x 16)/forward/CUDA/Lux 86382264 ns 90330187 ns 0.96
ViT small (256 x 256 x 3 x 16)/forward/CPU/Reactant 2216207055 ns 2213119086 ns 1.00
ViT small (256 x 256 x 3 x 16)/forward/CPU/Lux 3548437058 ns 4023816395 ns 0.88
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Reactant 1308980482.5 ns 1270812264 ns 1.03
ViT small (256 x 256 x 3 x 32)/forward/CUDA/Lux 116489754.5 ns 113097539 ns 1.03
ViT small (256 x 256 x 3 x 32)/forward/CPU/Reactant 3037687031 ns 3042643080 ns 1.00
ViT small (256 x 256 x 3 x 32)/forward/CPU/Lux 9576755622 ns 8210106924.5 ns 1.17
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Reactant 1332370009 ns 1324054039 ns 1.01
ViT base (256 x 256 x 3 x 16)/forward/CUDA/Lux 126824571 ns 127669686.5 ns 0.99
ViT base (256 x 256 x 3 x 16)/forward/CPU/Reactant 3191878044 ns 3203794253 ns 1.00
ViT base (256 x 256 x 3 x 16)/forward/CPU/Lux 7124639562 ns 11004907984 ns 0.65
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Reactant 1280201506 ns 1299288245 ns 0.99
ViT base (256 x 256 x 3 x 4)/forward/CUDA/Lux 83942324 ns 96277750 ns 0.87
ViT base (256 x 256 x 3 x 4)/forward/CPU/Reactant 1900998509 ns 2155333265.5 ns 0.88
ViT base (256 x 256 x 3 x 4)/forward/CPU/Lux 2374950424 ns 2863535293.5 ns 0.83

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

github-actions bot commented Oct 5, 2024

Benchmark Results

main 4a81556... main/4a81556a8fc685...
comptime/basics/2D sum 27.4 ± 2.3 ms 27.5 ± 1.2 ms 0.997
comptime/basics/Basic grad cos 0.0455 ± 0.00079 s 0.0457 ± 0.00073 s 0.995
comptime/basics/cos.(x) 0.0323 ± 0.0014 s 0.0321 ± 0.00059 s 1.01
comptime/lux neural networks/ViT base 5.88 s 5.83 s 1.01
comptime/lux neural networks/ViT tiny 6.04 s 5.85 s 1.03
comptime/lux neural networks/vgg11 bn=false 0.4 ± 0.017 s 0.409 ± 0.0068 s 0.977
comptime/lux neural networks/vgg13 bn=false 0.439 ± 0.017 s 0.47 ± 0.015 s 0.934
comptime/lux neural networks/vgg16 bn=false 0.534 ± 0.012 s 0.553 ± 0.0049 s 0.964
comptime/lux neural networks/vgg19 bn=false 0.616 ± 0.0099 s 0.635 ± 0.024 s 0.969
runtime/lux neural networks/ViT base (compiled) 6.38 s 6.37 s 1
runtime/lux neural networks/ViT tiny (compiled) 1.71 s 1.69 s 1.01
runtime/lux neural networks/vgg11 bn=false (compiled) 2.14 s 2.13 s 1
runtime/lux neural networks/vgg13 bn=false (compiled) 2.97 s 2.95 s 1.01
runtime/lux neural networks/vgg16 bn=false (compiled) 3.78 s 3.77 s 1
runtime/lux neural networks/vgg19 bn=false (compiled) 4.55 s 4.55 s 1
time_to_load 1.45 ± 0.0048 s 1.34 ± 0.006 s 1.08

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@mofeing
Copy link
Collaborator

mofeing commented Oct 5, 2024

Let me just merge #163 after CI checks pass and rebase this PR on top of it because it adds some tests that should be passed by TracedRNumbers too.

src/Reactant.jl Outdated Show resolved Hide resolved
Copy link
Member

@wsmoses wsmoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm only minor comment: would defining traced value = union{tracedrarrsy, travednumber} simplify some things

@avik-pal
Copy link
Collaborator Author

avik-pal commented Oct 5, 2024

Do we also need a ConcreteRScalar? If we want to extend to_rarray to all of ReactantPrimitives

@wsmoses
Copy link
Member

wsmoses commented Oct 5, 2024

For now we can leave concrete as is

@mofeing
Copy link
Collaborator

mofeing commented Oct 5, 2024

I don't think we need a concrete scalar type, because indexing on a concrete array is just a normal number.

@avik-pal
Copy link
Collaborator Author

avik-pal commented Oct 5, 2024

I don't think we need a concrete scalar type, because indexing on a concrete array is just a normal number.

I meant more from a tracing perspective. What should to_rarray(::ReactantPrimitives) return? If it is Number then we can't convert it to a TracedRNumber. If we want to convert it then we need something like a ConcreteRNumber so that users are aware that it is going to be traced.

@wsmoses
Copy link
Member

wsmoses commented Oct 5, 2024

So I think this is two different questions.

  1. Should we upgrade scalars into traced values (like we do for arrays) or compile with respect to them (like we do constants).

For now I'm fine leaving them as is which is compile as constants

  1. Can we represent a ConcreteScalar, if desired. I think this will be necessary in the future.

For example we may want to have a function with a user defined index offset.

e.g.

A = ConcreteRarray
v = ConcreteScalar
f = @compile getindex(A, v)

without a concrete scalar we need to recompile for each index. I do think we need such a scalar, even if the default behavior for conversion is only converting arrays

@wsmoses
Copy link
Member

wsmoses commented Oct 5, 2024

Even more simply though, we may need as the return type of a function with array inputs and a scalar output.

i.e.

x = ConcreteArray(ones(4,4))
f = @compile sum(x)
f(x) # this should return a scalar now, but we cannot actually represent this as a regularfloat64, since it will be the result of a future/etc

@avik-pal
Copy link
Collaborator Author

avik-pal commented Oct 5, 2024

Looking at the above example I also realized the mapreduce semantics is incorrect at-present. For example sum(x) should return TracedRNumber and not TracedRArray{T,0}

Even more simply though, we may need as the return type of a function with array inputs and a scalar output.

We could store it like an array in the ConcreteRArray but expose it to the enduser as a scalar. JuliaGPU/GPUArrays.jl#550 is something similar in the GPUArrays world. But let's keep this PR simple and avoid this for now

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 60.13072% with 61 lines in your changes missing coverage. Please review.

Project coverage is 30.62%. Comparing base (9904590) to head (841376d).

Files with missing lines Patch % Lines
src/TracedRNumber.jl 50.81% 30 Missing ⚠️
src/TracedRArray.jl 62.79% 16 Missing ⚠️
ext/ReactantNNlibExt.jl 0.00% 7 Missing ⚠️
src/Tracing.jl 80.64% 6 Missing ⚠️
src/ConcreteRArray.jl 0.00% 1 Missing ⚠️
src/XLA.jl 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #161      +/-   ##
==========================================
- Coverage   33.09%   30.62%   -2.47%     
==========================================
  Files          37       38       +1     
  Lines        5107     5175      +68     
==========================================
- Hits         1690     1585     -105     
- Misses       3417     3590     +173     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/utils.jl Outdated Show resolved Hide resolved
Copy link
Member

@wsmoses wsmoses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a small comment but otherwise lgtm

@mofeing
Copy link
Collaborator

mofeing commented Oct 6, 2024

now that #163 has been merged, we should test concatenation on numbers.

@avik-pal should I do it or do you prefer to do it yourself?

@mofeing
Copy link
Collaborator

mofeing commented Oct 6, 2024

There's a bug on concatenation of TracedRNumber. In particular, if we run this kernel:

using Reactant

x = fill(true)
x_concrete = Reactant.to_rarray(x)

function traced_vcat(x)
	a = x[]
	[a; a; a]
end

f = @compile traced_vcat(x_concrete)

f(x_concrete)

It fails with the following error:

ERROR: BoundsError: attempt to access 3-element Vector{ConcreteRArray{Bool}} at index [1]
Stacktrace:
 [1] traced_getfield
   @ ~/Developer/Reactant.jl/src/Compiler.jl:18 [inlined]
 [2] macro expansion
   @ ~/Developer/Reactant.jl/src/Compiler.jl:649 [inlined]
 [3] (::Reactant.Compiler.Thunk{Symbol("##test_vcat_reactant#229")})(args::ConcreteRArray{Bool, 0})
   @ Reactant.Compiler ~/Developer/Reactant.jl/src/Compiler.jl:665
 [4] top-level scope
   @ REPL[10]:1

The error seems to be that it's not doing any XLA call and it's setting result returning to 3 empty buffers. Check out the generated Julia code of f:

quote
    #= /Users/mofeing/Developer/Reactant.jl/src/Compiler.jl:639 =#
    #= /Users/mofeing/Developer/Reactant.jl/src/Compiler.jl:640 =#
    nothing
    #= /Users/mofeing/Developer/Reactant.jl/src/Compiler.jl:647 =#
    usbuf_1 = (getindex(args, 1)).data
    sbuf_1 = XLA.synced_buffer(usbuf_1)
    #= /Users/mofeing/Developer/Reactant.jl/src/Compiler.jl:648 =#
    ()
    #= /Users/mofeing/Developer/Reactant.jl/src/Compiler.jl:649 =#
    result = (ConcreteRArray{Bool})[ConcreteRArray{Bool, 0}(Reactant.XLA.AsyncBuffer(Reactant.XLA.Buffer(Ptr{Nothing} @0x0000000000000000), nothing), ()), ConcreteRArray{Bool, 0}(Reactant.XLA.AsyncBuffer(Reactant.XLA.Buffer(Ptr{Nothing} @0x0000000000000000), nothing), ()), ConcreteRArray{Bool, 0}(Reactant.XLA.AsyncBuffer(Reactant.XLA.Buffer(Ptr{Nothing} @0x0000000000000000), nothing), ())]
    (traced_getfield(result, $(Expr(:quote, 1)))).data = (args[1]).data
    (traced_getfield(result, $(Expr(:quote, 2)))).data = (args[1]).data
    (traced_getfield(result, $(Expr(:quote, 3)))).data = (args[1]).data
    #= /Users/mofeing/Developer/Reactant.jl/src/Compiler.jl:650 =#
    return result
end

@wsmoses
Copy link
Member

wsmoses commented Oct 6, 2024 via email

@mofeing
Copy link
Collaborator

mofeing commented Oct 6, 2024

That actually makes sense here since it’s making an array of values all available as args. What’s the issue?

I don't get it. Its behavior should be like the one on 0-dim array test since we are passing a 0-dim array and should return a vector.

From the MLIR point of view, it should be calling stablehlo.concatenate on a list of 0-dim tensors.

@wsmoses
Copy link
Member

wsmoses commented Oct 6, 2024

ah I see, I think this is a bug on the new number and cat perhaps?

@avik-pal
Copy link
Collaborator Author

avik-pal commented Oct 6, 2024

I think we won't do any XLA call here, see the julia fallbacks for number:

vcat(X::Number...) = hvcat_fill!(Vector{promote_typeof(X...)}(undef, length(X)), X)
hcat(X::Number...) = hvcat_fill!(Matrix{promote_typeof(X...)}(undef, 1,length(X)), X)

It is going to fill into a regular array

@avik-pal
Copy link
Collaborator Author

avik-pal commented Oct 6, 2024

Should work now

julia> function traced_vcat(x)
               a = x[]; 
               Float64[a; a; a]
       end
traced_vcat (generic function with 1 method)

julia> @code_hlo optimize=false traced_vcat(x_concrete)
Module:
module {
  func.func @main(%arg0: tensor<i1>) -> (tensor<3xf64>, tensor<i1>) {
    %0 = stablehlo.transpose %arg0, dims = [] : (tensor<i1>) -> tensor<i1>
    %1 = stablehlo.broadcast_in_dim %0, dims = [] : (tensor<i1>) -> tensor<1xi1>
    %2 = stablehlo.broadcast_in_dim %0, dims = [] : (tensor<i1>) -> tensor<1xi1>
    %3 = stablehlo.broadcast_in_dim %0, dims = [] : (tensor<i1>) -> tensor<1xi1>
    %4 = stablehlo.convert %1 : (tensor<1xi1>) -> tensor<1xf64>
    %5 = stablehlo.convert %2 : (tensor<1xi1>) -> tensor<1xf64>
    %6 = stablehlo.convert %3 : (tensor<1xi1>) -> tensor<1xf64>
    %7 = stablehlo.concatenate %4, %5, %6, dim = 0 : (tensor<1xf64>, tensor<1xf64>, tensor<1xf64>) -> tensor<3xf64>
    %8 = stablehlo.transpose %7, dims = [0] : (tensor<3xf64>) -> tensor<3xf64>
    %9 = stablehlo.transpose %0, dims = [] : (tensor<i1>) -> tensor<i1>
    return %8, %9 : tensor<3xf64>, tensor<i1>
  }
}

julia> @code_hlo traced_vcat(x_concrete)
Module:
module attributes {transform.with_named_sequence} {
  func.func @main(%arg0: tensor<i1>) -> tensor<3xf64> {
    %0 = stablehlo.convert %arg0 : (tensor<i1>) -> tensor<f64>
    %1 = stablehlo.broadcast_in_dim %0, dims = [] : (tensor<f64>) -> tensor<3xf64>
    return %1 : tensor<3xf64>
  }
}

@mofeing mofeing merged commit 6e89952 into main Oct 6, 2024
15 of 23 checks passed
@mofeing mofeing deleted the ap/scalar branch October 6, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent semantics of 0-dim RArrays compared to Julia semantics
4 participants