Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing TBAA information on view #34847

Open
vchuravy opened this issue Feb 23, 2020 · 3 comments
Open

Missing TBAA information on view #34847

vchuravy opened this issue Feb 23, 2020 · 3 comments
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@vchuravy
Copy link
Member

vchuravy commented Feb 23, 2020

In https://discourse.julialang.org/t/rfc-ann-restacker-jl-a-workaround-for-the-heap-allocated-immutable-problem/35037 @tkf that copying a view manually at that tart of the function causes a sizable performance improvement. It seems different from the stack allocation of immutable that contain heap references (x-ref: #34126). From my brief investigation it seems like the performance difference is due to a vectorization failure. The example I looked at is:

@noinline function f!(ys, xs)
         @inbounds for i in eachindex(ys, xs)
           x = xs[i]
           if -0.5 < x < 0.5
             ys[i] = 2x
           end
        end
end

"Restacking" ys causes this function to be vectorizable.
Looking at the optimized LLVM IR I noticed

   %40 = add i64 %39, %28, !dbg !67
   %41 = getelementptr inbounds double, double addrspace(13)* %31, i64 %40, !dbg !67
   %42 = load double, double addrspace(13)* %41, align 8, !dbg !67, !tbaa !70
...
   %47 = add i64 %39, %36, !dbg !81
   %48 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %38, align 8, !dbg !81, !tbaa !59, !nonnull !4
   %49 = getelementptr inbounds double, double addrspace(13)* %48, i64 %47, !dbg !81
   store double %46, double addrspace(13)* %49, align 8, !dbg !81, !tbaa !70

Where %42 = load is the getindex from xs and store double is the setindex! into ys. Notice how in %48 we are reloading the base pointer of the array at each iteration.

So how are we deriving %38 and %48, why don't we have to reload the pointer we are loading from?

The series of operations we do to obtain the base pointer for the getpointer

  %24 = bitcast %jl_value_t addrspace(11)* %11 to %jl_value_t addrspace(10)* addrspace(11)*
  %25 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %24, align 8, !tbaa !41, !nonnull !4, !dereferenceable !57, !align !58
  %26 = getelementptr inbounds i8, i8 addrspace(11)* %12, i64 16
  %27 = bitcast i8 addrspace(11)* %26 to i64 addrspace(11)*
  %28 = load i64, i64 addrspace(11)* %27, align 8, !tbaa !41
  %29 = addrspacecast %jl_value_t addrspace(10)* %25 to %jl_value_t addrspace(11)*
  %30 = bitcast %jl_value_t addrspace(11)* %29 to double addrspace(13)* addrspace(11)*
  %31 = load double addrspace(13)*, double addrspace(13)* addrspace(11)* %30, align 8, !tbaa !59, !nonnull !4

the operations we do to obtain the pointer from which to load the base pointer.

  %32 = bitcast %jl_value_t addrspace(11)* %8 to %jl_value_t addrspace(10)* addrspace(11)*
  %33 = load %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(11)* %32, align 8
  %34 = getelementptr inbounds i8, i8 addrspace(11)* %9, i64 16
  %35 = bitcast i8 addrspace(11)* %34 to i64 addrspace(11)*
  %36 = load i64, i64 addrspace(11)* %35, align 8
  %37 = addrspacecast %jl_value_t addrspace(10)* %33 to %jl_value_t addrspace(11)*
  %38 = bitcast %jl_value_t addrspace(11)* %37 to double addrspace(13)* addrspace(11)*

Notice how in the later case we are missing all TBAA information. The two interesting tbaa trees are below.

!41 = !{!42, !42, i64 0}
!42 = !{!"jtbaa_immut", !43, i64 0}
!43 = !{!"jtbaa_value", !44, i64 0}
!44 = !{!"jtbaa_data", !45, i64 0}
!45 = !{!"jtbaa", !46, i64 0}
!46 = !{!"jtbaa"}
!59 = !{!60, !60, i64 0}
!60 = !{!"jtbaa_arrayptr", !61, i64 0}
!61 = !{!"jtbaa_array", !45, i64 0}

I don't understand our TBAA annotations well enough to conclude that this is a bug, but shouldn't the view be marked immutable either way?

@longemen3000
Copy link
Contributor

Bump

@yuyichao
Copy link
Contributor

yuyichao commented Aug 23, 2020

This is the same issue as #36970. Also, AFAICT, the issue is not the absense of TBAA but the absense of dereferenceable. All the memory operations in the loop have all the metadata on them. However, LICM striped the metadata for the load of the array from subarray and therefore lost information needed to hoist the load of the array pointer in ys from inside the branch. Adding the dereferenceable metadata back to the post-LICM IR makes vectorization happen.

https://reviews.llvm.org/D85598 helps with #36970 but not this one since the load is not constant.

@yuyichao yuyichao added the upstream The issue is with an upstream dependency, e.g. LLVM label Aug 23, 2020
@pdeffebach
Copy link
Contributor

Bumping this as I think it's relevant. @chriselrod pointed me to this issue. Currently @view isn't taking full advantage of a transpose or a PermutedDimsArray, see below

julia> function summatrows(x)
       N = size(x, 1)
       z = Vector{Float64}(undef, N)
       @inbounds for i in 1:N
           z[i] = sum(@view x[i, :])
       end
       z
       end;
julia> function summatcols(x)
       N = size(x, 2)
       z = Vector{Float64}(undef, N)
       @inbounds for i in 1:N
           z[i] = sum(@view x[:, i])
       end
       z
       end;
julia> x = rand(1000, 1000);
julia> y = transpose(permutedims(x));
julia> using BenchmarkTools;
julia> @btime summatcols($x);
  354.489 μs (1 allocation: 7.94 KiB)
julia> @btime summatrows($x);
  2.249 ms (1 allocation: 7.94 KiB)
julia> @btime summatcols($y);
  2.345 ms (1 allocation: 7.94 KiB)
julia> @btime summatrows($y);
  1.269 ms (1 allocation: 7.94 KiB)

The very last timing should be 350 μs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

4 participants