Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite mul! to dispatch based on memory layout, not matrix type #25558

Closed
wants to merge 54 commits into from

Conversation

dlfivefifty
Copy link
Contributor

This is in part to address #10385 and based on discussions there, and is the second stage of #25321.

The aim of this pull request is to use memory layout to decide when to dispatch to BLAS/LAPack routines, instead of array type. This will lead to a simplified dispatch of linear algebra, without so many special cases for Adjoint, Transpose, etc. It also facilitates other array types outside of Base (e.g. BandedMatrix) use BLAS linear algebra routines.

I've only adapted mul! so far (with ldiv!, etc. yet to be adapted), but I wanted to get comments before finishing the changes. I'll finish completing implementing this pull request if it's decided to be a good solution to the problem.

Note I use the name CTransposeStridedLayout instead of AdjointStridedLayout as it is referring to memory layout, not mathematical structure. Alternatively, it could be RowMajorStridedLayout and ConjRowMajorStridedLayout.

@timholy @stevengj Any thoughts would be appreciated.

@ararslan ararslan added the linear algebra Linear algebra label Jan 14, 2018
@dlfivefifty
Copy link
Contributor Author

dlfivefifty commented Jan 16, 2018

Some notes to myself as I delete code to make sure to restore behaviour:

  • Should mul!(::Vector, ::Matrix, ::Matrix), etc., really be allowed?
  • Go through mul! returns and remove unnecessary ambiguity overrides
  • copyto!(A::Adjoint, B) should call adjoint!(parent(A), B)
  • copyto!(A::Transpose, B) should call transpose!(parent(A), B)

@@ -488,6 +488,7 @@ mul2!(A::Tridiagonal, B::AbstractTriangular) = A*full!(B) # is this necessary?

mul!(C::AbstractMatrix, A::AbstractTriangular, B::Tridiagonal) = mul!(C, copyto!(similar(parent(A)), A), B)
mul!(C::AbstractMatrix, A::Tridiagonal, B::AbstractTriangular) = mul!(C, A, copyto!(similar(parent(B)), B))
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge/rebase error

@dlfivefifty
Copy link
Contributor Author

Just pushed a fix.

@dlfivefifty
Copy link
Contributor Author

dlfivefifty commented Jan 20, 2018

Some (but only some) of the builds are failing because of this:

Test Failed at C:\projects\julia\julia-d966fe96ce\share\julia\site\v0.7\IterativeEigensolvers\test\runtests.jl:104
  Expression: sort((eigs(A, B, nev=k, sigma=1.0))[1])  (sort(eigvals(A, B)))[2:4]
   Evaluated: [0.00646459, 0.0779555, 0.312055]  [-Inf, 0.00646459, 0.0779555]
ERROR: LoadError: Test run finished with errors

Anyone have any idea why the changes here would effect generalized eigenvalues?

I suspect on some architectures it's returning one -Inf and some it's returning two -Infs, so I can work around it by taking absolute values. But I'm not sure if I've triggered different behaviour.

On my machine, it looked like LAPACK.geev is calling the same ccallroutine with the same matrices, but perhaps there were small errors...

EDIT: Nevermind, it's because A'A is not dispatching correctly.

@dlfivefifty dlfivefifty changed the title WIP/RFC: Rewrite mul! to dispatch based on memory layout, not matrix type Rewrite mul! to dispatch based on memory layout, not matrix type Jan 21, 2018
@stevengj
Copy link
Member

stevengj commented Jan 24, 2018

I'm for this in principle, but I have two issues with the proposed DenseLayout type:

  • BLAS and LAPACK can directly handle layouts with contiguous columns, but in which there is an arbitrary stride from one column to the next. This is not strictly "Dense"

  • "Dense" row-major (possibly with arbitrary strides between rows, again) is also common in many applications involving external libraries, and it is often possible to use BLAS and LAPACK with this format, without making copies, by treating the data as implicitly transposed and transforming the linear algebra accordingly. I guess this is your "transpose" layout, but in many applications this is the native layout and it seems weird to call it "transposed".

So, it seems like we would want at least four "dense" layouts:

  • DenseColumnMajor — the current DenseLayout type
  • DenseColumnsStridedRows — dense columns, but with an arbitrary stride from one column to the next (corresponding to the "leading dimension" arguments in BLAS/LAPACK). Maybe also a DenseColumns = Union{DenseColumnMajor,DenseColumnsStridedRows} type for convenience, as the layouts accepted directly by BLAS/LAPACK.
  • DenseRowMajor, DenseRowsStridedColumns, DenseRows — similarly, but for row-major.

And then I guess you also need conjugated versions of the above, but it seems more flexible to be able to conjugate any layout, i.e. have a ConjugatedLayout{T, L<:AbstractLayout{T}} <: AbstractLayout{T} for a conjugated version of any layout L.

@dlfivefifty
Copy link
Contributor Author

I like your suggestion.

I suppose we can keep the chkstride1 checks, which will become a no-op for DenseColumnsStridedRows.

@stevengj
Copy link
Member

Also, to play devil's advocate here, what exactly is the advantage of making this a trait, rather than just a check of strides?

Pro: traits let these checks be done at compile-time, although the cost of a runtime strides check is almost certainly negligible for linear algebra. (And the cost of the runtime strides check could be eliminated if desired by adding a chkstride1(A) = true method.)

Con: now array types have to implement yet another method. And, for cases like NumPy arrays where strides are only known at runtime, this requires a type-unstable MemoryLayout(A) function and dynamic dispatch (which is probably slower than a runtime stride if-else.)

@dlfivefifty
Copy link
Contributor Author

I think the benefit of traits is it allows other formats like triangular, symmetric, and banded.

The main point of the pull request from my perspective is so that banded matrices can pipe into the mul! framework without hundreds of ambiguity overrides.

@dlfivefifty
Copy link
Contributor Author

I'm not sure what proposed changes are left: @mbauman had suggestions for simplifying the type hierarchy and some renamings (e.g., IndexingImplementation instead of MemoryLayout), but I'm going to hold off on that until others pipe in with their support.

Adding linearstrided(A) would be fine but is not needed in this PR (I'd rather get this PR finalized and handle other compatible improvements in a subsequent PR).

I've also looked through the usage of IndexStyle and a serious option would be to combine the proposed MemoryLayout with IndexStyle: that is, the role played by IndexLinear() would be played by DenseColumnMajor() (or whatever it ends up being renamed to). This can be tried out in a separate PR (which would deprecate IndexStyle).

@mbauman
Copy link
Member

mbauman commented Apr 9, 2018

Thanks for the update! I'm not too worried about the names and linearstrided could easily be added later, but I still feel quite strongly about simplifying the AbstractStrided hierarchy. This thread is long and cumbersome, but the diff comments here are a pretty good self-contained summary for anyone else interested in piping in.

No, IndexLinear could not be replaced by DenseColumnMajor. It says nothing about the storage of the array — simply that it is most efficient to use one linear index into the multidimensional structure instead of a cartesian one.

@dlfivefifty
Copy link
Contributor Author

No, IndexLinear could not be replaced by DenseColumnMajor. It says nothing about the storage of the array — simply that it is most efficient to use one linear index into the multidimensional structure instead of a cartesian one.

Can you give an example of such array that is not dense column major?

@mbauman
Copy link
Member

mbauman commented Apr 9, 2018

struct OnesMatrix <: AbstractArray{Int,2} end
Base.getindex(::OnesMatrix, i::Int) = 1
Base.size(::OnesMatrix) = (2,2)
Base.IndexStyle(::Type{OnesMatrix}) = IndexLinear()

(Edit: or in the wild: FillArrays.jl)

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Apr 9, 2018

I'm only following this at a very high level, but here's my 2¢:

  • I really like the direction here, it seems very promising and I've been consistently encouraging @mbauman to work with you to try to get something along these lines to happen.

  • I agree with him, however, that this is too many different orthogonal concepts crammed into a single trait — this needs a more composable approach. Having 9+ different trait values seems to me to be a crystal clear sign that this needs to be expressed in a more compositional manner. We have, at the very least, the following mostly orthogonal concepts:

    1. Indexing mechanism: pointer + strides, etc.
    2. Preferred indexing scheme: linear or cartesian
    3. Preferred order of dimension iteration

Sure, these aren't perfectly orthogonal, but we can come up with examples to fill in most of the boxes if we put a grid of these together:

  • Plain dense arrays are strided and linear indexing is the most efficient way to go through all the elements; on the other hand, a subarray is strided but prefers cartesian indexing.
  • A matrix can be represented as a vector of vectors, which is not strided and prefers cartesian indexing. Depending on whether you represent it as a vector of rows or a vector of columns, either ordering of the dimensions could be preferable. This can be generalized to arbitrary dimensions with an arbitrary preferred ordering of dimension iteration.
  • Sparse matrix representations have similar properties to vectors of vectors but a very different implementation of indexing.
  • You might want to store a matrix in Z-order: this would not be strided but still linear indexing would be the best way to scan all the elements and would be much more efficient than using cartesian coordinates.

@dlfivefifty
Copy link
Contributor Author

If conflating traits is the issue, another option would be to move everything back to LinearAlgebra (now possible as DenseArrays definition has changed) and rename it to LinearAlgebraStyle as that is the primary purpose from my perspective...

@dlfivefifty
Copy link
Contributor Author

I think the main sticking point at this moment is whether ColumnMajor() is needed in addition to DenseColumnMajor(). Otherwise, what's currently ColumnMajor() would become StridedLayout().

Here are the arguments in favour of ColumnMajor():

  1. If A is a ColumnMajor matrix we know that view(A, 1:5, 1) is DenseColumnMajor().
  2. ColumnMajor checks that stride(A,1) == 1 at compile time. Though this doesn't appear to make a difference in speed even for 1 x1 matrices.
  3. Dispatch of mul! is significantly simplified as for strided matrices we would need code like the following:
function _mul!(y::AbstractVector{T}, A::AbstractMatrix{T}, x::AbstractVector{T}, ::AbstractStridedLayout, ::AbstractStridedLayout, ::AbstractStridedLayout) where T<: BlasFloat
   if stride(A,1) == 1 && stride(A,2)  stride(A,1)
      gemv!(y, 'N', A, x)
  elseif stride(A,2) == 2 && stride(A,1)  stride(A,2)
      gemv!(y, 'T', transpose(A), x)
  elseif stride(A,2)  stride(A,1) 
      generic_matvecmul!(y, 'T', transpose(A), x)
  else
     generic_matvecmul!(y, 'N', A, x)
   end
end
# repeat above for ConjLayout{<:AbstractStridedLayout} with 'C' in place of 'T'

Compare this with the current version:

_mul!(y::AbstractVector{T}, A::AbstractMatrix{T}, x::AbstractVector{T}, ::AbstractStridedLayout, ::AbstractColumnMajor, ::AbstractStridedLayout) where {T<:BlasFloat} = gemv!(y, 'N', A, x)
_mul!(y::AbstractVector{T}, A::AbstractMatrix{T}, x::AbstractVector{T}, ::AbstractStridedLayout, ::AbstractRowMajor, ::AbstractStridedLayout) where {T<:BlasFloat} = gemv!(y, 'T', transpose(A), x)
_mul!(y::AbstractVector, A::AbstractMatrix, x::AbstractVector, _1, _2, _3) = generic_matvecmul!(y, 'N', A, x)

The first two points are very minor issues. The last point is really the sticking point: the code becomes complex handling each possible stride variant, and so I won't have the time to make that change.

If @mbauman or someone else still feels strongly about removing ColumnMajor() and is willing to do the work, I have no objections with the change.

@mbauman
Copy link
Member

mbauman commented Apr 11, 2018

Thanks for that great explanation and narrowing our focus down to just that one type (well, two including RowMajor). They do seem to be at the crux of the matter — the others you don't make use of here and would be much easier to remove.

The status quo on master is:

  • We have that hacky StridedArray union. For the purposes of matmul, we assume that these strided arrays are column major and pass them to the BLAS module with the 'N' flag. The BLAS module then just checks (by value) to see if they are actually supported column major strided arrays before deciding whether to bail back out to the generic implementation or to ccall BLAS.
  • Similarly, wrapped c/transposes of StridedArray are assumed to be 'C' or 'T'.
  • Subtypes of DenseArray are what is named DenseColumnMajor here. Lots of views and reshapes of DenseArrays are able to preserve strided-ness in cases where they wouldn't be able to otherwise. I agree that it is absolutely vital that we preserve our ability to make this distinction at the type/trait level.
  • Similarly, c/transposes of DenseArray are what is named DenseRowMajor here. For the same reasons that DenseArray is valuable, would I tend to agree that DenseRowMajor makes sense.

This leads us to the three types I proposed in my earlier comment.

Now, this PR — especially with the ColumnMajor and RowMajor types — gives us the great advantage that we're no longer optimistically guessing that Transpose{<:ColumnMajor} is a 'T'. We know it. But we're still just assuming that any AbstractStrided type is an 'N', and only doing that one if check when it shouldn't be much harder to just check it either way.

I'm still not sure how much work such a refactor would be, but I'd like to try and tackle it once I'm done with the broadcasting stuff.

@dlfivefifty
Copy link
Contributor Author

👍 Though note keeping ColumnMajor and RowMajor makes implementing a "smart" mul! for AbstractStridedLayout slightly easier, in that we don't need to know anything about gemv!:

function _mul!(y::AbstractVector, A::AbstractMatrix, x::AbstractVector, ::AbstractStridedLayout, ::AbstractStridedLayout, ::AbstractStridedLayout)
   if stride(A,1) == 1 && stride(A,2)  stride(A,1)
      _mul!(y, A, x, MemoryLayout(y), ColumnMajor(), MemoryLayout(x))
  elseif stride(A,2) == 2 && stride(A,1)  stride(A,2)
      _mul!(y, A, x, MemoryLayout(y), RowMajor(), MemoryLayout(x))
  elseif stride(A,2)  stride(A,1) 
      generic_matvecmul!(y, 'T', transpose(A), x)
  else
     generic_matvecmul!(y, 'N', A, x)
   end
end

@dlfivefifty
Copy link
Contributor Author

@mbauman I had a look at the new Broadcast interface, and I'm now wondering if this PR would be better rewritten to replicate that. What I mean is that mul! would pass through a Mul type:

struct Mul{StyleA, StyleX, AType, XType} 
   style_A::StyleA
   style_x::StyleX
    A::AType
    x::XType
end

Mul(A::AbstractMatrix, x::AbstractVecOrMat) = Mul(MulStyle(A), MulStyle(x), A, b)

mul!(y::AbstractVector, A::AbstractMatrix, x::AbstractVector) = copyto!(y, Mul(A,x))

copyto!(y::AbstractVector{T}, M::Mul{DenseColumnMajor,<:AbstractStridedLayout,<:AbstractMatrix{T},<:AbstractVector{T}}) where T<:BlasFloat = BLAS.gemv!('N', one(T), M.A, M.x, zero(T), y)

One could imagine combining this with Broadcasted to automatically dispatch

z .= α .* A*x .+ β .* y

To gemv!('N', α, A, β, copyto!(z,y)), which has the benefit of being allocation-free. This might be possible if the above were first lowered to

materialize!(z, Broadcasted((α, Ax, β, y) -> α*Ax + β*y, (α, Mul(A,x), y)))

@mbauman
Copy link
Member

mbauman commented May 25, 2018

That would indeed be a wonderful feature, but the trouble is that we need to know when to actually perform the multiplication. Broadcasting is able to figure it out because the dot-fusion happens at the parser level — it's a very clear syntax that has well-defined bounds. We materialize as soon as you hit a non-dotted function call. * doesn't get that special parser-level support, so we don't know when that materialize should happen. At the most basic level, we cannot have A*B return a lazy Mul object because we don't have a materialize if it appears outside a broadcasted expression.

@dlfivefifty
Copy link
Contributor Author

The "one could imagine" line was more of a pipe-dream. But for near term I was proposing

*(A::AbstractMatrix, b::AbstractVector) = materialize(Mul(A, b))

which is certainly doable without parser changes.

@dlfivefifty
Copy link
Contributor Author

I've done a mock up of the lazy Mul approach

https://github.com/dlfivefifty/LazyLinearAlgebra.jl

and it works surprisingly well with the new broadcast, to give nice versions of BLAS routines:

A = randn(5,5); b = randn(5); c = similar(b);

c .= Mul(A,b)
@test all(c .=== BLAS.gemv!('N', 1.0, A, b, 0.0, similar(c)))

c .= 2.0 .* Mul(A,b)
@test all(c .=== BLAS.gemv!('N', 2.0, A, b, 0.0, similar(c)))

c = copy(b)
c .= Mul(A,b) .+ c
@test all(c .=== BLAS.gemv!('N', 1.0, A, b, 1.0, copy(b)))

c = copy(b)
c .= Mul(A,b) .+ 2.0 .* c
@test all(c .=== BLAS.gemv!('N', 1.0, A, b, 2.0, copy(b)))

c = copy(b)
c .= 3.0 .* Mul(A,b) .+ 2.0 .* c
@test all(c .=== BLAS.gemv!('N', 3.0, A, b, 2.0, copy(b)))

d = similar(c)
c = copy(b)
d .= 3.0 .* Mul(A,b) .+ 2.0 .* c
@test all(d .=== BLAS.gemv!('N', 3.0, A, b, 2.0, copy(b)))

I think I'm going to focus on developing that approach as a package, instead of working on it as a PR, as it can exist outside of StdLib. Then the various linear algebra packages (BlockArrays.jl, BandedMatrices.jl, etc.) can use Mul directly instead of LinearAlgebra's mul! and avoid the ambiguity issues with mul!. We can think of merging it into Base in the future.

@dlfivefifty
Copy link
Contributor Author

FYI I'm planning to make a new proposal for how to approach this based on

https://github.com/JuliaMatrices/ArrayLayouts.jl

This used to be in LazyArrays.jl, which became too heavy. Essentially the idea is very close to the previous PR except the following:

  1. MemoryLayout now only depends on the type of the array.
  2. We mimic broadcasting and make the fundamental override materialize!(::MulAdd{LayoutA,LayoutB,LayoutC}). This has more "structure" in the call than the previous version which tacked on the layouts to mul!.
  3. The set-up is "battle tested" in BandedMatrices, BlockBandedMatrices, QuasiArrays, ContinuumArrays, and dependent packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants