Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] thread adjoint sparse matrix vector multiply #29525

Closed
wants to merge 1 commit into from

Conversation

KristofferC
Copy link
Member

@KristofferC KristofferC commented Oct 4, 2018

What is required for this to be mergeable?

@KristofferC KristofferC added performance Must go faster sparse Sparse arrays multithreading Base.Threads and related functionality labels Oct 4, 2018
@KristofferC KristofferC changed the title thread sparse matrix multiply thread sparse matrix vector multiply Oct 4, 2018
@KristofferC KristofferC changed the title thread sparse matrix vector multiply thread adjoint sparse matrix vector multiply Oct 4, 2018
@KristofferC
Copy link
Member Author

KristofferC commented Oct 5, 2018

And to provide a benchmark, using e.g. the Matrix in https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc2/bcsstk16.html I get with 4 threads

julia> @btime mul!(c, M', x)
  75.789 μs (3 allocations: 96 bytes)

vs master

julia> @btime mul!(c, M', x)
  250.705 μs (2 allocations: 32 bytes)

As a reference, MKL Sparse does this in ~50 μs (with 4 threads).

@JeffBezanson
Copy link
Member

Great test case for #22631.

@StefanKarpinski
Copy link
Member

Am I understanding correctly that this

(a) already works on master and is (only) 50% slower than MKL?
(b) would potentially be faster (maybe competitive with MKL) after #22631 is merged?

@KristofferC
Copy link
Member Author

a) yes
b) probably no, at least not in this case where there is no nested threading

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Oct 5, 2018

Ok, that makes sense. I guess the test would be that #22631 is not making anything worse.

@haampie
Copy link
Contributor

haampie commented Oct 5, 2018

I guess the most important thing to parallellize is the standard mv-product (not the adjoint). But that seems very tricky to get right. How is MKL performance for this case? If they get similar performance I would be very curious how they do it.

We could enable threading for Symmetric{Tv,SparseMatrixCSC{Tv,Ti}} and Hermitian{...} though, exploiting symmetry. And in the case of SpMM we can parallellize the loop over the columns of the dense matrix.

@KristofferC
Copy link
Member Author

KristofferC commented Oct 5, 2018

I guess the most important thing to parallellize is the standard mv-product (not the adjoint).

Well, we should have used CSR then ;). But for algorithms dealing with symmetric matrices you could just call the adjoint one.

MKL does the non-transpose in 80 μs (4 threads) and 244.475 μs (1 thread), so pretty good speedup even there.

@tknopp
Copy link
Contributor

tknopp commented Oct 6, 2018

While the speedup looks great I wonder if there should first be a strategy where in the code we want to apply multithreading. For instance why here but not in broadcasting?

@KristofferC
Copy link
Member Author

Because this is an isolated place with concrete julia Base types. Broadcasting has user defined types and functions, seems 100x times more difficult and opens up a lot more space for different decisions.

Anyway, I don't expect this to be merged right now but we have this cool threading framework and I just wanted to show a concrete place where it could be used for a good performance boost to get the discussion going.

@tknopp
Copy link
Contributor

tknopp commented Oct 7, 2018

Broadcasting was just an example. Reductions could also benefit from multithreading. My concern was more that after #22631 the threading interface will certainly change.

@StefanKarpinski
Copy link
Member

There will be other ways to write threaded code but this should still work.

@tknopp
Copy link
Contributor

tknopp commented Oct 8, 2018

Yes thats true. Do you want this then as a hidden feature or should this be documented? (How to get a multithreaded sparse multiplication...)

@ViralBShah
Copy link
Member

Can we close this? We can't multi-thread things in our libraries as they will interfere with the user's multi-threading. Once we have the partr stuff, we can start doing these things.

@ViralBShah ViralBShah changed the title thread adjoint sparse matrix vector multiply thread adjoint sparse matrix vector multiply - [Do not merge] Jan 28, 2019
@tknopp
Copy link
Contributor

tknopp commented Jan 28, 2019

I have seen crashes when using nested threading. Is that what you mean by "interfere"? If yes, I am also for waiting for partr.

@KristofferC
Copy link
Member Author

Merge this when partr is in then? :)

@tknopp
Copy link
Contributor

tknopp commented Jan 28, 2019

Merge if nested threading is supported I would say. I had some real world examples where I needed to remove the inner threading because of a crash (do not remember if it was a segfault)

@ViralBShah ViralBShah changed the title thread adjoint sparse matrix vector multiply - [Do not merge] thread adjoint sparse matrix vector multiply Feb 20, 2019
@ViralBShah ViralBShah added the DO NOT MERGE Do not merge this PR! label Feb 20, 2019
@KristofferC
Copy link
Member Author

Update with results from master using

julia> using MatrixMarket

julia> M = MatrixMarket.mmread("bcsstk16.mtx");

julia> x = rand(size(M, 2)); y = similar(x);

julia> using BenchmarkTools

julia> using LinearAlgebra

julia> @btime mul!(y, M', x);

On master

julia> @btime mul!(y, M', x);
  250.861 μs (1 allocation: 16 bytes)

With this PR:

# 1 thread
julia> @btime mul!(y, M', x);
  281.209 μs (10 allocations: 816 bytes)

2 threads
julia> @btime mul!(y, M', x);
  153.064 μs (17 allocations: 1.48 KiB)

4 threads
julia> @btime mul!(y, M', x);
  88.616 μs (30 allocations: 2.86 KiB)

So there is a bit of overhead when using @threads and running with 1 thread.

Anything blocking something like this getting merged now with the new threading system in place?

@ViralBShah
Copy link
Member

ViralBShah commented Jul 18, 2019

In general, are we going to start threading other parts of stdlib or at least the sparse matrix implementation? The reason I ask is that it would otherwise be odd to have only one function multi-threaded - but then again, one has to start somewhere.

@tknopp
Copy link
Contributor

tknopp commented Jul 18, 2019

@KristofferC is nested threading now supported by the new system (without crashing but also without overcommitting threads)?

@KristofferC
Copy link
Member Author

In general, are we going to start threading other parts of stdlib or at least the sparse matrix implementation?

The point of the threading work surely gotta be that we want to start using it.

The reason I ask is that it would otherwise be odd to have only one function multi-threaded - but then again, one has to start somewhere

Indeed you have to start somewhere. This seems like a simple place.

@KristofferC is nested threading now supported by the new system (without crashing but also without overcommitting threads)?

Yes. That is the point of the partr scheduler work

@tknopp
Copy link
Contributor

tknopp commented Jul 18, 2019

Yes. That is the point of the partr scheduler work

I know, but was not sure of the threading macro already is adapted to not create too much tasks, if threading macros are nested.

@StefanKarpinski
Copy link
Member

but was not sure of the threading macro already is adapted to not create too much tasks, if threading macros are nested.

Tasks are cheap and are mapped onto hardware threads so it's ok to create lots of tasks and do so in a nested fashion—that's exactly how the system is supposed to work.

@ViralBShah ViralBShah removed the DO NOT MERGE Do not merge this PR! label Jul 26, 2019
@StefanKarpinski
Copy link
Member

Yes, seems like it.

@jebej
Copy link
Contributor

jebej commented Jul 31, 2019

What's the performance currently?

@KristofferC
Copy link
Member Author

The slowdown for 1 thread as said in #29525 (comment) might be worth discussing a bit more. Especially when we only run with 1 thread by default.

@JeffBezanson
Copy link
Member

May want to hold off due to #32701.

@KristofferC
Copy link
Member Author

Also, I guess a SparseVector where the nonzeros are backed by a BitArray or something interesting like that is potentially dangerous.

@tkf
Copy link
Member

tkf commented Aug 25, 2019

It would be nice if threaded mul! is defined for transpose as well. I think one possibility is to do a refactoring like I proposed in #33069.

@ViralBShah
Copy link
Member

Merge this?

@KristofferC
Copy link
Member Author

KristofferC commented Jan 10, 2020

We introduced this nice new AbstractSparseMatrixCSC, so we can no longer assume that anything that uses it is threadsafe for anything (who knows how the implementation of such a matrix is made).

@ViralBShah
Copy link
Member

It still has to be a CSC matrix, right? In which case, shouldn't be safe to multi-thread the operations. We are not updating the matrix itself, so I am not sure how thread safety affects this operation.

@KristofferC
Copy link
Member Author

Thread safety is a property of the implementation and since it is Abstract who knows what anything does. I agree that it is unlikely to cause problems in practive though. However, #29525 (comment) might be worth thinking about.

@ViralBShah
Copy link
Member

ViralBShah commented Jan 10, 2020

I still can't see how. I can imagine that if the output was an AbstractVector, it might matter.

Is the worry that a different thread may be updating the AbstractSparseMatrix? If so, that worry would be relevant even in SparseMatrixCSC.

Agree that overhead of single threaded case with @threads is a real issue.

@tknopp
Copy link
Contributor

tknopp commented Jan 10, 2020

In principle one could implement an AbstractSparseMatrix where the colptr (nonzeros/ rowvals) function returns an object which internally changes the representation when getindex is called. But that is pretty hypothetical. I don't see a use case for that.

@KristofferC
Copy link
Member Author

You could do some caching on getindex calls so that the next getindex in the same column could be faster than a full binary search in some cases etc.

@tknopp
Copy link
Contributor

tknopp commented Jan 10, 2020

I don't see a binary search there. The question is what the contract of AbstractSparseMatrix is. If the contract is rowvals(A)::Vector and rowvals(A)::Vector then its thread safe. If that contract does not hold (which is debatable) than it might not be thread-safe.

@KristofferC
Copy link
Member Author

I don't see a binary search there.

My example wasn't very good. On getindex for a SparseMatrixCSC there is a binary search but for this code, there isn't.

The question is what the contract of AbstractSparseMatrix is. If the contract is rowvals(A)::Vector and rowvals(A)::Vector then its thread safe. If that contract does not hold (which is debatable) than it might not be thread-safe.

Yeah, my point is mostly that this is something we are going to have to deal with on a larger scale when we want to start threading things more in Base. A lot of our code is quite generic and I don't think handwaving like "oh, it will probably be thread-safe for most reasonable implementations" is really going to cut it because people do and should be allowed to have "weird" implementations of e.g. AbstractArray.

@tkf
Copy link
Member

tkf commented Jan 11, 2020

IIUC, this implementation of mul! was not "thread-safe" even before AbstractSparseMatrixCSC as you can have arbitrary element type. For example, * on the elements could ccall an external library that is not thread-safe. I think a safe option here is to just define a new method for SparseMatrixCSC{<:ThreadSafeNumber, <:ThreadSafeInteger} with appropriate unions ThreadSafeNumber and ThreadSafeInteger. (It would be nice to have a trait system to make this more extensible but I guess that's another topic.)

@pablosanjose
Copy link
Contributor

I'm worried how this would mix with pmap and distributed computation in general. I've tested this PR in some code where I distribute some sparse matrix-vector products over several workers using pmap, and it lead to severe oversubscription of my cores, and much worse performance. How is such a problem meant to be handled if this is merged?

@ViralBShah ViralBShah changed the title thread adjoint sparse matrix vector multiply [DO NOT MERGE] thread adjoint sparse matrix vector multiply Mar 4, 2020
@ViralBShah ViralBShah marked this pull request as draft May 22, 2021 22:50
@giordano giordano deleted the kc/ohnohedidn branch February 13, 2022 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster sparse Sparse arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants