-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] thread adjoint sparse matrix vector multiply #29525
Conversation
And to provide a benchmark, using e.g. the Matrix in https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc2/bcsstk16.html I get with 4 threads
vs master
As a reference, MKL Sparse does this in ~50 μs (with 4 threads). |
Great test case for #22631. |
Am I understanding correctly that this (a) already works on master and is (only) 50% slower than MKL? |
a) yes |
Ok, that makes sense. I guess the test would be that #22631 is not making anything worse. |
I guess the most important thing to parallellize is the standard mv-product (not the adjoint). But that seems very tricky to get right. How is MKL performance for this case? If they get similar performance I would be very curious how they do it. We could enable threading for |
Well, we should have used CSR then ;). But for algorithms dealing with symmetric matrices you could just call the adjoint one. MKL does the non-transpose in 80 μs (4 threads) and 244.475 μs (1 thread), so pretty good speedup even there. |
While the speedup looks great I wonder if there should first be a strategy where in the code we want to apply multithreading. For instance why here but not in broadcasting? |
Because this is an isolated place with concrete julia Base types. Broadcasting has user defined types and functions, seems 100x times more difficult and opens up a lot more space for different decisions. Anyway, I don't expect this to be merged right now but we have this cool threading framework and I just wanted to show a concrete place where it could be used for a good performance boost to get the discussion going. |
Broadcasting was just an example. Reductions could also benefit from multithreading. My concern was more that after #22631 the threading interface will certainly change. |
There will be other ways to write threaded code but this should still work. |
Yes thats true. Do you want this then as a hidden feature or should this be documented? (How to get a multithreaded sparse multiplication...) |
Can we close this? We can't multi-thread things in our libraries as they will interfere with the user's multi-threading. Once we have the partr stuff, we can start doing these things. |
I have seen crashes when using nested threading. Is that what you mean by "interfere"? If yes, I am also for waiting for partr. |
Merge this when partr is in then? :) |
Merge if nested threading is supported I would say. I had some real world examples where I needed to remove the inner threading because of a crash (do not remember if it was a segfault) |
Update with results from master using julia> using MatrixMarket
julia> M = MatrixMarket.mmread("bcsstk16.mtx");
julia> x = rand(size(M, 2)); y = similar(x);
julia> using BenchmarkTools
julia> using LinearAlgebra
julia> @btime mul!(y, M', x); On master julia> @btime mul!(y, M', x);
250.861 μs (1 allocation: 16 bytes) With this PR: # 1 thread
julia> @btime mul!(y, M', x);
281.209 μs (10 allocations: 816 bytes)
2 threads
julia> @btime mul!(y, M', x);
153.064 μs (17 allocations: 1.48 KiB)
4 threads
julia> @btime mul!(y, M', x);
88.616 μs (30 allocations: 2.86 KiB) So there is a bit of overhead when using Anything blocking something like this getting merged now with the new threading system in place? |
In general, are we going to start threading other parts of stdlib or at least the sparse matrix implementation? The reason I ask is that it would otherwise be odd to have only one function multi-threaded - but then again, one has to start somewhere. |
@KristofferC is nested threading now supported by the new system (without crashing but also without overcommitting threads)? |
The point of the threading work surely gotta be that we want to start using it.
Indeed you have to start somewhere. This seems like a simple place.
Yes. That is the point of the partr scheduler work |
I know, but was not sure of the threading macro already is adapted to not create too much tasks, if threading macros are nested. |
Tasks are cheap and are mapped onto hardware threads so it's ok to create lots of tasks and do so in a nested fashion—that's exactly how the system is supposed to work. |
Yes, seems like it. |
What's the performance currently? |
The slowdown for 1 thread as said in #29525 (comment) might be worth discussing a bit more. Especially when we only run with 1 thread by default. |
May want to hold off due to #32701. |
Also, I guess a |
It would be nice if threaded |
Merge this? |
c5713c6
to
ab91b87
Compare
We introduced this nice new |
It still has to be a CSC matrix, right? In which case, shouldn't be safe to multi-thread the operations. We are not updating the matrix itself, so I am not sure how thread safety affects this operation. |
Thread safety is a property of the implementation and since it is |
I still can't see how. I can imagine that if the output was an AbstractVector, it might matter. Is the worry that a different thread may be updating the AbstractSparseMatrix? If so, that worry would be relevant even in SparseMatrixCSC. Agree that overhead of single threaded case with @threads is a real issue. |
In principle one could implement an |
You could do some caching on |
I don't see a binary search there. The question is what the contract of |
My example wasn't very good. On
Yeah, my point is mostly that this is something we are going to have to deal with on a larger scale when we want to start threading things more in Base. A lot of our code is quite generic and I don't think handwaving like "oh, it will probably be thread-safe for most reasonable implementations" is really going to cut it because people do and should be allowed to have "weird" implementations of e.g. |
IIUC, this implementation of |
I'm worried how this would mix with |
What is required for this to be mergeable?