[DO NOT MERGE] thread adjoint sparse matrix vector multiply #29525

KristofferC · 2018-10-04T20:21:46Z

What is required for this to be mergeable?

KristofferC · 2018-10-05T14:02:50Z

And to provide a benchmark, using e.g. the Matrix in https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc2/bcsstk16.html I get with 4 threads

julia> @btime mul!(c, M', x)
  75.789 μs (3 allocations: 96 bytes)

vs master

julia> @btime mul!(c, M', x)
  250.705 μs (2 allocations: 32 bytes)

As a reference, MKL Sparse does this in ~50 μs (with 4 threads).

JeffBezanson · 2018-10-05T15:03:29Z

Great test case for #22631.

StefanKarpinski · 2018-10-05T15:59:55Z

Am I understanding correctly that this

(a) already works on master and is (only) 50% slower than MKL?
(b) would potentially be faster (maybe competitive with MKL) after #22631 is merged?

KristofferC · 2018-10-05T16:01:21Z

a) yes
b) probably no, at least not in this case where there is no nested threading

StefanKarpinski · 2018-10-05T16:25:24Z

Ok, that makes sense. I guess the test would be that #22631 is not making anything worse.

haampie · 2018-10-05T18:35:51Z

I guess the most important thing to parallellize is the standard mv-product (not the adjoint). But that seems very tricky to get right. How is MKL performance for this case? If they get similar performance I would be very curious how they do it.

We could enable threading for Symmetric{Tv,SparseMatrixCSC{Tv,Ti}} and Hermitian{...} though, exploiting symmetry. And in the case of SpMM we can parallellize the loop over the columns of the dense matrix.

KristofferC · 2018-10-05T18:44:43Z

I guess the most important thing to parallellize is the standard mv-product (not the adjoint).

Well, we should have used CSR then ;). But for algorithms dealing with symmetric matrices you could just call the adjoint one.

MKL does the non-transpose in 80 μs (4 threads) and 244.475 μs (1 thread), so pretty good speedup even there.

tknopp · 2018-10-06T23:21:56Z

While the speedup looks great I wonder if there should first be a strategy where in the code we want to apply multithreading. For instance why here but not in broadcasting?

KristofferC · 2018-10-06T23:48:20Z

Because this is an isolated place with concrete julia Base types. Broadcasting has user defined types and functions, seems 100x times more difficult and opens up a lot more space for different decisions.

Anyway, I don't expect this to be merged right now but we have this cool threading framework and I just wanted to show a concrete place where it could be used for a good performance boost to get the discussion going.

tknopp · 2018-10-07T10:51:03Z

Broadcasting was just an example. Reductions could also benefit from multithreading. My concern was more that after #22631 the threading interface will certainly change.

StefanKarpinski · 2018-10-08T05:20:33Z

There will be other ways to write threaded code but this should still work.

tknopp · 2018-10-08T19:42:13Z

Yes thats true. Do you want this then as a hidden feature or should this be documented? (How to get a multithreaded sparse multiplication...)

ViralBShah · 2019-01-28T06:08:43Z

Can we close this? We can't multi-thread things in our libraries as they will interfere with the user's multi-threading. Once we have the partr stuff, we can start doing these things.

tknopp · 2019-01-28T08:35:12Z

I have seen crashes when using nested threading. Is that what you mean by "interfere"? If yes, I am also for waiting for partr.

KristofferC · 2019-01-28T08:52:35Z

Merge this when partr is in then? :)

tknopp · 2019-01-28T08:55:50Z

Merge if nested threading is supported I would say. I had some real world examples where I needed to remove the inner threading because of a crash (do not remember if it was a segfault)

KristofferC · 2019-07-17T08:06:37Z

Update with results from master using

julia> using MatrixMarket

julia> M = MatrixMarket.mmread("bcsstk16.mtx");

julia> x = rand(size(M, 2)); y = similar(x);

julia> using BenchmarkTools

julia> using LinearAlgebra

julia> @btime mul!(y, M', x);

On master

julia> @btime mul!(y, M', x);
  250.861 μs (1 allocation: 16 bytes)

With this PR:

# 1 thread
julia> @btime mul!(y, M', x);
  281.209 μs (10 allocations: 816 bytes)

2 threads
julia> @btime mul!(y, M', x);
  153.064 μs (17 allocations: 1.48 KiB)

4 threads
julia> @btime mul!(y, M', x);
  88.616 μs (30 allocations: 2.86 KiB)

So there is a bit of overhead when using @threads and running with 1 thread.

Anything blocking something like this getting merged now with the new threading system in place?

ViralBShah · 2019-07-18T10:49:40Z

In general, are we going to start threading other parts of stdlib or at least the sparse matrix implementation? The reason I ask is that it would otherwise be odd to have only one function multi-threaded - but then again, one has to start somewhere.

tknopp · 2019-07-18T10:54:46Z

@KristofferC is nested threading now supported by the new system (without crashing but also without overcommitting threads)?

KristofferC · 2019-07-18T11:57:27Z

In general, are we going to start threading other parts of stdlib or at least the sparse matrix implementation?

The point of the threading work surely gotta be that we want to start using it.

The reason I ask is that it would otherwise be odd to have only one function multi-threaded - but then again, one has to start somewhere

Indeed you have to start somewhere. This seems like a simple place.

@KristofferC is nested threading now supported by the new system (without crashing but also without overcommitting threads)?

Yes. That is the point of the partr scheduler work

tknopp · 2019-07-18T12:53:57Z

Yes. That is the point of the partr scheduler work

I know, but was not sure of the threading macro already is adapted to not create too much tasks, if threading macros are nested.

StefanKarpinski · 2019-07-18T16:23:38Z

but was not sure of the threading macro already is adapted to not create too much tasks, if threading macros are nested.

Tasks are cheap and are mapped onto hardware threads so it's ok to create lots of tasks and do so in a nested fashion—that's exactly how the system is supposed to work.

StefanKarpinski · 2019-07-31T18:17:40Z

Yes, seems like it.

jebej · 2019-07-31T22:02:34Z

What's the performance currently?

KristofferC · 2019-07-31T23:38:36Z

The slowdown for 1 thread as said in #29525 (comment) might be worth discussing a bit more. Especially when we only run with 1 thread by default.

JeffBezanson · 2019-07-31T23:41:27Z

May want to hold off due to #32701.

KristofferC · 2019-07-31T23:44:24Z

Also, I guess a SparseVector where the nonzeros are backed by a BitArray or something interesting like that is potentially dangerous.

tkf · 2019-08-25T05:08:49Z

It would be nice if threaded mul! is defined for transpose as well. I think one possibility is to do a refactoring like I proposed in #33069.

ViralBShah · 2020-01-10T00:07:16Z

Merge this?

KristofferC · 2020-01-10T12:49:27Z

We introduced this nice new AbstractSparseMatrixCSC, so we can no longer assume that anything that uses it is threadsafe for anything (who knows how the implementation of such a matrix is made).

ViralBShah · 2020-01-10T14:26:27Z

It still has to be a CSC matrix, right? In which case, shouldn't be safe to multi-thread the operations. We are not updating the matrix itself, so I am not sure how thread safety affects this operation.

KristofferC · 2020-01-10T14:32:04Z

Thread safety is a property of the implementation and since it is Abstract who knows what anything does. I agree that it is unlikely to cause problems in practive though. However, #29525 (comment) might be worth thinking about.

ViralBShah · 2020-01-10T14:38:34Z

I still can't see how. I can imagine that if the output was an AbstractVector, it might matter.

Is the worry that a different thread may be updating the AbstractSparseMatrix? If so, that worry would be relevant even in SparseMatrixCSC.

Agree that overhead of single threaded case with @threads is a real issue.

tknopp · 2020-01-10T14:49:45Z

In principle one could implement an AbstractSparseMatrix where the colptr (nonzeros/ rowvals) function returns an object which internally changes the representation when getindex is called. But that is pretty hypothetical. I don't see a use case for that.

KristofferC · 2020-01-10T15:20:19Z

You could do some caching on getindex calls so that the next getindex in the same column could be faster than a full binary search in some cases etc.

tknopp · 2020-01-10T15:26:00Z

I don't see a binary search there. The question is what the contract of AbstractSparseMatrix is. If the contract is rowvals(A)::Vector and rowvals(A)::Vector then its thread safe. If that contract does not hold (which is debatable) than it might not be thread-safe.

KristofferC · 2020-01-10T16:05:37Z

I don't see a binary search there.

My example wasn't very good. On getindex for a SparseMatrixCSC there is a binary search but for this code, there isn't.

The question is what the contract of AbstractSparseMatrix is. If the contract is rowvals(A)::Vector and rowvals(A)::Vector then its thread safe. If that contract does not hold (which is debatable) than it might not be thread-safe.

Yeah, my point is mostly that this is something we are going to have to deal with on a larger scale when we want to start threading things more in Base. A lot of our code is quite generic and I don't think handwaving like "oh, it will probably be thread-safe for most reasonable implementations" is really going to cut it because people do and should be allowed to have "weird" implementations of e.g. AbstractArray.

tkf · 2020-01-11T01:02:43Z

IIUC, this implementation of mul! was not "thread-safe" even before AbstractSparseMatrixCSC as you can have arbitrary element type. For example, * on the elements could ccall an external library that is not thread-safe. I think a safe option here is to just define a new method for SparseMatrixCSC{<:ThreadSafeNumber, <:ThreadSafeInteger} with appropriate unions ThreadSafeNumber and ThreadSafeInteger. (It would be nice to have a trait system to make this more extensible but I guess that's another topic.)

pablosanjose · 2020-02-12T22:02:06Z

I'm worried how this would mix with pmap and distributed computation in general. I've tested this PR in some code where I distribute some sparse matrix-vector products over several workers using pmap, and it lead to severe oversubscription of my cores, and much worse performance. How is such a problem meant to be handled if this is merged?

KristofferC added performance Must go faster sparse Sparse arrays multithreading Base.Threads and related functionality labels Oct 4, 2018

KristofferC changed the title ~~thread sparse matrix multiply~~ thread sparse matrix vector multiply Oct 4, 2018

KristofferC changed the title ~~thread sparse matrix vector multiply~~ thread adjoint sparse matrix vector multiply Oct 4, 2018

ViralBShah changed the title ~~thread adjoint sparse matrix vector multiply~~ thread adjoint sparse matrix vector multiply - [Do not merge] Jan 28, 2019

ViralBShah changed the title ~~thread adjoint sparse matrix vector multiply - [Do not merge]~~ thread adjoint sparse matrix vector multiply Feb 20, 2019

ViralBShah added the DO NOT MERGE Do not merge this PR! label Feb 20, 2019

ViralBShah removed the DO NOT MERGE Do not merge this PR! label Jul 26, 2019

StefanKarpinski assigned KristofferC Jul 31, 2019

tknopp mentioned this pull request Nov 19, 2019

Multithreading JuliaMath/NFFT.jl#19

Closed

thread sparse matrix multiply

ab91b87

ViralBShah force-pushed the kc/ohnohedidn branch from c5713c6 to ab91b87 Compare January 10, 2020 03:03

ViralBShah changed the title ~~thread adjoint sparse matrix vector multiply~~ [DO NOT MERGE] thread adjoint sparse matrix vector multiply Mar 4, 2020

KristofferC mentioned this pull request May 15, 2020

Could sparse matrix-vector multiplication be optimized #35829

Closed

jebej mentioned this pull request May 17, 2020

Feature request: Multithreaded sparse matrix vector product #35798

Closed

pablosanjose mentioned this pull request Jul 7, 2020

Threaded sparse matrix - vector multiplication pablosanjose/Quantica.jl#74

Closed

ViralBShah marked this pull request as draft May 22, 2021 22:50

Roger-luo mentioned this pull request Nov 10, 2021

native multithreaded sparse-vector QuEraComputing/Bloqade.jl#12

Closed

KristofferC closed this Jan 14, 2022

giordano deleted the kc/ohnohedidn branch February 13, 2022 11:23

[DO NOT MERGE] thread adjoint sparse matrix vector multiply #29525

[DO NOT MERGE] thread adjoint sparse matrix vector multiply #29525

Conversation

KristofferC commented Oct 4, 2018 • edited Loading

KristofferC commented Oct 5, 2018 • edited Loading

JeffBezanson commented Oct 5, 2018

StefanKarpinski commented Oct 5, 2018

KristofferC commented Oct 5, 2018

StefanKarpinski commented Oct 5, 2018 • edited Loading

haampie commented Oct 5, 2018

KristofferC commented Oct 5, 2018 • edited Loading

tknopp commented Oct 6, 2018

KristofferC commented Oct 6, 2018

tknopp commented Oct 7, 2018

StefanKarpinski commented Oct 8, 2018

tknopp commented Oct 8, 2018

ViralBShah commented Jan 28, 2019

tknopp commented Jan 28, 2019

KristofferC commented Jan 28, 2019

tknopp commented Jan 28, 2019

KristofferC commented Jul 17, 2019

ViralBShah commented Jul 18, 2019 • edited Loading

tknopp commented Jul 18, 2019

KristofferC commented Jul 18, 2019

tknopp commented Jul 18, 2019

StefanKarpinski commented Jul 18, 2019

StefanKarpinski commented Jul 31, 2019

jebej commented Jul 31, 2019

KristofferC commented Jul 31, 2019

JeffBezanson commented Jul 31, 2019

KristofferC commented Jul 31, 2019

tkf commented Aug 25, 2019

ViralBShah commented Jan 10, 2020

KristofferC commented Jan 10, 2020 • edited Loading

ViralBShah commented Jan 10, 2020

KristofferC commented Jan 10, 2020

ViralBShah commented Jan 10, 2020 • edited Loading

tknopp commented Jan 10, 2020

KristofferC commented Jan 10, 2020

tknopp commented Jan 10, 2020

KristofferC commented Jan 10, 2020

tkf commented Jan 11, 2020

pablosanjose commented Feb 12, 2020

KristofferC commented Oct 4, 2018 •

edited

Loading

KristofferC commented Oct 5, 2018 •

edited

Loading

StefanKarpinski commented Oct 5, 2018 •

edited

Loading

KristofferC commented Oct 5, 2018 •

edited

Loading

ViralBShah commented Jul 18, 2019 •

edited

Loading

KristofferC commented Jan 10, 2020 •

edited

Loading

ViralBShah commented Jan 10, 2020 •

edited

Loading