-
-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Special case modn_dense matrix operations to improve performance #15104
Comments
Author: Nils Bruin |
comment:1
Straightforward special case implementation of |
Various optimizations for modn_dense matrices (faster transpose etc.) |
comment:2
Attachment: 15104_modn_dense.patch.gz As it turns out, special casing |
comment:3
(remnants of a comment that was erroneously posted here) |
comment:5
Here's the git version based on 6.1, and I've made a few docstring touchups. However this currently leads to a regression. Here are my timings:
Before:
New commits:
|
Branch: u/tscrim/ticket/15104 |
Commit: |
comment:6
I can confirm that the branch attached here seems to lead to a performance regression relative to 6.0. I wonder what happened in between. Did empty matrix creation get better? Did cython get better? It's pretty obvious that a couple of operations here should be way faster and have virtually no overhead, whereas the generic implementations definitely do have overhead. It just seems that overhead is not as big as it used to be, making this ticket less essential. The good news is that the performance quoted on #15113 can now be obtained on vanilla 6.0, whereas before I definitely needed the patch here. I think optimizations in the spirit of what is proposed here are still worth considering, but they neeed to be reevaluated in the light of the present state, which is happily much better than 5 months ago! |
comment:7
OK, staring at some profile information: it seems that in the transpose, stack, and submatrix cases, virtually all time is spent in creating the parent. It seems the |
Changed branch from u/tscrim/ticket/15104 to u/nbruin/ticket/15104 |
Branch pushed to git repo; I updated commit sha1. New commits:
|
comment:10
OK, I think I changed all the places where a new matrix is created. I think there's room for even further optimization by someone who understands the intricacies of matrix creation a little better. I'm now getting significantly better timings again. With the new branch:
On vanilla 6.0:
(for instance, for some example on #15113 application of this ticket means a run time of 681ms instead of 731ms. So the difference is significant) |
comment:11
I tried a little experiment for matrix creation. Presently, in
leading to
If I change that line to
this becomes
so there are much lower overhead ways of creating matrices (if you already have a hold of the parent). Of course, in |
comment:13
In the ticket description, you say: "taking a transpose of a modn_dense matrix of smallish size is much more expensive than constructing the right kernel, because the method is just a generic implementation." In contrast, I believe that there should be a generic implementation of the transpose, or at least there should be some generic helper methods for creating the transposed matrix. For example, it should make sense to have a method of a matrix space returning the transposed matrix space, and this method should perhaps even be a cached method. When using this helper in all custom implementations of Moreover, all matrices are supposed to provide fast def transpose(self):
...
for i from 0 <= i < ncols:
for j from 0 <= j < nrows:
M._entries[j+i*nrows] = self._entries[i+j*ncols] And is this really the correct thing to do? Namely, in the unsafe set/get method of Regardless whether we create a fully fledged generic implementation of
|
comment:14
Replying to @simon-king-jena:
I'm not arguing with that. I'm just saying that we should override it on something like matrices over small finite prime fields because the call overhead of the
That might very well be a good idea. It doesn't quite solve the problem for stack/submatrix operations, though.
Yes, for modndense it definitely is, because it boils down to reshuffling a C-array of floats/doubles. I'm pretty sure the compiler isn't capable of inlining
And of course, in this line we could even save some multiplications, but I wasn't sure whether that makes a proper difference on modern CPUs.
We can reevaluate them with what we learn here, but I'm not sure that the same tradeoffs will apply. For instance, for an integer matrix we cannot expect the entries to be contiguously stored and of the same size, so the memory management just for the entries is already much more expensive. |
comment:15
Some timings. I changed
I get:
As you can see, parent creation overhead is still the main thing. With this inner loop
I get:
as one of the better timings. Depending on the matrix creation, but consistent with that fixed, I was also seeing I've also tried:
which was not really distinguishable from the first solution, but if anything, slightly slower. So my guess is that a multiplication is not something to worry about on modern CPUs. |
comment:16
If I understand correctly, you say that we should keep all the custom transpose methods, since the matrix classes should know best how to create one of their instances efficiently. Since you say that the creation of the parent has the most impact, I suggest to add a lazy attribute Variation of this theme: We could instead say that |
Changed branch from u/nbruin/ticket/15104 to u/SimonKing/ticket/15104 |
comment:18
With your branch, I got
With the additional commits that I have just pushed, I get
So, looks like progress. My changes are: I added a lazy attribute to matrix spaces, returning the transposed matrix space, and I am using it in New commits:
|
Changed branch from u/SimonKing/ticket/15104 to u/nbruin/ticket/15104 |
comment:45
rebased; ready for review. New commits:
|
Changed keywords from none to sd86.5 |
Changed work issues from regression in right_kernel_matrix to none |
Changed branch from u/nbruin/ticket/15104 to public/linear_algebra/modn_dense_speedup-15104 |
comment:46
I made some doc improvements and fixed a typo that caused this to not compile. Using the same test order as comment:5:
vs
So the right kernel matrices are faster, but not the For the For Instead of overwriting However, I don't have a reason why these direct implementations are slower than their generic counterparts. Addendum: Actually, a good part of the New commits:
|
Reviewer: Travis Scrimshaw |
Branch pushed to git repo; I updated commit sha1. New commits:
|
comment:48
Hm, somehow the changes for better construction of parents didn't make it in. Fixed now. At least nothing is worse now. However, without improvement it's hard to argue why to make the changes. Perhaps there are other cases where differences are bigger. |
Branch pushed to git repo; I updated commit sha1. New commits:
|
Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:
|
comment:51
By using
The most amazing thing is the speedup of the right kernel matrix. There was a lot of overhead there: about a 10x speedup. If my last changes are good, then positive review. |
comment:52
Thanks! This looks fine. It seems this is a uniform improvement, and further optimizations would likely need to consider some new ideas and areas, so this is definitely worth merging. |
Changed branch from public/linear_algebra/modn_dense_speedup-15104 to |
comment:54
For future reference, code like
is unsafe for 2 reasons:
Luckily, the function
|
Changed commit from |
Presently, taking a transpose of a modn_dense matrix of smallish size is much more expensive than constructing the right kernel, because the method is just a generic implementation. We can make this much faster. Same for submatrix and stack.
CC: @simon-king-jena @nthiery
Component: linear algebra
Keywords: sd86.5
Author: Nils Bruin, Simon King
Branch:
5c878c7
Reviewer: Travis Scrimshaw
Issue created by migration from https://trac.sagemath.org/ticket/15104
The text was updated successfully, but these errors were encountered: