Special case modn_dense matrix operations to improve performance #15104

nbruin · 2013-08-27T00:10:09Z

Presently, taking a transpose of a modn_dense matrix of smallish size is much more expensive than constructing the right kernel, because the method is just a generic implementation. We can make this much faster. Same for submatrix and stack.

CC: @simon-king-jena @nthiery

Component: linear algebra

Keywords: sd86.5

Author: Nils Bruin, Simon King

Branch: 5c878c7

Reviewer: Travis Scrimshaw

Issue created by migration from https://trac.sagemath.org/ticket/15104

The text was updated successfully, but these errors were encountered:

nbruin · 2013-08-27T00:15:01Z

Author: Nils Bruin

nbruin · 2013-08-27T00:15:01Z

comment:1

Straightforward special case implementation of transpose, stack, submatrix for Matrix_modn_dense_template, i.e., for Matrix_modn_dense_float and Matrix_modn_dense_double. The same optimization would probably benefit many other implementations.

nbruin · 2013-08-27T22:03:33Z

Various optimizations for modn_dense matrices (faster transpose etc.)

nbruin · 2013-08-27T22:05:34Z

comment:2

Attachment: 15104_modn_dense.patch.gz

As it turns out, special casing right_kernel_matrix is also very worthwhile.

nbruin · 2013-08-28T00:18:00Z

comment:3

(remnants of a comment that was erroneously posted here)

tscrim · 2014-02-02T17:22:45Z

comment:5

Here's the git version based on 6.1, and I've made a few docstring touchups. However this currently leads to a regression. Here are my timings:

sage: M=matrix(GF(5),6,6,range(36))
sage: %timeit M.right_kernel_matrix(basis='computed')
1000 loops, best of 3: 536 us per loop
sage: %timeit M.right_kernel_matrix(basis='pivot')
1000 loops, best of 3: 556 us per loop
sage: %timeit M.transpose()
10000 loops, best of 3: 19.8 us per loop
sage: %timeit M.stack(M)
10000 loops, best of 3: 165 us per loop
sage: %timeit M.submatrix(1,1)
10000 loops, best of 3: 38.3 us per loop

Before:

sage: M=matrix(GF(5),6,6,range(36))
sage: %timeit M.right_kernel_matrix(basis='computed') 
1000 loops, best of 3: 250 us per loop
sage: %timeit M.right_kernel_matrix(basis='pivot')
10000 loops, best of 3: 114 us per loop
sage: %timeit M.transpose()
100000 loops, best of 3: 14 us per loop
sage: %timeit M.stack(M)
10000 loops, best of 3: 51.2 us per loop
sage: %timeit M.submatrix(1,1)
10000 loops, best of 3: 37.5 us per loop

New commits:

`a84c802`	`Various optimizations for modn_dense matrices (faster transpose etc.)`
`b2d2f1b`	`Some reviewer docstring tweaks.`

tscrim · 2014-02-02T17:22:45Z

Branch: u/tscrim/ticket/15104

tscrim · 2014-02-02T17:22:45Z

Commit: b2d2f1b

nbruin · 2014-02-02T18:50:15Z

comment:6

I can confirm that the branch attached here seems to lead to a performance regression relative to 6.0. I wonder what happened in between. Did empty matrix creation get better? Did cython get better? It's pretty obvious that a couple of operations here should be way faster and have virtually no overhead, whereas the generic implementations definitely do have overhead. It just seems that overhead is not as big as it used to be, making this ticket less essential.

The good news is that the performance quoted on #15113 can now be obtained on vanilla 6.0, whereas before I definitely needed the patch here. I think optimizations in the spirit of what is proposed here are still worth considering, but they neeed to be reevaluated in the light of the present state, which is happily much better than 5 months ago!

nbruin · 2014-02-02T21:08:37Z

comment:7

OK, staring at some profile information: it seems that in the transpose, stack, and submatrix cases, virtually all time is spent in creating the parent. It seems the self.new_matrix call is a much better way of getting a new matrix. When I change that, the methods proposed here are again considerably faster. Since new_matrix is such a generic routine itself, I think it should be possible to further save overhead on that -- but clearly it's better than the explicit parent creation the code here was using before.

nbruin · 2014-02-02T21:25:48Z

Changed branch from u/tscrim/ticket/15104 to u/nbruin/ticket/15104

sagetrac-git · 2014-02-02T21:33:25Z

Branch pushed to git repo; I updated commit sha1. New commits:

`a908e28`	`trac #15104: faster creation of new matrices (there should be even lower overhead solutions)`

sagetrac-git · 2014-02-02T21:33:25Z

Changed commit from b2d2f1b to a908e28

nbruin · 2014-02-02T21:38:33Z

comment:10

OK, I think I changed all the places where a new matrix is created. I think there's room for even further optimization by someone who understands the intricacies of matrix creation a little better. I'm now getting significantly better timings again. With the new branch:

sage: k=GF(17)
sage: n=20
sage: m=30
sage: M=matrix(k,n,m,[k.random_element() for j in range(n*m)])
sage: %timeit M.right_kernel_matrix(basis='computed')
10000 loops, best of 3: 88.2 us per loop
sage: %timeit M.right_kernel_matrix(basis='pivot')
10000 loops, best of 3: 91.5 us per loop
sage: %timeit M.transpose()
100000 loops, best of 3: 13.7 us per loop
sage: %timeit M.stack(M)
10000 loops, best of 3: 19.3 us per loop
sage: %timeit M.submatrix(1,1)
100000 loops, best of 3: 14.1 us per loop
sage: %timeit M.submatrix(3,nrows=5)
100000 loops, best of 3: 14 us per loop

On vanilla 6.0:

sage: k=GF(17)
sage: n=20
sage: m=30
sage: M=matrix(k,n,m,[k.random_element() for j in range(n*m)])
sage: %timeit M.right_kernel_matrix(basis='computed')
10000 loops, best of 3: 141 us per loop
sage: %timeit M.right_kernel_matrix(basis='pivot')
10000 loops, best of 3: 143 us per loop
sage: %timeit M.transpose()
10000 loops, best of 3: 22.2 us per loop
sage: %timeit M.stack(M)
10000 loops, best of 3: 50.3 us per loop
sage: %timeit M.submatrix(1,1)
10000 loops, best of 3: 37.1 us per loop
sage: %timeit M.submatrix(3,nrows=5)
10000 loops, best of 3: 22.3 us per loop

(for instance, for some example on #15113 application of this ticket means a run time of 681ms instead of 731ms. So the difference is significant)

nbruin · 2014-02-04T03:24:19Z

comment:11

I tried a little experiment for matrix creation. Presently, in M.__neg__() the matrix gets created via

        M = self.__class__.__new__(self.__class__, self._parent,None,None,None)

leading to

sage: k=GF(17); M=matrix(20,30,[k.random_element() for i in range(600)])
sage: %timeit M.__neg__()
100000 loops, best of 3: 2.81 us per loop

If I change that line to

        M = self.new_matrix()

this becomes

sage: %timeit M.__neg__()
100000 loops, best of 3: 11.8 us per loop

so there are much lower overhead ways of creating matrices (if you already have a hold of the parent).

Of course, in submatrix and stack you don't have the parent in hand, and in transpose you only do if the matrix happens to be square (which may be worth special casing!). However, when you're doing frequent matrix computations, the matrices often end up having similar shapes, so the parents you're looking for are probably already available. This suggests that new_matrix should have access to a cache of parents (weakly referenced or rotating), so that it can do really quick (empty) matrix creation.

simon-king-jena · 2014-02-04T22:32:11Z

comment:13

In the ticket description, you say: "taking a transpose of a modn_dense matrix of smallish size is much more expensive than constructing the right kernel, because the method is just a generic implementation."

In contrast, I believe that there should be a generic implementation of the transpose, or at least there should be some generic helper methods for creating the transposed matrix.

For example, it should make sense to have a method of a matrix space returning the transposed matrix space, and this method should perhaps even be a cached method. When using this helper in all custom implementations of .transpose(), then the time to create the transposed matrix' parent could be reduced.

Moreover, all matrices are supposed to provide fast set/get_unsafe() methods. Are you really sure that a generic implementation relying on these fast "unsafe" methods would be slower than a custom implementation such as the one of Matrix_modn_dense that uses the underlying data structure directly, rather than using set/get_unsafe:

    def transpose(self):
        ...
        for i from 0 <= i < ncols:
            for j from 0 <= j < nrows:
                M._entries[j+i*nrows] = self._entries[i+j*ncols]

And is this really the correct thing to do? Namely, in the unsafe set/get method of Matrix_modn_dense_float, the operations are done on the attribute ._matrix, not ._entries. If I understand correctly, ._matrix is arranged in rows, whereas ._entries is the same chunk of memory, but not separated in rows. But then, wouldn't it be faster to use ._matrix as long as one stays in the same row (which in the above loop over j is the case for M?

Regardless whether we create a fully fledged generic implementation of .transpose() or just provide new helper methods: We need to address the following custom .transpose() methods:

matrix/matrix_integer_dense.pyx:4770:    def transpose(self):
matrix/matrix_mod2_dense.pyx:1430:    def transpose(self):
matrix/matrix_double_dense.pyx:2242:    def transpose(self):
matrix/matrix_sparse.pyx:365:    def transpose(self):
matrix/matrix_rational_dense.pyx:2356:    def transpose(self):
matrix/matrix_modn_sparse.pyx:684:    def transpose(self):
matrix/matrix_dense.pyx:131:    def transpose(self):
modules/free_module_element.pyx:1119:    def transpose(self):
misc/functional.py:1746:def transpose(x):
misc/table.py:366:    def transpose(self):
combinat/alternating_sign_matrix.py:311:    def transpose(self):
libs/ntl/ntl_mat_GF2E.pyx:538:    def transpose(ntl_mat_GF2E self):
libs/ntl/ntl_mat_GF2.pyx:559:    def transpose(ntl_mat_GF2 self):
matroids/lean_matrix.pyx:332:    cdef LeanMatrix transpose(self):
matroids/lean_matrix.pyx:695:    cdef LeanMatrix transpose(self):
matroids/lean_matrix.pyx:1114:    cdef LeanMatrix transpose(self):
matroids/lean_matrix.pyx:1734:    cdef LeanMatrix transpose(self):
matroids/lean_matrix.pyx:2250:    cdef LeanMatrix transpose(self):
matroids/lean_matrix.pyx:2706:    cdef LeanMatrix transpose(self):

nbruin · 2014-02-04T23:27:09Z

comment:14

Replying to @simon-king-jena:

In the ticket description, you say: "taking a transpose of a modn_dense matrix of smallish size is much more expensive than constructing the right kernel, because the method is just a generic implementation."

In contrast, I believe that there should be a generic implementation of the transpose, or at least there should be some generic helper methods for creating the transposed matrix.

I'm not arguing with that. I'm just saying that we should override it on something like matrices over small finite prime fields because the call overhead of the getunsafe and setunsafe functions is really noticeable relative to the straight C-level reshuffling of data one can do in this specific case.

For example, it should make sense to have a method of a matrix space returning the transposed matrix space, and this method should perhaps even be a cached method. When using this helper in all custom implementations of .transpose(), then the time to create the transposed matrix' parent could be reduced.

That might very well be a good idea. It doesn't quite solve the problem for stack/submatrix operations, though.

Moreover, all matrices are supposed to provide fast set/get_unsafe() methods. Are you really sure that a generic implementation relying on these fast "unsafe" methods would be slower than a custom implementation such as the one of Matrix_modn_dense that uses the underlying data structure directly, rather than using set/get_unsafe

Yes, for modndense it definitely is, because it boils down to reshuffling a C-array of floats/doubles. I'm pretty sure the compiler isn't capable of inlining getunsafe and setunsafe.

                M._entries[j+i*nrows] = self._entries[i+j*ncols]

And of course, in this line we could even save some multiplications, but I wasn't sure whether that makes a proper difference on modern CPUs.

And is this really the correct thing to do? Namely, in the unsafe set/get method of Matrix_modn_dense_float, the operations are done on the attribute ._matrix, not ._entries. If I understand correctly, ._matrix is arranged in rows,

._matrix is an array of pointers into ._entries (and yes, in my code I assumed that Matrix_modn_dense_template always has its entries laid out in the same way. That assumption is made elsewhere in the file too). Looking up via ._matrix might save a multiplication at the expense of an extra memory access. My gut feeling was that the memory access was going to be worse, but I didn't extensively check. My guess would be that if the multiplications are costly, it's better to avoid them through repeated additions (we need the intermediate results) than by extra memory indirection. Perhaps replace everything by straight pointer arithmetic on a pointer starting at M._entries?

We need to address the following custom .transpose() methods:

We can reevaluate them with what we learn here, but I'm not sure that the same tradeoffs will apply. For instance, for an integer matrix we cannot expect the entries to be contiguously stored and of the same size, so the memory management just for the entries is already much more expensive.

nbruin · 2014-02-06T02:26:42Z

comment:15

Some timings. I changed dense_template.transpose to use the given parent for square matrices.
With this inner copy loop:

        for i from 0 <= i < ncols:
            for j from 0 <= j < nrows:
                M._entries[j+i*nrows] = self._entries[i+j*ncols]

I get:

sage: k=GF(17)
sage: A=matrix(k,100,101,[k.random_element() for i in range(100*101)])
sage: B=matrix(k,100,100,[k.random_element() for i in range(100*100)])
sage: %timeit At=A.transpose()
10000 loops, best of 3: 30.3 us per loop
sage: %timeit Bt=B.transpose()
100000 loops, best of 3: 11 us per loop

As you can see, parent creation overhead is still the main thing.

With this inner loop

        for i from 0 <= i < ncols:
            for j from 0 <= j < nrows:
                M._matrix[i][j] = self._matrix[j][i]

I get:

sage: %timeit At=A.transpose()
10000 loops, best of 3: 35.8 us per loop
sage: %timeit Bt=B.transpose()
100000 loops, best of 3: 15.8 us per loop

as one of the better timings. Depending on the matrix creation, but consistent with that fixed, I was also seeing 23.4 us, which I guess happens if the ._matrix pointer array is unfortunately allocated in memory relative to _entries (cache thrashing perhaps).

I've also tried:

        Midx=0
        for i from 0 <= i < ncols:
            selfidx=i
            for j from 0 <= j < nrows:
                M._entries[Midx]=self._entries[selfidx]
                Midx+=1
                selfidx+=ncols

which was not really distinguishable from the first solution, but if anything, slightly slower. So my guess is that a multiplication is not something to worry about on modern CPUs.

simon-king-jena · 2014-02-06T08:13:13Z

comment:16

If I understand correctly, you say that we should keep all the custom transpose methods, since the matrix classes should know best how to create one of their instances efficiently.

Since you say that the creation of the parent has the most impact, I suggest to add a lazy attribute transposed to matrix spaces, and this can be used to create a blank matrix of the correct dimension that can then be filled by whatever custom method we have.

Variation of this theme: We could instead say that self.new_matrix() should not only check whether the to-be-created matrix lives in the same parent as self, but make a second special case for the transposed dimensions.

simon-king-jena · 2014-02-06T08:33:37Z

Changed branch from u/nbruin/ticket/15104 to u/SimonKing/ticket/15104

simon-king-jena · 2014-02-06T08:37:27Z

comment:18

With your branch, I got

sage: k=GF(17)
sage: A=matrix(k,100,101,[k.random_element() for i in range(100*101)])
sage: B=matrix(k,100,100,[k.random_element() for i in range(100*100)])
sage: %timeit At=A.transpose()
10000 loops, best of 3: 68.5 us per loop
sage: %timeit Bt=B.transpose()
10000 loops, best of 3: 48.7 us per loop

With the additional commits that I have just pushed, I get

sage: k=GF(17)
sage: A=matrix(k,100,101,[k.random_element() for i in range(100*101)])
sage: B=matrix(k,100,100,[k.random_element() for i in range(100*100)])
sage: %timeit At=A.transpose()
10000 loops, best of 3: 49.9 us per loop
sage: %timeit Bt=B.transpose()
10000 loops, best of 3: 49.3 us per loop

So, looks like progress.

My changes are: I added a lazy attribute to matrix spaces, returning the transposed matrix space, and I am using it in .new_matrix(). This has the advantage that it is a generic method used by different custom implementations of .transpose(). Hence, there should be a speed-up for all types of matrices, not only for Matrix_modn_dense_float.

New commits:

`5487c67`	`Trac 15104: Faster creation of transposed matrix' parent`
`0f008f9`	`Trac 15104: Add a test for the new lazy attribute`

nbruin · 2017-06-07T22:56:28Z

Changed branch from u/SimonKing/ticket/15104 to u/nbruin/ticket/15104

nbruin · 2017-06-07T22:58:06Z

comment:45

rebased; ready for review.

New commits:

`ce2cfdb`	`trac15104: some special cased modn_dense matrix operations for improved performance`

nbruin · 2017-06-07T22:58:06Z

Changed commit from 0f008f9 to ce2cfdb

nbruin · 2017-06-07T22:58:06Z

Changed keywords from none to sd86.5

nbruin · 2017-06-07T22:58:06Z

Changed work issues from regression in right_kernel_matrix to none

tscrim · 2017-06-07T23:52:51Z

Changed branch from u/nbruin/ticket/15104 to public/linear_algebra/modn_dense_speedup-15104

tscrim · 2017-06-07T23:52:51Z

Changed commit from ce2cfdb to 10fa935

tscrim · 2017-06-07T23:52:51Z

comment:46

I made some doc improvements and fixed a typo that caused this to not compile.

Using the same test order as comment:5:

10000 loops, best of 3: 93.2 µs per loop
10000 loops, best of 3: 95.5 µs per loop
100000 loops, best of 3: 3.95 µs per loop
10000 loops, best of 3: 89.3 µs per loop
100000 loops, best of 3: 8.08 µs per loop

vs develop:

10000 loops, best of 3: 132 µs per loop
10000 loops, best of 3: 133 µs per loop
100000 loops, best of 3: 2.73 µs per loop
100000 loops, best of 3: 6.69 µs per loop
100000 loops, best of 3: 8.38 µs per loop

So the right kernel matrices are faster, but not the transpose, stack (this one is really bad), and submatrix.

For the transpose, you are not using self._parent.transposed lazy attribute.

For stack, the meat of the code should be in cdef _stack_impl(self, bottom): (see matrix1.pyx).

Instead of overwriting submatrix, it would be better to define matrix_from_rows_and_columns.

However, I don't have a reason why these direct implementations are slower than their generic counterparts.

Addendum: Actually, a good part of the stack slowdown might be because you are doing the bulk of the work in a def function rather than a cdef.

New commits:

`10fa935`	`Cleanup doc and fix typo.`

tscrim · 2017-06-07T23:52:51Z

Reviewer: Travis Scrimshaw

sagetrac-git · 2017-06-08T06:39:29Z

Changed commit from 10fa935 to 27f9467

sagetrac-git · 2017-06-08T06:39:29Z

Branch pushed to git repo; I updated commit sha1. New commits:

`27f9467`	`faster construction of new parents`

nbruin · 2017-06-08T06:44:04Z

comment:48

Hm, somehow the changes for better construction of parents didn't make it in. Fixed now. At least nothing is worse now. However, without improvement it's hard to argue why to make the changes. Perhaps there are other cases where differences are bigger.

sagetrac-git · 2017-06-08T07:53:42Z

Changed commit from 27f9467 to 81c7ef4

sagetrac-git · 2017-06-08T07:53:42Z

Branch pushed to git repo; I updated commit sha1. New commits:

`81c7ef4`	`Use _stack_impl for the work and leave the checks for stack.`

sagetrac-git · 2017-06-08T07:56:46Z

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

`5c878c7`	`Use _stack_impl for the work and leave the checks for stack.`

sagetrac-git · 2017-06-08T07:56:46Z

Changed commit from 81c7ef4 to 5c878c7

tscrim · 2017-06-08T07:59:43Z

comment:51

By using _stack_impl, I could get some extra speed out of stack:

100000 loops, best of 3: 9.13 µs per loop
100000 loops, best of 3: 10.5 µs per loop
100000 loops, best of 3: 2.14 µs per loop
100000 loops, best of 3: 5.9 µs per loop
100000 loops, best of 3: 8.34 µs per loop

The most amazing thing is the speedup of the right kernel matrix. There was a lot of overhead there: about a 10x speedup.

If my last changes are good, then positive review.

nbruin · 2017-06-08T16:52:54Z

comment:52

Thanks! This looks fine. It seems this is a uniform improvement, and further optimizations would likely need to consider some new ideas and areas, so this is definitely worth merging.

vbraun · 2017-06-10T11:38:19Z

Changed branch from public/linear_algebra/modn_dense_speedup-15104 to 5c878c7

jdemeyer · 2017-06-12T12:45:03Z

comment:54

For future reference, code like

cdef Py_ssize_t* nonpivots = <Py_ssize_t*>sig_malloc(sizeof(Py_ssize_t)*(ncols-r))

is unsafe for 2 reasons:

sig_malloc might return NULL in the case that no memory is available.
The multiplication sizeof(Py_ssize_t) * (ncols-r) can overflow (this is unlikely to happen in practice for matrices, but it's still a bug)

Luckily, the function check_allocarray from cysignals solves both these problems. The allocation could be written as

cdef Py_ssize_t* nonpivots = <Py_ssize_t*>check_allocarray(ncols - r, sizeof(Py_ssize_t))

jdemeyer · 2017-06-12T12:45:03Z

Changed commit from 5c878c7 to none

nbruin added this to the sage-6.1 milestone Aug 27, 2013

nbruin added c: linear algebra labels Aug 27, 2013

nbruin added the s: needs review label Aug 27, 2013

sagetrac-vbraun-spam mannequin modified the milestones: sage-6.1, sage-6.2 Jan 30, 2014

tscrim added s: needs work and removed s: needs review labels Feb 2, 2014

nbruin added s: needs review and removed s: needs work labels Jun 7, 2017

tscrim modified the milestones: sage-6.4, sage-8.0 Jun 7, 2017

nbruin added s: positive review and removed s: needs review labels Jun 8, 2017

vbraun removed the s: positive review label Jun 10, 2017

vbraun closed this as completed in 117e9e3 Jun 10, 2017

sagetrac-kartikv mannequin mentioned this issue Dec 29, 2022

Unifying modn matrices #19076

Open

Special case modn_dense matrix operations to improve performance #15104

Special case modn_dense matrix operations to improve performance #15104

Comments

nbruin commented Aug 27, 2013

nbruin commented Aug 27, 2013

nbruin commented Aug 27, 2013

nbruin commented Aug 27, 2013

nbruin commented Aug 27, 2013

nbruin commented Aug 28, 2013

tscrim commented Feb 2, 2014

tscrim commented Feb 2, 2014

tscrim commented Feb 2, 2014

nbruin commented Feb 2, 2014

nbruin commented Feb 2, 2014

nbruin commented Feb 2, 2014

sagetrac-git mannequin commented Feb 2, 2014

sagetrac-git mannequin commented Feb 2, 2014

nbruin commented Feb 2, 2014

nbruin commented Feb 4, 2014

simon-king-jena commented Feb 4, 2014

nbruin commented Feb 4, 2014

nbruin commented Feb 6, 2014

simon-king-jena commented Feb 6, 2014

simon-king-jena commented Feb 6, 2014

simon-king-jena commented Feb 6, 2014

nbruin commented Jun 7, 2017

nbruin commented Jun 7, 2017

nbruin commented Jun 7, 2017

nbruin commented Jun 7, 2017

nbruin commented Jun 7, 2017

tscrim commented Jun 7, 2017

tscrim commented Jun 7, 2017

tscrim commented Jun 7, 2017

tscrim commented Jun 7, 2017

sagetrac-git mannequin commented Jun 8, 2017

sagetrac-git mannequin commented Jun 8, 2017

nbruin commented Jun 8, 2017

sagetrac-git mannequin commented Jun 8, 2017

sagetrac-git mannequin commented Jun 8, 2017

sagetrac-git mannequin commented Jun 8, 2017

sagetrac-git mannequin commented Jun 8, 2017

tscrim commented Jun 8, 2017

nbruin commented Jun 8, 2017

vbraun commented Jun 10, 2017

jdemeyer commented Jun 12, 2017

jdemeyer commented Jun 12, 2017