optimization for dot(csr.T, dense) = rsp #8611

ZiyueHuang · 2017-11-10T17:34:27Z

Description

Use prefix sum to compute nnr in order to allocate the row_sparse output.

Currently dot(csr.T, dense) = rsp will allocate the dense output and then cast it to row_sparse, but not free the unused memory.

I use run_benchmark(context, lhs="csr", rhs="default", lhs_trans=True, ...) in mxnet/benchmark/python/sparse/dot.py. Please correct me if I'm wrong.

But is dot(csr.T, dense) = rsp in master slow like this? Might due to others are using my machine at the same time?

Performance of origin dot(csr.T, dense) = rsp,

[hanfeng@model-gpu00:sparse]$ python dot.py --num-omp-threads 20
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      128  1000000      256        366.19        135.76     0.37
            1.0           100.0     cpu(0)      128  1000000     1000       1327.12        503.92     0.38
            1.0           100.0     cpu(0)      128  1000000     1000       1237.33        454.01     0.37
            1.0           100.0     cpu(0)       64  1000000     1000        868.38        345.38     0.40
            1.0           100.0     cpu(0)      128  1000000     1000       1237.09        437.32     0.35

After this PR,

[hanfeng@model-gpu00:sparse]$ python dot.py --num-omp-threads 20
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      128  1000000      256         83.90        137.18     1.64
            1.0           100.0     cpu(0)      128  1000000     1000        410.63        448.30     1.09
            1.0           100.0     cpu(0)      128  1000000     1000        467.91        492.87     1.05
            1.0           100.0     cpu(0)       64  1000000     1000        259.99        348.32     1.34
            1.0           100.0     cpu(0)      128  1000000     1000        481.77        416.20     0.86

As a feature requested in #8168

cc @eric-haibin-lin

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
For user-facing API changes, API doc string has been updated.
To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

unittests already exist

Comments

If this change is a backward incompatible change, why must this change be made.
Intersting edge cases to note here

eric-haibin-lin · 2017-11-10T22:24:35Z

src/operator/tensor/dot-inl.h

@@ -573,6 +577,18 @@ inline void DotCsrDnsDnsImpl(const OpContext& ctx,
  });
 }

+
+struct MarkCsrColKernel {


Can we reuse this kernel and give it a better name?
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/util/tensor_util-inl.h#L37

eric-haibin-lin · 2017-11-10T22:26:57Z

src/operator/tensor/dot-inl.h

+            row_idx_out, prefix_sum, num_rows);
+
+          num_threads = mxnet_op::get_num_threads<cpu>(ret->shape()[0]);
+          dim_t seg_len = (ret->shape()[0] + num_threads - 1) / num_threads;


Should we replace ret->shape()[0] by nnr instead?
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/indexing_op.h#L661

I think no, this range should cover num_cols of csr, i.e. num_rows of output. So nnr is not helpful for this kernel.

eric-haibin-lin · 2017-11-10T22:28:36Z

Yes, the original dot(csr.T, dense) performance is not good. The same for scipy - it's not always faster than the dense dot even with very sparse data.

eric-haibin-lin · 2017-11-12T19:29:04Z

src/operator/tensor/dot-inl.h

@@ -375,28 +375,32 @@ struct DotCsrTransDnsRspByRowBlocks {
   * \brief


The documentation should be updated

ZiyueHuang · 2017-11-19T17:23:06Z

Add benchmark for n=2 case,

Before,

python dot.py --num-omp-threads 16
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      256  1000000        2         41.61         36.01     0.87

After,

python dot.py --num-omp-threads 16
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      256  1000000        2         14.44         32.46     2.25

This reverts commit f79d22d.

* optimization for dot(csr.T, dense) = rsp * remove unneccessary headers * load balance * minor fix and update comments * resolve * trigger * trigger

moveforever · 2017-12-31T03:24:58Z

hi, @eric-haibin-lin @ZiyueHuang , can the storage type of dot(csr.T, rsp) be rsp?

eric-haibin-lin · 2017-12-31T08:25:27Z

It's now indeed rsp:

a = mx.nd.ones((2,2)).tostype('row_sparse')
>>> a

<RowSparseNDArray 2x2 @cpu(0)>
>>> b = mx.nd.ones((2,2)).tostype('csr')
>>> mx.nd.sparse.dot(b,a,transpose_a=True)

<RowSparseNDArray 2x2 @cpu(0)>

moveforever · 2017-12-31T17:42:29Z

oh, i know, thanks.

* optimization for dot(csr.T, dense) = rsp * remove unneccessary headers * load balance * minor fix and update comments * resolve * trigger * trigger

ZiyueHuang added 2 commits November 11, 2017 01:17

optimization for dot(csr.T, dense) = rsp

625e996

remove unneccessary headers

9019052

eric-haibin-lin reviewed Nov 10, 2017

View reviewed changes

eric-haibin-lin mentioned this pull request Nov 11, 2017

program crash when run sparse model predict #8500

Closed

eric-haibin-lin self-assigned this Nov 12, 2017

ZiyueHuang added 2 commits November 13, 2017 01:41

load balance

45bae96

Merge remote-tracking branch 'upstream/master' into dot

6a6b394

eric-haibin-lin reviewed Nov 12, 2017

View reviewed changes

ZiyueHuang added 6 commits November 13, 2017 12:33

minor fix and update comments

3bd6493

Merge remote-tracking branch 'upstream/master' into dot

40f6ba1

resolve conflict

04afdb4

resolve

9ec6508

trigger

ca58c06

trigger

16b1c46

eric-haibin-lin approved these changes Nov 19, 2017

View reviewed changes

piiswrong merged commit f79d22d into apache:master Nov 19, 2017

eric-haibin-lin added a commit to eric-haibin-lin/mxnet that referenced this pull request Nov 21, 2017

Revert "optimization for dot(csr.T, dense) = rsp (apache#8611)"

cf4f94f

This reverts commit f79d22d.

ZiyueHuang deleted the dot branch January 30, 2018 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimization for dot(csr.T, dense) = rsp #8611

optimization for dot(csr.T, dense) = rsp #8611

ZiyueHuang commented Nov 10, 2017 •

edited

Loading

eric-haibin-lin Nov 10, 2017

eric-haibin-lin Nov 10, 2017

ZiyueHuang Nov 11, 2017

eric-haibin-lin commented Nov 10, 2017

eric-haibin-lin Nov 12, 2017

ZiyueHuang commented Nov 19, 2017 •

edited

Loading

moveforever commented Dec 31, 2017

eric-haibin-lin commented Dec 31, 2017

moveforever commented Dec 31, 2017

		@@ -375,28 +375,32 @@ struct DotCsrTransDnsRspByRowBlocks {
		* \brief

optimization for dot(csr.T, dense) = rsp #8611

optimization for dot(csr.T, dense) = rsp #8611

Conversation

ZiyueHuang commented Nov 10, 2017 • edited Loading

Description

Checklist

Essentials

Changes

Comments

eric-haibin-lin Nov 10, 2017

Choose a reason for hiding this comment

eric-haibin-lin Nov 10, 2017

Choose a reason for hiding this comment

ZiyueHuang Nov 11, 2017

Choose a reason for hiding this comment

eric-haibin-lin commented Nov 10, 2017

eric-haibin-lin Nov 12, 2017

Choose a reason for hiding this comment

ZiyueHuang commented Nov 19, 2017 • edited Loading

moveforever commented Dec 31, 2017

eric-haibin-lin commented Dec 31, 2017

moveforever commented Dec 31, 2017

ZiyueHuang commented Nov 10, 2017 •

edited

Loading

ZiyueHuang commented Nov 19, 2017 •

edited

Loading