Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

optimization for dot(csr.T, dense) = rsp #8611

Merged
merged 10 commits into from
Nov 19, 2017
Merged

optimization for dot(csr.T, dense) = rsp #8611

merged 10 commits into from
Nov 19, 2017

Conversation

ZiyueHuang
Copy link
Member

@ZiyueHuang ZiyueHuang commented Nov 10, 2017

Description

Use prefix sum to compute nnr in order to allocate the row_sparse output.

Currently dot(csr.T, dense) = rsp will allocate the dense output and then cast it to row_sparse, but not free the unused memory.

I use run_benchmark(context, lhs="csr", rhs="default", lhs_trans=True, ...) in mxnet/benchmark/python/sparse/dot.py. Please correct me if I'm wrong.

But is dot(csr.T, dense) = rsp in master slow like this? Might due to others are using my machine at the same time?

Performance of origin dot(csr.T, dense) = rsp,

[hanfeng@model-gpu00:sparse]$ python dot.py --num-omp-threads 20
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      128  1000000      256        366.19        135.76     0.37
            1.0           100.0     cpu(0)      128  1000000     1000       1327.12        503.92     0.38
            1.0           100.0     cpu(0)      128  1000000     1000       1237.33        454.01     0.37
            1.0           100.0     cpu(0)       64  1000000     1000        868.38        345.38     0.40
            1.0           100.0     cpu(0)      128  1000000     1000       1237.09        437.32     0.35

After this PR,

[hanfeng@model-gpu00:sparse]$ python dot.py --num-omp-threads 20
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      128  1000000      256         83.90        137.18     1.64
            1.0           100.0     cpu(0)      128  1000000     1000        410.63        448.30     1.09
            1.0           100.0     cpu(0)      128  1000000     1000        467.91        492.87     1.05
            1.0           100.0     cpu(0)       64  1000000     1000        259.99        348.32     1.34
            1.0           100.0     cpu(0)      128  1000000     1000        481.77        416.20     0.86

As a feature requested in #8168

cc @eric-haibin-lin

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • For user-facing API changes, API doc string has been updated.
  • To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • unittests already exist

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Intersting edge cases to note here

@@ -573,6 +577,18 @@ inline void DotCsrDnsDnsImpl(const OpContext& ctx,
});
}


struct MarkCsrColKernel {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_idx_out, prefix_sum, num_rows);

num_threads = mxnet_op::get_num_threads<cpu>(ret->shape()[0]);
dim_t seg_len = (ret->shape()[0] + num_threads - 1) / num_threads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no, this range should cover num_cols of csr, i.e. num_rows of output. So nnr is not helpful for this kernel.

@eric-haibin-lin
Copy link
Member

Yes, the original dot(csr.T, dense) performance is not good. The same for scipy - it's not always faster than the dense dot even with very sparse data.

@@ -375,28 +375,32 @@ struct DotCsrTransDnsRspByRowBlocks {
* \brief
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should be updated

@ZiyueHuang
Copy link
Member Author

ZiyueHuang commented Nov 19, 2017

Add benchmark for n=2 case,

Before,

python dot.py --num-omp-threads 16
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      256  1000000        2         41.61         36.01     0.87

After,

python dot.py --num-omp-threads 16
========================================================
  mxnet sparse dot benchmark: dot(csr, default) = default
  (matrix multiplication: (m x k)^T * (k x n) = m x n)
========================================================
 lhs_density(%)  rhs_density(%)    context        m        k        n  t_sparse(ms)   t_dense(ms)  speedup
            1.0           100.0     cpu(0)      256  1000000        2         14.44         32.46     2.25

@piiswrong piiswrong merged commit f79d22d into apache:master Nov 19, 2017
eric-haibin-lin added a commit to eric-haibin-lin/mxnet that referenced this pull request Nov 21, 2017
eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this pull request Dec 3, 2017
* optimization for dot(csr.T, dense) = rsp

* remove unneccessary headers

* load balance

* minor fix and update comments

* resolve

* trigger

* trigger
@moveforever
Copy link
Contributor

hi, @eric-haibin-lin @ZiyueHuang , can the storage type of dot(csr.T, rsp) be rsp?

@eric-haibin-lin
Copy link
Member

It's now indeed rsp:

a = mx.nd.ones((2,2)).tostype('row_sparse')
>>> a

<RowSparseNDArray 2x2 @cpu(0)>
>>> b = mx.nd.ones((2,2)).tostype('csr')
>>> mx.nd.sparse.dot(b,a,transpose_a=True)

<RowSparseNDArray 2x2 @cpu(0)>

@moveforever
Copy link
Contributor

oh, i know, thanks.

@ZiyueHuang ZiyueHuang deleted the dot branch January 30, 2018 11:34
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* optimization for dot(csr.T, dense) = rsp

* remove unneccessary headers

* load balance

* minor fix and update comments

* resolve

* trigger

* trigger
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants