cpu sparse embedding op #8460

eric-haibin-lin · 2017-10-28T06:39:21Z

Description

The SparseEmbedding op takes indices and (rowsparse) weight as input and produces dense result in the forward pass. In backward pass it outputs (rowsparse) gradient for the weight, which is useful for sparse gradient update.

~ 8x faster on a c4.8xlarge machine (36 cores) compared to dense embedding:
With sparse embedding: OMP_NUM_THREADS=32 python matrix_factorization.py --print-every=1000

INFO:root:Epoch[0] Batch [1000] Speed: 81742.24 samples/sec     mse=1.143560
INFO:root:Epoch[0] Batch [2000] Speed: 84291.38 samples/sec     mse=0.980251
INFO:root:Epoch[0] Batch [3000] Speed: 82812.25 samples/sec     mse=0.950991
INFO:root:Epoch[0] Batch [4000] Speed: 82077.48 samples/sec     mse=0.911922
INFO:root:Epoch[0] Batch [5000] Speed: 80918.93 samples/sec     mse=0.898004
INFO:root:Epoch[0] Batch [6000] Speed: 73614.71 samples/sec     mse=0.872740
INFO:root:Epoch[0] Batch [7000] Speed: 83037.03 samples/sec     mse=0.863862
INFO:root:Epoch[0] Batch [8000] Speed: 83170.19 samples/sec     mse=0.852437
INFO:root:Epoch[0] Batch [9000] Speed: 83669.75 samples/sec     mse=0.858010

With embedding: OMP_NUM_THREADS=32 python matrix_factorization.py --print-every=1000 --use-dense

INFO:root:Epoch[0] Batch [1000] Speed: 9179.29 samples/sec      mse=1.101776
INFO:root:Epoch[0] Batch [2000] Speed: 10797.47 samples/sec     mse=0.928593
INFO:root:Epoch[0] Batch [3000] Speed: 11022.51 samples/sec     mse=0.903449
INFO:root:Epoch[0] Batch [4000] Speed: 9529.76 samples/sec      mse=0.885801
INFO:root:Epoch[0] Batch [5000] Speed: 10027.75 samples/sec     mse=0.876870
INFO:root:Epoch[0] Batch [6000] Speed: 9447.15 samples/sec      mse=0.857204
INFO:root:Epoch[0] Batch [7000] Speed: 9982.74 samples/sec      mse=0.847336
INFO:root:Epoch[0] Batch [8000] Speed: 10335.85 samples/sec     mse=0.840450
INFO:root:Epoch[0] Batch [9000] Speed: 9890.59 samples/sec      mse=0.840442

Note: SparseEmbedding checks if any input is out-of-bound, and throws an exception if found one.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
For user-facing API changes, API doc string has been updated.
To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Intersting edge cases to note here

eric-haibin-lin · 2017-10-31T16:21:51Z

@anirudh2290 @cjolivier01 @reminisce

anirudh2290

Thank you for working on this op!

anirudh2290 · 2017-11-01T07:00:15Z

src/executor/graph_executor.cc

      it->second = InitZeros(dest_arg_stype, dest_arg_shape, ctx, dest_arg_dtype);
      return it->second;
+    } else {
+      // not shareable storage
+      return InitZeros(dest_arg_stype, dest_arg_shape, ctx, dest_arg_dtype);
    }  // arg_array.shape().Size() >= arg_shape.Size()


This comment here probably makes sense after the else if clause instead of here.

line 704 is still performing sharing by updating it->second , it's just that its size is too small, so creating a bigger ndarray for sharing. Maybe the name size_shareable is misleading?

Yes, the comment "// arg_array.shape().Size() >= arg_shape.Size()" probably makes sense after if instead of else. I don't think the name size_shareable is misleading.

Oh I see you're referring to this comment. I'll move it

anirudh2290 · 2017-11-01T07:29:23Z

src/executor/graph_executor.cc

        in_arg_vec->emplace_back(ReshapeOrCreate(arg_name, inferred_shape, inferred_dtype,
                                                 inferred_stype, in_arg_ctxes[arg_top],
-                                                 shared_buffer));
-        // gradient for model parameter
+                                                 shared_buffer, true));


using a bool variable and passing it as argument to ReshapeOrCreate here and below would improve readability.

anirudh2290 · 2017-11-01T07:38:55Z

src/operator/tensor/util/tensor_util-inl.h

+  MSHADOW_XINLINE static void Map(int tid,
+                                  DType* row_flg,
+                                  const IType* row_idx) {
+    nnvm::dim_t idx = static_cast<nnvm::dim_t>(row_idx[tid]);


Can we use IType here instead of nnvm::dim_t ?

I'm afraid not. For embedding the data/row_idx could be float, which cannot be used for [], that's why it's always casted

anirudh2290 · 2017-11-01T07:39:37Z

src/operator/tensor/util/tensor_util-inl.h

+                                  const nnvm::dim_t* row_flg_sum,
+                                  const nnvm::dim_t num_rows) {
+    if (tid < num_rows) {
+      nnvm::dim_t prev = (tid == 0) ? 0 : row_flg_sum[tid-1];


Can RType be used here instead of nnvm::dim_t

This kernel was used for other operators (such as dot, elemwise_sum), too. I'm moving this kernel from it's original place to this file so that it can be reused for embedding. Changing that interface is kind of out of the scope for this PR, since row_flg_sum comes from temp storage which is always allocated as nnvm::dim_t type...

anirudh2290 · 2017-11-01T08:11:28Z

src/operator/tensor/indexing_op.cc

+       [  5.,   6.,   7.,   8.,   9.],
+       [ 10.,  11.,  12.,  13.,  14.],
+       [ 15.,  16.,  17.,  18.,  19.]]
+


Can we use a rsp weight in the example here ?

Maybe I'll add one more example with rsp weight..

I'm actually not sure about that. Showing weights with 0's is kind of confusing..

anirudh2290 · 2017-11-01T08:21:48Z

src/operator/tensor/indexing_op.cc

@@ -95,6 +95,77 @@ Examples::
 .add_argument("weight", "NDArray-or-Symbol", "The embedding weight matrix.")
 .add_arguments(EmbeddingParam::__FIELDS__());

+NNVM_REGISTER_OP(_contrib_SparseEmbedding)


Stupid question: Why are we registering a new operator here instead of extending the existing embedding op ?

Good question! Because if we override the existing op, we don't know whether to infer rsp grad or dense grad for weight, because the weight ndarray is not visible for backward pass.

anirudh2290 · 2017-11-01T21:47:34Z

src/operator/tensor/indexing_op-inl.h

+    const dim_t idx_offset = first - weight_idx;
+    const dim_t out_offset = i * row_length;
+    const dim_t weight_offset = idx_offset * row_length;
+    if (idx_offset >= nnr || *(weight_idx + idx_offset) > val) {


It is not very obvious to me why element is not found in the case where *(weight_idx + idx_offset) > val. Maybe add a comment here.

Sure. It's possible that weight.idx = [5,10] and data = [3,7], so cannot find any matching indices in weight_idx.

anirudh2290 · 2017-11-01T21:52:17Z

src/operator/tensor/indexing_op.h

+  using namespace rowsparse;
+  using namespace mxnet_op;
+  // zeros weight
+  if (req == kWriteTo && !weight.storage_initialized()) {


What happens when input data storage is not initialized.

input data is always dense.

anirudh2290 · 2017-11-01T22:31:27Z

src/operator/tensor/indexing_op.cc

+
+                                [[  0.,   1.,   2.,   3.,   4.],
+                                 [ 10.,  11.,  12.,  13.,  14.]]]
+


Also a frontend python example , similar to one here: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/symbol_doc.py#L128 would be a good to have.

I thought we stopped using SymbolDoc with doc tests. Is that still working?

I see the symbol doc example here: https://mxnet.incubator.apache.org/versions/master/api/python/symbol/symbol.html#mxnet.symbol.Embedding. I am not sure if it is deprecated.

Cool will add this

Looks like that doesn't work for contrib ops.. :(

anirudh2290 · 2017-11-01T22:32:41Z

src/executor/graph_executor.cc

      it->second = InitZeros(dest_arg_stype, dest_arg_shape, ctx, dest_arg_dtype);
      return it->second;
+    } else {
+      // not shareable storage
+      return InitZeros(dest_arg_stype, dest_arg_shape, ctx, dest_arg_dtype);
    }  // arg_array.shape().Size() >= arg_shape.Size()


Yes, the comment "// arg_array.shape().Size() >= arg_shape.Size()" probably makes sense after if instead of else. I don't think the name size_shareable is misleading.

…-embed

* cpu embedding draft * clean up * fix omp thread call * add sparse embedding example * check bound with signel thread * add note * add comments * add operator note * support rsp weight sharing for bucketing * improve workload balance in take add grad rsp kernel * use MSHADOW_CINLINE for cpu kernel * review comments. add unit test for shared rsp weight * remove indexing op-inl.h * Trigger * Trigger

eric-haibin-lin added 10 commits October 15, 2017 23:51

cpu embedding draft

a8e70ab

clean up

d6cb8d2

merge with master

fefd314

fix omp thread call

17b1720

add sparse embedding example

60bbc2a

check bound with signel thread

5029296

Merge remote-tracking branch 'origin/master' into cpu-embed

6e828dd

add note

ed31763

add comments

afab636

add operator note

66c37b7

eric-haibin-lin mentioned this pull request Oct 28, 2017

can mxnet provide the sparse gradient update for word embedding #1237

Closed

eric-haibin-lin added 3 commits October 30, 2017 22:08

support rsp weight sharing for bucketing

f5eb5ad

improve workload balance in take add grad rsp kernel

c41bf88

Merge remote-tracking branch 'origin/master' into cpu-embed

86d7daf

eric-haibin-lin changed the title ~~[WIP] Cpu sparse embedding op~~ cpu sparse embedding op Oct 31, 2017

use MSHADOW_CINLINE for cpu kernel

a976810

formath mentioned this pull request Oct 31, 2017

rm redundant code #8445

Closed

anirudh2290 reviewed Nov 1, 2017

View reviewed changes

eric-haibin-lin added 4 commits November 4, 2017 18:01

review comments. add unit test for shared rsp weight

516f827

Merge branch 'cpu-embed' of github.com:eric-haibin-lin/mxnet into cpu…

5cb6142

…-embed

Merge remote-tracking branch 'origin/master' into cpu-embed

2866a79

remove indexing op-inl.h

3e84650

reminisce approved these changes Nov 7, 2017

View reviewed changes

Ubuntu and others added 2 commits November 7, 2017 04:26

Trigger

83b5764

Trigger

af1b25b

piiswrong merged commit 4862c41 into apache:master Nov 7, 2017

eric-haibin-lin deleted the cpu-embed branch November 14, 2017 05:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu sparse embedding op #8460

cpu sparse embedding op #8460

eric-haibin-lin commented Oct 28, 2017 •

edited

Loading

eric-haibin-lin commented Oct 31, 2017

anirudh2290 left a comment •

edited

Loading

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 2, 2017

eric-haibin-lin Nov 2, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 1, 2017

anirudh2290 Nov 1, 2017

eric-haibin-lin Nov 2, 2017

anirudh2290 Nov 2, 2017

eric-haibin-lin Nov 3, 2017

eric-haibin-lin Nov 4, 2017

anirudh2290 Nov 1, 2017

cpu sparse embedding op #8460

cpu sparse embedding op #8460

Conversation

eric-haibin-lin commented Oct 28, 2017 • edited Loading

Description

Checklist

Essentials

Changes

Comments

eric-haibin-lin commented Oct 31, 2017

anirudh2290 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented Oct 28, 2017 •

edited

Loading

anirudh2290 left a comment •

edited

Loading