[TOPI] Add embedding op and gradient #6794

tkonolige · 2020-10-29T23:32:12Z

This PR adds the embed op and its gradient. Embed is a specialization of take with a 2D lookup table.

tqchen · 2020-10-30T00:11:59Z

Thanks @tkonolige , given that this is a new op, it would be great to do a API review per https://tvm.apache.org/docs/contribute/code_review.html#deliberate-on-api-and-data-structures

In particular, it would be great to checkout the convention of similar APIs in existing frameworks like numpy, PyTorch, Keras, TensorFlow, we should ideally follow common conventions. See previous related topics on nms #2535

altanh · 2020-10-30T01:32:08Z

re: API review

Naming

All of the above APIs call it Embedding, so we may want to rename embed to embedding (although grammatically I do feel like "embed" is more correct).

Arguments

I don't think we need to pass in the vocabulary size or embedding dimension like these examples do, since we can infer it from the weight/table matrix (I imagine they use it for bookkeeping in training, which is a separate matter). Likewise, we can ignore anything related to weight initialization.

PyTorch has the following additional arguments:

padding_idx: int, index into embedding table that will always have 0 gradient, generally used for padding
scale_grad_by_freq: boolean, "scale gradients by the inverse of frequency of the words in the mini-batch." I believe this means the gradient update for index j will be divided by sum(indices == j) (count of j in input indices).
sparse: boolean: "gradient w.r.t. weight matrix will be a sparse tensor."

mxnet has:

sparse_grad: boolean: gradient for weight is row-sparse (probably the same as PyTorch above?)

TF/keras has:

mask_zero: boolean: "whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1)." I don't fully understand this but it seems similar to padding_idx from PyTorch, but requires TF-specific masking support. I prefer PyTorch's approach if they are both equivalent.
input_length: int: "Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed)." Again, this sounds like some kind of TF/Keras design quirk.

In my opinion, we should aim for PyTorch's API over TF/Keras, but perhaps others can give more insight. We are also thinking about adding sparse gradient support, so it may be best to add it as an attr but raise an error for now.

Shapes

PyTorch and mxnet support arbitrary input shape. In particular, if our embedding dimension is dim, the shape relation is (d1,...,dn) -> (d1,...,dn, dim).

TF/Keras is strange as they have (batch_size, input_length) -> (batch_size, input_length, dim), which just seems like a restriction of PyTorch and mxnet.

This PR currently proposes (flat_length,) -> (flat_length, dim). Note that we can easily support the PyTorch and mxnet approach by flattening the indices and then "reshaping the first n dimensions": (d1,...,dn) -> (d1 * ... * dn) -> (d1 * ... * dn, dim) -> (d1,...,dn,dim). I imagine this should be easy to implement but I'm not too familiar with TOPI.

altanh · 2020-10-30T01:35:24Z

cc @antinucleon

tkonolige · 2020-10-30T16:48:19Z

For this PR, we are just going to do the dense gradient. The sparse gradient will take some work, so we will add it in a latter PR.

giuseros · 2020-11-10T15:48:50Z

python/tvm/topi/x86/nn.py

+    """
+    s = te.create_schedule([outs[0].op])
+
+    vec_size = 8  # should autotune this, but we can't with hybrid script


May I ask why 8? I am just wondering if we could reuse this schedule for the arm back-end as well

We could reuse it. I just didn't have a good way to figure out the width of the vector instructions.

The embed op is a specialization of take with a 2D lookup table.

tqchen added the status: api design review label Oct 30, 2020

tqchen added the status: need review label Oct 30, 2020

tkonolige force-pushed the embed_op branch from 879481c to 0b36b63 Compare November 5, 2020 18:42

giuseros reviewed Nov 10, 2020

View reviewed changes

tkonolige mentioned this pull request Dec 23, 2020

Sparse segment sum sqrtn op #7149

Closed

tkonolige added 4 commits February 18, 2021 15:38

[TOPI] Add embed op and gradient.

4994661

The embed op is a specialization of take with a 2D lookup table.

formatting

9db66d7

formatting

f69589d

fix test dtype

60f8d87

tkonolige force-pushed the embed_op branch from 0b36b63 to 60f8d87 Compare February 18, 2021 23:38

tkonolige closed this Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Add embedding op and gradient #6794

[TOPI] Add embedding op and gradient #6794

tkonolige commented Oct 29, 2020

tqchen commented Oct 30, 2020 •

edited

Loading

altanh commented Oct 30, 2020

altanh commented Oct 30, 2020

tkonolige commented Oct 30, 2020

giuseros Nov 10, 2020

tkonolige Nov 10, 2020

[TOPI] Add embedding op and gradient #6794

[TOPI] Add embedding op and gradient #6794

Conversation

tkonolige commented Oct 29, 2020

tqchen commented Oct 30, 2020 • edited Loading

altanh commented Oct 30, 2020

Naming

Arguments

Shapes

altanh commented Oct 30, 2020

tkonolige commented Oct 30, 2020

giuseros Nov 10, 2020

Choose a reason for hiding this comment

tkonolige Nov 10, 2020

Choose a reason for hiding this comment

tqchen commented Oct 30, 2020 •

edited

Loading