Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Add embedding op and gradient #6794

Closed
wants to merge 4 commits into from
Closed

Conversation

tkonolige
Copy link
Contributor

This PR adds the embed op and its gradient. Embed is a specialization of take with a 2D lookup table.

@altanh

@tqchen
Copy link
Member

tqchen commented Oct 30, 2020

Thanks @tkonolige , given that this is a new op, it would be great to do a API review per https://tvm.apache.org/docs/contribute/code_review.html#deliberate-on-api-and-data-structures

In particular, it would be great to checkout the convention of similar APIs in existing frameworks like numpy, PyTorch, Keras, TensorFlow, we should ideally follow common conventions. See previous related topics on nms #2535

@altanh
Copy link
Contributor

altanh commented Oct 30, 2020

re: API review

Naming

All of the above APIs call it Embedding, so we may want to rename embed to embedding (although grammatically I do feel like "embed" is more correct).

Arguments

I don't think we need to pass in the vocabulary size or embedding dimension like these examples do, since we can infer it from the weight/table matrix (I imagine they use it for bookkeeping in training, which is a separate matter). Likewise, we can ignore anything related to weight initialization.

PyTorch has the following additional arguments:

  • padding_idx: int, index into embedding table that will always have 0 gradient, generally used for padding
  • scale_grad_by_freq: boolean, "scale gradients by the inverse of frequency of the words in the mini-batch." I believe this means the gradient update for index j will be divided by sum(indices == j) (count of j in input indices).
  • sparse: boolean: "gradient w.r.t. weight matrix will be a sparse tensor."

mxnet has:

  • sparse_grad: boolean: gradient for weight is row-sparse (probably the same as PyTorch above?)

TF/keras has:

  • mask_zero: boolean: "whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1)." I don't fully understand this but it seems similar to padding_idx from PyTorch, but requires TF-specific masking support. I prefer PyTorch's approach if they are both equivalent.
  • input_length: int: "Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed)." Again, this sounds like some kind of TF/Keras design quirk.

In my opinion, we should aim for PyTorch's API over TF/Keras, but perhaps others can give more insight. We are also thinking about adding sparse gradient support, so it may be best to add it as an attr but raise an error for now.

Shapes

PyTorch and mxnet support arbitrary input shape. In particular, if our embedding dimension is dim, the shape relation is (d1,...,dn) -> (d1,...,dn, dim).

TF/Keras is strange as they have (batch_size, input_length) -> (batch_size, input_length, dim), which just seems like a restriction of PyTorch and mxnet.

This PR currently proposes (flat_length,) -> (flat_length, dim). Note that we can easily support the PyTorch and mxnet approach by flattening the indices and then "reshaping the first n dimensions": (d1,...,dn) -> (d1 * ... * dn) -> (d1 * ... * dn, dim) -> (d1,...,dn,dim). I imagine this should be easy to implement but I'm not too familiar with TOPI.

@altanh
Copy link
Contributor

altanh commented Oct 30, 2020

cc @antinucleon

@tkonolige
Copy link
Contributor Author

For this PR, we are just going to do the dense gradient. The sparse gradient will take some work, so we will add it in a latter PR.

"""
s = te.create_schedule([outs[0].op])

vec_size = 8 # should autotune this, but we can't with hybrid script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why 8? I am just wondering if we could reuse this schedule for the arm back-end as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reuse it. I just didn't have a good way to figure out the width of the vector instructions.

The embed op is a specialization of take with a 2D lookup table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants