The soft one-hot encoding method was originally proposed to encode the duration of an event [1], but we propose its usage to encode all continuous features. It represents a scalar as a weighted sum over all embeddings for that feature. More formally, the scalar feature 𝑛 is projected onto a vector space as in 𝑝𝑛=𝑛𝑊𝑛+𝑏𝑛, where 𝑊𝑛 ∈ R1×𝑃 is the weight matrix, 𝑏𝑛 ∈ R𝑃 is the bias vector, and 𝑃 is the number of desired embeddings for the feature 𝑛 embedding table. Then, a softmax function is applied to the projection vector 𝑝𝑛, as 𝑠𝑛=softmax(𝑝𝑛). Finally, the probability distribution obtained from the softmax is used to do a weighted sum over an embedding space, 𝑔𝑛=𝑠𝑛𝐸𝑛, where 𝐸𝑛 ∈ R𝑃×𝐷 is the embedding matrix for feature 𝑛, and 𝐷 is its embedding size.
The NLP community has proposed this technique to tie the input embeddings matrix with the output projection layer matrix [2, 3]. The main motivation is that for language models, both the input and the output of the models are words, so they should lie in the same vector space. Analogously recommender systems models have item ids as input and output. Deep recommender system models are generally memory-bound, and most of the parameters are concentrated in large embedding tables cite [4]. In such a scenario, tying embeddings results in a significant reduction of memory requirements by holding only one projection matrix for item and output representations. Additionally, rare item embeddings can benefit more from the output layer updates at each training step.
A reduction in the number of parameters isn’t the only benefit to weight tying though. Under the recommender systems taxonomy, tying embeddings technique introduces a matrix factorization operation between the item embeddings and the final representation of the user or session, as we demonstrate below. Formally, let 𝑛 be the number of items, 𝑑 the dimension of the item embeddings, and 𝑈𝑛×𝑑 be the item embedding matrix. Let a neural network with arbitrary layers that takes the input features, including the item embeddings, and outputs a vector of activations ℎ ∈ R𝑠. The output projection matrix 𝑉 ∈ R𝑠×𝑛 then maps ℎ to the logits 𝑙 ∈ R𝑛 for all items, by computing 𝑙=ℎ𝑉. In order to tie the embeddings, we make 𝑈=𝑉, so that the embedding matrix are shared and 𝑙=ℎ𝑉=ℎ𝑈⊤. It is worth noting that GRU4Rec used weight tying (referred as constrained embeddings) in their publicly available source code but did not mention the technique or its benefits in their paper. In this work, we propose adding a bias 𝑏 ∈ R𝑛 to the output projection layer, making 𝑙=(ℎ𝑈⊤)+𝑏, gives the output an additional degree of freedom from the input embeddings. We found that the bias addition has slightly improved model accuracies, so we use this variation of tying embeddings.
References
[1] Li, Yang, Nan Du, and Samy Bengio. "Time-dependent representation for neural event sequence prediction." arXiv preprint arXiv:1708.00065 (2017).
[2] Inan, Hakan, Khashayar Khosravi, and Richard Socher. "Tying word vectors and word classifiers: A loss framework for language modeling." arXiv preprint arXiv:1611.01462 (2016).
[3] Press, Ofir, and Lior Wolf. "Using the output embedding to improve language models." arXiv preprint arXiv:1608.05859 (2016).
[4] Zhang, Jian, Jiyan Yang, and Hector Yuen. "Training with low-precision embedding tables." Systems for Machine Learning Workshop at NeurIPS. Vol. 2018. 2018.